Retrieving fashion products based on a query image
Have you ever seen a picture on Instagram and thought, “Oh, wow! I want these shoes”? or been inspired by your favourite fashion blogger and looked for similar products (for example, on Zalando)? Visual search for fashion, the task of identifying fashion articles in an image and finding them in an online store, has been the subject of an ever growing body of scientific literature over the last few years (see for example [1-11]).
At Zalando, we have many outlets where this search is possible: our app, our Facebook chatbot, etc. We want to provide our customers with the best shopping experience possible, and words are not always enough to describe fashion.
Visual search poses some interesting challenges: how to deal with variations in image quality, lighting, background, different human poses and article distortion, or finding the right product in a large database in real-time.
Our working scenario so far has been to build on our home-grown FashionDNA to retrieve blazers, dresses, jumpers, shirts, skirts, trousers, t-shirts and tops in fashion images, with or without backgrounds.
Our Data Source
As a fashion company, Zalando creates outfits every day and therefore generates many fashion images annotated with their corresponding products. This means that we can use state-of-the-art learning techniques such as deep nets which have revolutionized computer vision. As can be seen in Figure 1, these images include full body poses, half-body close-ups as well as detailed close-ups on a garment of interest. Although model poses are usually standardized and do not really reflect the more natural poses found on Instagram, having these different kinds of shots allows us to handle different scales. These images also display occlusions (shirts occluded by jackets for example) and back views.
Figure 1: Examples of images in our dataset. Image types (a-d) are query images featuring models, image type (e) represents the articles we retrieve from.
Unfortunately, an overwhelming majority of our fashion images have standardised clean backgrounds as shown in Figure 1, which means we have to think of a work around to learn how to handle them.
Studio2Shop: matching model
We have designed a ConvNet model that takes a fashion image with Zalando clean backgrounds and an assortment of interest as input and returns a ranking of the products in the assortment for the eight categories mentioned above.
The products in the assortment are not represented by images, as is common in the literature, but by their FashionDNA. In other words, only a feature representation of the article is needed.
Figure 2 below illustrates the setting and the results we can get. On the left is the image of a person wearing an outfit, on the right side are the 50 top-ranking products in the assortment. The articles that are actually present in the outfit are marked in green.
Figure 2: Random examples of the retrieval test using 20,000 queries against 50,000 Zalando articles. Query images are in the left-most column. Each query image is next to two rows displaying the top 50 retrieved articles, from left to right, top to bottom. Green boxes show exact hits.
To show its generalization capabilities, we have tested our model on part of an independent dataset published in , without fine-tuning it. Results are shown in Figure 3 below. Unfortunately, the dataset was modified to fit our setting, so our performance is not comparable with the one reported in .
Figure 3: Random examples of outcomes of the retrieval test on query images from DeepFashion In-Shop-Retrieval . Query images are in the left-most column. Each query image is next to two rows displaying the top 50 retrieved articles, from left to right, top to bottom. Green boxes show exact hits.
Note that this exercise is a little academic as focusing on finding the exact products allows us to assess models quantitatively. In fact, retrieving exact matches is not critical for two reasons: a) it is quite unlikely that the exact product is part of the assortment, b) usually the customer feels inspired and a similar item will feel just as rewarding to them, if not more, because they can have a rounder neckline for example.
Thanks to how this model is built, it is able to provide similar items as a by-product. Figures 2 and 3 show that the style of the 50 top-ranking garments fits the style of the outfit, and that these garments are quite similar to one another.
This means that we can also retrieve similar products from other assortments. Figure 4 below shows the 50 top-ranking garments from a Zalando assortment on query images from , without our model being fine-tuned for such images.
Figure 4: Random examples of outcomes of the retrieval test on query images from DeepFashion In-Shop-Retrieval  against 50,000 Zalando articles. Query images are in the left-most columns. Each query image is next to two rows showing the top 50 retrieved articles, from left to right, top to bottom.
The details of this work can be found in .
Extension to images with backgrounds
Unfortunately, training a similar model for natural images would require large amounts of natural fashion images annotated with products, which we don’t have. However, we do have large amounts of unannotated fashion images, in particular those available from public datasets such as Chictopia (10k), but also our own in-house images. The advantage of public datasets is that the segmentation’s ground-truth is given, whereas we have to segment our images ourselves.
Using these images and their segmentation, we have designed and trained Street2Fashion, a U-net-like segmentation model that can find the person in the image and simply replaces the background with white pixels. The results shown in Figure 5 below are good enough to focus on the fashion in the image.
Figure 5: Examples of segmentation results on test images.
We use Street2Fashion as a preprocessing step, and build Fashion2Shop, a model with the same architecture as Studio2Shop but trained on segmented images. We refer to the full pipeline described in Figure 6 as Street2Fashion2Shop. In practice, a query fashion image is processed by the segmentation model to remove the background, and can then go through the matching model described above to be matched with appropriate products.
Figure 6: Street2Fashion2Shop. The query image (top row) is segmented by Street2Fashion, while FashionDNA is run on the title images of the products in the assortment (bottom row) to obtain static feature vectors. The result of these two operations forms the input of Fashion2Shop which handles the product matching.
Figure 7 shows results obtained using Street2Fashion2Shop.
(a) Random examples of Zalando products retrieval using query images from LookBook .
(b) Random examples of Zalando products retrieval using query images from street shots.
Figure 7: Qualitative results on external datasets. For each query image, the query image is displayed on the very left, followed by the segmented image and by the top 50 product suggestions. Better viewed with a zoom.
The details of this work will shortly be available in .
 X. Wang and T. Zhang. Clothes search in consumer photos via color matching and attribute learning. Multimedia Conference (MM), 2011.
[²] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu and S. Yan. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 J. Fu, J. Wang, Z. Li, M. Xu and H Lu. Efficient clothing retrieval with semantic-preserving visual phrases. Asian Conference on Computer Vision (ACCV), 2012.
 Y. Kalantidis, L. Kennedy and L.J. Li. Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. International Conference on Multimedia Retrieval (ICMR), 2013.
 K. Yamaguchi, M.H. Kiapour and T.L. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. International Conference on Computer Vision (ICCV), 2013.
 J. Huang, R.S. Feris, Q. Chen and S. Yan. Cross-domain image retrieval with a dual attribute-aware ranking network. International Conference on Computer Vision (ICCV), 2015.
 Z. Liu, P. Luo, S. Qiu, X. Wang and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. Computer Vision and Pattern Recognition (CVPR), 2016.
 E. Simo-Serra and H. Ishikawa. Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction. Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 X. Wang, Z. Sun, W. Zhang, Y. Zhou and Y.G. Jiang. Matching user photos to online products with robust deep features. International Conference on Multimedia Retrieval (ICMR), 2016.
 D. Shankar, S. Narumanchi, H.A. Ananya, P. Kompalli and K. Chaudhury. Deep learning based large scale visual recommendation and search for e-commerce. CoRR, 2017.
 X. Ji, W. Wang, M. Zhang and Y. Yang. Cross-domain image retrieval with attention modeling. Multimedia Conference (MM), 2017.
 J. Lasserre, K. Rasch and R. Vollgraf. Studio2Shop: from studio photo shoots to fashion articles. International Conference on Pattern Recognition Applications and Methods (ICPRAM), 2018.
 D. Yoo, N. Kim, S. Park, A.S Paek and I. Kweon: Pixel-level domain transfer. European Conference on Computer Vision (ECCV), 2016.
 J. Lasserre, C. Bracher and R. Vollgraf. To appear in Lecture Notes in Computer Science, 2018.
-- Our visual search engines are currently powered by the company Fashwell, this work is at the research stage. --