Decoding Fashion Contexts Using Word Embeddings

Decoding Fashion Contexts Using Word Embeddings Sagar Arora

Myntra Designs, India [email protected]

Deepak Warrier Myntra Designs, India

[email protected]

ABSTRACT Personalisation in e-commerce hinges on dynamically uncovering the user’s context via his/her interactions on the portal. The harder the context identification, lesser is the effectiveness of personalisation. Our work attempts to uncover and understand the user’s context to effectively render personalisation for fashion ecommerce. We highlight fashion-domain specific gaps with typical implementations of personalised recommendation systems and present an alternate approach. Our approach hinges on user sessions (clickstream) as a proxy to the context and explores “session vector” as an atomic unit for personalization. The approach to learn context vector incorporates both the fashion product (style) attributes and the users’ browsing signals. We establish various possible user contexts (product clusters) and a style can have a fuzzy membership into multiple contexts. We predict the user’s context using the skip-gram model with negative sampling introduced by Mikolov et al [1]. We are able to decode the context with a high accuracy even for non-coherent sessions.

Keywords word2vec; personalization; context; product clusters

1. INTRODUCTION Internet based services strive to render personalised offerings trying to truly address each consumer's need. In addition, each domain poses unique challenges and often necessitates a customized approach. The need to decode the “context” i.e. the implicit need state of the user using explicit signals of purchases and browse history (click stream) becomes even more important in fashion domain.

While the domain of fashion can be thought of being similar to movie, music, books as far as their personal choices and social proofing elements are concerned, there are striking differences between fashion and the others. Fashion products have a very short life span; freshness and diversity are indispensable. Ephemeral, seasonal and fast-changing products results into two issues with fashion domain: a) manual curated fashion taxonomy is inherently unscalable and b) transactions signals at a product level are sparse and volatile.

The other notable difference is that fashion products can emerge in multiple contexts. A song / movie / book has significantly more of a context baked into the product itself, than a T-shirt which could have been bought to match with existing jeans, or to pair it up with new chinos, or just a new entry into a user’s

collection. Pure content-based approaches without factoring in the context will hence fall short. Our approach addresses the above differences that are critical to build a personalised recommender system for fashion.

Our approach hinges on user sessions as a proxy to the context and explores a vector representation of this context as an atomic unit for personalisation. These contexts are further clustered to form product groups. Hereafter, we will use the terms product groups, product clusters and contexts interchangeably. For personalization, we predict the current context of the user; specifically the product group; from which the user is likely to browse in the session. The fact that user’s intent is not explicitly clear through the user’s session is a main challenge that our approach addresses. Sessions could be both coherent and non coherent as shown in Fig 1(a) and Fig 1(b) below. This work will also quantify the coherency in a session based on word embeddings; better the coherency, better is the degree to which we can understand the context.

The rest of the paper is organised as follows. In Section 2, we discuss related work. Section 3 formally discusses the various approaches. Section 4 discusses the experimental results. We finally conclude in Section 5.

2. RELATED WORK Recommender Systems have usually been studied in terms of content-based recommendations, collaborative filtering and other hybrid approaches. Recommendations based on content [5] need to have an accurate vector representation of content (the context is often missing in the fashion domain), misses the exploration element and is difficult for new users with very less signals. Moreover, Collaborative filtering and Matrix factorization [8] have also been studied in fashion domain. Yang Hu suggested recommendations based on functional tensor factorization approach to suggest outfit to users [6]. There have been recommender systems in fashion using computer vision [7]. Collaborative filtering often suffers from cold start problem and has drawbacks in fashion wherein the entire context is not baked into the product and the user-item matrix is very sparse in nature due to temporal nature of items. Apart from these traditional approaches, there has been some work on recommendation using ontologies [9]; which again do not scale due to ever increasing entities in fashion domain. An approach similar to ours has been tried in industry by the likes of Stitch Fix[10] and Lyst[11]. However, in our work, we consider user sessions as the document rather than an individual style to incorporate the context and also create product groups with each style having fuzzy membership into product groups.

3. METHODOLOGY For the rest of the paper, we will use the following terminology. Let S be the set of all possible styles and C be the set of all possible sessions involving the sequence of clicks on those

styles (styles in each session are a subset of S). Each style also has few catalogued features (attributes). Let A be the set of all possible attributes of a given article type.

𝑆 = ∪ 𝑆! 𝑠𝑡 𝑆! 𝑐𝑎𝑛 𝑏𝑒 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑒𝑑 𝑎𝑠 ∪ 𝑎!; 𝑎! ∈ 𝐴

𝐶 = ∪ 𝐾;𝐾 ⊆ 𝑆

Typically, content-based personalised recommender systems [5] rely on a notion of similarity between styles i.e. given a style, what are the similar styles. The problem boils down to arriving at a suitable vector representation of style. We present a few of the typical approaches before presenting the skip-gram based approach.

Fig 1(a). Coherent session clearly revealing the context of user; who wants to buy a Nike Sports Shoe of bright color in

premium price range

Mango Dress

FabAlley

Dress

W

Tunic

Puma Cap

Roadster

Jeans

Replay Jeans

Fig 1(b). Non coherent session that involves casual browsing, spanning multiple brands, article types, price points; and hence

not revealing any context/intent.

3.1 Straw-man Approaches One could either go for a feature based approach, which

represents styles, based on its attributes; or could use clickstream data to create an item-item graph. We can form style vectors by either explicitly using attribute vectors (physical space) or using approaches like word2vec [1], glove [4], doc2vec [3] and Latent Dirichlet Allocation [2] (latent space).

For approaches like word2vec and glove, we learn vectors for each attribute of the style, and later aggregate those vectors to form style vector as follows:

Let a style 𝑆! has the attributes {𝑎! , 𝑎!!!,… . , 𝑎!} and let 𝑣!! be the vector of 𝑎! ; Then vector for the style 𝑣!! can be written as:

1𝑗 − 𝑖 + 1

∗ 𝑣!!

!

!

For doc2vec, we consider each style as the document and the style’s attributes as words. We directly learn the style vector using this approach.

These approaches suffer from the absence of contextual representation of style - the vector representations are learnt agnostic of user contexts.

In contrast, in the second approach, an item-item graph is an undirected graph with all styles as its nodes, and the edges represent the co-similarity index. We use pointwise mutual information (PMI) as the co-similarity index 𝑐!, which can be defined as:

𝑐! = log𝑝(𝑥, 𝑦)

𝑝 𝑥 ∗ 𝑝(𝑦)

where p(x,y) is the probability of style x and style y co-occurring together in the entire corpus of sessions (x,y ∈ S) and p(x) ( p(y) ) is the probability of style x (style y) occurring in the entire corpus of sessions. This definition of co-occurrence index has a drawback due to the fact that styles are ephemeral in nature; hence each style has interactions with only a small subset of styles in catalogue. We further reiterate the point of lack of contextual information in typical approaches in Section 4.2.2.

3.2 Skip-Gram Based Approach Our approach tries to marry the above two approaches i.e. use both clickstream data and style attributes to understand user’s context. We use our huge database of users’ sessions as the primary dataset. A session is considered as a proxy to the context and our approach attempts to uncover cohesive contexts from sessions. Each style in the session is encoded into its various attributes (for instance a women dress can have attributes like brand, price, fabric, type (skater/bodycon/sweater/maxi etc.), pattern (solid/stripes/polka dots etc), length, sleeves etc. We also use the title and description of each style as its attributes. The data cleansing on title and description involves stemming, stop words removal and extracting various nouns and adjectives using POS tagger. Unlike a typical approach of using style as a document, we consider each session as a document; the words in the documents correspond to the attributes for all styles observed in that session. We then train a word2Vec model on this dataset to learn a vector representation of attributes. We aggregate these vector representations to arrive at vector representations of styles, sessions (contexts) and users.

As above we represent each style in the context by its attributes and learn the vector 𝑣!! of 𝑎!; we can aggregate the vectors 𝑣!! to get the vectors for a session.

i.e. if session 𝐶! has the attributes {𝑎! , 𝑎!!!,… . , 𝑎!}, then 𝑣!! can be written as:

1𝑦 − 𝑥 + 1

∗ 𝑣!!

!

!

Now, we cluster sessions to form homogenous groups representing user contexts from the session vectors using K-means with cosine distance. Here, there is a need to remove non-coherent sessions before clustering. To retain organic/ coherent sessions, we define the coherency score 𝑐! as the minimum cosine similarity of any style in a session from the session vector. We prune all the sessions with score 𝑐! below a certain threshold. Since each style can be a part of multiple sessions, a style can now potentially fall into multiple clusters. The vector for each cluster can be the centroid vector of styles, which fall into that cluster.

We leveraged this knowledge in form of vectors in our internal personalisation system. We personalise the list page for our users based on the clickstream data. Each user sees a different sort order of products based on his clickstream data. Fig 2 describes the hyper personalization in action.

4. RESULTS 4.1 Model Training 4.1.1 Data The dataset used for training our models (for a specific article type eg T-shirts, Jeans, Dresses, Sports Shoes etc.) has ~2M sessions (filtered based on the article type) from ~1M unique users. Each session is further composed of 5 styles on an average. Our typical catalogue for an article type has ~50K styles; each of which

Fig 2. Two individuals seeing different rank ordered styles based on their preferences on same page

is represented by various attributes. A typical attribute set (key-value pairs) for an article type has a cardinality of 1000. Moreover, as described in Fig 1, the sessions are often noisy. Fig 3 below describes the distribution of coherency score in a set of ~2M sessions involving T-shirts.

Fig 3. Distribution of number of sessions for different

coherency scores

4.1.2 Training Each model is trained separately for each article type. For the item-item graph, we consider the set of ~2M sessions and create the graph as already described. The graph creation also involves removing the edges whose nodes have co-occurring frequency < 10.

The word2vec (skip gram) model (considering the session as document) and doc2vec model (considering each style as a document) were trained using gensim [13,14] keeping the window size as 20 and dimensions (vector length) as 200. Window size is large enough to capture the entire context in our sessions since each session has 5 styles on an average and each style has 8 attributes on an average.

4.2 Similarity 4.2.1 Attribute Based Similarity We consider attributes of various products and find attributes similar to them in the semantic space as shown in Table 1. The

similarity is established based on the vectors learnt using the word2vec model described above in Section 3.2.

Table 1. Table illustrating attributes similar to ‘an attribute’ along with cosine similarity

Query Attribute Similar Attributes

brand_replay (Jeans)

mrp_greater_than_10000Rs (0.67), brand_scoth_&_soda (0.52), brand_gas (0.51), brand_calvin_klein_jeans (0.46), brand_diesel (0.44), brand_guess (0.43),

brand_antony_morato (0.36), closure_button_fly (0.35), brand_superdry

(0.34), brand_883_police (0.34), feature_turned_up_cuff (0.31)

pattern_solid/plain (Casual Shirts)

fabric_cotton (0.82), long_sleeves (0.82), spread_collar (0.80), slim_fit (0.79),

colour_blue (0.77), regular_filt (0.72), mrp_1500Rs_2000Rs (0.70)

dress_shape_maxi (Dresses)

dress_length_maxi (0.94), occasion_party (0.53), brand_athena (0.53), sleeveless (0.50),

shoulder_straps (0.50), v-neck (0.49), brand_eavan (0.48), brand_suhi_designer

(0.48), brand_mysin (0.45)

cleats_fixed (Sports Shoes)

sport_football (0.85), stud (0.72), fastening_asymmetric_laceup (0.72),

width_narrow (0.66), multi_coloured (0.61), colour_orange (0.60), brand_diadora (0.50),

upper_material_pu/synthetic (0.44), flouroscent_green (0.34), brand_puma (0.23)

4.2.2 Style Similarity We consider a particular style and find out various similar styles to it (as shown in Table 2), based on (i) skip-gram based approach which uses both user sessions and style attributes and (ii) doc2vec approach which uses style as a document and its attributes as words. Table 2: Table illustrating styles similar to ‘a style’ along with

cosine similarity

Query Style Skip-Gram Based Similarity Using

Context

Doc2Vec Similarity Using Style as a

Document

(Brooks Brothers)

GANT (0.88)

Brooks

Brothers (0.86)

Pepe Jeans

(0.74)

Scotch &

Soda (0.69)

Ferrari (0.80)

Nautica (0.80)

Nike (0.68)

Nautica (0.68)

Scotch &

Soda (0.797)

Tommy Hilfiger (0.791)

Duke (0.68)

United

Colors of Benetton

(0.67)

(Antony Morato)

Antony Morato (0.98)

Superdry (0.975)

Locomotive

(0.68)

Jack & Jones (0.64)

Tommy Hilfiger (0.952)

Sisley (0.951)

Slub (0.63)

John Pride (0.63)

Being

Human (0.942)

French

Connection (0.942)

SPYKAR

(0.63)

GAS (0.63)

We validated the above results based on the feedback from our stylists. As can be clearly seen, how presence of context in Skip-Gram based approach improved the results by incorporating brands quite similar to the query style’s brand. For the ‘Brooks Brothers T-shirt’ query, most tshirts in skip-gram based approach results have long sleeves, premium price points and stripes as compared to doc2vec approach results, which also have few printed T-shirts; spanning multiple price points. Similarly, for the ‘Antony Morato Jeans’ query, most jeans in skip-gram based approach results are of super-skinny fit, and slightly washed and distressed as compared to doc2vec approach which have varied fits and wash / torn intensity.

4.2.3 Session Based Similarity We consider a particular user session and find out various similar sessions to it, based on the session vectors from word2Vec. Consider the query session involving superheroes tshirts as shown in Fig 4(a). Fig 4(b) and 4(c) demonstrates two similar sessions to it.

4.3 Style Having Fuzzy Membership Into Multiple Contexts We now demonstrate the concept of a style falling into multiple contexts. To form product clusters, we prune our dataset of 2M sessions based on coherency score threshold of 0.9. This reduces the dataset to ~200K sessions. This reduced and cleansed dataset was used to form 100 product groups. We used cluto [15] for the clustering. Fig 5 illustrates the distribution of styles into multiple clusters. On an average, a style belongs to 2.67 clusters.

Basics

Avengers

Joker

Batman

Batman

Batman

Fig 4(a). Query session

Superman

DC Comics

Batman Batman

Batman

Superman

Batman

Batman

Avengers

Batman

Batman

Batman

Fig 4(b). Similar Session 1 (similarity = 0.988)

Superman

Batman

Batman

Joker

Batman

Joker

Fig 4(c). Similar Session 2 (similarity = 0.987)

Fig 5. Distribution of number of styles into multiple clusters

Consider the following G-STAR RAW T-shirt as a query.

It has membership into two contexts as described below in Table 3. We describe the contexts based on their centroids’ similarity to our attribute set A. The given style is at a distance of 0.08 and 0.1 from Context I and Context II respectively.

Table 3(a): Membership of Style (T-shirt) Into Multiple Contexts

Context I Context II

'brand_replay', 'pattern_printed', 'occasion_casual',

'neck_round_neck', 'fit_regular', 'brand_guess',

'print_coverage_chest_print', 'fabric_type_single_jersey', 'fabric_cotton', 'brand_g-

star_raw', 'sleeve_length_short_sleeves', 'brand_calvin_klein_jeans', '

brand_scotch_&_soda',

'brand_replay', 'brand_g-star_raw', 'fabric_type_single_jersey',

'brand_scotch_&_soda', 'brand_harley-davidson®', 'brand_calvin_klein_jeans',

'mrp_band_3000-4000', 'fit_regular', 'mrp_band_4000-

6000', 'brand_guess', 'occasion_casual',

'brand_antony_morato', 'brand_brooks_brothers',

'fabric_cotton', 'mrp_band_2000-

'print_and_pattern_type_typography_-_brand_/_logo',

'print_and_pattern_type_graphic_-_others',

'print_coverage_full_front', 'brand_harley-davidson®',

'brand_desigual', 'colour_off_white',

'brand_antony_morato', 'brand_883_police’

3000', 'brand_gas', 'brand_desigual', 'brand_superdry',

'brand_selected', 'print_coverage_chest_print',

'brand_emporio_armani', 'print_coverage_placement_print', 'brand_gant', 'brand_883_police',

'brand_ferrari', 'brand_h.e._by_mango',

'pattern_self_design'

Few differences between the contexts are:

• High affinity of Context 1 towards round neck T-shirts as compared to Context II which has both round and V neck t-shirts.

• Context II is more coherent in terms of MRP bands • All T-shirts in Context 1 are short sleeves while Context

II has both long and short sleeves • Brand distribution is different in both the contexts.

As another example, consider the following Maxi Dress by Mango

It also has a membership into two contexts as described in Table 3(b) below. The given style is at a distance of 0.09 and 0.12 from Context I and Context II respectively.

Table 3(b): Membership of Style (Dress) Into Multiple Contexts

Context I Context II

'pattern_solid', 'neck_round_neck',

'fabric_polyester', 'neck_v-neck', 'fabric_viscose', 'dress_length_maxi',

'colour_pink', 'dress_shape_maxi',

'occasion_party', 'colour_white', 'pattern_printed',

'mrp_band_3000-4000', 'colour_beige',

'neck_boat_neck', 'fabric_cotton', 'brand_eavan',

'pattern_self_design', 'sleeve_length_long_sleeves',

'desc_beaded', 'mrp_band_4000-6000',

'dress_shape_tailored_dress', 'brand_vero_moda', 'neck_scoop_neck',

'brand_athena', 'brand_only', 'brand_dressberry', 'brand_faballey', 'brand_mango',

'pattern_striped', 'colour_cream', 'colour_purple',

'mrp_band_6000-8000', 'brand_harpa',

'neck_halter_neck',

'occasion_casual', 'sleeve_length_sleeveless',

'fabric_polyester', 'pattern_solid',

'neck_round_neck', 'colour_black', 'neck_v-neck',

'mrp_band_1500-2000', 'colour_pink', 'pattern_printed',

'dress_length_maxi', 'dress_shape_maxi',

'mrp_band_1000-1500', 'occasion_party',

'sleeve_length_shoulder_straps', 'brand_eavan',

'colour_off_white', 'pattern_self_design', 'neck_scoop_neck',

'brand_athena', 'colour_yellow', 'mrp_band_3000-4000',

'sleeve_length_long_sleeves', 'brand_dressberry',

'dress_shape_sheath', 'dress_length_below_knee',

'desc_beaded', 'brand_harpa', 'dress_shape_tailored_dress',

'brand_faballey', 'colour_cream',

'dress_shape_skater', 'brand_belle_fille',

'dress_shape_skater', 'mrp_band_8000-10000',

'brand_and_by_anita_dongre', 'brand_mysin',

'neck_deep_neck', 'brand_only', 'desc_ribbed', 'name_polka',

'mrp_band_4000-6000'

Even though both contexts are mainly focused on maxi dresses, there are a few differences between the contexts:

• Context I is more on ‘party’ occasion while Context II has both ‘party’ and ‘casual’ dresses (including few polka dots dresses which are absent in Context I)

• Context I is more coherent in terms of MRP bands (all high price points)

• Contexts differ in terms of brand, colour, neck type and dress shape distribution. For example, there are no deep necked dresses in Context I.

4.4 Prediction of User’s Context We further experimented with prediction of user’s context i.e. predicting which product groups is user likely to browse in the current session. For this, we consider the first 𝑛 − 1 styles of a user’s session as the current context of the user and try to predict the context of 𝑛!! style. The fundamental assumption in this experiment is that there is no context switch in the user’s session. This context switch is indeed captured by the coherency score of the session (based only on first 𝑛 − 1 styles). Hence, we report our accuracy metric as a function of session coherency. Moreover, through this experiment we try to capture the semantic cohesiveness in the session, rather than focusing on syntactic realtionship and sequence modeling.

4.4.1 Preparation of Dataset We considered a set of 200K coherent sessions (with coherency score >= 0.9) to form 100 product clusters for the article type T-shirts. In practical situations, often the user sessions are quite noisy (non-coherent) as described in Fig 1 and Fig 3 and also involve context switches.

As a test set, we fetched 30K unseen sessions with uniform distribution of sessions’ coherency score. All these sessions had at least 8 clicks (to reveal enough context).

4.4.2 Approach We consider first 𝑛 − 1 styles of the user’s session; and calculate the current context vector. This vector is the centroid vector of first 𝑛 − 1 styles. We find most similar contexts to the resultant sum vector. For a positive (+1) output, the actual context of 𝑛!! style should be in top 3 predicted contexts (out of 100 possible contexts). We use top 3 contexts because of the membership distribution as described above in Fig 6 i.e. each style on an average has membership into 2.67 contexts. For each coherency score (rounded off), we calculate the average accuracy as shown in Fig 6. Instead of using the accuracy metric based on top 3 predicted contexts, we can also use NDCG metric [12] to find how relevant is the predicted context sort order. Both metrics have a similar behavior wherein most coherent sessions have accuracy close to 95% and NDCG score close to 0.95.

Fig 6. Accuracy Curve for prediction of user’s context based on

nearest neighbor approach

4.5 Complex Queries In Vector Space We also experimented with other complex queries, specifically on inclusion / exclusion filters. Consider a query as “Nike Sports Shoes with upper material as mesh”. In vector space, it essentially tries to find styles similar to “brand_nike + upper_material_mesh”. The corresponding output is shown in Fig 7(a).

Fig 7(a). Styles similar to resultant vector of “brand_nike + upper_material_mesh”

Now consider that we want to exclude black shoes. Hence the query in vector space would be “brand_nike + upper_material_mesh - colour_black”. The similar styles are shown in Fig 7(b).

Fig 7(b). Styles similar to resultant vector of “brand_nike + upper_material_mesh – colour_black”

5. CONCLUSIONS We showed how skip gram modelling can be used to project all attributes, styles and sessions to a high dimensional vector space, using both the style attributes and explicit users’ browsing signals. This enabled us to form more cohesive contexts. This projection to vector space allows a style to be a part of multiple contexts and have a better similarity notion herewith. We demonstrated how we can leverage these vectors to predict the current context of the user. We also demonstrated how word embeddings can be used to power search based on inclusion/exclusion filters.

6. ACKNOWLEDGMENTS We would like to thank Debdoot Mukherjee, Ashay Tamhane, Ullas Nambiar and Kunal Sachdeva for their contributions in reviewing and drafting the paper; and providing various inputs for algorithm design and evaluation.

7. REFERENCES [1] Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient estimation of Word Representations in Vector Space. arXiv preprint arXiv: 1301.3781 [2] David M. Blei, Andrew Y. Ng and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research [3] Tomas Mikolov. Distributed Representations of Sentences and Documents [4] Jeffrey Pennington, Richard Socher, Christopher D. Manning. GloVe: Global Vectors for Word Representations. Conference on Empirical Methods in Natural Language Processing (EMNLP 2014) [5] Adomavicius, G., Tuzhilin, A. 2005. Toward the Next Generation of Recommender Systems: A Survey and Possible Extensions. IEEE Transactions on Knowledge & Data Engineering,17(6),734-7

[6] Yang Hu† , Xi Yi‡ , Larry S. Davis. Collaborative Fashion Recommendation: A Functional Tensor Factorization Approach. Proceedings of the 23rd ACM international conference on Multimedia

[7] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and N. Sundaresan. Large scale visual recommendations from street fashion images. In KDD, 2014.

[8] Deepak Agarwal, Bee-Chung Chen. fLDA: Matrix Factorization through Latent Dirichlet Allocation. WSDM '10 Proceedings of the third ACM international conference on Web search and data mining

[9] Dimitrios Vogiatzis , Dimitrios Pierrakos , Georgios Paliouras and Sue Jenkyn-Jones. Exploiting knowledge about fashion to provide personalised clothing recommendations..

[10] http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/

[11] http://developers.lyst.com/2014/11/11/word-embeddings-for-fashion/

[12] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Tie-Yan Liu, Wei Chen. A Theoretical Analysis of NDCG Type Ranking Measures.

[13] Radim Rehurek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora [14] https://radimrehurek.com/gensim/index.html [15] http://glaros.dtc.umn.edu/gkhome/cluto/cluto

Decoding Fashion Contexts Using Word Embeddings

Documents

Transcript of Decoding Fashion Contexts Using Word Embeddings