The current project aims to build a recommendation model for cosmetic products.
More specifically, the current proeject aims to:
The reasons for choosing the said approach are as stated below:
Since the input data exists in a form a list of ingredients, Tokenization Process is not required.
Therefore, the focus is on researching different methods to convert a text sequence to a vector (Vectorization).
Word vectors aggregation is one of the popular and baseline approach if the vectors for the words are available or easily obtainable.
'Averaging' and 'Max-pooling' are two of the most frequent operations for aggreation.
Topic modeling obtains a hidden vector consists of dimensions that represent a topic.
Example of Topic Modeling
A vector: ["dinosaurs","amusement parks","extinctino"]
Recurrent models, utilise the innate ability of recurrent neural networks.
Recurrent models are composed of an encoder and decoder, where the encoder accumulates the sequence meaning and its internal final state is used as embeddings.
BOW technique vectorizes a text by using one dimension per word where the value represents the weight of the word in the text. It disregards the order and the syntax of a text.
Example of BOW
(1) John likes to watch movies. Mary likes movies too.
(2) Mary also likes to watch football games.
The vocab for the above sentences would be as below with the deletion of the stopwords(eg. "to").
V ={John, likes, watch, movies, Mary, too, also, football, games}
(1) [1,2,1,2,1,1,0,0,0]
(2) [0,1,1,0,1,0,1,1,1]
(1) [1,1,1,1,1,1,0,0,0]
(2) [0,1,1,0,1,0,1,1,1]
-Indicates the presence of the word in the vector with 1,0
-All words are equally relevant
(1) [1,2,1,2,1,1,0,0,0]
(2) [0,1,1,0,1,0,1,1,1]
(1) [1/8, 2/8, 1/8, 2/8, 1/8, 1/8, 0/8, 0/8, 0/8]
→ [0.12, 0.25, 0.12, 0.25, 0.12, 0.12, 0.00, 0.00, 0.00]
(2) [0/6, 1/6, 1/6, 0/6, 1/6, 0/6, 1/6, 1/6, 1/6]
→ [0.00, 0.16, 0.16, 0.00, 0.16, 0.00, 0.16, 0.16, 0.16]
Example of IDF values
[ln(2/1), ln(2/2), ln(2/2), ln(2/1), ln(2/2), ln(2/1), ln(2/1), ln(2/1), ln(2/1)]
→ [0.69, 0.0, 0.0, 0.69, 0.0, 0.69, 0.69, 0.69, 0.69]
TF-IDF:
(1) [0.12*0.69, 0.25*0.00, 0.12*0.00, 0.25*0.69, 0.12*0.00, 0.12*0.69, 0.00*0.69, 0.00*0.69, 0.00*0.69]
→[0.08, 0.00, 0.00, 0.17, 0.00, 0.08, 0.00, 0.00, 0.00]
(2) [0.00*0.69, 0.16*0.00, 0.16*0.00, 0.00*0.69, 0.16*0.00, 0.00*0.69, 0.16*0.69, 0.16*0.69, 0.16*0.69]
→[0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.11, 0.11, 0.11]
Bag of n-grams is an augmented model that generates a vector from a text as combinations of words appear in a specific order.
→ the use of larger n-gram results in extesively lengthened vectors
Doc3Vec uses a special token D at the beginning of the text which represents the whole sequence, in accordance with the distributional hypothesis:
"words that occur in the same contexts tend to have similar meanings".
BERT produces a vector which represents the whole sequence via token[CLS], and for every token in the sequence.
SBERT is a variant of BERT which specialises in the efficient comparison of sentences with following technique:
InferSent is a sentence embedding method that provides semantic sentence representations.
It consists of training NN encoders of different architectures such as GRUs,LSTMs, and BiLSTMs on the Standford Natural Language Inference task.
Universal Sentence Encoder includes below two possible models with multi-task learning ability for sentence representation learning:
DAN
→ 1. average the input embeddings for words & the bi-grams
→ 2. Pass through a feedforward NN