
The process of analyzing text data to extract meaningful information
The process of converting text data into numerical data
Definition : Representing a document as a vector based on the frequency of words that appear in the document
Key Concepts
Procedure
['I', 'love', 'sports', 'is', 'fun', 'enjoy', 'and', 'data']['I', 'love', 'sports'] β [1, 1, 1, 0, 0, 0, 0, 0]['Sports', 'is', 'fun'] β [0, 0, 1, 1, 1, 0, 0, 0]['I', 'enjoy', 'sports', 'and', 'data'] β [1, 0, 1, 0, 0, 1, 1, 1]formula
Code Example
from sklearn.feature_extraction.text import CountVectorizer
# Documents
documents = [
"I love sports",
"Sports is fun",
"I enjoy sports and data"
]
# Create a Bag of Words model
vectorizer = CountVectorizer()
# Convert documents to BoW by counting the word frequencies in the document
X = vectorizer.fit_transform(documents)
# Print the BoW vectors
print(X.toarray())
# Print the vocabulary
print(vectorizer.get_feature_names_out())
``````Python
# Output
[[1 0 0 1 0 0 1 0]
[0 1 1 0 1 1 0 0]
[1 0 1 0 0 0 1 1]]
Limitation
Definition : A method that evaluates the importance of a word by combining two criteria: a word is considered important if it appears frequently in a document, but if it appears too often across all documents, it is treated as less important
Formula
TF (Term Frequency) : A value that indicates how often a particular word appears in a document
IDF (Inverse Document Frequency) : Measures how rarely a term appears across all documents; the more common a term is, the less informative it is, and the rarer it is, the more important it is considered
-> 1 is added to the denominator to avoid division by zero if a term does not appear in any document
TF-IDF Formula : The TF-IDF weight of a specific word is calculated by multiplying TF and IDF
Example
Code Example
from sklearn.feature_extraction.text import TfidfVectorizer
# Document list
documents = [
"I love natural language processing",
"I love machine learning",
"machine learning is great"
]
# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Output results
print("Word list:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:")
print(tfidf_matrix.toarray())
# Output
# Word list: ['great', 'is', 'language', 'learning', 'love', 'machine', 'natural', 'processing']
TF-IDF matrix:
[[0. 0. 0.51785612 0. 0.51785612 0. 0.51785612 0.51785612]
[0. 0. 0. 0.6316672 0.6316672 0.6316672 0. 0. ]
[0.57496152 0.57496152 0. 0.46941728 0. 0.46941728 0. 0. ]]
A method to represent words as vectors in a high-dimensional space, reflecting the semantic relationships between words.
Word2Vec : Converts words in text into fixed-size real number vectors, learning vectors that reflect semantic similarity between words.
- Here, T is the total number of words in the text, and c is the context window size. - w_0: Output word (context word)
- w_1: Input word (target word)
- v_{w_0}, v_{w_1}: Vectors of the words w_0 and w_1, respectively
- V: The size of the vocabularyGloVe : A method that learns word vectors using the co-occurrence probability between words.
Global Co-occurrence Matrix: A matrix that records how often word pairs appear together in a given corpus, which is used to learn the relationship between words.
Loss Function: Designed to reflect the relationship between words on a logarithmic scale, ensuring that word pairs with a large have similar vectors.
- w_i, w_j are vectors for words i and j
- tilde{w}_i, tilde{w}_j are the context vectors for those words
- X_{ij} is the co-occurrence frequency between the two words
- b_i, tilde{b}_j are the bias terms for the word and context vectors, respectively
FastText : An extended version of Word2Vec, learning by breaking words into n-grams.
app, ppl, ple in 3-gram form.
Unsupervised learning technique used to automatically discover the topics within a large collection of documents
Each document is composed of a mixture of topics.Each topic is composed of a probabilistic distribution of words.The meaning of a word is determined by its context -> Words that appear in similar contexts are likely to have similar meanings.