Text Mining

been_29·2024년 9월 25일

한국경제신문 with Toss bank MLOps 과정

목록 보기

20/26

💡 Text Mining

The process of analyzing text data to extract meaningful information

🎨 Feature Extraction

The process of converting text data into numerical data

Bag of Words (BoW)

Definition : Representing a document as a vector based on the frequency of words that appear in the document
Key Concepts
- Vocabulary : A set of unique words from the given text is created, and this list of words is called a vocabulary
- Vectorization : Counting the frequency of words appearing in a document and converting it into a vector
Procedure
1. Creating a Vocabulary from the Document Set : Extract unique words from the given documents to create a word dictionary
  - For example, assuming the following three documents:
    Document 1: "I love sports"
    Document 2: "Sports is fun"
    Document 3: "I enjoy sports and data"
  - In this case, the vocabulary would be ['I', 'love', 'sports', 'is', 'fun', 'enjoy', 'and', 'data']
2. Calculating Word Frequency : For each document, count how many times the words in the vocabulary appear, and represent each document as a vector -> each vector becomes the BoW representation of the document
  - The above documents represented in BoW would be:
    Document 1: ['I', 'love', 'sports'] → [1, 1, 1, 0, 0, 0, 0, 0]
    Document 2: ['Sports', 'is', 'fun'] → [0, 0, 1, 1, 1, 0, 0, 0]
    Document 3: ['I', 'enjoy', 'sports', 'and', 'data'] → [1, 0, 1, 0, 0, 1, 1, 1]
formula
- In BoW, the vector representation for a document $d$ is denoted as $v_d$
- Each element of this vector represents the frequency $f(t,d)$ of word $t$ in the document $d$
- Here, $f(t,d)$ represents how often word $t$ appears in document $d$ $v_d = [f(t_1,d), f(t_2,d),..., f(t_n,d)]$

Code Example

from sklearn.feature_extraction.text import CountVectorizer

 # Documents
documents = [
    "I love sports",
    "Sports is fun",
    "I enjoy sports and data"
]

  # Create a Bag of Words model
vectorizer = CountVectorizer()

 # Convert documents to BoW by counting the word frequencies in the document
X = vectorizer.fit_transform(documents)

 # Print the BoW vectors
print(X.toarray())

 # Print the vocabulary
print(vectorizer.get_feature_names_out())
``````Python
   # Output
   [[1 0 0 1 0 0 1 0]
    [0 1 1 0 1 1 0 0]
    [1 0 1 0 0 0 1 1]]

Limitation
- BoW ignores the context or the order of words, so it cannot distinguish between different meanings of the same word used in different contexts
- To overcome this limitation, methods like TF-IDF are used -> TF-IDF considers not only the frequency of words but also how often the word appears in other documents to calculate its importance

TF-IDF (Term Frequency-Inverse Document Frequency)

Definition : A method that evaluates the importance of a word by combining two criteria: a word is considered important if it appears frequently in a document, but if it appears too often across all documents, it is treated as less important
Formula
- TF (Term Frequency) : A value that indicates how often a particular word appears in a document
  $\text{TF}(t, d) = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of terms in document } d}$
- IDF (Inverse Document Frequency) : Measures how rarely a term appears across all documents; the more common a term is, the less informative it is, and the rarer it is, the more important it is considered
  
  $\text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1} \right)$
```
                -> 1 is added to the denominator to avoid division by zero if a term does not appear in any document
```
- TF-IDF Formula : The TF-IDF weight of a specific word is calculated by multiplying TF and IDF
  $\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$
Example
- Assume we have the following three documents:
  - Document 1: "I love natural language processing"
  - Document 2: "I love machine learning"
  - Document 3: "machine learning is great"
- Calculate TF : Calculate how frequently each word appears in each document
  - In Document 1, "love" appears twice, and there are 5 total words -> TF("love", Document 1) = 2/5 = 0.4
  - In Document 2, "machine" appears once, and there are 4 total words -> TF("machine", Document 2) = 1/4 = 0.25
  - In Document 3, "learning" appears once, and there are 4 total words -> TF("learning", Document 3) = 1/4 = 0.25
- Calculate IDF : Calculate how rarely a term appears across all documents
  - "love" appears in Document 1 and Document 2 -> IDF("love") = $\log\left( \frac{3}{2} \right) \approx 0.18$
  - "machine" appears in Document 2 and Document 3 -> IDF("machine") = $\log\left( \frac{3}{2} \right) \approx 0.18$
  - "learning" appears in Document 2 and Document 3 -> IDF("learning") = $\log\left( \frac{3}{2} \right) \approx 0.18$
- Calculate TF-IDF
  - TF-IDF("love", Document 1) = 0.4 $\times$ 0.18 = 0.072
  - TF-IDF("machine", Document 2) = 0.25 $\times$ 0.18 = 0.045
  - TF-IDF("learning", Document 3) = 0.25 $\times$ 0.18 = 0.045

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer

# Document list
documents = [
   "I love natural language processing",
   "I love machine learning",
   "machine learning is great"
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Output results
print("Word list:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:")
print(tfidf_matrix.toarray())

# Output
# Word list: ['great', 'is', 'language', 'learning', 'love', 'machine', 'natural', 'processing']
TF-IDF matrix:
[[0.         0.         0.51785612 0.         0.51785612 0.         0.51785612 0.51785612]
[0.         0.         0.         0.6316672  0.6316672  0.6316672  0.         0.        ]
[0.57496152 0.57496152 0.         0.46941728 0.         0.46941728 0.         0.        ]]

Word Embeddings

A method to represent words as vectors in a high-dimensional space, reflecting the semantic relationships between words.
Word2Vec : Converts words in text into fixed-size real number vectors, learning vectors that reflect semantic similarity between words.
- Skip-Gram : A method that predicts Context Words from a Target Word.
  - The model learns by probabilistically predicting which words will appear around a given target word.
  - The goal is to predict the surrounding words $w_{t-i}, w_{t+i}$ from the target word $w_t$ .
    $\mathcal{L} = \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} \mid w_t)$
```
- Here, T is the total number of words in the text, and c is the context window size.
```
  - This probability is calculated using the softmax function.
    $P(w_O \mid w_I) = \frac{\exp(\mathbf{v}_{w_O} \cdot \mathbf{v}_{w_I})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{v}_{w_I})}$
```
   - w_0: Output word (context word)
   - w_1: Input word (target word)
   - v_{w_0}, v_{w_1}: Vectors of the words w_0 and w_1, respectively
   - V: The size of the vocabulary
```
  - This softmax calculation is done by computing the dot product between word vectors.
- CBOW(Continuous Bag of Words) Model : Opposite of Skip-Gram, it predicts the target word from surrounding words.
  - The goal of the CBOW model is to maximize $P(w_t|w_{t-i}, w_{t+i})$ .
  - The model calculates the probability of the target word given the context words.
GloVe : A method that learns word vectors using the co-occurrence probability between words.
- Global Co-occurrence Matrix: A matrix that records how often word pairs appear together in a given corpus, which is used to learn the relationship between words.
- Loss Function: Designed to reflect the relationship between words on a logarithmic scale, ensuring that word pairs with a large $X_{ij}$ have similar vectors.
  
  $f(w_i, w_j, \tilde{w}_i, \tilde{w}_j) = (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij}))^2$
```
 - w_i, w_j are vectors for words i and j
 - tilde{w}_i, tilde{w}_j are the context vectors for those words
 - X_{ij} is the co-occurrence frequency between the two words
 - b_i, tilde{b}_j are the bias terms for the word and context vectors, respectively
```
FastText : An extended version of Word2Vec, learning by breaking words into n-grams.
- Subword Expression: Words are split into n-grams for learning.
  - For example, the word "apple" can be split into app, ppl, ple in 3-gram form.
  - These subwords are learned, and even if a word itself doesn't appear, it can learn from similar subwords in other words.
- Word2Vec assigns a vector only to each word, while FastText splits words into multiple subwords, assigns a vector to each subword, and then combines them to create the word vector.
- Therefore, FastText can handle rare words or newly coined words more effectively.

🎨 Topic Modeling

Unsupervised learning technique used to automatically discover the topics within a large collection of documents

Key Concepts

Topic: A probability distribution of words: a topic refers to a subject in which certain words appear more frequently.
Document: A probability distribution of topics.

LDA (Latent Dirichlet Allocation)

Definition: A probabilistic generative model that learns latent variables to represent the topic distribution of documents and the word distribution of each topic.
Basic idea:
- Each document is composed of a mixture of topics.
- Each topic is composed of a probabilistic distribution of words.
Mathematical expression:
- Document's topic distribution $\theta_d$ : The probability distribution of topics appearing in document $d$ .
- Topic's word distribution $\phi_k$ : The probability distribution of words appearing in topic $k$ .
- Topic assignment of words in a document $z_{dn}$ : Indicates which topic $z$ the $n$ th word in document $d$ is assigned to.
- Word generation $w_{dn}$ : The $n$ th word in document $d$ is sampled from the word distribution $\phi_{z_{dn}}$ of the topic $z_{dn}$ assigned to the word.
LDA can be described mathematically as:
- For each document $d$ : Topic distribution $\theta_d \sim Dir(\alpha)$
- For each topic $k$ : Word distribution $\phi_k \sim Dir(\beta)$
- For each word $n$ in document $d$ :
  - Topic assignment $z_{dn} \sim Multinomial(\theta_d)$ (Choose a topic from the topic distribution of the document)
  - Word generation $w_{dn} \sim Multinomial(\phi_{z_{dn}})$ (Generate a word from the word distribution of the selected topic)
- Here, $\alpha$ and $\beta$ are hyperparameters influencing the topic distribution of documents and the word distribution of topics, respectively.

LSA (Latent Semantic Analysis)

Definition: A dimensionality reduction technique used to extract latent semantic structures from text data.
- Primarily uses a linear algebra technique called SVD (Singular Value Decomposition) to transform the document-term matrix into a lower-dimensional space to learn the latent semantic relationships between documents and words.
Basic assumptions:
- The meaning of a word is determined by its context -> Words that appear in similar contexts are likely to have similar meanings.
- By learning the relationship between documents and words in a lower-dimensional space, we can understand how each document and word represents a specific topic.
Mathematical expression:
- Document-term matrix $A$ : Size $m \times n$ , composed of $m$ documents and $n$ words.
  - Each element $A_{ij}$ represents the frequency of word $j$ in document $i$ (TF-IDF values).
- Decompose the document-term matrix $A$ using SVD:
  $A = U\Sigma{V^T}$
  - A: Document-term matrix of size $m \times n$
  - U: Orthogonal matrix of size $m \times k$ (Represents the latent semantic space of documents)
  - $\Sigma$ : Diagonal matrix of size $k \times k$ (Composed of singular values)
  - $V^T$ : Orthogonal matrix of size $k \times n$ (Represents the latent semantic space of words)
  - Here, $k$ is the size of the reduced dimensional space.
- Dimensionality reduction: In LSA, only the top $k$ singular values are used to construct the latent semantic space of documents and words.
  - After dimensionality reduction, the resulting matrix is: $A_k = U_k\Sigma{_k}V_k^T$