Text Mining

been_29Β·2024λ…„ 9μ›” 25일
post-thumbnail

πŸ’‘ Text Mining

The process of analyzing text data to extract meaningful information


🎨 Feature Extraction

The process of converting text data into numerical data

Bag of Words (BoW)

  • Definition : Representing a document as a vector based on the frequency of words that appear in the document

  • Key Concepts

    • Vocabulary : A set of unique words from the given text is created, and this list of words is called a vocabulary
    • Vectorization : Counting the frequency of words appearing in a document and converting it into a vector
  • Procedure

    1. Creating a Vocabulary from the Document Set : Extract unique words from the given documents to create a word dictionary
      • For example, assuming the following three documents:
        Document 1: "I love sports"
        Document 2: "Sports is fun"
        Document 3: "I enjoy sports and data"
      • In this case, the vocabulary would be ['I', 'love', 'sports', 'is', 'fun', 'enjoy', 'and', 'data']
    2. Calculating Word Frequency : For each document, count how many times the words in the vocabulary appear, and represent each document as a vector -> each vector becomes the BoW representation of the document
      • The above documents represented in BoW would be:
        Document 1: ['I', 'love', 'sports'] β†’ [1, 1, 1, 0, 0, 0, 0, 0]
        Document 2: ['Sports', 'is', 'fun'] β†’ [0, 0, 1, 1, 1, 0, 0, 0]
        Document 3: ['I', 'enjoy', 'sports', 'and', 'data'] β†’ [1, 0, 1, 0, 0, 1, 1, 1]
  • formula

    • In BoW, the vector representation for a document dd is denoted as vdv_d
    • Each element of this vector represents the frequency f(t,d)f(t,d) of word tt in the document dd
    • Here, f(t,d)f(t,d) represents how often word tt appears in document dd
      vd=[f(t1,d),f(t2,d),...,f(tn,d)]v_d = [f(t_1,d), f(t_2,d),..., f(t_n,d)]
  • Code Example

    from sklearn.feature_extraction.text import CountVectorizer
    
     # Documents
    documents = [
        "I love sports",
        "Sports is fun",
        "I enjoy sports and data"
    ]
    
      # Create a Bag of Words model
    vectorizer = CountVectorizer()
    
     # Convert documents to BoW by counting the word frequencies in the document
    X = vectorizer.fit_transform(documents)
    
     # Print the BoW vectors
    print(X.toarray())
    
     # Print the vocabulary
    print(vectorizer.get_feature_names_out())
    ``````Python
       # Output
       [[1 0 0 1 0 0 1 0]
        [0 1 1 0 1 1 0 0]
        [1 0 1 0 0 0 1 1]]
  • Limitation

    • BoW ignores the context or the order of words, so it cannot distinguish between different meanings of the same word used in different contexts
    • To overcome this limitation, methods like TF-IDF are used -> TF-IDF considers not only the frequency of words but also how often the word appears in other documents to calculate its importance

TF-IDF (Term Frequency-Inverse Document Frequency)

  • Definition : A method that evaluates the importance of a word by combining two criteria: a word is considered important if it appears frequently in a document, but if it appears too often across all documents, it is treated as less important

  • Formula

    • TF (Term Frequency) : A value that indicates how often a particular word appears in a document

      TF(t,d)=NumberΒ ofΒ occurrencesΒ ofΒ termΒ tΒ inΒ documentΒ dTotalΒ numberΒ ofΒ termsΒ inΒ documentΒ d\text{TF}(t, d) = \frac{\text{Number of occurrences of term } t \text{ in document } d}{\text{Total number of terms in document } d}
    • IDF (Inverse Document Frequency) : Measures how rarely a term appears across all documents; the more common a term is, the less informative it is, and the rarer it is, the more important it is considered

      IDF(t,D)=log⁑(Total number of documentsNumber of documents containing term t+1)\text{IDF}(t, D) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t + 1} \right)
                      -> 1 is added to the denominator to avoid division by zero if a term does not appear in any document
    • TF-IDF Formula : The TF-IDF weight of a specific word is calculated by multiplying TF and IDF

      TF-IDF(t,d,D)=TF(t,d)Γ—IDF(t,D)\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
  • Example

    • Assume we have the following three documents:
      • Document 1: "I love natural language processing"
      • Document 2: "I love machine learning"
      • Document 3: "machine learning is great"
    • Calculate TF : Calculate how frequently each word appears in each document
      • In Document 1, "love" appears twice, and there are 5 total words -> TF("love", Document 1) = 2/5 = 0.4
      • In Document 2, "machine" appears once, and there are 4 total words -> TF("machine", Document 2) = 1/4 = 0.25
      • In Document 3, "learning" appears once, and there are 4 total words -> TF("learning", Document 3) = 1/4 = 0.25
    • Calculate IDF : Calculate how rarely a term appears across all documents
      • "love" appears in Document 1 and Document 2 -> IDF("love") = log⁑(32)β‰ˆ0.18\log\left( \frac{3}{2} \right) \approx 0.18
      • "machine" appears in Document 2 and Document 3 -> IDF("machine") = log⁑(32)β‰ˆ0.18\log\left( \frac{3}{2} \right) \approx 0.18
      • "learning" appears in Document 2 and Document 3 -> IDF("learning") = log⁑(32)β‰ˆ0.18\log\left( \frac{3}{2} \right) \approx 0.18
    • Calculate TF-IDF
      • TF-IDF("love", Document 1) = 0.4 Γ—\times 0.18 = 0.072
      • TF-IDF("machine", Document 2) = 0.25 Γ—\times 0.18 = 0.045
      • TF-IDF("learning", Document 3) = 0.25 Γ—\times 0.18 = 0.045
  • Code Example

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Document list
    documents = [
       "I love natural language processing",
       "I love machine learning",
       "machine learning is great"
    ]
    
    # Calculate TF-IDF
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    
    # Output results
    print("Word list:", vectorizer.get_feature_names_out())
    print("TF-IDF matrix:")
    print(tfidf_matrix.toarray())
    
    # Output
    # Word list: ['great', 'is', 'language', 'learning', 'love', 'machine', 'natural', 'processing']
    TF-IDF matrix:
    [[0.         0.         0.51785612 0.         0.51785612 0.         0.51785612 0.51785612]
    [0.         0.         0.         0.6316672  0.6316672  0.6316672  0.         0.        ]
    [0.57496152 0.57496152 0.         0.46941728 0.         0.46941728 0.         0.        ]]

Word Embeddings

  • A method to represent words as vectors in a high-dimensional space, reflecting the semantic relationships between words.

  • Word2Vec : Converts words in text into fixed-size real number vectors, learning vectors that reflect semantic similarity between words.

    • Skip-Gram : A method that predicts Context Words from a Target Word.
      • The model learns by probabilistically predicting which words will appear around a given target word.
      • The goal is to predict the surrounding words wtβˆ’i,wt+iw_{t-i}, w_{t+i} from the target word wtw_t.
        L=βˆ‘t=1Tβˆ‘βˆ’c≀j≀c,jβ‰ 0log⁑P(wt+j∣wt)\mathcal{L} = \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} \mid w_t)
        - Here, T is the total number of words in the text, and c is the context window size.
      • This probability is calculated using the softmax function.
        P(wO∣wI)=exp⁑(vwOβ‹…vwI)βˆ‘w=1Vexp⁑(vwβ‹…vwI)P(w_O \mid w_I) = \frac{\exp(\mathbf{v}_{w_O} \cdot \mathbf{v}_{w_I})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w \cdot \mathbf{v}_{w_I})}
           - w_0: Output word (context word)
           - w_1: Input word (target word)
           - v_{w_0}, v_{w_1}: Vectors of the words w_0 and w_1, respectively
           - V: The size of the vocabulary
      • This softmax calculation is done by computing the dot product between word vectors.
    • CBOW(Continuous Bag of Words) Model : Opposite of Skip-Gram, it predicts the target word from surrounding words.
      • The goal of the CBOW model is to maximize P(wt∣wtβˆ’i,wt+i)P(w_t|w_{t-i}, w_{t+i}).
      • The model calculates the probability of the target word given the context words.
  • GloVe : A method that learns word vectors using the co-occurrence probability between words.

    • Global Co-occurrence Matrix: A matrix that records how often word pairs appear together in a given corpus, which is used to learn the relationship between words.

    • Loss Function: Designed to reflect the relationship between words on a logarithmic scale, ensuring that word pairs with a large XijX_{ij} have similar vectors.

      f(wi,wj,w~i,w~j)=(wiTw~j+bi+b~jβˆ’log⁑(Xij))2f(w_i, w_j, \tilde{w}_i, \tilde{w}_j) = (w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij}))^2
       - w_i, w_j are vectors for words i and j
       - tilde{w}_i, tilde{w}_j are the context vectors for those words
       - X_{ij} is the co-occurrence frequency between the two words
       - b_i, tilde{b}_j are the bias terms for the word and context vectors, respectively
  • FastText : An extended version of Word2Vec, learning by breaking words into n-grams.

    • Subword Expression: Words are split into n-grams for learning.
      • For example, the word "apple" can be split into app, ppl, ple in 3-gram form.
      • These subwords are learned, and even if a word itself doesn't appear, it can learn from similar subwords in other words.
    • Word2Vec assigns a vector only to each word, while FastText splits words into multiple subwords, assigns a vector to each subword, and then combines them to create the word vector.
    • Therefore, FastText can handle rare words or newly coined words more effectively.






🎨 Topic Modeling

Unsupervised learning technique used to automatically discover the topics within a large collection of documents

Key Concepts

  • Topic: A probability distribution of words: a topic refers to a subject in which certain words appear more frequently.
  • Document: A probability distribution of topics.

LDA (Latent Dirichlet Allocation)

  • Definition: A probabilistic generative model that learns latent variables to represent the topic distribution of documents and the word distribution of each topic.
  • Basic idea:
    • Each document is composed of a mixture of topics.
    • Each topic is composed of a probabilistic distribution of words.
  • Mathematical expression:
    • Document's topic distribution ΞΈd\theta_d: The probability distribution of topics appearing in document dd.
    • Topic's word distribution Ο•k\phi_k: The probability distribution of words appearing in topic kk.
    • Topic assignment of words in a document zdnz_{dn}: Indicates which topic zz the nnth word in document dd is assigned to.
    • Word generation wdnw_{dn}: The nnth word in document dd is sampled from the word distribution Ο•zdn\phi_{z_{dn}} of the topic zdnz_{dn} assigned to the word.
  • LDA can be described mathematically as:
    • For each document dd: Topic distribution ΞΈd∼Dir(Ξ±)\theta_d \sim Dir(\alpha)
    • For each topic kk: Word distribution Ο•k∼Dir(Ξ²)\phi_k \sim Dir(\beta)
    • For each word nn in document dd:
      • Topic assignment zdn∼Multinomial(ΞΈd)z_{dn} \sim Multinomial(\theta_d) (Choose a topic from the topic distribution of the document)
      • Word generation wdn∼Multinomial(Ο•zdn)w_{dn} \sim Multinomial(\phi_{z_{dn}}) (Generate a word from the word distribution of the selected topic)
    • Here, Ξ±\alpha and Ξ²\beta are hyperparameters influencing the topic distribution of documents and the word distribution of topics, respectively.

LSA (Latent Semantic Analysis)

  • Definition: A dimensionality reduction technique used to extract latent semantic structures from text data.
    • Primarily uses a linear algebra technique called SVD (Singular Value Decomposition) to transform the document-term matrix into a lower-dimensional space to learn the latent semantic relationships between documents and words.
  • Basic assumptions:
    • The meaning of a word is determined by its context -> Words that appear in similar contexts are likely to have similar meanings.
    • By learning the relationship between documents and words in a lower-dimensional space, we can understand how each document and word represents a specific topic.
  • Mathematical expression:
    • Document-term matrix AA: Size mΓ—nm \times n, composed of mm documents and nn words.
      • Each element AijA_{ij} represents the frequency of word jj in document ii (TF-IDF values).
    • Decompose the document-term matrix AA using SVD:
      A=UΞ£VTA = U\Sigma{V^T}
      • A: Document-term matrix of size mΓ—nm \times n
      • U: Orthogonal matrix of size mΓ—km \times k (Represents the latent semantic space of documents)
      • Ξ£\Sigma: Diagonal matrix of size kΓ—kk \times k (Composed of singular values)
      • VTV^T: Orthogonal matrix of size kΓ—nk \times n (Represents the latent semantic space of words)
      • Here, kk is the size of the reduced dimensional space.
    • Dimensionality reduction: In LSA, only the top kk singular values are used to construct the latent semantic space of documents and words.
      • After dimensionality reduction, the resulting matrix is:
        Ak=UkΞ£kVkTA_k = U_k\Sigma{_k}V_k^T
profile
Data Analysis

0개의 λŒ“κΈ€