Text Preprocessing

been_29ยท2024๋…„ 9์›” 23์ผ
post-thumbnail

๐Ÿ’ก Text Preprocessing

The process of converting text data into a format that can be analyzed or processed by a machine learning model


๐ŸŽจ Corpus

A structured collection of text data that a model can learn from and analyze

Types of Corpus

  • Monolingual Corpus: A collection of text data in a single language
  • Bi-lingual Corpus: A collection of text data in two languages
  • Multilingual Corpus: A collection of text data in multiple languages
  • Parallel Corpus: A collection of sentence pairs in different languages that are aligned and labeled

How to use Corpus

  1. Data Collection: Gather text data that fits the subject you want to analyze and build the Corpus
  2. Text Preprocessing: Most Corpus goes through a cleaning process, such as removing whitespace, punctuation, and stopwords, to convert raw text into a structure suitable for analysis
  3. Feature Extraction and Analysis: Extract and analyze features from the cleaned Corpus by performing tasks like Tokenization, stemming, and lemmatization

Usage of Corpus

  • Sentiment Analysis: Use a Corpus of customer reviews or social media data to classify text as positive or negative
  • Topic Modeling: Use algorithms like LDA to analyze text data and extract topics from documents
  • Document Classification: Classify texts such as legal documents or research papers into specific categories, often using domain-specific Corpus






๐ŸŽจ Text Preprocessing

Converting text data into a format that can be processed by a model

Main Steps

  • Lowercasing : Convert all text to lowercase for consistency

    text = "Apple is looking at buying U.K. startup for $1 billion"
    text = text.lower()
    print(text)
    
    # Output
    # "apple is looking at buying u.k. startup for $1 billion"
  • Stopword Removal: Remove stopwords such as articles, conjunctions, and prepositions that do not add significant meaning

    from nltk.corpus import stopwords
    import nltk
    nltk.download('stopwords')
    
    stop_words = set(stopwords.words('english'))
    tokens = ["I", "am", "learning", "natural", "language", "processing"]
    filtered_tokens = [word for word in tokens if word not in stop_words]
    print(filtered_tokens)
    
    # Output
    # ['learning', 'natural', 'language', 'processing']
  • Punctuation Removal : Remove punctuation which may add noise in user-generated content like tweets or reviews

    import re
    text = "I love NLP! It's amazing, isn't it?"
    text = re.sub(r'[^\w\s]', '', text)
    print(text)
    
    # Output
    # "I love NLP Its amazing isnt it"
  • Number Removal : Remove numbers if they are not relevant

    text = "The company earned $1 billion in 2020"
    text = re.sub(r'\d+', '', text)
    print(text)
    
    # Output
    # "The company earned $ billion in "
  • Tokenization : Split text into words, sentences, or sub-word units

    from nltk.tokenize import word_tokenize
    nltk.download('punkt')
    
    text = "I love NLP and Machine Learning"
    tokens = word_tokenize(text)
    print(tokens)
    
    # Output
    # ['I', 'love', 'NLP', 'and', 'Machine', 'Learning']
  • Stemming : Remove affixes from words to extract the word stem

    from nltk.stem import PorterStemmer
    
    ps = PorterStemmer()
    words = ["running", "ran", "runs", "easily", "fairly"]
    stemmed_words = [ps.stem(word) for word in words]
    print(stemmed_words)
    
    # Output
    # ['run', 'ran', 'run', 'easili', 'fairli']
  • Lemmatization: Similar to stemming, but extracts the lemma based on the word's context

    from nltk.stem import WordNetLemmatizer
    
    lemmatizer = WordNetLemmatizer()
    words = ["better", "running", "cats", "geese"]
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    print(lemmatized_words)
    
    # Output
    # ['better', 'running', 'cat', 'goose']
  • Whitespace Removal

    text = "I    love   NLP   "
    text = " ".join(text.split())
    print(text)
    
    # Output
    # "I love NLP"
  • Special Character and URL Removal

    text = "Visit https://www.example.com for more information!"
    text = re.sub(r'http\S+', '', text)
    print(text)
    
    # Output
    # "Visit for more information!"

Regular Expression

CharacterMeaningDescription
[expr]One ofMatch one character from within [ ]
[expr-expr]One of from toMatch a range of characters between the first and last character
[^expr]None ofMatch all characters except those in [ ]
(expr)GroupGroup a pattern for logical separation
****Or
?existsCheck if a chracter appears 0 or 1 times
+least onceCheck if a chracter appears 1 or more times
*greedyMatch if a chracter appear 0 or more times
{n}, {n,}, {n,m}n times, least n times, n to m timesSpecify the number of times a pattern must appear
.Any CharacterMatch any character
^$Start of Line, End of LineMatch the start and end of a string






๐ŸŽจ Tokenization

The process of dividing sentences into units for text data analysis

Token

  • Definition: A small unit into which raw text is divided for analysis, representing the smallest meaningful element within a sentence
  • Characteristics:
    • Independence: Tokens are processed independently from the original text, and each token is considered the smallest meaningful unit in text analysis
    • Form: Each word, subword, or character in a text can be treated as a token. For example, "The quick brown fox" can be split into four tokens at the word level
  • Types:
    • Word-based Token: Each word is treated as a token.
      • Example: "I enjoy reading books." -> ["I", "enjoy", "reading", "books"]
    • Subword-based Token: Splits words into smaller units, allowing rare words to be processed.
      • Example: "unhappiness" -> ["un", "happi", "ness"]
    • Character-based Token: Each character is split into a token.
      • Example: "apple" -> ["a", "p", "p", "l", "e"]
    • Sentence-based Token: Splits the text into tokens by sentence.
      • Example: "The cat sat on the mat." -> ["The", "cat", "sat", "on", "the", "mat"]

Tokenization

  • Definition : The process of dividing text into meaningful minimal units in natural language processing

  • Types

    • Word Tokenization : Simple and intuitive, but may encounter issues when handling punctuation, capitalization, and abbreviations

      from nltk.tokenize import word_tokenize
      
      text = "I love machine learning. It's amazing!"
      tokens = word_tokenize(text)
      print(tokens)
      
      
      # Output
      # ['I', 'love', 'machine', 'learning', '.', 'It', "'s", 'amazing', '!']
      
    • Sentence Tokenization : Require additional rules to handle special cases like abbreviations or capitalization after punctuation

      from nltk.tokenize import sent_tokenize
      
      text = "I love machine learning. It's amazing! How about you?"
      tokens = sent_tokenize(text)
      print(tokens)
      
      
      # Output
      # ['I love machine learning.', "It's amazing!", 'How about you?']
      
    • Subword Tokenization : Effective for handling rare words or neologisms, but some tokens may not have meaning on their own

      import sentencepiece as spm
      
      # Load pre-trained BPE model
      sp = spm.SentencePieceProcessor()
      sp.load('pretrained_model.model')
      
      text = "machine learning"
      tokens = sp.encode_as_pieces(text)
      print(tokens)
      
      # Output
      # ['โ–machine', 'โ–learning']
    • Charater-level Tokenization : A very granular method but challenging to capture context when processing at the character level

      text = "hello"
      tokens = list(text)
      print(tokens)
      
      # Output
      # ['h', 'e', 'l', 'l', 'o']
    • Whitespace Tokenization : Suitable for languages like English, where spaces clearly separate words

      text = "I love natural language processing"
      tokens = text.split()
      print(tokens)
      
      # Output
      # ['I', 'love', 'natural', 'language', 'processing']
profile
Data Analysis

0๊ฐœ์˜ ๋Œ“๊ธ€