Text Preprocessing

been_29·2024년 9월 23일

한국경제신문 with Toss bank MLOps 과정

목록 보기

19/26

💡 Text Preprocessing

The process of converting text data into a format that can be analyzed or processed by a machine learning model

🎨 Corpus

A structured collection of text data that a model can learn from and analyze

Types of Corpus

Monolingual Corpus: A collection of text data in a single language
Bi-lingual Corpus: A collection of text data in two languages
Multilingual Corpus: A collection of text data in multiple languages
Parallel Corpus: A collection of sentence pairs in different languages that are aligned and labeled

How to use Corpus

Data Collection: Gather text data that fits the subject you want to analyze and build the Corpus
Text Preprocessing: Most Corpus goes through a cleaning process, such as removing whitespace, punctuation, and stopwords, to convert raw text into a structure suitable for analysis
Feature Extraction and Analysis: Extract and analyze features from the cleaned Corpus by performing tasks like Tokenization, stemming, and lemmatization

Usage of Corpus

Sentiment Analysis: Use a Corpus of customer reviews or social media data to classify text as positive or negative
Topic Modeling: Use algorithms like LDA to analyze text data and extract topics from documents
Document Classification: Classify texts such as legal documents or research papers into specific categories, often using domain-specific Corpus

🎨 Text Preprocessing

Converting text data into a format that can be processed by a model

Main Steps

Lowercasing : Convert all text to lowercase for consistency

text = "Apple is looking at buying U.K. startup for $1 billion"
text = text.lower()
print(text)

# Output
# "apple is looking at buying u.k. startup for $1 billion"

Stopword Removal: Remove stopwords such as articles, conjunctions, and prepositions that do not add significant meaning

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
tokens = ["I", "am", "learning", "natural", "language", "processing"]
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)

# Output
# ['learning', 'natural', 'language', 'processing']

Punctuation Removal : Remove punctuation which may add noise in user-generated content like tweets or reviews

import re
text = "I love NLP! It's amazing, isn't it?"
text = re.sub(r'[^\w\s]', '', text)
print(text)

# Output
# "I love NLP Its amazing isnt it"

Number Removal : Remove numbers if they are not relevant

text = "The company earned $1 billion in 2020"
text = re.sub(r'\d+', '', text)
print(text)

# Output
# "The company earned $ billion in "

Tokenization : Split text into words, sentences, or sub-word units

from nltk.tokenize import word_tokenize
nltk.download('punkt')

text = "I love NLP and Machine Learning"
tokens = word_tokenize(text)
print(tokens)

# Output
# ['I', 'love', 'NLP', 'and', 'Machine', 'Learning']

Stemming : Remove affixes from words to extract the word stem

from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "ran", "runs", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

# Output
# ['run', 'ran', 'run', 'easili', 'fairli']

Lemmatization: Similar to stemming, but extracts the lemma based on the word's context

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["better", "running", "cats", "geese"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

# Output
# ['better', 'running', 'cat', 'goose']

Whitespace Removal

text = "I    love   NLP   "
text = " ".join(text.split())
print(text)

# Output
# "I love NLP"

Special Character and URL Removal

text = "Visit https://www.example.com for more information!"
text = re.sub(r'http\S+', '', text)
print(text)

# Output
# "Visit for more information!"

Regular Expression

Character	Meaning	Description
*[expr]*	One of	Match one character from within [ ]
*[expr-expr]*	One of from to	Match a range of characters between the first and last character
*[^expr]*	None of	Match all characters except those in [ ]
*(expr)*	Group	Group a pattern for logical separation
**	**	Or
?	exists	Check if a chracter appears 0 or 1 times
+	least once	Check if a chracter appears 1 or more times
*	greedy	Match if a chracter appear 0 or more times
*{n}, {n,}, {n,m}*	n times, least n times, n to m times	Specify the number of times a pattern must appear
.	Any Character	Match any character
^$	Start of Line, End of Line	Match the start and end of a string

🎨 Tokenization

The process of dividing sentences into units for text data analysis

Token

Definition: A small unit into which raw text is divided for analysis, representing the smallest meaningful element within a sentence
Characteristics:
- Independence: Tokens are processed independently from the original text, and each token is considered the smallest meaningful unit in text analysis
- Form: Each word, subword, or character in a text can be treated as a token. For example, "The quick brown fox" can be split into four tokens at the word level
Types:
- Word-based Token: Each word is treated as a token.
  - Example: "I enjoy reading books." -> ["I", "enjoy", "reading", "books"]
- Subword-based Token: Splits words into smaller units, allowing rare words to be processed.
  - Example: "unhappiness" -> ["un", "happi", "ness"]
- Character-based Token: Each character is split into a token.
  - Example: "apple" -> ["a", "p", "p", "l", "e"]
- Sentence-based Token: Splits the text into tokens by sentence.
  - Example: "The cat sat on the mat." -> ["The", "cat", "sat", "on", "the", "mat"]

Tokenization

Definition : The process of dividing text into meaningful minimal units in natural language processing

Types

Word Tokenization : Simple and intuitive, but may encounter issues when handling punctuation, capitalization, and abbreviations

from nltk.tokenize import word_tokenize

text = "I love machine learning. It's amazing!"
tokens = word_tokenize(text)
print(tokens)


# Output
# ['I', 'love', 'machine', 'learning', '.', 'It', "'s", 'amazing', '!']

Sentence Tokenization : Require additional rules to handle special cases like abbreviations or capitalization after punctuation

from nltk.tokenize import sent_tokenize

text = "I love machine learning. It's amazing! How about you?"
tokens = sent_tokenize(text)
print(tokens)


# Output
# ['I love machine learning.', "It's amazing!", 'How about you?']

Subword Tokenization : Effective for handling rare words or neologisms, but some tokens may not have meaning on their own

import sentencepiece as spm

# Load pre-trained BPE model
sp = spm.SentencePieceProcessor()
sp.load('pretrained_model.model')

text = "machine learning"
tokens = sp.encode_as_pieces(text)
print(tokens)

# Output
# ['▁machine', '▁learning']

Charater-level Tokenization : A very granular method but challenging to capture context when processing at the character level
```
text = "hello"
tokens = list(text)
print(tokens)

# Output
# ['h', 'e', 'l', 'l', 'o']
```

Whitespace Tokenization : Suitable for languages like English, where spaces clearly separate words

text = "I love natural language processing"
tokens = text.split()
print(tokens)

# Output
# ['I', 'love', 'natural', 'language', 'processing']

been_29

Data Analysis

이전 포스트

Recommendation System

다음 포스트