
The process of converting text data into a format that can be analyzed or processed by a machine learning model
A structured collection of text data that a model can learn from and analyze
Converting text data into a format that can be processed by a model
Lowercasing : Convert all text to lowercase for consistency
text = "Apple is looking at buying U.K. startup for $1 billion"
text = text.lower()
print(text)
# Output
# "apple is looking at buying u.k. startup for $1 billion"
Stopword Removal: Remove stopwords such as articles, conjunctions, and prepositions that do not add significant meaning
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
tokens = ["I", "am", "learning", "natural", "language", "processing"]
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)
# Output
# ['learning', 'natural', 'language', 'processing']
Punctuation Removal : Remove punctuation which may add noise in user-generated content like tweets or reviews
import re
text = "I love NLP! It's amazing, isn't it?"
text = re.sub(r'[^\w\s]', '', text)
print(text)
# Output
# "I love NLP Its amazing isnt it"
Number Removal : Remove numbers if they are not relevant
text = "The company earned $1 billion in 2020"
text = re.sub(r'\d+', '', text)
print(text)
# Output
# "The company earned $ billion in "
Tokenization : Split text into words, sentences, or sub-word units
from nltk.tokenize import word_tokenize
nltk.download('punkt')
text = "I love NLP and Machine Learning"
tokens = word_tokenize(text)
print(tokens)
# Output
# ['I', 'love', 'NLP', 'and', 'Machine', 'Learning']
Stemming : Remove affixes from words to extract the word stem
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "ran", "runs", "easily", "fairly"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
# Output
# ['run', 'ran', 'run', 'easili', 'fairli']
Lemmatization: Similar to stemming, but extracts the lemma based on the word's context
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["better", "running", "cats", "geese"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
# Output
# ['better', 'running', 'cat', 'goose']
Whitespace Removal
text = "I love NLP "
text = " ".join(text.split())
print(text)
# Output
# "I love NLP"
Special Character and URL Removal
text = "Visit https://www.example.com for more information!"
text = re.sub(r'http\S+', '', text)
print(text)
# Output
# "Visit for more information!"
| Character | Meaning | Description |
|---|---|---|
| [expr] | One of | Match one character from within [ ] |
| [expr-expr] | One of from to | Match a range of characters between the first and last character |
| [^expr] | None of | Match all characters except those in [ ] |
| (expr) | Group | Group a pattern for logical separation |
| ** | ** | Or |
| ? | exists | Check if a chracter appears 0 or 1 times |
| + | least once | Check if a chracter appears 1 or more times |
| * | greedy | Match if a chracter appear 0 or more times |
| {n}, {n,}, {n,m} | n times, least n times, n to m times | Specify the number of times a pattern must appear |
| . | Any Character | Match any character |
| ^$ | Start of Line, End of Line | Match the start and end of a string |
The process of dividing sentences into units for text data analysis
Definition : The process of dividing text into meaningful minimal units in natural language processing
Types
Word Tokenization : Simple and intuitive, but may encounter issues when handling punctuation, capitalization, and abbreviations
from nltk.tokenize import word_tokenize
text = "I love machine learning. It's amazing!"
tokens = word_tokenize(text)
print(tokens)
# Output
# ['I', 'love', 'machine', 'learning', '.', 'It', "'s", 'amazing', '!']
Sentence Tokenization : Require additional rules to handle special cases like abbreviations or capitalization after punctuation
from nltk.tokenize import sent_tokenize
text = "I love machine learning. It's amazing! How about you?"
tokens = sent_tokenize(text)
print(tokens)
# Output
# ['I love machine learning.', "It's amazing!", 'How about you?']
Subword Tokenization : Effective for handling rare words or neologisms, but some tokens may not have meaning on their own
import sentencepiece as spm
# Load pre-trained BPE model
sp = spm.SentencePieceProcessor()
sp.load('pretrained_model.model')
text = "machine learning"
tokens = sp.encode_as_pieces(text)
print(tokens)
# Output
# ['โmachine', 'โlearning']
Charater-level Tokenization : A very granular method but challenging to capture context when processing at the character level
text = "hello"
tokens = list(text)
print(tokens)
# Output
# ['h', 'e', 'l', 'l', 'o']
Whitespace Tokenization : Suitable for languages like English, where spaces clearly separate words
text = "I love natural language processing"
tokens = text.split()
print(tokens)
# Output
# ['I', 'love', 'natural', 'language', 'processing']