➡️ Text preproceessing is the practice of preparing text data for NLP through tokenisation, cleaning & normalisation.
➡️ 자연어 처리를 위하여 텍스트 데이터를 토큰화, 정제 및 정규화 하는 작업
💡 Usefuyl Python libraries: NLTK, re
➡️ Tokenisation is the practice of splitting a corpus (a collection of written texts) into "token" (predetermined unit).
➡️ 코퍼스를 토큰 (=의미있는 단위)로 나누는 작업
➡️ Token = Word
"Time is an illusion. Lunchtime double so!"
"Time", "is", "an", "illustion", "Lunchtime", "double", "so"
# punctuation removed
# split based on whitespace
🚨🚨🚨 IF INPUT:
"Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."
🤔 What to do with the Apostrophe(punctuation)?
Options for "Don't":
"Don't" or "Don","t" or "Dont" or "Do","n't"
Options for "Jone's":
"Jone's" or "Jone","s" or "Jone" or "Jones"
from nltk.tokenize import word_tokenize
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import TreebankWordTokenizer
# word_tokenize
print(word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))
➡️ ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']
# WordPunctTokenizer
print(WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))
➡️ ['Don', "'", 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr', '.', 'Jone', "'", 's', 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']
Penn Treebank is one of the standard tokenisation mehtods with below rules:
1. Leave hyphenated words (ex. "Back-to-back")
2. Split words with clitic (ex. "We're", "Doesn't")
print(TreebankWordTokenizer().tokenize("Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."))
➡️ ['Starting', 'a', 'home-based', 'restaurant', 'may', 'be', 'an', 'ideal.', 'it', 'does', "n't", 'have', 'a', 'food', 'chain', 'or', 'restaurant', 'of', 'their', 'own', '.']
from tensorflow.keras.preprocessing.text import text_to_word_sequence
print(text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))
➡️ ["don't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'mr', "jone's", 'orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop']