Text Preprocessing - Tokenisation 1

양나윤 Alyson·2022년 5월 27일

NLP

목록 보기

1/2

? What is text preprocessing

➡️ Text preproceessing is the practice of preparing text data for NLP through tokenisation, cleaning & normalisation.
➡️ 자연어 처리를 위하여 텍스트 데이터를 토큰화, 정제 및 정규화 하는 작업
💡 Usefuyl Python libraries: NLTK, re

Tokenisation

? What is tokenisation

➡️ Tokenisation is the practice of splitting a corpus (a collection of written texts) into "token" (predetermined unit).
➡️ 코퍼스를 토큰 (=의미있는 단위)로 나누는 작업

Word Tokenisation

➡️ Token = Word

INPUT:

"Time is an illusion. Lunchtime double so!"

OUTPUT:

"Time", "is", "an", "illustion", "Lunchtime", "double", "so"

# punctuation removed
# split based on whitespace

🚨🚨🚨 IF INPUT:

"Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."

🤔 What to do with the Apostrophe(punctuation)?
Options for "Don't":

"Don't" or "Don","t" or "Dont" or "Do","n't"

Options for "Jone's":

"Jone's" or "Jone","s" or "Jone" or "Jones"

NLTK (ENG)

from nltk.tokenize import word_tokenize
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import TreebankWordTokenizer

# word_tokenize
print(word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

➡️ ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']

# WordPunctTokenizer
print(WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

➡️ ['Don', "'", 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr', '.', 'Jone', "'", 's', 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']

Penn Treebank Tokenisation

Penn Treebank is one of the standard tokenisation mehtods with below rules:
1. Leave hyphenated words (ex. "Back-to-back")
2. Split words with clitic (ex. "We're", "Doesn't")

print(TreebankWordTokenizer().tokenize("Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."))

➡️ ['Starting', 'a', 'home-based', 'restaurant', 'may', 'be', 'an', 'ideal.', 'it', 'does', "n't", 'have', 'a', 'food', 'chain', 'or', 'restaurant', 'of', 'their', 'own', '.']

Tensorflow.Keras

from tensorflow.keras.preprocessing.text import text_to_word_sequence

print(text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."))

➡️ ["don't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'mr', "jone's", 'orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop']

양나윤 Alyson

데이터 나라의 앨리슨 👩🏼‍💻 Alyson in Dataland

다음 포스트