preparing text data : 용도에 맞게 텍스트를 사전에 처리
to reduce the text to only the words that you need for your NLP goals.
w/ Regex & NLTK libraries
Removing unnecessary characters and formatting + Tokenization (break multi-word strings into smaller components) + Normalization(a catch-all term for processing data; includes stemming and lemmatization)
to remove unwanted information
ex) punctuation and accents, special characters, numeric digits
whitespace, HTML formatting
.sub()
method in Python’s re
library -> for most of noise removal
.sub()
3 arguments required :
pattern
: a regular expression that is searched for in the input string. r
is must to indicate raw stringreplacement_text
: text that replaces all matches in the input stringinput
: edited by the .sub() methodpattern
replaced by the replacement_text
import re
# tag removal
headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'
headline_no_tag = re.sub(r'<.?h1>', '', headline_one)
# char removal
tweet = '@fat_meats, veggies are better than you think.'
tweet_no_at = re.sub(r'[@]' , '', tweet)
**nltk
‘s word_tokenize()
function**
from nltk.tokenize import word_tokenize
: word level
from nltk.tokenize import sent_tokenize
: sent level
string.upper()
string.lover()
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
text_no_stops = [word for word in tokenized_survey if word not in stop_words]
**PorterStemmer
**
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokenized]
WordNetLemmatizer()
class .lemmatize()
method
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [ lemmatizer.lemmatize(word) for word in tokenized_string ]
To improve the performance of lemmatization, we need to find the part of speech for each word in our string
get_part_of_speech(word) 만들기
import nltk
from nltk.corpus import wordnet
from collections import Counter
def get_part_of_speech(word):
probable_part_of_speech = wordnet.synsets(word)
pos_counts = Counter()
pos_counts["n"] = len( [ item for item in probable_part_of_speech if item.pos()=="n"] )
pos_counts["v"] = len( [ item for item in probable_part_of_speech if item.pos()=="v"] )
pos_counts["a"] = len( [ item for item in probable_part_of_speech if item.pos()=="a"] )
pos_counts["r"] = len( [ item for item in probable_part_of_speech if item.pos()=="r"] )
most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
return most_likely_part_of_speech
lemmatized_pos =
[ lemmatizer.lemmatize(word, get_part_of_speech(token)) for word in tokenized_string ]
++ stemming 이 효율적? coz it doens't require the part of speech word