[NLP] 3. Text Preproessing

Joy·2020년 7월 6일

Codecademy | Natural Language Processing

목록 보기

4/10

Codecademy [Learn Natural Language Processing]

Text Preproessing : 텍스트 전처리

1. Introduction

preparing text data : 용도에 맞게 텍스트를 사전에 처리
to reduce the text to only the words that you need for your NLP goals.
w/ Regex & NLTK libraries
Removing unnecessary characters and formatting + Tokenization (break multi-word strings into smaller components) + Normalization(a catch-all term for processing data; includes stemming and lemmatization)

2. Noise Removal

to remove unwanted information
ex) punctuation and accents, special characters, numeric digits
whitespace, HTML formatting
.sub() method in Python’s re library -> for most of noise removal

.sub()
3 arguments required :
1. pattern : a regular expression that is searched for in the input string. ris must to indicate raw string
1. replacement_text : text that replaces all matches in the input string
1. input : edited by the .sub() method
  return : a string with all instances of the pattern replaced by the replacement_text

import re
# tag removal
headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'

headline_no_tag = re.sub(r'<.?h1>', '', headline_one)

# char removal
tweet = '@fat_meats, veggies are better than you think.'

tweet_no_at = re.sub(r'[@]' , '', tweet)

3. Tokenization

break the text into smaller components (individual components : tokens)
**nltk‘s word_tokenize() function**

from nltk.tokenize import word_tokenize : word level
from nltk.tokenize import sent_tokenize : sent level

4. Normalization

표현 방법이 다른 단어들을 통합시켜서 같은 단어로 만들기
Upper / lowercasing, Stopword removal, Stemming, Lemmatization

string.upper()
string.lover()

5. Stopword Removal

“a”, “an”, and “the”

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

text_no_stops = [word for word in tokenized_survey if word not in stop_words]

6. Stemming

어간 추출
removing word affixes (prefixes and suffixes)
NLTK built-in stemmer **PorterStemmer**

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

stemmed = [stemmer.stem(token) for token in tokenized]

7. Lemmatization

표제어 추출
-단어들이 다른 형태를 가지더라도, 그 뿌리 단어를 찾아가서 단어의 개수를 줄일 수 있는지 판단
NLTK’s WordNetLemmatizer() class .lemmatize() method

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lemmatized_words = [ lemmatizer.lemmatize(word) for word in tokenized_string ]

8. Part-of-speech Tagging

To improve the performance of lemmatization, we need to find the part of speech for each word in our string
get_part_of_speech(word) 만들기

import nltk
from nltk.corpus import wordnet
from collections import Counter

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  
  pos_counts = Counter()

  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

lemmatization에 적용하기

lemmatized_pos = 
[ lemmatizer.lemmatize(word, get_part_of_speech(token)) for word in tokenized_string ]

review

Text preprocessing : cleaning and prepping text data -> ready for other NLP tasks.
Noise removal : removing unnecessary formatting from our text.
Tokenization : breaking up text into smaller units (usually words or discrete terms).
Normalization: name for most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.
Stemming is the normalization preprocessing task focused on removing word affixes.
Lemmatization is the normalization preprocessing task that more carefully brings words down to their root forms.