[NLP] 3. Text Preproessing

Joy·2020년 7월 6일
0

Codecademy [Learn Natural Language Processing]

Text Preproessing : 텍스트 전처리


1. Introduction

  • preparing text data : 용도에 맞게 텍스트를 사전에 처리

  • to reduce the text to only the words that you need for your NLP goals.

  • w/ Regex & NLTK libraries

  • Removing unnecessary characters and formatting + Tokenization (break multi-word strings into smaller components) + Normalization(a catch-all term for processing data; includes stemming and lemmatization)

2. Noise Removal

  • to remove unwanted information
    ex) punctuation and accents, special characters, numeric digits
    whitespace, HTML formatting

  • .sub() method in Python’s re library -> for most of noise removal

    .sub()
    3 arguments required :

    1. pattern : a regular expression that is searched for in the input string. ris must to indicate raw string
    1. replacement_text : text that replaces all matches in the input string
    1. input : edited by the .sub() method
      return : a string with all instances of the pattern replaced by the replacement_text
import re
# tag removal
headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'

headline_no_tag = re.sub(r'<.?h1>', '', headline_one)

# char removal
tweet = '@fat_meats, veggies are better than you think.'

tweet_no_at = re.sub(r'[@]' , '', tweet)

3. Tokenization

  • break the text into smaller components (individual components : tokens)
  • **nltk‘s word_tokenize() function**

    from nltk.tokenize import word_tokenize : word level
    from nltk.tokenize import sent_tokenize : sent level

4. Normalization

  • 표현 방법이 다른 단어들을 통합시켜서 같은 단어로 만들기
  • Upper / lowercasing, Stopword removal, Stemming, Lemmatization

string.upper()
string.lover()

5. Stopword Removal

  • “a”, “an”, and “the”

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

text_no_stops = [word for word in tokenized_survey if word not in stop_words]

6. Stemming

  • 어간 추출
  • removing word affixes (prefixes and suffixes)
  • NLTK built-in stemmer **PorterStemmer**

    from nltk.stem import PorterStemmer
    stemmer = PorterStemmer()

stemmed = [stemmer.stem(token) for token in tokenized]

7. Lemmatization

  • 표제어 추출
    -단어들이 다른 형태를 가지더라도, 그 뿌리 단어를 찾아가서 단어의 개수를 줄일 수 있는지 판단
  • NLTK’s WordNetLemmatizer() class .lemmatize() method

    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

lemmatized_words = [ lemmatizer.lemmatize(word) for word in tokenized_string ]

8. Part-of-speech Tagging

  • To improve the performance of lemmatization, we need to find the part of speech for each word in our string

  • get_part_of_speech(word) 만들기

import nltk
from nltk.corpus import wordnet
from collections import Counter

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  
  pos_counts = Counter()

  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech
  • lemmatization에 적용하기
lemmatized_pos = 
[ lemmatizer.lemmatize(word, get_part_of_speech(token)) for word in tokenized_string ]

review

  • Text preprocessing : cleaning and prepping text data -> ready for other NLP tasks.
  • Noise removal : removing unnecessary formatting from our text.
  • Tokenization : breaking up text into smaller units (usually words or discrete terms).
  • Normalization: name for most other text preprocessing tasks, including stemming, lemmatization, upper and lowercasing, and stopword removal.
  • Stemming is the normalization preprocessing task focused on removing word affixes.
  • Lemmatization is the normalization preprocessing task that more carefully brings words down to their root forms.

오답

++ stemming 이 효율적? coz it doens't require the part of speech word

profile
roundy

0개의 댓글