Introduction to NLP (Wk.2)

송종빈·2021년 10월 29일

DeepLearning NLP

introduction to nlp

목록 보기

1/9

Ch. 2 Text Preprocessing

2-5) Regular Expression

Introduction to RegEx

Python officially supports a module 're'
It helps refining text data with a certain pattern

Grammar of RegEx

Symbol	Explannation
.	a random character except \n
?	character may exist or not {0,1}
*	character may exist or not {0,}
+	character exists {1,}
^	string starts with character behind
$	string ends with character ahead
{nvm}	repeat nvm times
{nvm1, nvm2}	repeat more than nvm1, less than nvm2
{nvm,}	repeat more than nvm times
[characters]	match with one of character in []
[range]	match with one of character in the range
[^character]	match character except character in []
a\|b	match a or b
\	backslash itself
\d	every digits [0-9]
\D	everything except digits [^0-9]
\s	every spaces [\t\n\r\f\v]
\S	everything except spaces [^\t\n\r\f\v]
\w	every characters and numbers [a-zA-Z0-9]
\W	everything except characters nor number [^a-zA-Z0-9]

RegEx Module Definition

Module Def	Explannation
re.compile()	compile RegEx
re.search()	search string if it matches with RegEx. If there exists, return Match Object, else return none
re.match()	search beginning of string if it matches with RegEx
re.split()	split string with RegEx, and return list
re.findall()	search every case that matches with RegEx from string, and return list. If there is none, return empty list
re.finditer()	search every case that matches with RegEx from string, and return iterate object
re.sub()	replace strings that match with RegEx to different string

Example of RegEx in Python

# code A and code B is the same

code A
r = re.compile('ab+c')
r.search('abc')

code B
re.search('ab+c', 'abc')

re.match('what_to_find.', 'from_where')
re.split('\s', from_where)
re.findall('what_to_find', 'from_where')
re.finditer('what_to_find', 'from_where')
re.sub('from_sth', 'to_sth', 'from_where')

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w]+")
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("\s+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))

2-6) Integer Encoding

Introduction to Integer Encoding

Computer processes int better than str
Sometimes, we map words to certain integers(or index), this is called 'mapping'
Usually, we assign index after sorting numbers by frequency

Integer Encoding Using Python Dictionary

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

raw_text = "A barber is a person. a barber is good person. a barber is huge person. he Knew A Secret! The Secret He Kept is huge secret. Huge secret. His barber kept his word. a barber kept his word. His barber kept his secret. But keeping and keeping such a huge secret to himself was driving the barber crazy. the barber went up a huge mountain."

# sentence tokenization
sentences = sent_tokenize(raw_text)

vocab = {}
preprocessed_sentences = []
stop_words = set(stopwords.words('english'))

for sentence in sentences:
    # word tokenization
    tokenized_sentence = word_tokenize(sentence)
    result = []

    for word in tokenized_sentence: 
        word = word.lower() # lower words to reduce words
        if word not in stop_words: # remove stop words
            if len(word) > 2: # remove words with length lower than 2
                result.append(word)
                if word not in vocab:
                    vocab[word] = 0 
                vocab[word] += 1
    preprocessed_sentences.append(result) 

# sort by frequency
vocab_sorted = sorted(vocab.items(), key = lambda x:x[1], reverse = True)

word_to_index = {}
i = 0
for (word, frequency) in vocab_sorted :
    if frequency > 1 : # remove words with small frequency
        i = i + 1
        word_to_index[word] = i
        
vocab_size = 5
words_frequency = [word for word, index in word_to_index.items() if index >= vocab_size + 1] # remove words whose index is more than 5
for w in words_frequency:
    del word_to_index[w] # remove index information

word_to_index['OOV'] = len(word_to_index) + 1

encoded_sentences = []
for sentence in preprocessed_sentences:
    encoded_sentence = []
    for word in sentence:
        try:
            encoded_sentence.append(word_to_index[word])
        except KeyError:
            encoded_sentence.append(word_to_index['OOV'])
    encoded_sentences.append(encoded_sentence)

Changing text to number signifies 'processing' starts.
Therefore, we have to finish all the preprocessing that is only possible in text form.
Lower index means higher frequency.
The reason why we remove words with lower frequency is they are often meaningless in NLP.
Because of this, there exists words not in word_to_index dictionary; we call them OOV (Out-Of-Vocabulary).
We add OOV as the last of the index.
Then, we encode every word in sentences with the mapped integers.
Often, we use Counter, FreqDist, enumerate, or Keras Tokenizer than using dictionary in Python.

In the code above,
vocab = (dictionary) {unique word: its frequency}
vocab_sort = (list) [(unique word, its frequency)] /descending sorted by frequency
word_to_index = (dictionary) {unique word: its index} /ascending sorted by index

Integer Encoding Using Counter

from collections import Counter

all_words_list = sum(preprocessed_sentences, [])
# or you can use 'words = np.hstack(preprocessed_sentences)' instead

# count word frequency using 'Counter' module in Python
vocab = Counter(all_words_list)

vocab_size = 5
vocab = vocab.most_common(vocab_size) # leave only top 5 words with higher frequency

word_to_index = {}
i = 0
for (word, frequency) in vocab :
    i = i + 1
    word_to_index[word] = i

In the code above, 'sentences' is already tokenized by words.
Counter() : remove duplicated words and get their frequency
most_common(nvm) : return top nvm words with high frequency

Integer Encoding Using NLTK's FreqDist

from nltk import FreqDist
import numpy as np

# remove punctuation using np.hstack
vocab = FreqDist(np.hstack(preprocessed_sentences))

vocab_size = 5
vocab = vocab.most_common(vocab_size) # store only top 5 words with high frequency

word_to_index = {word[0] : index + 1 for index, word in enumerate(vocab)}

enumerate() is useful when assigning index

Integer Encoding Using Keras

from tensorflow.keras.preprocessing.text import Tokenizer

preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]

tokenizer = Tokenizer()

# fit_on_texts()안에 코퍼스를 입력으로 하면 빈도수를 기준으로 단어 집합을 생성.
tokenizer.fit_on_texts(preprocessed_sentences)

vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # 상위 5개 단어만 사용
tokenizer.fit_on_texts(preprocessed_sentences)

# show how indexes assigned to words
tokenizer.word_index

# show unique words and their frequencies
tokenizer.word_counts

# change words in corpus to given index
tokenizer.texts_to_sequences(preprocessed_sentences)

### If we want to only use top 5 frequency words for texts_to_sequences ###
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # see description below for the reason
tokenizer.fit_on_texts(preprocessed_sentences)

### If we want to only use top 5 frequency words for word_index & word_counts as well ###

tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)

vocab_size = 5
words_frequency = [word for word, index in tokenizer.word_index.items() if index >= vocab_size + 1] # delete words whose index exceed 5
for word in words_frequency:
    del tokenizer.word_index[word] # delete index information
    del tokenizer.word_counts[word] # delete count information

### If we want to save words not in word list as OOV ###
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 2, oov_token = 'OOV')
tokenizer.fit_on_texts(preprocessed_sentences)

The reason why we add +1 to num_words value:
The 'num_words' count number from 0, thus if we put 5, it will save words in range 0th~4th, which means only words with index 1 to 4 will remain.
Therefore, if we want to save words with index 1 to 5, we need to put 5+1 rather than just 5.
The reason why Keras Tokenizer inculdes 0 when it does not actually exits is the process named 'padding'.
This will be explained in the next chapter.

2-7) Padding

Introduction to Padding

When processing natural language, each sentence(or document) may have different length.
Computer can consider documents with the same length as a matrix, and process altoghter.
Therefore, sometimes we need to adjust documents' length same.

Padding with NumPy

# What we did in the last chapter - encoding words to assigned integer
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer

preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)

Then, we do a padding with this encoded data

max_len = max(len(item) for item in encoded) # the longest sentence's length

# We then adjust other sentences' length as same as the longest one
# We suppose there's a imaginary word 'PAD' and its index as 0

for sentence in encoded:
    while len(sentence) < max_len:
        sentence.append(0)

padded_np = np.array(encoded)

For sentences whose length is shorter than 7, number 0 has been added behind to make their length as 7.
Now computer can consider them as a matrix, and conduct parallel processing.
0th word is meaningless, so we will ignore it when processing natural language.
Adjusting data's shape(or size) by filling in certain value is called 'Padding'.
If we are using number 0 as above, it is called 'Zero Padding'.

Padding with Keras Preprocessint Tools

from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(encoded)

# If we want to fill 0 behind as NumPy, rather than ahead
padded = pad_sequences(encoded, padding = 'post')

# It is not necessary to adjust to the longest sentence. We can assign the max length of sentences
padded = pad_sequences(encoded, padding = 'post', maxlen = 5) # in this case, if length of the sentence exceeds 5, then data is loss

# Usually we pad with number 0, but we can use other numbers as well. The code below is padding with the number that is +1 bigger than the size of the word set
last_value = len(tokenizer.word_index) + 1

2-8) One-Hot Encoding

Introduction to One-Hot Encoding

Computer handles number better than text.
Therefore, in NLP, there are many techniques we use to change text to number.
'One-Hot Encoding' is the most basic technique to express 'words'.
Before moving on to one-hot encoding, we first make *vocabulary.
Then, we do integer encoding.
If there are 5,000 different words in the text, the size of vocabulary is 5,000.
And, there are 5,000 indexes that are assigned to each word.

vocabulary is a set of different words.
We also consider 'book' and 'books' as different as well.

We set the dimension of vector as the size of vocabulary.
Then, we put 1 for words that we want to express, and put 0 for others.
This vector is called One-Hot vector.

One-Hot Encoding Function in Python

def one_hot_encoding(word, word2index):
  one_hot_vector = [0]*(len(word2index))
  index = word2index[word]
  one_hot_vector[index] = 1
  return one_hot_vector

2-9) Splitting Data

Introduction to Supervised Learning

Data for supervised learning consists of 'question' data, and 'answer' data (also known as label)

We split data as below:

train data
X_train : question for train
y_train : answer for train
test data
X_test : question for test
y_test : answer for test

Computer train with train data, and guess with x_test data.
Then, we compare its prediction with y_test data, and return its Accuracy.

2-10) Text Preprocessing Tools for Korean Text

PyKoSpacing

It convert sentences without spacing to sentences with proper spacing.

Py-Hanspell

It is based on Naver Hangul Spell Checker.
It also checks spacing as well.

SOYNLP

It is a word tokenizer that supports pos-tagging and word tokenization.
It is based on unsupervised learning, and analyze frequent words in the data.
It operates as word score table in inner side.
This score uses 'cohesion probability' and 'branching entrophy'.

It can solve new words such as new debuted idol group's name.

Ch. 3 Language Model

3-1) Introduction to Language Model

Introduction to Language Model

It means model that assign possibility to word sequence(sentence).
Nowadays, we usually use neural based model rather than statistic based model.
Uprising technology GPT or BERT is also based on neural-network language model.

The most common case is predicting the next word with given previous words. This is called 'Language Modeling'.
Or, it may also predict word between given words.

Assigning Possibility of Word Sequence

This can be applied in fields like

Machine Learning
Spell Correction
Speech Recognition

3-2) Statistical Language Model, SLM

Possibility of Sentence

It is composed of multiplication of possibility of words with given previous words.

For example, the possibility of 'An adorable little boy is spreading smiles' is:

P (An) \times P (adorable|An) \times P (little|An adorable) \times P (boy|An adorable little) \times P (is|An adorable little boy)

3-3) N-gram Language Model

Introduction to N-gram Language Model

It uses 'Count Based Statistic Approach'.
Therefore, it is a form of SLM as well.
However, instead of considering every word, it only consider a part of it.
We decide how many words we consider, and the number of it is 'n' in 'n-gram'.

Decrease in Cases of Not Being Counted in Corpus

There can be a high possibility of not being able to count in corpus if the sentence we want to calculate possibility gets longer.
Since it'd not exist in the corpus.
This is the limitation of SLM; the goal sentence may not be in train corpus.
By reducing words that we refer can increase the possiblity to be counted.

3-4) Language Model for Korean Sentences

Korean sentences are more difficult to language model because of following reasons:

order of words is not important
it is an agglutinative language
often its spacing is not correct

3-5) Perplexity

Extrinsic Evaluation

When comparing performance of different models, we can apply it to spell check, machine translation, or speech recognition.
And see which model was better.
However, it takes too much time when comparing models more than two.

Therefore, we do 'Intrinsic Evaluation', which can be less accurate compared to extrinsic evaluation, but faster.
It digitize its performance and return the result inside the model.

Perplexity, PPL

Perplexity is an inner evaluation matrix for language model.
It is often shorten as PPL.
Lower the PPL is, better the language model's performance is.

3-6) Conditional Probability

Not a special thing to be noted

송종빈

Student Dev - Language Tech & Machine Learning

다음 포스트

Introduction to NLP (Wk.2)