Python officially supports a module 're'
It helps refining text data with a certain pattern
Symbol | Explannation |
---|---|
. | a random character except \n |
? | character may exist or not {0,1} |
* | character may exist or not {0,} |
+ | character exists {1,} |
^ | string starts with character behind |
$ | string ends with character ahead |
{nvm} | repeat nvm times |
{nvm1, nvm2} | repeat more than nvm1, less than nvm2 |
{nvm,} | repeat more than nvm times |
[characters] | match with one of character in [] |
[range] | match with one of character in the range |
[^character] | match character except character in [] |
a|b | match a or b |
\ | backslash itself |
\d | every digits [0-9] |
\D | everything except digits [^0-9] |
\s | every spaces [\t\n\r\f\v] |
\S | everything except spaces [^\t\n\r\f\v] |
\w | every characters and numbers [a-zA-Z0-9] |
\W | everything except characters nor number [^a-zA-Z0-9] |
Module Def | Explannation |
---|---|
re.compile() | compile RegEx |
re.search() | search string if it matches with RegEx. If there exists, return Match Object, else return none |
re.match() | search beginning of string if it matches with RegEx |
re.split() | split string with RegEx, and return list |
re.findall() | search every case that matches with RegEx from string, and return list. If there is none, return empty list |
re.finditer() | search every case that matches with RegEx from string, and return iterate object |
re.sub() | replace strings that match with RegEx to different string |
# code A and code B is the same
code A
r = re.compile('ab+c')
r.search('abc')
code B
re.search('ab+c', 'abc')
re.match('what_to_find.', 'from_where')
re.split('\s', from_where)
re.findall('what_to_find', 'from_where')
re.finditer('what_to_find', 'from_where')
re.sub('from_sth', 'to_sth', 'from_where')
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w]+")
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("\s+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))
Computer processes int better than str
Sometimes, we map words to certain integers(or index), this is called 'mapping'
Usually, we assign index after sorting numbers by frequency
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
raw_text = "A barber is a person. a barber is good person. a barber is huge person. he Knew A Secret! The Secret He Kept is huge secret. Huge secret. His barber kept his word. a barber kept his word. His barber kept his secret. But keeping and keeping such a huge secret to himself was driving the barber crazy. the barber went up a huge mountain."
# sentence tokenization
sentences = sent_tokenize(raw_text)
vocab = {}
preprocessed_sentences = []
stop_words = set(stopwords.words('english'))
for sentence in sentences:
# word tokenization
tokenized_sentence = word_tokenize(sentence)
result = []
for word in tokenized_sentence:
word = word.lower() # lower words to reduce words
if word not in stop_words: # remove stop words
if len(word) > 2: # remove words with length lower than 2
result.append(word)
if word not in vocab:
vocab[word] = 0
vocab[word] += 1
preprocessed_sentences.append(result)
# sort by frequency
vocab_sorted = sorted(vocab.items(), key = lambda x:x[1], reverse = True)
word_to_index = {}
i = 0
for (word, frequency) in vocab_sorted :
if frequency > 1 : # remove words with small frequency
i = i + 1
word_to_index[word] = i
vocab_size = 5
words_frequency = [word for word, index in word_to_index.items() if index >= vocab_size + 1] # remove words whose index is more than 5
for w in words_frequency:
del word_to_index[w] # remove index information
word_to_index['OOV'] = len(word_to_index) + 1
encoded_sentences = []
for sentence in preprocessed_sentences:
encoded_sentence = []
for word in sentence:
try:
encoded_sentence.append(word_to_index[word])
except KeyError:
encoded_sentence.append(word_to_index['OOV'])
encoded_sentences.append(encoded_sentence)
Changing text to number signifies 'processing' starts.
Therefore, we have to finish all the preprocessing that is only possible in text form.
Lower index means higher frequency.
The reason why we remove words with lower frequency is they are often meaningless in NLP.
Because of this, there exists words not in word_to_index dictionary; we call them OOV (Out-Of-Vocabulary).
We add OOV as the last of the index.
Then, we encode every word in sentences with the mapped integers.
Often, we use Counter, FreqDist, enumerate, or Keras Tokenizer than using dictionary in Python.
In the code above,
vocab = (dictionary) {unique word: its frequency}
vocab_sort = (list) [(unique word, its frequency)] /descending sorted by frequency
word_to_index = (dictionary) {unique word: its index} /ascending sorted by index
from collections import Counter
all_words_list = sum(preprocessed_sentences, [])
# or you can use 'words = np.hstack(preprocessed_sentences)' instead
# count word frequency using 'Counter' module in Python
vocab = Counter(all_words_list)
vocab_size = 5
vocab = vocab.most_common(vocab_size) # leave only top 5 words with higher frequency
word_to_index = {}
i = 0
for (word, frequency) in vocab :
i = i + 1
word_to_index[word] = i
In the code above, 'sentences' is already tokenized by words.
Counter() : remove duplicated words and get their frequency
most_common(nvm) : return top nvm words with high frequency
from nltk import FreqDist
import numpy as np
# remove punctuation using np.hstack
vocab = FreqDist(np.hstack(preprocessed_sentences))
vocab_size = 5
vocab = vocab.most_common(vocab_size) # store only top 5 words with high frequency
word_to_index = {word[0] : index + 1 for index, word in enumerate(vocab)}
enumerate() is useful when assigning index
from tensorflow.keras.preprocessing.text import Tokenizer
preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]
tokenizer = Tokenizer()
# fit_on_texts()안에 코퍼스를 입력으로 하면 빈도수를 기준으로 단어 집합을 생성.
tokenizer.fit_on_texts(preprocessed_sentences)
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # 상위 5개 단어만 사용
tokenizer.fit_on_texts(preprocessed_sentences)
# show how indexes assigned to words
tokenizer.word_index
# show unique words and their frequencies
tokenizer.word_counts
# change words in corpus to given index
tokenizer.texts_to_sequences(preprocessed_sentences)
### If we want to only use top 5 frequency words for texts_to_sequences ###
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # see description below for the reason
tokenizer.fit_on_texts(preprocessed_sentences)
### If we want to only use top 5 frequency words for word_index & word_counts as well ###
tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
vocab_size = 5
words_frequency = [word for word, index in tokenizer.word_index.items() if index >= vocab_size + 1] # delete words whose index exceed 5
for word in words_frequency:
del tokenizer.word_index[word] # delete index information
del tokenizer.word_counts[word] # delete count information
### If we want to save words not in word list as OOV ###
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 2, oov_token = 'OOV')
tokenizer.fit_on_texts(preprocessed_sentences)
The reason why we add +1 to num_words value:
The 'num_words' count number from 0, thus if we put 5, it will save words in range 0th~4th, which means only words with index 1 to 4 will remain.
Therefore, if we want to save words with index 1 to 5, we need to put 5+1 rather than just 5.
The reason why Keras Tokenizer inculdes 0 when it does not actually exits is the process named 'padding'.
This will be explained in the next chapter.
When processing natural language, each sentence(or document) may have different length.
Computer can consider documents with the same length as a matrix, and process altoghter.
Therefore, sometimes we need to adjust documents' length same.
# What we did in the last chapter - encoding words to assigned integer
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
Then, we do a padding with this encoded data
max_len = max(len(item) for item in encoded) # the longest sentence's length
# We then adjust other sentences' length as same as the longest one
# We suppose there's a imaginary word 'PAD' and its index as 0
for sentence in encoded:
while len(sentence) < max_len:
sentence.append(0)
padded_np = np.array(encoded)
For sentences whose length is shorter than 7, number 0 has been added behind to make their length as 7.
Now computer can consider them as a matrix, and conduct parallel processing.
0th word is meaningless, so we will ignore it when processing natural language.
Adjusting data's shape(or size) by filling in certain value is called 'Padding'.
If we are using number 0 as above, it is called 'Zero Padding'.
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(encoded)
# If we want to fill 0 behind as NumPy, rather than ahead
padded = pad_sequences(encoded, padding = 'post')
# It is not necessary to adjust to the longest sentence. We can assign the max length of sentences
padded = pad_sequences(encoded, padding = 'post', maxlen = 5) # in this case, if length of the sentence exceeds 5, then data is loss
# Usually we pad with number 0, but we can use other numbers as well. The code below is padding with the number that is +1 bigger than the size of the word set
last_value = len(tokenizer.word_index) + 1
Computer handles number better than text.
Therefore, in NLP, there are many techniques we use to change text to number.
'One-Hot Encoding' is the most basic technique to express 'words'.
Before moving on to one-hot encoding, we first make *vocabulary.
Then, we do integer encoding.
If there are 5,000 different words in the text, the size of vocabulary is 5,000.
And, there are 5,000 indexes that are assigned to each word.
We set the dimension of vector as the size of vocabulary.
Then, we put 1 for words that we want to express, and put 0 for others.
This vector is called One-Hot vector.
def one_hot_encoding(word, word2index):
one_hot_vector = [0]*(len(word2index))
index = word2index[word]
one_hot_vector[index] = 1
return one_hot_vector
Data for supervised learning consists of 'question' data, and 'answer' data (also known as label)
We split data as below:
train data
X_train : question for train
y_train : answer for train
test data
X_test : question for test
y_test : answer for test
Computer train with train data, and guess with x_test data.
Then, we compare its prediction with y_test data, and return its Accuracy.
It convert sentences without spacing to sentences with proper spacing.
It is based on Naver Hangul Spell Checker.
It also checks spacing as well.
It is a word tokenizer that supports pos-tagging and word tokenization.
It is based on unsupervised learning, and analyze frequent words in the data.
It operates as word score table in inner side.
This score uses 'cohesion probability' and 'branching entrophy'.
It can solve new words such as new debuted idol group's name.
It means model that assign possibility to word sequence(sentence).
Nowadays, we usually use neural based model rather than statistic based model.
Uprising technology GPT or BERT is also based on neural-network language model.
The most common case is predicting the next word with given previous words. This is called 'Language Modeling'.
Or, it may also predict word between given words.
This can be applied in fields like
It is composed of multiplication of possibility of words with given previous words.
For example, the possibility of 'An adorable little boy is spreading smiles' is:
It uses 'Count Based Statistic Approach'.
Therefore, it is a form of SLM as well.
However, instead of considering every word, it only consider a part of it.
We decide how many words we consider, and the number of it is 'n' in 'n-gram'.
There can be a high possibility of not being able to count in corpus if the sentence we want to calculate possibility gets longer.
Since it'd not exist in the corpus.
This is the limitation of SLM; the goal sentence may not be in train corpus.
By reducing words that we refer can increase the possiblity to be counted.
Korean sentences are more difficult to language model because of following reasons:
When comparing performance of different models, we can apply it to spell check, machine translation, or speech recognition.
And see which model was better.
However, it takes too much time when comparing models more than two.
Therefore, we do 'Intrinsic Evaluation', which can be less accurate compared to extrinsic evaluation, but faster.
It digitize its performance and return the result inside the model.
Perplexity is an inner evaluation matrix for language model.
It is often shorten as PPL.
Lower the PPL is, better the language model's performance is.
Not a special thing to be noted