[NLP 3] tokenization 1 : NLTK 라이브러리

방선생·2025년 1월 10일

Natural Language Processing

목록 보기

3/22

토큰화

토큰화 (tokenization) : 텍스트를 의미 있는 단위로 나누는 작업

문장 토큰화 : 문장의 끝을 나타내는 구두점 기호(.,!,?)를 기준으로 분할

단어 토큰화 : 띄어쓰기 기준으로 분할

형태소 토큰화 : 형태소 단위로 분할

영어와 한글의 토큰화 차이

NLTK(NLp Tool Kit) 라이브러리

라이브러리 임폴트
import nltk

필요한 기능 추가 다운로드
① 문장을 구분하기 위한 문장 부호 다운로드
nltk.download('punkt’)
② 불용어 다운로드
nltk.download('stopwords’)
③ 불용어 생성 함수 호출 → 영어 불용어 리스트 생성
stopwords_list = nltk.corpus.stopwords.words('english’)

문장 단위 토큰화
sent_tokenize() 사용

단어 단위 토큰화
word_tokenize() 사용

punkt : 구두점
stopword : 불용어
morpheme : 형태소

(이 시리즈의 모든 코드는 코랩환경에서 Python으로 작성하였습니다)

NLTK Code 1 (불용어 리스트 확인)

from google.colab import drive
drive.mount('/content/drive')

import nltk # 라이브러리 임폴트

nltk.download('punkt_tab') # 문장 부호 다운

nltk.download('stopwords') # 불용어 리스트 다운

print(f'영어 불용어 리스트 : \n{stopwords_list[:10]}')

print('-'*80)
print(len(stopwords_list))

NLTK는 불용어를 영어만 지원하기 때문에 한국어 불용어가 없음

NLTK Code 2 (문장단위 토큰화)

# text data 생성
text_ko='''산업통상자원부는 지난 2개월간 공모절차를 진행한 결과, 시스템반도체 검증지원센터의 입지로 성남 판교가 최종 선정됐다고 9일 밝혔다.
시스템반도체 검증지원센터는 제2판교 테크노벨리에 위치한 성남 글로벌 융합센터 내에 조성될 계획이다.
올해부터 2028년까지 5년간 국비 150억원, 지방비 64억5000만원 등 총 214억5000만원의 예산을 투입해 한국팹리스산업협회, 한국반도체산업협회, 성남산업진흥원, 한국전자기술연구원 등이 함께 구축한다.
센터는 중소·중견기업이 확보하기 어려운 검증용 첨단장비를 구비하고, 전문 검증인력을 채용해 반도체 검증 환경을 구축할 예정이다.
또 검증 전문 인력 및 수요 측면 전문가들이 팹리스 기업에 설계의 취약점 분석하고, 해결방안을 제시해 ‘제품의 상용화’도 지원한다.
오는 8월까지 공간을 조성하고, 올해 하반기부터 기업들에게 검증지원 서비스를 제공할 예정이다.
산업부 관계자는 “설계 프로그램(EDA), 시제품 제작 등 반도체 설계를 중점 지원하는 ‘설계지원센터’와 검증·상용화를 지원하는 ‘검증지원센터’를 연계할 예정”이라며 “반도체 칩 설계-검증-상용화 전주기에 걸친 밀착 지원으로 팹리스들의 경쟁력을 높일 수 있을 것”이라고 기대했다.'''
# print(text_ko)

text_en='''I am happy to join with you today in what will go down in history as the greatest demonstration for freedom in the history of our nation.
Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation.
This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice.
It came as a joyous daybreak to end the long night of their captivity.But one hundred years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later, the Negro is still languishing in the corners of American society and finds himself an exile in his own land. So we have come here today to dramatize a shameful condition.
In a sense we have come to our nation's capital to cash a check. When the architects of our republic wrote the magnificent words of the Constitution and the Declaration of Independence, they were signing a promissory note to which every American was to fall heir. This note was a promise that all men, -yes, black men as well as white men,- would be guaranteed the unalienable rights of life, liberty, and the pursuit of happiness. '''

# 문장 단위 토큰화
result_ko = nltk.tokenize.sent_tokenize(text_ko)
result_en = nltk.tokenize.sent_tokenize(text_en)

print(result_ko, '\n')
print(result_en)


# 문장 단위 토큰화 결과
for sent in result_ko:
  print(sent, '\n')

print('\n')

for sent in result_en:
  print(sent, '\n')

한국어 불용어 제거는 안되지만 토큰화는 가능함

NLTK Code 3 (단어 단위 토큰화)

'''
단어의 개념
  1. 영어 : 띄어쓰기로 구분됨
  2. 한글 : 띄어쓰기로 구분 > 어절 > 1단어 이상으로 구성
'''

# nltk.tokenize.word.tokenize() 함수 활용 > 결과물 리스트로 생성됨

word_ko = nltk.tokenize.word_tokenize(text_ko)
word_en = nltk.tokenize.word_tokenize(text_en)

print(word_ko, '\n')
print(word_en)

NLTK Code 4 (영어 텍스트 불용어 제거)

# 영어 텍스트 > 토큰화(word) > 불용어 제거

# 결과를 저장할 리스트 생성
cleaned_words = []

for word in word_en:
  if word.lower() not in stopwords_list:
    cleaned_words.append(word)

print(len(word_en), '\n')
print(cleaned_words, '\n')
print(len(cleaned_words))