[Python] Web Crawling - 형태소 분석, WordCloud

거친코딩·2021년 5월 3일

누구나 할 수 있는 크롤링

목록 보기

4/7

파이썬을 활용한 웹 크롤링 #4

휴잭맨 영화 리뷰 분석 및 워드클라우드

네이버 리뷰 데이터

형태소 분석(koNLPy)
1. 크롤링한 댓글파일을 불러와서 리트스 변수에 저장
1. konlpy 모듈 호출 및 Okt 객체 생성
2. 반복문을 사용하여 문장별 형태소구분 및 품사매칭(koLNPy함수)
3. [응용실습] 필요한 품사만 추출
4. [응용실습] 선별된 품사별 빈도수 계산하고 상위 빈도 10위 까지 출력

# konlpy 설치
pip install konlpy

import os
from konlpy.tag import Okt
from collections import Counter

# Okt 형태소 분석 객체 생성
ok_twitter = Okt()

# 저장된 파일의 위치 탐색 후, file변수에 저장
file = open('./drive/MyDrive/machine_learning_data/Naver_opinion.txt','r',encoding='utf-8')
total_lines = file.readlines() # txt파일을 줄 단위로 읽음
file.close()
print(total_lines)
>['휴잭맨의. 매력이 한껏 돋보인 영화입니다.다른 출연자들도 너무 멋지구요.사람은 누구나 소중하며, 현재를 멋지...

# 크롤링 댓글파일 가져와서 reply_text 리스트에 저장
reply_text = []
for line in total_lines:
  reply_text.append(line[:-1])

# 형태소 분류하고 확인 하기
sentences_tag = []
for sentence in reply_text:
  morph = ok_twitter.pos(sentence)
  sentences_tag.append(morph)
  print(morph) # 분석된 1건 결과 확인
  print('-' * 30)
>[('스토리', 'Noun'), ('는', 'Josa'), ('좀', 'Noun'), ('뻔하지만', 'Adjective'), ('화려하고', 'Adjective'), ('노래', 'Noun'), ('도', 'Josa'), ('너무', 'Adverb'), ('좋아', 'Adjective'), ('시간', 'Noun'), ('가는줄', 'Verb'), ('몰랐다', 'Verb')]
>------------------------------

# 명사만 출력해 보기
for my_sentence in sentences_tag:
	for word, tag in my_sentence:
		if tag in ['Noun']:
			print(word)
>남자배우
스토리
감동
작품
만
품위
즐거움
무조건

# 필요한 품사만 추출해보기(명사를 bucket list에 담기)
bucket_list = []
for my_sentence in sentences_tag:
  for word, tag in my_sentence:
    if tag in ['Noun']:
    bucket_list.append(word)

print(bucket_list)
>['휴잭맨', '매력', '영화', '다른', '출연자', '사람', '누구', '현재', '메세지', '정말', '인생', '최고', '영화'...

# 단어 빈도수 구하기
# 각 원소의 출현 횟수를 계산하는 Counter 모듈을 활용한다.
from collections import Counter
counts = Counter(bucket_list)
print(counts)
>Counter({'영화': 1106, '노래': 349, '최고': 320, '음악': 216, '정말': 192...

# 명사 빈도 순서대로 상위 30개 출력
print(counts.most_common(30))
>[('영화', 1106), ('노래', 349), ('최고', 320), ('음악', 216), ('정말', 192),...

# 명사와 형용사를 모두 추출하고 상위 50개를 출력
bucket_list_2 = []
for my_sentence in sentences_tag:
  for word, tag in my_sentence:
    if tag in ['Noun','Adjective']:
      bucket_list_2.append(word)
counts = Counter(bucket_list)
print(counts.most_common(50))
>Counter({'영화': 1106, '노래': 349, '최고': 320, '음악': 216, '정말': 192, '감동': 191, '인생': 183,...

WordCloud
1.워드클라우드(word cloud)는 자주 나타나는 단어들의 크기를 상대적인 크기로 시각화 하여 보여줌으로써 많이 사용되는 단어의 영향력을 한 눈에 알 수 있도록 제공
2.필요한 라이브러리
- konlpy.tag(형태소 분석을 통해 명사 추출)
- pytagclud 패키지(워드클라우드 그림 그리기) 또는 wordcloud 1.8.1 패키지
- pytagclud : 교육용으로 좋음, 하지만 복잡함
- wordcloud : 함수로 쉽게 구현 가능(옵션 조절을 통해)
- wordcloud에 배경이미지를 넣으려면 마스킹 이미지를 만들어야 한다는 피곤함 있음(파워포인트를 사용하면 그나마 쉽게 만들 수 있음)
- ⛑ 주의!! : 인터넷에 있는 무료 폰트라고 해서, 절대로 막 사용하지 않기!, 개인적으로 집에서 볼 것이라면 사용해도 되지만, 다른 곳에 활용한다면 절대 쓰지 말기
WordCloud 패키지 활용
- 단어 구름을 형태를 가지는 그림으로 만들기
  1.이미지를 마스크로 사용하여 수행
  2.흰색이 아닌 위치에 단어를 생성함
- 과정
  1.마스크용 이미지 준비(포토샵, 구글이미지 검색, 파워포인트) (돌고래 jpg)
  2.크롤링된 문장이 저장된 txt파일 (휴잭맨 영화 리뷰)
  3.사용할 폰트파일 준비 (NanumBarunGothic)
  4.패키지 설치:형태소 분석(Okt),wordcloud(PIL,numpy)
  5.wordcloud로 생성 후 결과 이미지 생성

# pip install wordcloud

# 크롤링한 댓글 파일 가져와서 리스트 변수에 저장
from konlpy.tag import Okt
from wordcloud import WordCloud
from PIL import Image
import numpy as np

with open('./drive/MyDrive/machine_learning_data/Naver_opinion.txt','r') as f:
  reply = f.readlines()

# 크롤링한 댓글파일 가져와서 리스트 변수에 저장
# 반드시 문자 사이 공백을 넣어야지 wordcloud에서 분석 가능!!
ok_twitter = Okt()
text = ''
for sentence in reply:
  for noun in ok_twitter.nouns(sentence):
    text += noun+' '
print(text)
> 휴잭맨 매력 영화 다른 출연자 사람 누구 현재 메세지 정말 인생 최고 영화 남 기립박수 뻔 강추 노래 진짜 쩐다...

# 크롤링한 댓글파일 가져와서 리스트 변수에 저장
# 배경이 흰색인 마스크 이미지
mask_image = np.array(Image.open('./drive/MyDrive/machine_learning_data/dolphine.jpg'))
wc = WordCloud(
    font_path='./drive/MyDrive/machine_learning_data/NanumBarunGothic.ttf', # 사용할 폰트
    background_color='white', # 배경색
    max_words=100, # 최대 빈도수를 기준으로 출력할 단어 수
    mask=mask_image, # 마스크 이미지
    max_font_size=70, # 최대 폰트 크기
    colormap='hsv' # 컬러 스타일 ex)'Accent', 'Accent_r', 'Blues', 'Blues_r' 등등
).generate(text)
wc.to_file('./drive/MyDrive/machine_learning_data/wc_result.png')

# 다른 스타일 적용해보기(Accent_r)
from konlpy.tag import Okt
from wordcloud import WordCloud
from PIL import Image
import numpy as np

with open('./drive/MyDrive/machine_learning_data/Naver_opinion.txt','r') as f:
  reply = f.readlines()
ok_twitter = Okt()
text = ''
# 필요없는  단어는 아래와 같이 제외 시킬 수 있다.
for sentence in reply:
  for noun in ok_twitter.nouns(sentence):
    if noun not in ['그','아주','나','거','임','위','이','수','때','것','남','더']:
      text += noun+' '

mask_image = np.array(Image.open('./drive/MyDrive/machine_learning_data/dolphine.jpg'))
wc = WordCloud(
    font_path='./drive/MyDrive/machine_learning_data/NanumBarunGothic.ttf', # 사용할 폰트
    background_color='white', # 배경색
    max_words=100, # 최대 빈도수를 기준으로 출력할 단어 수
    mask=mask_image, # 마스크 이미지
    max_font_size=70, # 최대 폰트 크기
    colormap='Accent_r' # 컬러 스타일 ex)'Accent', 'Accent_r', 'Blues', 'Blues_r' 등등
).generate(text)

거친코딩

데이터 분석 유튜버 "거친코딩"입니다.

이전 포스트

[Python] Web Crawling - 실전(네이버 영화 리뷰)

다음 포스트

[Python] Web Crawling - 형태소 분석, WordCloud

누구나 할 수 있는 크롤링

파이썬을 활용한 웹 크롤링 #4

휴잭맨 영화 리뷰 분석 및 워드클라우드

[Python] Web Crawling - 실전(네이버 영화 리뷰)

[Python] Web Crawling - 공공데이터 크롤링

0개의 댓글

관련 채용 정보