๐Ÿ’ญ KR-WordRank

์„œ์€์„œยท2023๋…„ 8์›” 10์ผ
post-thumbnail

๐Ÿ’ก ๊ต๋ณด๋ฌธ๊ณ ์˜ ๋„์„œ๋ฅผ ์ถ”์ฒœํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค ๋˜ ์ค‘ ์ฑ…์†Œ๊ฐœ๋ฅผ ์ด์šฉํ•˜์—ฌ ์œ ์˜๋ฏธํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์‹ถ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ๊ณ  ์•Œ๊ฒŒ ๋œ ๊ฒƒ์ด ๋ฐ”๋กœ WordRank์˜€๋‹ค. ์ง€๊ธˆ๋ถ€ํ„ฐ ํ•œ๊ธ€๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ์—์„œ ์œ ์˜๋ฏธํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด ์ฐพ์•„๋ดค๋˜ ๋ฐฉ๋ฒ•๋“ค์„ ์š”์•ฝ ํ•ด๋ณผ๊นŒ ํ•œ๋‹ค!


ํ…์ŠคํŠธ ์š”์•ฝ

ํ…์ŠคํŠธ ์š”์•ฝ์—๋Š” ํฌ๊ฒŒ ์ถ”์ถœ์  ์š”์•ฝ(Extractive Summarization)๊ณผ ์ถ”์ƒ์  ์š”์•ฝ(Abstractive Summarization)์œผ๋กœ ๋‚˜๋ˆ ์ง„๋‹ค.

์ถ”์ถœ์  ์š”์•ฝ์€ ๊ธฐ์กด์˜ ๊ธ€์—์„œ ์ค‘์š”๋„๊ฐ€ ๋†’๊ฑฐ๋‚˜ ํ•ต์‹ฌ์ด ๋˜๋Š” ๋ฌธ์žฅ์„ ๊ทธ๋Œ€๋กœ ์ถ”์ถœํ•ด์„œ ์š”์•ฝ๋ฌธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.
์ถ”์ƒ์  ์š”์•ฝ์ด๋ž€ ์ƒˆ๋กœ์šด ๋‹จ์–ด์™€ ์ƒˆ๋กœ์šด ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•ด์„œ ์š”์•ฝ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.


โœ๐Ÿป ๊ฐœ์š”

์ถ”์ถœ์  ์š”์•ฝ๋ฒ•์˜ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.
๐Ÿ”— ๊ด€๋ จ ๋…ผ๋ฌธ : http://infolab.stanford.edu/~backrub/google.html
๐Ÿ”— TextRank : https://sungmooncho.com/2012/08/26/pagerank/

๐Ÿ“„ PageRank

ํ•˜์ดํผ๋งํฌ๋ฅผ ๊ฐ€์ง€๋Š” ์›น ํŽ˜์ด์ง€์— ๋Œ€ํ•ด์„œ ์–ผ๋งˆ๋‚˜ ์ฐธ์กฐ๊ฐ€ ๋˜์—ˆ๋Š”์ง€, ์–ผ๋งˆ๋‚˜ ์œ ์ž…๋˜์—ˆ๋Š”์ง€ ๋“ฑ์œผ๋กœ ํŽ˜์ด์ง€์˜ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

๋‹ค๋ฅธ ํŽ˜์ด์ง€์—์„œ ์˜ค๋Š” ๋งํฌ๋ฅผ ๊ฐ™์€ ๋น„์ค‘์œผ๋กœ ์„ธ๋Š” ๋Œ€์‹ ์—, ๊ทธ ํŽ˜์ด์ง€์— ๊ฑธ๋ฆฐ ๋งํฌ ์ˆซ์ž๋ฅผ โ€˜์ •๊ทœํ™”(normalize)โ€™ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

PR(A) = (1-d)/N + d (PR(T1)/C(T1) + โ€ฆ + PR(Tn)/C(Tn))

  • PR PageRank์˜ ์ค„์ž„๋ง
  • PR(A) โ€˜Aโ€™๋ผ๋Š” ์›นํŽ˜์ด์ง€์˜ ํŽ˜์ด์ง€ ๋žญํฌ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
  • T1, T2, โ€ฆ Tn ๊ทธ ํŽ˜์ด์ง€๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ๋‹ค๋ฅธ ํŽ˜์ด์ง€๋“ค์„ ์˜๋ฏธํ•œ๋‹ค.
  • PR(T1) T1์ด๋ผ๋Š” ํŽ˜์ด์ง€์˜ ํŽ˜์ด์ง€ ๋žญํฌ๊ฐ’์ด๋‹ค.
  • d โ€˜Damping Factorโ€™์„ ๋œป(์–ด๋–ค ๋งˆ๊ตฌ์žก์ด๋กœ ์›น์„œํ•‘์„ ํ•˜๋Š” ์‚ฌ๋žŒ์ด ๊ทธ ํŽ˜์ด์ง€์— ๋งŒ์กฑ์„ ๋ชปํ•˜๊ณ  ๋‹ค๋ฅธ ํŽ˜์ด์ง€๋กœ ๊ฐ€๋Š” ๋งํฌ๋ฅผ ํด๋ฆญํ•  ํ™•๋ฅ )
  • (T1) T1์ด๋ผ๋Š” ํŽ˜์ด์ง€๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋งํฌ์˜ ์ด ๊ฐฏ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด A์˜ ํŽ˜์ด์ง€ ๋žญํฌ๋Š” A๋ผ๋Š” ํŽ˜์ด์ง€๋ฅผ ๊ฐ€๋ฆฌํ‚ค๊ณ  ์žˆ๋Š” ๋‹ค๋ฅธ ํŽ˜์ด์ง€์˜ ํŽ˜์ด์ง€ ๋žญํฌ๊ฐ’์ด ๋†’์„์ˆ˜๋ก (์ฆ‰, ๋” ์ค‘์š”ํ• ์ˆ˜๋ก) ๋” ๋†’์•„์ง„๋‹ค.

๐Ÿ’ก ์™œ ์ •๊ทœํ™”๋ฅผ ํ• ๊นŒ?

โ–ถ๏ธŽ ํŽ˜์ด์ง€ ๋žญํฌ์˜ ๋‹จ์ˆœ ํ•ฉ์‚ฐ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, T1์˜ ํŽ˜์ด์ง€ ๋žญํฌ๊ฐ€ ๋†’๋‹ค๊ณ  ํ•˜๋”๋ผ๋„, ๊ทธ ํŽ˜์ด์ง€์—์„œ ๋งํฌ๋ฅผ ์ˆ˜์ฒœ ๊ฐœ ๋‹ฌ์•„๋†“์•˜๋‹ค๋ฉด(์ฆ‰, C(T1)๊ฐ’์ด ๋†’๋‹ค๋ฉด) ๊ทธ ํŽ˜์ด์ง€๊ฐ€ ๊ธฐ์—ฌํ•˜๋Š” ๋น„์ค‘์€ ๋‚ฎ์•„์ง„๋‹ค.


๐Ÿ—ฃ๏ธ TextRank

TextRank ๋ชจ๋ธ์€ PageRank์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•œ ๊ฒƒ์œผ๋กœ, ํŽ˜์ด์ง€์˜ ๊ฐœ๋…์„ ๋‹จ์–ด์˜ ๊ฐœ๋…์œผ๋กœ ๋ฐ”๊พผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ์ฆ‰, ํ…์ŠคํŠธ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ธ€์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‹ค๋ฅธ ๋ฌธ์žฅ๊ณผ ์–ผ๋งˆ๋งŒํผ์˜ ๊ด€๊ณ„๋ฅผ ๋งบ๊ณ  ์žˆ๋Š”์ง€๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
TextRank๋Š” ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜์˜ ๋žญํ‚น๋ชจ๋ธ๋กœ ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๋Š” ๋ฐฉ๋ฒ•์ด ๋ฌธ๋‹จ์˜ ์ถ”์ถœ์  ์š”์•ฝ์— ๋งค์šฐ ํšจ๊ณผ์ ์ด๋ผ ์ƒ๊ฐ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ๋‹ค.

๐Ÿ”— ๋…ผ๋ฌธ : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
๐Ÿ”— ์ฐธ๊ณ  ์‚ฌ์ดํŠธ : https://www.dinolabs.ai/288

๐Ÿ… ์–ด๋–ป๊ฒŒ ์ˆœ์œ„๋ฅผ ๋งค๊ธธ๊นŒ?

โ–ถ๏ธŽ ๋…ผ๋ฌธ์— ์˜ํ•˜๋ฉด 'voting'๊ณผ 'recommendation'๊ณผ ๊ฐ™์€ ์•„์ด๋””์–ด๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ํ•œ ๋‹จ์–ด(vertex)๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด(vertex)์™€ ์—ฐ๊ฒฐ๋œ๋‹ค๋ฉด ์ด๋ฅผ ์—ฐ๊ฒฐํ•œ vertex์— ํˆฌํ‘œ๋ฅผ ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ฆ‰, ํˆฌํ‘œ๋ฅผ ๋งŽ์ด ๋ฐ›์€ vertex์˜ ์ค‘์š”๋„๊ฐ€ ์ปค์ง€๊ฒŒ ๋œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ณ  ํˆฌํ‘œ์ˆ˜๋Š” ์ˆœ์œ„๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฐ’์ด ๋œ๋‹ค.


๐Ÿ“„ KR-WordRank

WordRank๋Š” ์ผ๋ณธ์–ด์™€ ์ค‘๊ตญ์–ด์˜ Unsupervised word segmentation์„ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•œ๊ตญ์–ด์— ์ ์šฉํ•  ์‹œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์—†๋‹ค.

WordRank

WordRank๋Š” ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์—†๋Š” ์ค‘๊ตญ์–ด์™€ ์ผ๋ณธ์–ด์—์„œ ๊ทธ๋ž˜ํ”„ ๋žญํ‚น ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•ด ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ด๋‹ค. WordRank๋Š” substring graph๋ฅผ ๋งŒ๋“  ๋’ค, ๊ทธ๋ž˜ํ”„ ๋žญํ‚น ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•™์Šตํ•œ๋‹ค.
Substring graph ๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ์˜ (a), (b) ์ฒ˜๋Ÿผ ๊ตฌ์„ฑ๋œ๋‹ค.


WordRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•œ๊ตญ์–ด์— ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ํ•œ ๊ธ€์ž๋“ค์ด ๋†’์€ ranking ์„ ์ง€๋‹ˆ๊ฒŒ ๋œ๋‹ค. ํ•œ๊ตญ์–ด์˜ ํ•œ๊ธ€์ž๋Š” ๊ทธ ์ž์ฒด๋กœ ๋‹จ์–ด์ด๊ธฐ๋„ ํ•˜๋ฉฐ, ๊ด€ํ˜•์‚ฌ๋‚˜ ์กฐ์‚ฌ๋กœ ์ด์šฉ๋˜๋Š” ๊ธ€์ž๋“ค์ด ๋งŽ์•„ ๋‹จ์–ด๋กœ ๋“ฑ์žฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•œ๊ตญ์–ด์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•ด์•ผํ•œ๋‹ค.

KR-WordRank

ํ•œ๊ตญ์–ด๋Š” ๋„์–ด์“ฐ๊ธฐ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์•ผํ•œ๋‹ค. ๋„์–ด์“ฐ๊ธฐ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์ง€ ์•Š์œผ๋ฉด ๋‘ ์–ด์ ˆ์˜ ์–‘๋์— ๊ฑธ์นœ substring ์—ญ์‹œ ๋‹จ์–ด ํ›„๋ณด์— ํฌํ•จ๋œ๋‹ค.

<
substring('์ด๋ฒˆ๋ด„์—๋Š”') = [์ด๋ฒˆ, ๋ฒˆ๋ด„, ๋ด„์—, ์—๋Š”, ์ด๋ฒˆ๋ด„, ๋ฒˆ๋ด„์—, ...]
subsrting('์ด๋ฒˆ ๋ด„์—๋Š”') = [์ด๋ฒˆ, ๋ด„์—, ์—๋Š”, ๋ด„์—๋Š”]

ํ•œ๊ตญ์–ด์˜ ํŠน์ง•์€ ์–ด์ ˆ์˜ ์™ผ์ชฝ์— ์œ„์น˜ํ•œ ๊ธ€์ž๋“ค์ด ์˜๋ฏธ๋ฅผ ์ง€๋‹ˆ๋Š” ๋‹จ์–ด๋“ค์ด๋ฉฐ, ์˜ค๋ฅธ์ชฝ์— ์œ„์น˜ํ•œ ๊ธ€์ž๋“ค์€ ๋ฌธ๋ฒ•๊ธฐ๋Šฅ์„ ํ•˜๋Š” ์กฐ์‚ฌ์™€ ์–ด๋ฏธ๋ผ๋Š” ์ ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์–ด์ ˆ์˜ ์™ผ์ชฝ๋ถ€๋ถ„์„ ์ด์šฉํ•ด ๋‹จ์–ด์‚ฌ์ „์œผ๋กœ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค.
WordRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ keyword extraction ๋Šฅ๋ ฅ์ด ์žˆ๋‹ค. ranking ์ด ๋†’์€ ๋งˆ๋””๋Š” ๋‹จ์–ด์ผ ๋ฟ ์•„๋‹ˆ๋ผ, ๊ทธ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์ด๋ฏ€๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์š”์•ฝํ•˜๋Š” keywords ๋กœ ์ด์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

keyword extraction : ์›๋ณธ ๋ฌธ์„œ๋ฅผ ๊ฐ€์žฅ ์ž˜ ํƒ€๋‚˜๋‚ด๋Š” ์ค‘์š”ํ•œ ์šฉ์–ด ๋˜๋Š” ๊ตฌ๋ฌธ์„ ์ฐพ์•„๋‚ด๋Š” ์ž‘์—…

๐Ÿ” ์ „์ฒด์ ์ธ ๊ณผ์ •

๐Ÿ’ป KR-WordRank์˜ ํ™œ์šฉ

  • df['์ฑ…์†Œ๊ฐœ']๋Š” โ€˜\nโ€™์„ ํฌํ•จํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ฌธ์žฅ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ์žฅ์œผ๋กœ ๋‚˜๋ˆ ์•ผ ํ•œ๋‹ค.
<for text in df['์ฑ…์†Œ๊ฐœ']:
  text.split('\n')
  text_list.append(text.split('\n'))

text_list[:2]
  • normalize ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ถˆํ•„์š”ํ•œ ํŠน์ˆ˜ ๊ธฐํ˜ธ๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค.
    ex) โ€˜โ–ถ๏ธŽ', โ€™!โ€™, โ€˜โ˜…โ€™ ๋“ฑโ€ฆ

from krwordrank.hangle import normalize
texts = [[normalize(text, english=True, number=True) for text in texts] for texts in text_list]
texts[:2]
  • texts๋ฅผ df์— โ€˜์ฑ…์†Œ๊ฐœ ์ „์ฒ˜๋ฆฌโ€™๋ผ๋Š” ์—ด(column)์ด๋ฆ„์œผ๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค.
  • KRWordRank ํŒจํ‚ค์ง€๋ฅผ ์ด์šฉํ•˜์—ฌ โ€™์ฑ…์†Œ๊ฐœ ์ „์ฒ˜๋ฆฌโ€™ ์—ด์—์„œ ์ค‘์š”ํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ์กด Dataframe์—์„œ '์ฑ…์†Œ๊ฐœ ํ‚ค์›Œ๋“œ ์ˆ˜์ •๋ณธ' ์—ด๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค.
from krwordrank.word import KRWordRank
from konlpy.tag import Okt

def KeyWord(x):
  sentence = []
  wordrank_extractor = KRWordRank(
    min_count = 1, # ๋‹จ์–ด์˜ ์ตœ์†Œ ์ถœํ˜„ ๋นˆ๋„์ˆ˜ (๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ ์‹œ)
    max_length = 10, # ๋‹จ์–ด์˜ ์ตœ๋Œ€ ๊ธธ์ด
    verbose = True
    )

  beta = 0.85    # PageRank์˜ decaying factor beta
  max_iter = 20
  keywords, rank, graph = wordrank_extractor.extract(x, beta, max_iter)
  word_list = list()
  for word, r in sorted(keywords.items(), key=lambda x:x[1], reverse=True)[:30]:
    if r >=1:
      word_list.append(word)
	# ํ•œ ๋ฌธ์žฅ์œผ๋กœ ํ•ฉ์นœ๋‹ค.
  sent = ' '.join(word_list)
  sentence.append(sent)

	# ํ˜•ํƒœ์†Œ ์ถ”์ถœ
  okt = Okt()
  OKT = okt.pos(sent)
  keyword_list = []

  # ์กฐ์‚ฌ์™€ ์ ‘๋ฏธ์‚ฌ๋ฅผ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€๋งŒ์„ ํ‚ค์›Œ๋“œ๋กœ ์ฑ„ํƒ
  for word, tag in OKT:
    if (tag not in ['Josa']) and (tag not in ['Suffix']):
      keyword_list.append(word)
  return keyword_list

df['์ฑ…์†Œ๊ฐœ ํ‚ค์›Œ๋“œ ์ˆ˜์ •๋ณธ'] = df['์ฑ…์†Œ๊ฐœ ํ‚ค์›Œ๋“œ'].apply(KeyWord)

๐Ÿ”— ์ฐธ๊ณ ์ž๋ฃŒ : https://lovit.github.io/nlp/2018/04/16/krwordrank/

profile
๋‚ด์ผ์˜ ๋‚˜๋Š” ์˜ค๋Š˜๋ณด๋‹ค ๋” ๋‚˜์•„์ง€๊ธฐ๋ฅผ :D

0๊ฐœ์˜ ๋Œ“๊ธ€