- LSA paper : Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse processes, 25(2-3), 259-284.
- LDA paper : Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
- BERTopic paper: BERTopic: Neural topic modeling with a class-based TF-IDF procedure
키워드 매칭만 생각했을 때 '중국집' 단어는 문서 1이랑만 매칭돼서, 문서2는 검색에서 빠져버리게 된다.
Co-occurrence를 통해 DTM의 잠재된(latent) 의미를 이끌어 내자
co-occurrence 정보를 이용한다 == semantic을 이용한다 !!
위의 예시에서 문서1 상 '중국집'은 '짜장면', '짬뽕'과 동시 등장함. 이를 통해 문서 2도 비슷하지 않을까! 하는 사실을 유추해볼 수 있음.
차원 축소 : Truncated SVD
- 원래 행렬 A
- 분해된 행렬
- 수식
- U: 직교 행렬 (m x m) - 자기 자신과 전치행렬 곱했을 때 단위행렬 나오는 경우
- V: 직교 행렬 (n x n)
- E: 직사각 대각 행렬 (m x n) - 주대각선을 제외한 곳 원소 모두 0인 행렬
*A의 특이값(singular value): 대각행렬 S에서 나온 대각 원소의 값
문서-문서 유사도 예시
- 문서 1과 문서2간 유사도 : arr[:,1] , arr[:,2] 간 cosine similarity 계산
- 문서 4,5 간 유사도 높은 것 확인
Ref :
- 노란색 토픽엔 gene ,,,등의 단어 존재
- 오른쪽 문서에는 노란색 토픽과 관련한 단어 수 많음 == 노란색 토픽일 확률 높음.
Goal: 위의 문서생성 과정을 역으로 따라가서 𝑤𝑑,𝑛 를 가지고 잠재변수를 추정 (z, ϕ, θ)
( ~ 단어를 보고 어떤 토픽 분포에서 왔는지 추정, 이 토픽을 보고 어떤 문서 분포에서 왔는지 추정)
Collapsed gibbs sampling (깁스 샘플링)
Process
0) 목표 : 𝑝(𝑧_1,2) 구하기! (1번째 문서의 2번째 단어 'trade'의 토픽 구하기)
1) 𝑧𝑖, ϕ𝑘 랜덤 초기화, 토픽 수(k) 지정
2) 깁스 샘플링 활용해 𝑝(𝑧_1,2) 구하기
target 단어 ('trade')의 정보만 지우고 나머지 정보만 활용
A : 𝑑번째 문서가 𝑘번째 토픽과 맺고 있는 연관성 정도
예시 : 1번째 문서의 토픽별 연관성 정도(A)
B : 𝑑번째 문서의 𝑛번째 단어(𝑤𝑑,𝑛)가 𝑘번째 토픽과 맺고 있는 연관성 정도
예시: trade 단어의 토픽별 연관성 정도(B)
𝑝(𝑧_1,2) == AB
3) 2번 반복 수행
장점: 가정이 위반되더라도 어느정도 robust (변화에 민감하지 않다).
출처: https://mambo-coding-note.tistory.com/205 [화학쟁이의 ㎚코딩노트:티스토리]
단점:
=> 단점 보완해서 등장한 것이 QDA (Quadratic Discriminant Analysis)라고 함
https://heeya-stupidbutstudying.tistory.com/entry/DL-keyword-extraction-with-KeyBERT-%EA%B0%9C%EC%9A%94%EC%99%80-%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98-1
BERTopic과 같은 개발자
import numpy as np
# DTM 정의
# 문서 단어 행렬(Document-Term Matrix, DTM)
A = np.array([[0,0,0,1,0,1,1,0,0],
[0,0,0,1,1,0,1,0,0],
[0,1,1,0,2,0,0,0,0],
[1,0,0,0,0,0,0,1,1]])
print('DTM의 크기(shape) :', np.shape(A))
DTM의 크기(shape) : (4, 9)
"""
U, S, V
linalg.svd()
"""
U, s, VT = np.linalg.svd(A, full_matrices = True)
print(f'행렬 U : [shape: {np.shape(U)}]')
print(U.round(2))
행렬 U : [shape: (4, 4)]
[[-0.24 0.75 0. -0.62]
[-0.51 0.44 -0. 0.74]
[-0.83 -0.49 -0. -0.27]
[-0. -0. 1. 0. ]]
# 특이값 벡터 리스트 => 대각 행렬로 변형
print(f'특이값 벡터 리스트 s : [shape: {np.shape(s)}]')
print(s.round(2))
print()
S = np.zeros((4, 9)) # 대각 행렬의 크기인 4 x 9의 임의의 행렬 생성
S[:4, :4] = np.diag(s) # 특이값을 대각행렬에 삽입
print(f'대각행렬 S : [shape: {np.shape(S)}]')
print(S.round(2))
특이값 벡터 리스트 s : [shape: (4,)]
[2.69 2.05 1.73 0.77]
대각행렬 S : [shape: (4, 9)]
[[2.69 0. 0. 0. 0. 0. 0. 0. 0. ]
[0. 2.05 0. 0. 0. 0. 0. 0. 0. ]
[0. 0. 1.73 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0.77 0. 0. 0. 0. 0. ]]
print(f'직교행렬 VT : [shape: {np.shape(VT)}]')
print(VT.round(2))
직교행렬 VT : [shape: (9, 9)]
[[-0. -0.31 -0.31 -0.28 -0.8 -0.09 -0.28 -0. -0. ]
[ 0. -0.24 -0.24 0.58 -0.26 0.37 0.58 -0. -0. ]
[ 0.58 -0. 0. 0. -0. 0. -0. 0.58 0.58]
[ 0. -0.35 -0.35 0.16 0.25 -0.8 0.16 -0. -0. ]
[-0. -0.78 -0.01 -0.2 0.4 0.4 -0.2 0. 0. ]
[-0.29 0.31 -0.78 -0.24 0.23 0.23 0.01 0.14 0.14]
[-0.29 -0.1 0.26 -0.59 -0.08 -0.08 0.66 0.14 0.14]
[-0.5 -0.06 0.15 0.24 -0.05 -0.05 -0.19 0.75 -0.25]
[-0.5 -0.06 0.15 0.24 -0.05 -0.05 -0.19 -0.25 0.75]]
# (4,4) X (4,9) X (9,9)
print(f'USV^T : [shape: {np.shape(np.dot(np.dot(U,S), VT))}]')
print(f'DTM: [shape: {np.shape(A)}]')
np.allclose(A, np.dot(np.dot(U,S), VT).round(2))
USV^T : [shape: (4, 9)]
DTM: [shape: (4, 9)]
True
S = S[:2,:2] # 특이값 상위 2개만 보존
U = U[:,:2]
VT = VT[:2,:]
print(f'행렬 U : [shape: {np.shape(U)}]')
print(f'대각행렬 S : [shape: {np.shape(S)}]')
print(f'직교행렬 VT : [shape: {np.shape(VT)}]')
행렬 U : [shape: (4, 2)]
대각행렬 S : [shape: (2, 2)]
직교행렬 VT : [shape: (2, 9)]
# (4,2) X (2,2) X (2,9)
# U 4X2 : 문서의 개수 X 토픽 수 t => 단어 갯수 9개는 유지 X, 문서크기 4 유지/ 4개의 문서를 각각 2개의 값으로 표현
# VT 2X9 : t X 단어수 => 잠재의미를 표현하는 단어 벡터
print(f'축소된 USV^T : [shape: {np.shape(np.dot(np.dot(U,S), VT))}]')
print(f'DTM: [shape: {np.shape(A)}]')
np.allclose(A, np.dot(np.dot(U,S), VT).round(2))
축소된 USV^T : [shape: (4, 9)]
DTM: [shape: (4, 9)]
False
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
### data load
dataset = fetch_20newsgroups(shuffle=True,
random_state=1,
remove=('headers', 'footers', 'quotes'))
documents = dataset.data
print('샘플 수:',len(documents))
print(dataset.target_names, len(dataset.target_names)) # topic 주제
documents[1]
샘플 수: 11314
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] 20
"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism? No, you need a little leap of faith, Jimmy. Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim. And I'm sorry that you have these feelings of\ndenial about the faith you need to get by. Oh well, just pretend that it will\nall end happily ever after anyway. Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim. Don't forget your Flintstone's Chewables! :) \n--\nBake Timmons, III"
### data preprocessing
news_df = pd.DataFrame({'document':documents})
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ") # 특문
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())
news_df['clean_doc'][1]
stop_words = stopwords.words('english') # stopwords
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
print(tokenized_doc[1])
# 역토큰화 (토큰화 작업을 역으로 되돌림)
detokenized_doc = []
for i in range(len(news_df)):
t = ' '.join(tokenized_doc[i])
detokenized_doc.append(t)
news_df['clean_doc'] = detokenized_doc
news_df['clean_doc'][1]
/tmp/ipykernel_1097416/312981451.py:3: FutureWarning: The default value of regex will change from True to False in a future version.
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ") # 특문
['yeah', 'expect', 'people', 'read', 'actually', 'accept', 'hard', 'atheism', 'need', 'little', 'leap', 'faith', 'jimmy', 'logic', 'runs', 'steam', 'sorry', 'pity', 'sorry', 'feelings', 'denial', 'faith', 'need', 'well', 'pretend', 'happily', 'ever', 'anyway', 'maybe', 'start', 'newsgroup', 'atheist', 'hard', 'bummin', 'much', 'forget', 'flintstone', 'chewables', 'bake', 'timmons']
'yeah expect people read actually accept hard atheism need little leap faith jimmy logic runs steam sorry pity sorry feelings denial faith need well pretend happily ever anyway maybe start newsgroup atheist hard bummin much forget flintstone chewables bake timmons'
news_df.head(2)
document | clean_doc | |
---|---|---|
0 | Well i'm not sure about the story nad it did s... | well sure story seem biased disagree statement... |
1 | \n\n\n\n\n\n\nYeah, do you expect people to re... | yeah expect people read actually accept hard a... |
### TF-IDF
vectorizer = TfidfVectorizer(stop_words='english',
max_features= 1000, # 상위 1,000개의 단어를 보존
max_df = 0.5,
smooth_idf=True)
X = vectorizer.fit_transform(news_df['clean_doc'])
print('TF-IDF 행렬의 크기 :',X.shape) ## == DTM
TF-IDF 행렬의 크기 : (11314, 1000)
### Topic modeling
"""
TruncatedSVD(n_components=2, *, algorithm='randomized', n_iter=5, random_state=None,tol=0.0,)
- n_components : Desired dimensionality of output data.
- algorithm : SVD solver to use. {'arpack', 'randomized'}, default='randomized'
- n_iter : Number of iterations for randomized SVD solver
- tol : Tolerance for ARPACK. 0 means machine precision.
"""
svd_model = TruncatedSVD(n_components=20, # topic 수
algorithm='randomized',
n_iter=100,
random_state=122)
svd_model.fit(X)
print("topic 수 : ",len(svd_model.components_))
print(np.shape(svd_model.components_))
topic 수 : 20
(20, 1000)
terms = vectorizer.get_feature_names() # 단어 집합. 1,000개의 단어가 저장됨.
def get_topics(components, feature_names, n=5):
for idx, topic in enumerate(components):
print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n - 1:-1]])
get_topics(svd_model.components_,terms)
Topic 1: [('like', 0.21386), ('know', 0.20046), ('people', 0.19293), ('think', 0.17805), ('good', 0.15128)]
Topic 2: [('thanks', 0.32888), ('windows', 0.29088), ('card', 0.18069), ('drive', 0.17455), ('mail', 0.15111)]
Topic 3: [('game', 0.37064), ('team', 0.32443), ('year', 0.28154), ('games', 0.2537), ('season', 0.18419)]
Topic 4: [('drive', 0.53324), ('scsi', 0.20165), ('hard', 0.15628), ('disk', 0.15578), ('card', 0.13994)]
Topic 5: [('windows', 0.40399), ('file', 0.25436), ('window', 0.18044), ('files', 0.16078), ('program', 0.13894)]
Topic 6: [('chip', 0.16114), ('government', 0.16009), ('mail', 0.15625), ('space', 0.1507), ('information', 0.13562)]
Topic 7: [('like', 0.67086), ('bike', 0.14236), ('chip', 0.11169), ('know', 0.11139), ('sounds', 0.10371)]
Topic 8: [('card', 0.46633), ('video', 0.22137), ('sale', 0.21266), ('monitor', 0.15463), ('offer', 0.14643)]
Topic 9: [('know', 0.46047), ('card', 0.33605), ('chip', 0.17558), ('government', 0.1522), ('video', 0.14356)]
Topic 10: [('good', 0.42756), ('know', 0.23039), ('time', 0.1882), ('bike', 0.11406), ('jesus', 0.09027)]
Topic 11: [('think', 0.78469), ('chip', 0.10899), ('good', 0.10635), ('thanks', 0.09123), ('clipper', 0.07946)]
Topic 12: [('thanks', 0.36824), ('good', 0.22729), ('right', 0.21559), ('bike', 0.21037), ('problem', 0.20894)]
Topic 13: [('good', 0.36212), ('people', 0.33985), ('windows', 0.28385), ('know', 0.26232), ('file', 0.18422)]
Topic 14: [('space', 0.39946), ('think', 0.23258), ('know', 0.18074), ('nasa', 0.15174), ('problem', 0.12957)]
Topic 15: [('space', 0.31613), ('good', 0.3094), ('card', 0.22603), ('people', 0.17476), ('time', 0.14496)]
Topic 16: [('people', 0.48156), ('problem', 0.19961), ('window', 0.15281), ('time', 0.14664), ('game', 0.12871)]
Topic 17: [('time', 0.34465), ('bike', 0.27303), ('right', 0.25557), ('windows', 0.1997), ('file', 0.19118)]
Topic 18: [('time', 0.5973), ('problem', 0.15504), ('file', 0.14956), ('think', 0.12847), ('israel', 0.10903)]
Topic 19: [('file', 0.44163), ('need', 0.26633), ('card', 0.18388), ('files', 0.17453), ('right', 0.15448)]
Topic 20: [('problem', 0.33006), ('file', 0.27651), ('thanks', 0.23578), ('used', 0.19206), ('space', 0.13185)]
tokenized_doc[:5]
0 [well, sure, story, seem, biased, disagree, st...
1 [yeah, expect, people, read, actually, accept,...
2 [although, realize, principle, strongest, poin...
3 [notwithstanding, legitimate, fuss, proposal, ...
4 [well, change, scoring, playoff, pool, unfortu...
Name: clean_doc, dtype: object
from sklearn.decomposition import LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components=10,
learning_method='online',
random_state=777,
max_iter=1)
# 1) 정수 인코딩과 단어 집합 만들기
from gensim import corpora
dictionary = corpora.Dictionary(tokenized_doc)
corpus = [dictionary.doc2bow(text) for text in tokenized_doc]
print("encoding된 corpus 1 예시: \n", corpus[1]) # 수행된 결과에서 두번째 뉴스 출력. 첫번째 문서의 인덱스는 0
print("66번째 dictionary 예시:", dictionary[66])
print("학습된 총 단어의 갯수: ",len(dictionary))
encoding된 corpus 1 예시:
[(52, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 2), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 2), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 2), (86, 1), (87, 1), (88, 1), (89, 1)]
66번째 dictionary 예시: faith
학습된 총 단어의 갯수: 64281
# 2) LDA 모델 훈련시키기
import gensim
NUM_TOPICS = 20 # 20개의 토픽, k=20
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
print(topic)
(0, '0.046*"thanks" + 0.038*"please" + 0.038*"anyone" + 0.034*"would"')
(1, '0.020*"game" + 0.018*"team" + 0.014*"year" + 0.014*"games"')
(2, '0.013*"people" + 0.009*"said" + 0.007*"government" + 0.006*"armenian"')
(3, '0.011*"university" + 0.010*"health" + 0.007*"medical" + 0.007*"national"')
(4, '0.013*"jesus" + 0.008*"christian" + 0.007*"bible" + 0.007*"believe"')
(5, '0.017*"cover" + 0.014*"rider" + 0.011*"copies" + 0.010*"swap"')
(6, '0.019*"would" + 0.013*"think" + 0.012*"like" + 0.010*"know"')
(7, '0.011*"soon" + 0.010*"pitt" + 0.010*"banks" + 0.010*"radar"')
(8, '0.011*"cars" + 0.011*"engine" + 0.010*"ground" + 0.009*"water"')
(9, '0.020*"space" + 0.008*"nasa" + 0.005*"science" + 0.005*"earth"')
(10, '0.016*"available" + 0.014*"software" + 0.012*"version" + 0.010*"image"')
(11, '0.030*"turkish" + 0.023*"turkey" + 0.015*"germany" + 0.014*"german"')
(12, '0.012*"information" + 0.011*"encryption" + 0.010*"public" + 0.009*"security"')
(13, '0.027*"color" + 0.027*"monitor" + 0.014*"screen" + 0.014*"colors"')
(14, '0.038*"israel" + 0.023*"israeli" + 0.022*"jews" + 0.015*"arab"')
(15, '0.016*"card" + 0.013*"scsi" + 0.010*"memory" + 0.010*"chip"')
(16, '0.013*"bike" + 0.009*"left" + 0.007*"right" + 0.007*"ride"')
(17, '0.042*"drive" + 0.023*"disk" + 0.016*"hard" + 0.014*"sale"')
(18, '0.012*"clemens" + 0.008*"runner" + 0.007*"catcher" + 0.007*"invaded"')
(19, '0.023*"file" + 0.012*"program" + 0.011*"window" + 0.010*"output"')
# 3) LDA 시각화 - 토픽 별 단어 분포
#! pip install pyLDAvis
"""
각 원과의 거리는 각 토픽들이 서로 얼마나 다른지를 보여줍니다.
만약 두 개의 원이 겹친다면, 이 두 개의 토픽은 유사한 토픽이라는 의미입니다.
"""
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis)
# 4) 문서 별 토픽 분포 보기 - 상위 5개
for i, topic_list in enumerate(ldamodel[corpus]):
if i==5:
break
print(i,'번째 문서 topic 비율: ',topic_list)
# (토픽 번호, 토픽이 해당 문서에서 차지하는 분포도)
0 번째 문서 topic 비율: [(2, 0.42456737), (5, 0.017701633), (6, 0.30507037), (7, 0.11740385), (11, 0.025931243), (14, 0.09783709)]
1 번째 문서 topic 비율: [(4, 0.19281802), (6, 0.56402504), (8, 0.034390826), (16, 0.18770018)]
2 번째 문서 topic 비율: [(2, 0.08864143), (3, 0.037529983), (6, 0.5558445), (10, 0.020978697), (11, 0.063511066), (14, 0.22198538)]
3 번째 문서 topic 비율: [(2, 0.12930709), (6, 0.44330305), (7, 0.06343465), (12, 0.28526807), (17, 0.06693932)]
4 번째 문서 topic 비율: [(1, 0.36722812), (6, 0.53172415), (19, 0.06954694)]
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
print('총 문서의 수 :', len(docs))
총 문서의 수 : 18846
model = BERTopic(nr_topics = 10)
topics, probabilities = model.fit_transform(docs)
print('각 문서의 토픽 번호 리스트 :',len(topics))
print('두번째 문서의 토픽 번호 :', topics[1])
각 문서의 토픽 번호 리스트 : 18846
두번째 문서의 토픽 번호 : -1
model.get_topic_info()
Topic | Count | Name | |
---|---|---|---|
0 | -1 | 13065 | -1_the_to_of_and |
1 | 0 | 1852 | 0_the_to_in_and |
2 | 1 | 649 | 1_the_to_of_and |
3 | 2 | 512 | 2_the_of_to_in |
4 | 3 | 479 | 3_the_car_and_it |
5 | 4 | 460 | 4_drive_the_scsi_drives |
6 | 5 | 456 | 5_ites_cheek_why_yep |
7 | 6 | 441 | 6_for_and_the_to |
8 | 7 | 332 | 7_the_to_they_that |
9 | 8 | 323 | 8_bike_the_to_and |
10 | 9 | 277 | 9_for_and_the_to |
model.get_topic(5)
[('ites', 0.7993813137543541),
('cheek', 0.7594094963882347),
('why', 0.6515218145793125),
('yep', 0.6189834540650221),
('huh', 0.5778063889614978),
('ken', 0.5496622938962114),
('luck', 0.5035041768627133),
('forget', 0.49035217536106335),
('art', 0.4862345814829837),
('lets', 0.4355922187643489)]
new_doc = docs[0]
print(new_doc)
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
topics, probs = model.transform([new_doc])
print('예측한 토픽 번호 :', topics)
예측한 토픽 번호 : [0]
import numpy as np
import itertools
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
doc = """
Supervised learning is the machine learning task of
learning a function that maps an input to an output based
on example input-output pairs.[1] It infers a function
from labeled training data consisting of a set of
training examples.[2] In supervised learning, each
example is a pair consisting of an input object
(typically a vector) and a desired output value (also
called the supervisory signal). A supervised learning
algorithm analyzes the training data and produces an
inferred function, which can be used for mapping new
examples. An optimal scenario will allow for the algorithm
to correctly determine the class labels for unseen
instances. This requires the learning algorithm to
generalize from the training data to unseen situations
in a 'reasonable' way (see inductive bias).
"""
# encoding
# 3개의 단어 묶음인 단어구 추출
n_gram_range = (3, 3)
stop_words = "english"
count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
candidates = count.get_feature_names_out()
print('trigram 개수 :',len(candidates))
print('trigram 다섯개만 출력 :',candidates[:5])
trigram 개수 : 72
trigram 다섯개만 출력 : ['algorithm analyzes training' 'algorithm correctly determine'
'algorithm generalize training' 'allow algorithm correctly'
'analyzes training data']
# 키워드 수치화
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([doc])
candidate_embeddings = model.encode(candidates)
print(doc_embedding.shape) # 문서별
print(candidate_embeddings.shape) # 단어별
(1, 768)
(72, 768)
# 문서와 가장 유사한 키워드 추출
top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]
print(keywords)
['algorithm analyzes training', 'learning algorithm generalize', 'learning machine learning', 'learning algorithm analyzes', 'algorithm generalize training']
def max_sum_sim(doc_embedding, candidate_embeddings, words, top_n, nr_candidates):
# 문서와 각 키워드들 간의 유사도
distances = cosine_similarity(doc_embedding, candidate_embeddings)
# 각 키워드들 간의 유사도
distances_candidates = cosine_similarity(candidate_embeddings,
candidate_embeddings)
# 코사인 유사도에 기반하여 키워드들 중 상위 top_n개의 단어를 pick.
words_idx = list(distances.argsort()[0][-nr_candidates:])
words_vals = [candidates[index] for index in words_idx]
distances_candidates = distances_candidates[np.ix_(words_idx, words_idx)]
# 각 키워드들 중에서 가장 덜 유사한 키워드들간의 조합을 계산
min_sim = np.inf
candidate = None
for combination in itertools.combinations(range(len(words_idx)), top_n):
sim = sum([distances_candidates[i][j] for i in combination for j in combination if i != j])
if sim < min_sim:
candidate = combination
min_sim = sim
return [words_vals[idx] for idx in candidate]
max_sum_sim(doc_embedding, candidate_embeddings, candidates, top_n=5, nr_candidates=10)
['requires learning algorithm',
'signal supervised learning',
'learning function maps',
'algorithm analyzes training',
'learning machine learning']
def mmr(doc_embedding, candidate_embeddings, words, top_n, diversity):
# 문서와 각 키워드들 간의 유사도가 적혀있는 리스트
word_doc_similarity = cosine_similarity(candidate_embeddings, doc_embedding)
# 각 키워드들 간의 유사도
word_similarity = cosine_similarity(candidate_embeddings)
# 문서와 가장 높은 유사도를 가진 키워드의 인덱스를 추출.
# 만약, 2번 문서가 가장 유사도가 높았다면
# keywords_idx = [2]
keywords_idx = [np.argmax(word_doc_similarity)]
# 가장 높은 유사도를 가진 키워드의 인덱스를 제외한 문서의 인덱스들
# 만약, 2번 문서가 가장 유사도가 높았다면
# ==> candidates_idx = [0, 1, 3, 4, 5, 6, 7, 8, 9, 10 ... 중략 ...]
candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]
# 최고의 키워드는 이미 추출했으므로 top_n-1번만큼 아래를 반복.
# ex) top_n = 5라면, 아래의 loop는 4번 반복됨.
for _ in range(top_n - 1):
candidate_similarities = word_doc_similarity[candidates_idx, :]
target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)
# MMR을 계산
mmr = (1-diversity) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
mmr_idx = candidates_idx[np.argmax(mmr)]
# keywords & candidates를 업데이트
keywords_idx.append(mmr_idx)
candidates_idx.remove(mmr_idx)
return [words[idx] for idx in keywords_idx]
mmr(doc_embedding, candidate_embeddings, candidates, top_n=5, diversity=0.2)
['algorithm generalize training',
'supervised learning algorithm',
'learning machine learning',
'learning algorithm analyzes',
'learning algorithm generalize']