코사인 유사도를 이용한 추천 시스템

허허맨·2025년 8월 1일

LLM

목록 보기

9/12

📌 코사인 유사도를 이용한 추천 시스템

1. 코사인 유사도란?

정의: 두 벡터가 가리키는 방향이 얼마나 비슷한지 측정
값 범위: -1 ~ 1
- 1 → 완전히 같은 방향 (유사도 최대)
- 0 → 90° (서로 완전 무관)
- -1 → 완전히 반대 방향
문서 유사도 계산 시: 문서 → 벡터로 변환(DTM, TF-IDF 등) → 벡터 간 각도 계산

수식

\text{cos\_sim}(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||}

2. 간단 예시

import numpy as np
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
    return dot(A, B) / (norm(A) * norm(B))

doc1 = np.array([0,1,1,1])
doc2 = np.array([1,0,1,1])
doc3 = np.array([2,0,2,2])

print("문서1 vs 문서2:", cos_sim(doc1, doc2))
print("문서1 vs 문서3:", cos_sim(doc1, doc3))
print("문서2 vs 문서3:", cos_sim(doc2, doc3))

출력

문서1 vs 문서2: 0.67
문서1 vs 문서3: 0.67
문서2 vs 문서3: 1.00

💡 포인트

문서2와 문서3은 단어 비율이 완전히 동일 → 코사인 유사도 1
문서 길이 차이가 나더라도 방향(패턴)이 같으면 1이 됨 → 문서 길이에 영향 없음

3. DTM & TF-IDF + 코사인 유사도 예제

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# 문서 데이터
corpus = [
    'you know I want your love',
    'I like you',
    'what should I do'
]

# (1) DTM
vector = CountVectorizer()
dtm = vector.fit_transform(corpus).toarray()
terms = vector.get_feature_names_out()
df_dtm = pd.DataFrame(dtm, columns=terms)

print("📌 DTM\n", df_dtm)

# (2) TF-IDF
tfidfv = TfidfVectorizer()
tfidf = tfidfv.fit_transform(corpus).toarray()
terms_tfidf = tfidfv.get_feature_names_out()
df_tfidf = pd.DataFrame(tfidf, columns=terms_tfidf)

print("\n📌 TF-IDF\n", df_tfidf.round(3))

# (3) 코사인 유사도 계산
cos_sim_dtm = cosine_similarity(dtm, dtm)
cos_sim_tfidf = cosine_similarity(tfidf, tfidf)

print("\n📌 코사인 유사도 (DTM 기반)\n", cos_sim_dtm.round(3))
print("\n📌 코사인 유사도 (TF-IDF 기반)\n", cos_sim_tfidf.round(3))

4. 실행 결과 예시

📌 DTM
   do  know  like  love  should  want  what  you  your
0   0     1     0     1       0     1     0    1     1
1   0     0     1     0       0     0     0    1     0
2   1     0     0     0       1     0     1    0     0

📌 TF-IDF
     do   know   like   love  should   want   what    you   your
0  0.000  0.467  0.000  0.467  0.000  0.467  0.000  0.355  0.467
1  0.000  0.000  0.796  0.000  0.000  0.000  0.000  0.605  0.000
2  0.577  0.000  0.000  0.000  0.577  0.000  0.577  0.000  0.000

📌 코사인 유사도 (DTM 기반)
[[1.    0.25  0.25 ]
 [0.25  1.    0.333]
 [0.25  0.333 1.   ]]

📌 코사인 유사도 (TF-IDF 기반)
[[1.    0.271 0.298]
 [0.271 1.    0.345]
 [0.298 0.345 1.   ]]