아이템에 대한 메타 데이터를 이용하여 어떤 사람이 특정 아이템을 선호한다면, 그것과 비슷한 아이템을 추천하는 방식
장점: 사용자가 평점을 매기지 않은 새로운 아이템이 들어와도 추천이 가능함
단점
import pandas as pd
import numpy as np
df1=pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')
df2=pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')
df1.columns = ['id','tittle','cast','crew']
df2= df2.merge(df1,on='id')
The first dataset contains the following features:
- movie_id - A unique identifier for each movie.
- cast - The name of lead and supporting actors.
- crew - The name of Director, Editor, Composer, Writer etc.
The second dataset has the following features:-
- budget - The budget in which the movie was made.
- genre - The genre of the movie, Action, Comedy ,Thriller etc.
- homepage - A link to the homepage of the movie.
- id - This is infact the movie_id as in the first dataset.
- keywords - The keywords or tags related to the movie.
- original_language - The language in which the movie was made.
- original_title - The title of the movie before translation or adaptation.
- overview - A brief description of the movie.
- popularity - A numeric quantity specifying the movie popularity.
- production_companies - The production house of the movie.
- production_countries - The country in which it was produced.
- release_date - The date on which it was released.
- revenue - The worldwide revenue generated by the movie.
- runtime - The running time of the movie in minutes.
- status - "Released" or "Rumored".
- tagline - Movie's tagline.
- title - Title of the movie.
- vote_average - average ratings the movie recieved.
- vote_count - the count of votes recieved.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
df2['overview'] = df2['overview'].fillna('') # NaN을 empty string으로 대체
tfidf_matrix = tfidf.fit_transform(df2['overview']) #fit 이후, TF-IDF matrix로 변환(transform)
tfidf_matrix.shape # (4803, 20978)
위에서 TF-IDF를 통해 얻은 matrix로 유사도를 계산할 수 있음
Euclidean(유클리디안), Pearson(피어슨), Cosine(코사인) 등과 같은 후보가 있으며, 어떤 것이 가장 좋다고 정해진 것은 없음
여기서는 코사인 유사도를 사용
유사도 함수들 또한 sklearn에 구현되어 있으며, 아래와 같이 사용
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates() # 영화 제목 => matrix의 인덱스 맵핑 목적
def get_recommendations(title, cosine_sim=cosine_sim):
idx = indices[title]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sim_scores = sim_scores[1:11]
movie_indices = [i[0] for i in sim_scores]
return df2['title'].iloc[movie_indices]
Credits(Cast, Director), Genres, Keyword
메타데이터를 사용하여 embedding 한 후 추천하는 방법도 제시하고 있음.def create_soup(x):
return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df2['soup'] = df2.apply(create_soup, axis=1)
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])
Neighborhood Based Collaborative Filtering 으로 User Based, Item Based가 있음
단점
E
와의 유사도를 계산함Me Before You
영화와의 유사도를 계산함