Content-Based Recommenders 2

TOLL TERRY·2022년 5월 26일

recommendation_study

목록 보기

5/11

		2. Metadata-based recommender

이번에 공부한 것은 메타데이터 기반의 추천 시스템이다.

순서는
1> 메타 데이터를 준비하기
2> 데이터 정제하기
3> CountVectorizer를 사용
4> 메타 데이터에 따른 추천 시스템 기반 구현.

countVectorizer

countVectorizer vs TF-IDF

분석을 위해 더 중요하고 덜 중요한 단어를 식별 할 수 없습니다.
단지 코퍼스에 풍부한 단어를 통계적으로 가장 중요한 단어로 간주 할 것입니다.
단어 간의 언어 유사성과 같은 단어 간의 관계를 식별하지 않습니다.

# 출처
https://www.linkedin.com/pulse/count-vectorizers-vs-tfidf-natural-language-processing-sheel-saket

TF-IDF

It fails to provide linguistic information about the words such as the real meaning of the words, similarity with other words etc.

단어의 실제 의미, 다른 단어와의 유사성 등과 같은 단어에 대한 언어 정보를 제공하지 못합니다.

START

1> 메타 데이터 준비하기.

#Print the head of the credit dataframe
cred_df.head()

credits의 데이터를 기존의 원래 데이터에 추가를 위해 확인.

#Print the head of the keywords dataframe
key_df.head()

keyword의 데이터를 기존의 원래 데이터에 추가를 위해 확인.

df[df['id']=='1997-08-20']

현재 데이터에 "-" 처럼 이런 것들로 인해 int로의 변환이 불가능.

2> 데이터 정제하기

def clean_ids(x):
    try:
        return int(x)
    except:
        return np.nan

int가 안되는 데이터들은 모두 nan으로 변경해준다.

df['id'] = df['id'].apply(clean_ids)

# nan 값을 삭제한다. 
df = df[df['id'].notnull()]

apply함수를 사용하여 우리가 사용할 데이터를 int로 변경이 안된 것들은 nan즉 만들어 버리고
nan의 값이 있는 데이터는 모두 notnull() 함수로 제거한다.

df['id'] = df['id'].astype('int')
key_df['id'] = key_df['id'].astype('int')
cred_df['id'] = cred_df['id'].astype('int')

df = df.merge(cred_df, on='id')
df = df.merge(key_df, on='id')

#Display the head of df
df.head()

keyword와 credits의 데이터를 모두 int로 변경되고 이를 우리 df(데이터프레임), 즉 우리 데이터에 merge로 추가해준다.

from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

데이터프레임내 'cast', 'crew', 'keywords', 'genres' 의 리스트 중에
literal_eval 함수를 apply함수(함수를 반영)에 넣어 딕셔너리형태의 데이터들을 문자열로
데이터를 정제한다.

def get_director(x):
    for crew_member in x:
        if crew_member['job'] == 'Director':
            return crew_member['name']
    return np.nan

메타 데이터를 위해
credit 데이터의 'job'의 부분에 있는 'Director'의 값이 존재한다면, 데이터프레임의 'Director' (감독이름)를 반환해주고, 아니라면 nan값을 주는 함수 구현.

#Define the new director feature
df['director'] = df['crew'].apply(get_director)

#Print the directors of the first five movies
df['director'].head()

credit 데이터의 'job'의 부분에 있는 'Director'의 값이 존재한다면, 데이터프레임의 'Director' (감독이름)를 반환해주고, 아니라면 nan값으로 데이터를 정제한다.

def generate_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        if len(names) > 3:
            names = names[:3]
        return names
    return []

함수를 다시, name의 이름의 총 개수가 3개 이상이라면 3개까지만 반환해주고, 만약 이상한 데이터나 반환되지 않는 수가 나온다면 빈 리스트 []를 반환하는 함수.

df['cast'] = df['cast'].apply(generate_list)
df['keywords'] = df['keywords'].apply(generate_list)

위의 함수를 실행시켜서 3개까지, 혹은 3개이하도 가능하지만, 반환이 안된 경우는 빈 리스트를 반환.

df['genres'] = df['genres'].apply(lambda x: x[:3])

장르도 3개까지만, 가져오도록 데이터를 정제한다.

# 
def sanitize(x):
    if isinstance(x, list):
        #Strip spaces and convert to lowercase
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

아직 이 함수에 정확하게는 이해가지 않았지만, 만약에 공간이 남아있거나, 감독의 이름을 체크해서 비어져 있다면 "" 빈문자열로 채워서 넣어주고, 그럼에도 다른 문제가 있어도 빈문자열로 채워주는 것으로 판단한다.

for feature in ['cast', 'director', 'genres', 'keywords']:
    df[feature] = df[feature].apply(sanitize)

빈 리스트이거나 다른 문제가 있을경우 빈 문자열로 변환해주는 데이터 정제.

def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

정제된 데이터를 이제 하나의 데이터프레임에 넣어주기 위해 함수를 구현한다.
이는 메타데이터를 포함하는 새로운 데이터프레임의 새로운 열을 의미한다.

df['soup'] = df.apply(create_soup, axis=1)

메타 데이터의 (즉, 여러가지 feature(4가지)를 모아둔 데이터를 만들어둔 것이다.)

3> CountVectorizer로 메타데이터의 추천을 위해, 데이터의 빈도수로 score 메기기.

# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#Define a new CountVectorizer object and create vectors for the soup
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df['soup'])

여기서 CountVectorizer를 사용한 이유를 나름 생각해봤다. 우리는 TF-IDF를 할 경우
단어의 실제 의미, 다른 단어와의 유사성 등과 같은 단어에 대한 언어 정보를 제공하지 못합니다.
즉 우리의 메타데이터는 현재 데이터에서 감독과 장르, cast를 가지고서 단어간의 3개 이하, 그리고 단어간의 의미를 가져오도록 구현이 되었기에 단순 빈도수로서 score를 메겨 사용자에게 추천하는 식으로 구현된 것이라고 판단된다.

#Import cosine_similarity function
from sklearn.metrics.pairwise import cosine_similarity

#Compute the cosine similarity score (equivalent to dot product for tf-idf vectors)
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

4> 메타데이터 기반의 추천 시스템 구현.

# Reset index of your df and construct reverse mapping again
df = df.reset_index()
indices2 = pd.Series(df.index, index=df['title'])

def content_recommender(title, cosine_sim=cosine_sim, df=df, indices=indices):
    # Obtain the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

content_recommender('The Lion King', cosine_sim2, df, indices2)

TOLL TERRY

행복을 찾아서(크리스 가드너)

이전 포스트