๐Ÿ”ฅ[KHUDA_RecSys] ํ”„๋กœ์ ํŠธ ์ค€๋น„(3)๐Ÿ”ฅ

nothingismeยท2022๋…„ 11์›” 18์ผ

[KHUDA_RecSys]

๋ชฉ๋ก ๋ณด๊ธฐ
3/8
post-thumbnail

๐Ÿธ ํ•ด์•ผํ•  ์ผ ๐Ÿธ

  • timestamp ๋ฐ์ดํ„ฐ ์š”์ผ / ์‹œ๊ฐ„๋Œ€(์˜ค์ „, ์˜คํ›„ ์•ผ๊ฐ„)์œผ๋กœ labeling ํ•ด์ฃผ๊ธฐ
  • ์š”์ผ / ์‹œ๊ฐ„๋Œ€์— ๋”ฐ๋ฅธ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ํ™•์ธํ•˜๋ฉด์„œ ์œ ์˜๋ฏธํ•œ์ง€ ํŒŒ์•…ํ•˜๊ธฐ
  • ์œ ์˜๋ฏธํ•˜๋‹ค๋ฉด, ์š”์ผ / ์‹œ๊ฐ„๋Œ€์— ๋”ฐ๋ผ ์–ด๋–ค ํŠน์ง•์ด ์žˆ๋Š”์ง€ ์ถ”๊ฐ€ ๋ถ„์„ ์ง„ํ–‰ํ•˜๊ธฐ
  • ์‹œ๊ฐ„๋Œ€๋ฅผ ์–ด๋–ป๊ฒŒ ๋ถ„ํ• ํ•ด์•ผ ๋” ์˜๋ฏธ์žˆ์„์ง€ ๋ถ„์„ํ•ด๋ณด๊ธฐ

๐Ÿ—“๏ธ 1118

โ‡๏ธ ๋˜ ์นธ๋‚˜์—์„œ ์“ฐ๋Š” ํ”„๋กœ์ ํŠธ ์ค€๋น„ ๊ณผ์ •. ๋‚ด์ผ ํšŒ์˜ํ•˜๊ธฐ ์ „์— ์ค€๋น„๋ฅผ ํ•ด์•ผํ•œ๋‹ค. ์˜ค๋Š˜ ํ•  ์ผ์€ ์š”์ผ๊ณผ ์‹œ๊ฐ„๋Œ€๋กœ timestamp๋ฅผ labelingํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

โœ… ์ฐธ๊ณ ์ž๋ฃŒ

โœ… pandas์˜ datetime์„ ํ™œ์šฉํ•˜์—ฌ ์š”์ผ ๋ผ๋ฒจ๋ง

-๐Ÿ“Œtimestamp์— ์ฐํžŒ ์‹œ๊ฐ„์„ 0~6์œผ๋กœ ๋ผ๋ฒจ๋งํ•˜๋Š” ๊ณผ์ •์„ ๊ฑฐ์นœ ์—ด์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ.

-๐Ÿ“Œ๊ฐ™์€ ์š”์ผ ๋ผ๋ฒจ๋ง ๊ฐ’์„ ๊ฐ–๋Š” event์„ groupingํ•˜์—ฌ ๋ถ„์„.

-๐Ÿ“Œ์š”์ผ๋ณ„ Top 20 frequencies aids ๋ถ„์„. for๋ฌธ ๋Œ๋ฆฌ๋ฉด ๋  ๊ฑธ ๋‚ด๊ฐ€ ๋…ธ๊ฐ€๋‹คํ–ˆ๋‹ค. 7๊ฐœ ๋„ˆ๋ฌด ๋งŽ์•„์„œ 2๊ฐœ๋งŒ ๊ทธ๋ฆผ ๋„ฃ์—ˆ๋‹ค

mon_aid_counts = df1[df1['weekday']==0].groupby('aid')[['aid']].count()
tue_aid_counts = df1[df1['weekday']==1].groupby('aid')[['aid']].count()
wed_aid_counts = df1[df1['weekday']==2].groupby('aid')[['aid']].count()
thu_aid_counts = df1[df1['weekday']==3].groupby('aid')[['aid']].count()
fri_aid_counts = df1[df1['weekday']==4].groupby('aid')[['aid']].count()
sat_aid_counts = df1[df1['weekday']==5].groupby('aid')[['aid']].count()
sun_aid_counts = df1[df1['weekday']==6].groupby('aid')[['aid']].count()

mon_aid_counts = mon_aid_counts.rename(columns={'aid': 'count'}).reset_index()
tue_aid_counts = tue_aid_counts.rename(columns={'aid': 'count'}).reset_index()
wed_aid_counts = wed_aid_counts.rename(columns={'aid': 'count'}).reset_index()
thu_aid_counts = thu_aid_counts.rename(columns={'aid': 'count'}).reset_index()
fri_aid_counts = fri_aid_counts.rename(columns={'aid': 'count'}).reset_index()
sat_aid_counts = sat_aid_counts.rename(columns={'aid': 'count'}).reset_index()
sun_aid_counts = sun_aid_counts.rename(columns={'aid': 'count'}).reset_index()

mon_aid_counts.sort_values(by='count', ascending=False, inplace=True)
tue_aid_counts.sort_values(by='count', ascending=False, inplace=True)
wed_aid_counts.sort_values(by='count', ascending=False, inplace=True)
thu_aid_counts.sort_values(by='count', ascending=False, inplace=True)
fri_aid_counts.sort_values(by='count', ascending=False, inplace=True)
sat_aid_counts.sort_values(by='count', ascending=False, inplace=True)
sun_aid_counts.sort_values(by='count', ascending=False, inplace=True)

mon_20_most_frequent_aids = mon_aid_counts.set_index('aid').head(20).to_dict()['count']
tue_20_most_frequent_aids = tue_aid_counts.set_index('aid').head(20).to_dict()['count']
wed_20_most_frequent_aids = wed_aid_counts.set_index('aid').head(20).to_dict()['count']
thu_20_most_frequent_aids = thu_aid_counts.set_index('aid').head(20).to_dict()['count']
fri_20_most_frequent_aids = fri_aid_counts.set_index('aid').head(20).to_dict()['count']
sat_20_most_frequent_aids = sat_aid_counts.set_index('aid').head(20).to_dict()['count']
sun_20_most_frequent_aids = sun_aid_counts.set_index('aid').head(20).to_dict()['count']

def visualize_aid_frequencies(aid_frequencies, title):
    fig, ax = plt.subplots(figsize=(6, 4), dpi=100)
    ax.barh(range(len(aid_frequencies)), aid_frequencies.values(), align='center')
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.set_yticks(range(len(aid_frequencies)))
    ax.set_yticklabels([f'{x} ({value_count:,})' for x, value_count in aid_frequencies.items()])
    ax.set_title(title, size=20, pad=15)
    plt.gca().invert_yaxis()
    plt.show()

for most_frequent_aids in [mon_20_most_frequent_aids, tue_20_most_frequent_aids, wed_20_most_frequent_aids, thu_20_most_frequent_aids, fri_20_most_frequent_aids, sat_20_most_frequent_aids, sun_20_most_frequent_aids] :
  visualize_aid_frequencies(
    aid_frequencies=most_frequent_aids,
    title='Top 20 Most Frequent aids'
)


+๋„ˆ๋ฌด ๊ธธ์–ด์„œ ์งค๋ผ๋„ฃ์—ˆ๋‹ค. ๊ทธ๋ƒฅ aid number๋งŒ printํ•œ ๊ฒƒ์ด ์•„๋ž˜.

โœ… pandas๋กœ ์˜ค์ „/์˜คํ›„/์•ผ๊ฐ„ 3๊ฐœ์˜ ์‹œ๊ฐ„๋Œ€๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๋ผ๋ฒจ๋ง

๐Ÿ“Œ ์–ธ์ œ๋ฅผ ์˜ค์ „/์˜คํ›„/์•ผ๊ฐ„์œผ๋กœ ์ง€์ •ํ• ์ง€๋Š” ์ผ๋‹จ ์˜ค๋Š˜ ์ค€๋น„ ๋•Œ๋Š” ๋‚ด๊ฐ€ ์ž„์˜๋กœ ๊ธฐ์ค€์„ ์ •ํ•˜๊ณ , ๋‚ด์ผ ์˜ˆ์ง€์™€ ์ด์•ผ๊ธฐํ•˜๋ฉด์„œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์ค€์œผ๋กœ ์ •ํ•ด์•ผ๊ฒ ๋‹ค.

๐Ÿ“Œ pandas์˜ between_time ๋ฉ”์†Œ๋“œ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ชจ๋“  ๋‚ ์งœ ํŠน์ • ์‹œ๊ฐ„ ๊ธฐ๊ฐ„์˜ ๊ด€์ธก๊ฐ’์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Type ์—๋Ÿฌ๊ฐ€ ๊ณ„์† ๋ฐœ์ƒํ–ˆ๋Š”๋ฐ. ๊ธฐ์กด DataFrame์€ ์ธ๋ฑ์Šค๊ฐ€ 'ts'์—ด์ด ์•„๋‹ˆ์–ด์„œ between_time ๋ฉ”์†Œ๋“œ๊ฐ€ ์ ์šฉ๋  ์ˆ˜ ์—†์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ set_index๋กœ timestamp ์—ด์ด index๋กœ ์ ์šฉ๋˜๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค.
TypeError: Index must be DatetimeIndex

๐Ÿ“Œ ํŠน์ • ์‹œ๊ฐ„๋Œ€์— ๋Œ€ํ•˜์—ฌ df๋ฅผ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ๊นŒ์ง€ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ด์ œ ์ด ํŠน์ • ์กฐ๊ฑด์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋“ค์—๊ฒŒ ๊ฐ์ž์— ๋งž๋Š” ๋ผ๋ฒจ์ด ๋ถ€์—ฌ๋œ Column์„ ์ถ”๊ฐ€ํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

  • ๋ฐฉ๋ฒ•1 : between_time์„ ์‚ฌ์šฉํ•˜๋ฉด DataFrameํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜๋œ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ์ž data frame์— ์†ํ•˜๋Š” event index ๊ฐ’('Unnamed :0'์ด์—ˆ๋˜ ๊ธฐ์กด index๊ฐ’. ์ง€๊ธˆ์€ between_time์„ ์“ฐ๊ธฐ ์œ„ํ•ด timestamp๋ฅผ index๋กœ ์ง€์ •ํ•œ ์ƒํƒœ์ด๋‹ค.) ์„ list์— ์ €์žฅํ•ด์„œ for๋ฌธ์„ ๋Œ๋ฉด์„œ ๊ฐ ๋ฆฌ์ŠคํŠธ์— ์žˆ์œผ๋ฉด ๊ฐ ๋ฆฌ์ŠคํŠธ์— ์•Œ๋งž์€ ๊ฐ’์œผ๋กœ 'time zone' ์—ด์— '์˜ค์ „', '์˜คํ›„', '์•ผ๊ฐ„' ๊ฐ’์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ๋ฐฉ์‹์„ ๋– ์˜ฌ๋ ธ๋‹ค. ์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด์ง€๋งŒ df1์˜ ๊ธธ์ด๋งŒํผ for๋ฌธ์„ ๋Œ๋ฉด์„œ, ๋ฆฌ์ŠคํŠธ์— i๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ณผ์ •๊นŒ์ง€ ๊ฑฐ์น˜๋ฉด ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. ์‹ค์ œ๋กœ๋„ 20๋ถ„ ๊ฐ€๋Ÿ‰ ๋Œ๋ฆฌ๋‹ค๊ฐ€ ๋„ˆ๋ฌด ๋น„ํšจ์œจ์ ์ด๊ณ  ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์ด ์žˆ์„ ๊ฒƒ ๊ฐ™์•„์„œ ๊ทธ๋งŒ๋‘์—ˆ๋‹ค. ๋Œ€์ถฉ ์•„๋ž˜ ์ฒ˜๋Ÿผ ์ฝ”๋“œ๋ฅผ ์งœ์„œ ๋Œ๋ ธ๋Š”๋ฐ. ๋„ˆ๋ฌด ๋ฉ์ฒญํ•œ ์ฝ”๋“œ๋ผ ํฌ๊ธฐ

  • ๋ฐฉ๋ฒ•2 : datetime์„ ์ด์šฉํ•ด ์‹œ๊ฐ„๊ฐ’๋งŒ ๋”ฐ๋กœ ๊ฐ€์ ธ์™€์„œ ์ด ์‹œ๊ฐ„๊ฐ’์„ ์กฐ๊ฑด๋ถ€ ํ•„ํ„ฐ๋งํ•ด์„œ ๋ผ๋ฒจ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•. ์š”์ผ ์ถ”์ถœํ–ˆ๋˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์‹œ๊ฐ„๋งŒ ๊ฐ€์ ธ์˜ค๋ฉด ์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์•„์ง„๋‹ค.


์ด hour๊ฐ’์€ 0~23 ๋ฒ”์œ„์˜ ์ •์ˆ˜๊ฐ’์ผ ๊ฒƒ์ด๋ฏ€๋กœ ๊ธฐ์กด datetime๋ณด๋‹ค ๋‹ค๋ฃจ๊ธฐ๊ฐ€ ์‰ฌ์šธ ๊ฒƒ ๊ฐ™๋‹ค. hour์— ๋Œ€ํ•œ ์กฐ๊ฑด์œผ๋กœ ๋ผ๋ฒจ๋ง์„ ํ•  ์ˆ˜ ์žˆ์„๊นŒ?

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋‚ด๊ฐ€ ์˜๋„ํ•œ๋Œ€๋กœ ๋˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๊ทผ๋ฐ ์ด๊ฒŒ ์ข‹์€ ๋ฐฉ๋ฒ•์ธ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค.

๐Ÿ“Œ ์œ„์—์„œ ํ–ˆ๋˜ ๋ถ„์„ ์‹œ๊ฐ„๋Œ€๋ณ„๋กœ๋„ ํ•ด๋ดค๋‹ค. Top 20 aids๋„ 3๊ฐœ ์‹œ๊ฐํ™”ํ–ˆ๋Š”๋ฐ ์—ฌ๊ธฐ์—๋Š” 0์ผ๋•Œ๋งŒ(์˜ค์ „) ๋„ฃ์—ˆ๋‹ค.

โœ… scikitlearn์˜ get_dummies์™€ label encoder ํ™œ์šฉํ•ด๋ณด๊ธฐ

๐Ÿ“Œ ์ด์ œ pandas ๋ง๊ณ  scikitlearn์„ ์‚ฌ์šฉํ•ด๋ณด์ž. ํšจ์ค€์˜ค๋น ๊ฐ€ ์•Œ๋ ค์ค€ label encoder๋ฅผ ๋จผ์ € ์‚ฌ์šฉํ•ด๋ณด์•˜๋‹ค.

์ „ํ˜€ ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋ญ”๊ฐ€ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ณ  ๋ชจ๋ธ์— fit ์‹œ์ผœ์•ผ ํ•  ๊ฒƒ ๊ฐ™๋‹ค.

๐Ÿ“Œ get_dummies๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์ž. ๋‚ด๊ฐ€ ๋ญ”๊ฐ€ ๋‘˜์˜ ๊ธฐ๋Šฅ์„ ์ž˜๋ชป์ดํ•ดํ•œ ๊ฒƒ ๊ฐ™๋‹ค. ๋‚ด๊ฐ€ ์œ„์—์„œ pandas๋ฅผ ๊ฐ€์ง€๊ณ  timestamp๋ฅผ 0~6(์š”์ผ)์ด๋‚˜ 0~2(์˜ค์ „/์˜คํ›„/์•ผ๊ฐ„)์œผ๋กœ labeling ํ•˜๋ฉด ๊ทธ์— ๋Œ€ํ•ด์„œ get_dummies์™€ label_encoder๋กœ one-hot enconding์„ ์ง„ํ–‰ํ•˜๋Š”๊ฒŒ ๋งž๋Š” ๊ฑธ๊นŒ?

๐Ÿง ๋‚ด์ผ ํ•ด์•ผํ•  ์ผ ๐Ÿง

  • ์œ ์˜๋ฏธํ•˜๋‹ค๋ฉด, ์š”์ผ / ์‹œ๊ฐ„๋Œ€์— ๋”ฐ๋ผ ์–ด๋–ค ํŠน์ง•์ด ์žˆ๋Š”์ง€ ์ถ”๊ฐ€ ๋ถ„์„ ์ง„ํ–‰ํ•˜๊ธฐ
  • ์‹œ๊ฐ„๋Œ€๋ฅผ ์–ด๋–ป๊ฒŒ ๋ถ„ํ• ํ•ด์•ผ ๋” ์˜๋ฏธ์žˆ์„์ง€ ๋ถ„์„ํ•ด๋ณด๊ธฐ
  • get_dummies. LabelEncoder๋กœ One-hot-Encoding ์ฒ˜๋ฆฌ ํ•ด๋ณด๊ธฐ
  • CF ๋ชจ๋ธ ๋ณต์Šตํ•˜๊ธฐ
  • ๋‚ด์ผ ํšŒ์˜ ๋ณด๊ณ  ์ค€๋น„ํ•˜๊ธฐ

profile
๊ฐ€๋ณ๊ฒŒ ์žฌ๋ฐŒ๋˜ ๊ฑฐ ๊ธฐ๋กํ•ด์š”

0๊ฐœ์˜ ๋Œ“๊ธ€