๐Ÿ”ฅ[KHUDA_RecSys] ํ”„๋กœ์ ํŠธ ์ค€๋น„(5)๐Ÿ”ฅ

nothingismeยท2022๋…„ 11์›” 22์ผ

[KHUDA_RecSys]

๋ชฉ๋ก ๋ณด๊ธฐ
5/8
post-thumbnail

๐Ÿ—“๏ธ 1122

โ‡๏ธ Co-visitation Matrix ๊ณต๋ถ€์™€ ํ•จ๊ป˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋ก  ๊ณต๋ถ€๋ฅผ ํ•ด์„œ ์ •๋ฆฌํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค. ์ฆ๊ฑฐ์šด ๋„๋ฅด๋งˆ๋ฌด~ ๋‚ด์ผ ์ €๋…์— ์žˆ์„ ํšŒ์˜ ์ค€๋น„๋ฅผ ์œ„ํ•ด ์˜ค๋Š˜์€ ์ง‘์—์„œ


โœ… ์ฐธ๊ณ ์ž๋ฃŒ 1

๋™์‹œ๋ฐœ์ƒ ๊ทธ๋ฃนํ™”(co-occurence grouping)๋Š” ๊ฐ์ฒด ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ์„ ์ฐพ์•„๋‚ด๋Š” ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹ ๊ธฐ๋ฒ•์ด๋‹ค. ๋™์‹œ๋ฐœ์ƒ ๊ทธ๋ฃนํ™”๋Š” ์•„์ดํ…œ A๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์•„์ดํ…œ B๋„ ๋ฐœ์ƒํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค๋Š” ๊ทœ์น™์ด ์ผ๋ฐ˜์ ์ด๋‹ค.

๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ์€ ์•„์ดํ…œ ๊ฐ„์˜ ๋™์‹œ๋ฐœ์ƒ ํšŸ์ˆ˜๋ฅผ ํ–‰๋ ฌ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์œผ๋กœ, ๋‘ ์•„์ดํ…œ์ด ๋™์‹œ์— ๋ฐœ์ƒํ•œ ํšŸ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ๋” ๋งŽ์€ ๊ด€๋ จ์ด ์žˆ๊ฑฐ๋‚˜ ์œ ์‚ฌํ•˜๋‹ค๋Š” ๊ฐœ๋…์ด๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์€ ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ๊ณผ ์‚ฌ์šฉ์ž ์„ ํ˜ธ ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ์ถ”์ฒœ ๋ฒกํ„ฐ์ธ R์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ๋‚˜ํƒ€๋ƒˆ๋‹ค.

์œ„ ๊ทธ๋ฆผ์€ 7๊ฐœ ์•„์ดํ…œ์— ๋Œ€ํ•œ ๋™์‹œ๋ฐœ์ƒ ํ–‰๋ ฌ๋กœ 7 x 7 ํ–‰๋ ฌ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๊ฐ ํ–‰์€ ํŠน์ • ์•„์ดํ…œ๊ณผ ๋‹ค๋ฅธ ๋ชจ๋“  ์•„์ดํ…œ ๊ฐ„์˜ ๋™์‹œ ๋ฐœ์ƒํšŸ์ˆ˜์ด๋ฉฐ, ์•„์ดํ…œ A์™€ ์•„์ดํ…œ B์˜ ๋™์‹œ๋ฐœ์ƒ ํšŸ์ˆ˜์™€ ์•„์ดํ…œ B์™€ ์•„์ดํ…œ A์™€์˜ ๋™์‹œ๋ฐœ์ƒ ํšŸ์ˆ˜๋Š” ๋™์ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ขŒ์šฐ๋Œ€์นญ์ด๋‹ค. ์ถ”์ฒœ ๋ฒกํ„ฐ R์€ ๊ฐ ํ–‰๊ณผ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋ฒกํ„ฐ์˜ ๋‚ด์ ์„ ํ†ตํ•ด์„œ ๊ณ„์‚ฐ๋œ๋‹ค. ์‚ฌ์šฉ์ž์˜ ์•„์ดํ…œ 6์— ๋Œ€ํ•œ R์„ ๊ณ„์‚ฐํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

์‚ฌ์šฉ์ž๋Š” ์•„์ดํ…œ 1, 2, 5, 7์— ๋Œ€ํ•œ ์„ ํ˜ธ๋ฅผ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ถ”์ฒœ์€ ์•„์ดํ…œ 3, 4, 5์— ๋Œ€ํ•ด์„œ ์ด๋ฃจ์–ด์ ธ์•ผ ํ•˜๋ฏ€๋กœ, ๊ฐ€์žฅ ์ข‹์€ ์ถ”์ฒœ์€ ์•„์ดํ…œ 6์ด ๋œ๋‹ค. ์ด์™€ ๊ฐ™์ด ํŠน์ • ์•„์ดํ…œ์ด ์‚ฌ์šฉ์ž๊ฐ€ ์„ ํ˜ธ๋ฅผ ํ‘œ์‹œํ•œ ๋‹ค๋ฅธ ์•„์ดํ…œ๊ณผ ๋™์‹œ์— ๋ฐœ์ƒํ•˜๊ฑฐ๋‚˜ ํŠน์ • ์•„์ดํ…œ์˜ ๋™์‹œ๋ฐœ์ƒ์ด ์„ ํ˜ธ๊ฐ€ ํฐ ์•„์ดํ…œ๊ณผ ๋งŽ์ด ๊ฒน์นœ๋‹ค๋ฉด R์˜ ๊ฐ’์€ ์ปค์ง€๊ฒŒ ๋œ๋‹ค.


โœ… ์ฐธ๊ณ ์ž๋ฃŒ 2

  • Oreilly Media ํ•™๊ต ๊ณ„์ •์œผ๋กœ ๋กœ๊ทธ์ธํ•˜๋ฉด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ฐ์‚ฌ ๊ต์žฌ๊ฐ€ ์—ฌ๊ธฐ์„œ ๋‚˜์™€? ์ƒ์ƒ๋„ ๋ชปํ•œ ์ •์ฒด ใ„ดใ…‡ใ„ฑ
  • Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involving them.
  • An example co-occurrence question would be: What items are commonly purchased together?
  • While clustering looks at similarity between objects based on the objectsโ€™ attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.
  • For example, analyzing purchase records from a supermarket may uncover that ground meat is purchased together with hot sauce much more frequently than we might expect.
  • Deciding how to act upon this discovery might require some creativity, but it could suggest a special promotion, product display, or combination offer.
  • Co-occurrence of products in purchases is a common type of grouping known as market-basket analysis. Some recommendation systems also perform a type of affinity grouping by finding, for example, pairs of books that are purchased frequently by the same people (โ€œpeople who bought X also bought Yโ€).
  • The result of co-occurrence grouping is a description of items that occur together. These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is.

โœ… ๊ฐœ๋… ์ •๋ฆฌ

๊ฒฐ๊ตญ Co-Visitaion Matrix = Co-Occurence Matrix = ARD ๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ๊ฒฐ๊ตญ ๋‚ด๊ฐ€ ์•Œ๊ณ  ์žˆ๋˜ ์—ฐ๊ด€ ๋ถ„์„์ด Co-visitation์ด์—ˆ๋˜ ๊ฒƒ์ด๋‹ค. ์—‰๋ง์ง„์ฐฝํ˜ธ ๊ทธ๋Š” ๋„๋Œ€์ฒด ๋ช‡ ์ˆ˜ ์•ž์„ ๋‚ด๋‹ค๋ณธ ๊ฒƒ์ธ์ง€... ์ด์ œ Kaggle์— ์žˆ๋Š” Code๋ฅผ ์ข€ ๋œฏ์–ด๋ณด๋ฉด์„œ ์ฝ”๋“œ๋ฅผ ์ดํ•ดํ•ด๋ณด๋Š” ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณ์•ผ ํ•  ๊ฒƒ ๊ฐ™๋‹ค.


โœ… ์ฝ”๋“œ ๋œฏ์–ด๋ณด๊ธฐ 1

fraction_of_sessions_to_use = 1

if fraction_of_sessions_to_use != 1:
    lucky_sessions_train = df.drop_duplicates(['session']).sample(frac=fraction_of_sessions_to_use, random_state=42)['session']
    subset_of_train = df[df.session.isin(lucky_sessions_train)]
else:
    subset_of_train = df
    
subset_of_train.index = pd.MultiIndex.from_frame(subset_of_train[['session']])

chunk_size = 30_000
min_ts = df.ts.min()
max_ts = df.ts.max()

from collections import defaultdict, Counter
next_AIDs = defaultdict(Counter)

sessions = df.session.unique()

df=df.drop(['Unnamed: 0'],axis=1)

์—ฌ๊ธฐ๊นŒ์ง€ ์‹คํ–‰ํ•˜๊ณ  ๋‚˜๋ฉด df๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค.

์ด์ œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ด ์ฝ”๋“œ๋ฅผ ์ž˜ ์ดํ•ดํ•˜๋ฉด ๋œ๋‹ค.

import datetime as dt

# 0๋ถ€ํ„ฐ session row ๊ฐœ์ˆ˜ -1๊นŒ์ง€, ์ฆ‰ session row ๊ฐœ์ˆ˜๊นŒ์ง€๋ฅผ chunk_size๋งŒํผ ๊ฑด๋„ˆ๋›ฐ๋ฉฐ ๋ฐ˜๋ณต๋ฌธ์„ ๋ˆ๋‹ค. 
# chunk_size = 30000๊ฐœ์”ฉ ๋Š์–ด๊ฐ€๋ฉด์„œ. i=0, 30000, 60000, ... 
for i in range(0, sessions.shape[0], chunk_size): 
	# ํ˜„์žฌ chunk = i๋ฒˆ์งธ event ๋ถ€ํ„ฐ chunk ํฌ๊ธฐ ๋งŒํผ๊นŒ์ง€์˜ ํ–‰์„ ๊ฐ€์ ธ์˜จ๋‹ค. 
    # min์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์€ ๋งจ ๋งˆ์ง€๋ง‰์— chunk_size๋ณด๋‹ค ๋‚จ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์„ ๋•Œ๋ฅผ ์œ„ํ•จ.
    # sessions.shape[0]-1 : sessions ์ด ๊ฐœ์ˆ˜. ์ฆ‰ range์˜ ๋ ๋ฒ”์œ„์— ๋„๋‹ฌํ–ˆ์„ ๋•Œ.
    # i+chunk_size-1 : ๋‹ค์Œ chunk๊นŒ์ง€์˜ ๊ธธ์ด. general case.
    current_chunk = df.loc[sessions[i]:sessions[min(sessions.shape[0]-1, i+chunk_size-1)]].reset_index(drop=True)
    # session ๊ธฐ์ค€์œผ๋กœ groupbyํ•˜๊ณ  ๋‚˜์„œ ๊ฐ™์€ session๋ผ๋ฆฌ ๋ฌถ์ธ ๊ทธ๋ฃน ์•ˆ์—์„œ
    # nth : Take the nth row from each group if n is an int, otherwise a subset of rows.
    #     : n๋ฒˆ์งธ ํ–‰์„ ๊ฐ€์ ธ์˜จ๋‹ค. => -30๋ฒˆ์งธ ํ–‰๋ถ€ํ„ฐ -1๋ฒˆ์งธ ํ–‰๊นŒ์ง€ ๊ฐ€์ ธ์˜จ๋‹ค.
    # session์˜ ๋๋ถ€๋ถ„์„ ๊ฐ€์ ธ์˜จ๋‹ค. -> ์‹œ์ž‘๋ถ€๋ถ„๋ณด๋‹ค ๋๋ถ€๋ถ„์ด ๋” ์ข‹์„ ํ™•๋ฅ ์ด ๋†’๋‹ค
    current_chunk = current_chunk.groupby('session', as_index=False).nth(list(range(-30,0))).reset_index(drop=True)
    # session ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ์ž๊ธฐ ์ž์‹ ์„ inner join(๊ต์ง‘ํ•ฉ)ํ•œ๋‹ค.
    consecutive_AIDs = current_chunk.merge(current_chunk, on='session')
    # ๊ฐ™์€ aid๊ฐ€ ์ค‘๋ณต๋œ ๊ฒฝ์šฐ๋ฅผ ๋บธ๋‹ค.
    consecutive_AIDs = consecutive_AIDs[consecutive_AIDs.aid_x != consecutive_AIDs.aid_y]
    # str์˜ ts์—ด์„ datetime type์œผ๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค.
    consecutive_AIDs.ts_y=pd.to_datetime(consecutive_AIDs.ts_y)
    consecutive_AIDs.ts_x=pd.to_datetime(consecutive_AIDs.ts_x)
    # y์™€ x์˜ ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ๊ตฌํ•œ๋‹ค.
    time_diff=(consecutive_AIDs.ts_y - consecutive_AIDs.ts_x)
    # ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ consecutive_AIDS์— ์—ด๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค.
    consecutive_AIDs['days_elapsed'] = time_diff
    # ์‹œ๊ฐ„ ์ฐจ์ด๊ฐ€ ํ•˜๋ฃจ ์ดํ•˜์ธ ๋‚ ๋งŒ ์ถ”์ถœํ•ด์„œ ์ €์žฅํ•œ๋‹ค. 
    consecutive_AIDs = consecutive_AIDs[(consecutive_AIDs.days_elapsed >= dt.timedelta(days=0)) & (consecutive_AIDs.days_elapsed <= dt.timedelta(days=1))]
    
    for aid_x, aid_y in zip(consecutive_AIDs['aid_x'], consecutive_AIDs['aid_y']):
        next_AIDs[aid_x][aid_y] += 1
  • ์‹คํ–‰ํ•˜๊ณ  ๋‚˜์„œ consecutive_AIDS๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ƒ๊ฒผ๋‹ค.

  • ์‹คํ–‰ํ•˜๊ณ  ๋‚˜์„œ next_AIDS๋Š” "collections.defaultdict" type์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ์ƒ๊ฒผ๋‹ค. aid_x์— ๋Œ€ํ•ด์„œ(์ฆ‰ ๋ชจ๋“  aid_x์— ๋Œ€ํ•ด์„œ) ์ค‘๋ณต๋˜์ง€ ์•Š๋Š” ์ œํ’ˆ์ธ aid_y๊ฐ€ ๋™์‹œ์— count๋œ ํšŸ์ˆ˜๋ฅผ ์„ธ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

๐Ÿ“Œ Chunk
Chunk : ๋ฐ์ดํ„ฐ ๋ฉ์–ด๋ฆฌ๋กœ ์ž‘์—…ํ•  ๋•Œ ๊ฐ ์ปค๋ฐ‹ ์‚ฌ์ด์— ์ฒ˜๋ฆฌ๋˜๋Š” row ์ˆ˜
Chunk ์ง€ํ–ฅ ์ฒ˜๋ฆฌ : ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์”ฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด Chunk๋ผ๋Š” ๋ฉ์–ด๋ฆฌ๋ฅผ ๋งŒ๋“  ๋’ค, Chunk ๋‹จ์œ„๋กœ transactions์„ ๋‹ค๋ฃจ๋Š” ๊ฒƒ
chunk_size : ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌ๋  transaction ๋‹จ์œ„
์ฆ‰, ๋‹ค์Œ ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค.

  • Reader์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜ ์ฝ์–ด์˜จ๋‹ค.
  • ์ฝ์–ด์˜จ ๋ฐ์ดํ„ฐ๋ฅผ Processor์—์„œ ๊ฐ€๊ณตํ•œ๋‹ค.
  • ๊ฐ€๊ณต๋œ ๋ฐ์ดํ„ฐ๋“ค์„ ๋ณ„๋„์˜ ๊ณต๊ฐ„์— ๋ชจ์€ ๋’ค, Chunk ๋‹จ์œ„๋งŒํผ ์Œ“์ด๊ฒŒ ๋˜๋ฉด Writer์— ์ „๋‹ฌํ•˜๊ณ  Writer๋Š” ์ผ๊ด„ ์ €์žฅํ•œ๋‹ค.

๐Ÿ“Œ range()
range ํ•จ์ˆ˜์˜ ์„ธ ๋ฒˆ์งธ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” step์„ ์˜๋ฏธํ•œ๋‹ค. ์‹œ์ž‘ ์ธ๋ฑ์Šค๋ถ€ํ„ฐ ์–ผ๋งˆ์”ฉ ๊ฑด๋„ˆ๋›ฐ๋ฉฐ ์‚ดํ•„์ง€๋ฅผ ๋ณด๋Š” ๊ฒƒ์ด๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์ดํ•ด๊ฐ€ ์‰ฝ๋‹ค.

  • step: [optional] integer value, denoting the difference between any two numbers in the sequence.

๐Ÿ“Œ pandas - merge
merge ๋ฉ”์†Œ๋“œ ๋Š” ๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๊ฐ ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜๋Š” ๊ณ ์œ ๊ฐ’(key)์„ ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉํ• ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค.

  • default : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None
  • right : ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„
  • left_on : ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๋‹ค๋ฅผ ๋•Œ, ์™ผ์ชฝ ๊ธฐ์ค€์—ด
  • right_on : ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๋‹ค๋ฅผ ๋•Œ, ์˜ค๋ฅธ์ชฝ ๊ธฐ์ค€์—ด
  • left_index / right_index : ์ธ๋ฑ์Šค ๊ธฐ์ค€ ๋ณ‘ํ•ฉ ์‹œ True๋กœ ํ•˜๋ฉด ํ•ด๋‹น ๊ฐ์ฒด์˜ ์ธ๋ฑ์Šค๊ฐ€ ๋ณ‘ํ•ฉ ๊ธฐ์ค€์ด ๋œ๋‹ค.
  • on : (๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ธฐ์ค€์—ด ์ด๋ฆ„์ด ๊ฐ™์„ ๋•Œ) ๊ธฐ์ค€์—ด
  • how : ์กฐ์ธ ๋ฐฉ์‹ {'left', 'right', 'inner', 'outer'} ๊ธฐ๋ณธ๊ฐ’์€ 'inner'
  • sort : ๋ณ‘ํ•ฉ ํ›„ ์ธ๋ฑ์Šค์˜ ์‚ฌ์ „์  ์ •๋ ฌ ์—ฌ๋ถ€
  • suffixes : ๋ณ‘ํ•ฉํ•  ๊ฐ์ฒด๋“ค๊ฐ„ ์ด๋ฆ„์ด ์ค‘๋ณต๋˜๋Š” ์—ด์ด ์žˆ๋‹ค๋ฉด, ํ•ด๋‹น ์—ด์— ๋ถ™์ผ ์ ‘๋ฏธ์‚ฌ
  • copy : ์‚ฌ๋ณธ์„ ์ƒ์„ฑํ• ์ง€ ์—ฌ๋ถ€
  • indicator : True๋กœ ํ• ๊ฒฝ์šฐ ๋ณ‘ํ•ฉ์ด ์™„๋ฃŒ๋œ ๊ฐ์ฒด์— ์ถ”๊ฐ€๋กœ ์—ด์„ ํ•˜๋‚˜ ์ƒ์„ฑํ•˜์—ฌ ๋ณ‘ํ•ฉ ์ •๋ณด๋ฅผ ์ถœ๋ ฅ
  • validate : {'1:1' / '1:m' / 'm:1' / 'm:m'} ๋ณ‘ํ•ฉ ๋ฐฉ์‹์— ๋งž๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋งŒ์•ฝ validate์— ์ž…๋ ฅํ•œ ๋ณ‘ํ•ฉ๋ฐฉ์‹๊ณผ, ์‹ค์ œ ๋ณ‘ํ•ฉ ๋ฐฉ์‹์ด ๋‹ค๋ฅผ๊ฒฝ์šฐ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค
  • ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋น„์Šทํ•œ dataframe์„ ๋งŒ๋“ค๊ณ  ํ…Œ์ŠคํŠธ ํ•ด๋ณด์•˜๋‹ค.

๐Ÿ“Œ nth
๋Œ€์ถฉ ์ด๋ ‡๊ฒŒ ์ƒ๊ธด DataFrame์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž.

(1) nth์˜ range๋ฅผ (-1,0)์œผ๋กœ ์„ค์ • : -1

(2) nth์˜ range๋ฅผ (-2,0)์œผ๋กœ ์„ค์ • : -2 -1

(3) nth์˜ range๋ฅผ (-3,0)์œผ๋กœ ์„ค์ • : -3 -2 -1

  • ๊ฒฐ๋ก  : ๊ฐ™์€ session๋ผ๋ฆฌ groupby๋œ ๊ทธ๋ฃน ๊ฐ๊ฐ์—์„œ 'list์•ˆ์— ๋“ค์–ด์žˆ๋Š” ๊ฐ’'๋ฒˆ์งธ ํ–‰๋ ฌ์„ ๊ฐ€์ ธ์˜จ๋‹ค.

์ด์ œ test set ์ทจ๊ธ‰์„ ํ•˜๋Š” df3์— ์ ์šฉํ•ด๋ณด๊ธฐ๋กœ ํ–ˆ๋‹ค.

del df
test=pd.read_csv("/content/drive/MyDrive/khuda_recosys_kaggle/df_sample3.csv")
session_types = ['clicks', 'carts', 'orders']
test_session_AIDs = test.reset_index(drop=True).groupby('session')['aid'].apply(list)
test_session_types = test.reset_index(drop=True).groupby('session')['type'].apply(list)

labels=[]
no_data = 0
no_data_all_aids = 0
type_weight_multipliers = {0: 1, 1: 6, 2: 3}
for AIDs, types in zip(test_session_AIDs, test_session_types):
	# ๋งŒ์•ฝ test_session_AIDS์— AID๊ฐ€ 20๊ฐœ ์ด์ƒ์ธ ๊ฒฝ์šฐ
    if len(AIDs) >= 20:
    	# 0.1๋ถ€ํ„ฐ 1๊นŒ์ง€๋ฅผ AIDS ๊ฐœ์ˆ˜๋งŒํผ ์ชผ๊ฐ ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๊ฐ’์—์„œ 1์„ ๋บด์„œ ๊ฐ€์ค‘์น˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๊ตฌํ•œ๋‹ค.
        # ์™œ test_session_AIDS์˜ AIDs์— ํ•ด๋‹นํ•˜๋Š” ๋ฆฌ์ŠคํŠธ ์›์†Œ์˜ ์ธ๋ฑ์Šค๊ฐ€ ํด์ˆ˜๋ก ๊ฐ€์ค‘์น˜๊ฐ€ ํด๊นŒ. 
        # ๊ธธ์ด๋Š” ์–ด๋–ค ์˜๋ฏธ์ผ๊นŒ. 
        # ๋ฆฌ์ŠคํŠธ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ •๋ ฌ๋˜์–ด์žˆ๋Š” ํ˜•ํƒœ์ธ๊ฐ€? = ?  
        weights=np.logspace(0.1,1,len(AIDs),base=2, endpoint=True)-1
        # ๋น„์–ด์žˆ๋Š” defaultdict ์ƒ์„ฑ
        aids_temp=defaultdict(lambda: 0)
        # w = 
        # type_weight_multipliers[t] : type(click, cart, order)์— ๋”ฐ๋ฅธ ๊ฐ€์ค‘์น˜. hyperparam
        for aid,w,t in zip(AIDs,weights,types): 
            aids_temp[aid]+= w * type_weight_multipliers[t]
        
        # aids_temp์—์„œ ์ •๋ ฌ์„ ํ•œ ๋‹ค์Œ์— aid number๋งŒ ๋ฝ‘๋Š”๋‹ค.
        sorted_aids=[k for k, v in sorted(aids_temp.items(), key=lambda item: -item[1])]
        # top 20๊ฐœ๋งŒ ๋ฝ‘๋Š”๋‹ค.
        labels.append(sorted_aids[:20])
        
   # ๋งŒ์•ฝ test_session_AIDS์— AID๊ฐ€ 20๊ฐœ๋ณด๋‹ค ์ ๊ฒŒ ์žˆ๋Š” ๊ฒฝ์šฐ
    else:
    	# AIDS[::-1] ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์—ญ์ˆœ์œผ๋กœ ( ๋->์ฒ˜์Œ )
        AIDs = list(dict.fromkeys(AIDs[::-1]))
        # 
        AIDs_len_start = len(AIDs)
        
        # ํ›„๋ณด๊ตฐ ๋‹ด์„ ๋นˆ ์ง‘ํ•ฉ์„ ๋งŒ๋“ ๋‹ค.
        candidates = []
        
        # test์— ์žˆ๋Š” aid์— ๋Œ€ํ•˜์—ฌ for๋ฌธ ์‹คํ–‰
        for AID in AIDs:
        	# ๋งŒ์•ฝ aid๊ฐ€ next_AIDs์— ๋“ค์–ด์žˆ๋‹ค๋ฉด, ํ›„๋ณด๊ตฐ ๋ฆฌ์ŠคํŠธ์— ํ•ด๋‹น aid์˜ next_AIDs์—์„œ 
            # ์ œ์ผ ๋งŽ์ด Count ๋˜์—ˆ๋˜ aid top20๊ฐœ๋ฅผ ๋„ฃ๋Š”๋‹ค.
            if AID in next_AIDs: candidates += [aid for aid, count in next_AIDs[AID].most_common(20)]
        # ์ƒ์„ฑ๋œ ํ›„๋ณด๊ตฐ ์ง‘ํ•ฉ์˜ Top40์˜ aid number๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ 
        # ์ด Top40์—์„œ ์ด๋ฏธ AIDs์— ๋“ค์–ด๊ฐ„ aid number์€ ์ œ์™ธํ•œ๋‹ค.
        AIDs += [AID for AID, cnt in Counter(candidates).most_common(40) if AID not in AIDs]
        
        # AIDS ๋ฆฌ์ŠคํŠธ์—์„œ top 20๋งŒ ์„ ์ •ํ•˜์—ฌ labels์— ๋‹ด๋Š”๋‹ค. 
        labels.append(AIDs[:20])
        # ๋ฐ์ดํ„ฐ ์—†๋Š” ๊ฒฝ์šฐ(ํ›„๋ณด๊ตฐ ๋ฆฌ์ŠคํŠธ๊ฐ€ ๋น„์–ด์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋œ๋‹ค.)
        if candidates == []: no_data += 1
        # 
        if AIDs_len_start == len(AIDs): no_data_all_aids += 1

๐Ÿ“Œ numpy.logspace
numpy.logspace(start, stop, num, endpoint, base, dtype)
logspace๋Š” ์„ค์ •ํ•œ ๋ฒ”์œ„์—์„œ ๋กœ๊ทธ๋กœ ๋ถ„ํ• ํ•œ ์œ„์น˜์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•œ๋‹ค.

  • start - ์‹œ์ž‘ ๊ฐ’
  • stop - ๊ฐ’์˜ ๋งˆ์ง€๋ง‰ ๊ฐ’. ๋‹จ, endpoint๊ฐ€ True์ด๋ฉด ํฌํ•จ๋˜๊ณ , False์ด๋ฉด ํฌํ•จ๋˜์ง€ ์•Š๋Š”๋‹ค.
  • num - ๋ฐฐ์—ด ๋งด๋ฒ„์˜ ๊ฐœ์ˆ˜์ž…๋‹ˆ๋‹ค. default=50
  • endpoint - stop ๊ฐ’์„ ํฌํ•จ ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. True - ํฌํ•จ, Fasle -๋ฏธํฌํ•จ
  • base - Log ๊ฐ’์˜ ๋ฒ ์ด์Šค ๊ฐ’์ž…๋‹ˆ๋‹ค. default=10
  • dtype - ๋ฐฐ์—ด์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…
  • ์˜ˆ์‹œ

๐Ÿ“Œ Counter( ) , most_commons( ) module

  • collections.Counter(a) : a์—์„œ ์š”์†Œ๋“ค์˜ ๊ฐœ์ˆ˜๋ฅผ ์„ธ์–ด, ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. {๋ฌธ์ž : ๊ฐœ์ˆ˜} ํ˜•ํƒœ
  • most_common() ํ•จ์ˆ˜ - ์ตœ๋นˆ๊ฐ’ ๊ตฌํ•˜๊ธฐ
    collections.Counter(a).most_common(n) : a์˜ ์š”์†Œ๋ฅผ ์„ธ์–ด, ์ตœ๋นˆ๊ฐ’ n๊ฐœ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. (๋ฆฌ์ŠคํŠธ์— ๋‹ด๊ธด ํŠœํ”Œํ˜•ํƒœ๋กœ)
labels_as_strings = [' '.join([str(l) for l in lls]) for lls in labels]

predictions = pd.DataFrame(data={'session_type': test_session_AIDs.index, 'labels': labels_as_strings})

labels_as_strings = [' '.join([str(l) for l in lls]) for lls in labels]

predictions = pd.DataFrame(data={'session_type': test_session_AIDs.index, 'labels': labels_as_strings})

prediction_dfs = []

for st in session_types:
    modified_predictions = predictions.copy()
    modified_predictions.session_type = modified_predictions.session_type.astype('str') + f'_{st}'
    prediction_dfs.append(modified_predictions)

submission = pd.concat(prediction_dfs).reset_index(drop=True)

์ฐธ๊ณ ์ž๋ฃŒ
https://data-analysis-expertise.tistory.com/92
https://www.geeksforgeeks.org/python-range-function/
https://appia.tistory.com/154

profile
๊ฐ€๋ณ๊ฒŒ ์žฌ๋ฐŒ๋˜ ๊ฑฐ ๊ธฐ๋กํ•ด์š”

0๊ฐœ์˜ ๋Œ“๊ธ€