Grouping / Apply, Map

์ •ํ•œ๋ณ„ยท2024๋…„ 6์›” 23์ผ
0

๐Ÿ“‚ Grouping

๊ทธ๋ฃน์„ ์ง€์–ด์„œ ํŠน์ • ๊ด€์ ์„ ๊ฐ€์ง€๊ณ  ๋ณด๊ฒ ๋‹ค๋Š” ๋œป

๐Ÿ—’๏ธ ๋ฐ์ดํ„ฐ์˜ ๊ฐ host_name์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ  host_name์œผ๋กœ ์ •๋ ฌํ•˜์—ฌ ์ƒ์œ„ 5๊ฐœ๋ฅผ ์ถœ๋ ฅํ•˜๋ผ

df.groupby('host_name').size().sort_index()

Ans = df.host_name.value_counts().sort_index()
Ans.head()

  1. Ans = df.groupby('host_name').size().sort_index()
  2. Ans = df.host_name.value_counts().sort_index()

โ†ช๏ธ
1. value_counts()

  1. host_name์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์œ ๋‹ˆํฌํ•œ ๊ฐœ์ˆ˜๋ฅผ ์„ธ์ค€๋‹ค. ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ์ˆœ์„œ๋Œ€๋กœ
  2. ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ์„ ํ•ด์ค€๋‹ค.
  3. host_name์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์œ ๋‹ˆํฌํ•œ ๊ฐœ์ˆ˜๋ฅผ ์„ธ์ค€๋‹ค.
  4. ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ์ˆœ์„œ๋Œ€๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ์„ ํ•ด์ค€๋‹ค.
  5. ๋„๊ฐ’์„ ์„ธ์ฃผ์ง€๋Š” ์•Š๋Š”๋‹ค.
    ํ•˜์ง€๋งŒ dropna=False ๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ๋„๊ฐ’๋„ ๊ฐ™์ด ์„ธ์ค€๋‹ค.
  • size() - ๋„๊ฐ’์ด ์žˆ์–ด๋„ ์„ธ์ค€๋‹ค.

๐Ÿ“Œ๊ทธ๋ฃนํ•‘์„ ํ•œ ์• ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์ธ๋ฑ์Šค๋กœ ์„ ์–ธ์ด ๋œ๋‹ค.

๐Ÿ—’๏ธ๋ฐ์ดํ„ฐ์˜ ๊ฐ host_name์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ  ๋นˆ๋„์ˆ˜๋กœ ์ •๋ ฌํ•˜์—ฌ ์ƒ์œ„ 5๊ฐœ๋ฅผ ์ถœ๋ ฅํ•˜๋ผ

Ans = df.groupby('host_name').size().\
                to_frame().rename(columns={0:'counts'}).\
                sort_values('counts',ascending=False)
Ans.head(5)

โ†ช๏ธ ์ฝ”๋“œ๊ฐ€ ๋„ˆ๋ฌด ๊ธธ๋•Œ '\'๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ์—”ํ„ฐ๋กœ ์ธ์‹์„ ํ•ด์„œ ์“ธ ์ˆ˜ ์žˆ๋‹ค.
to_frame() -๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ํ™” ํ•˜๊ฒ ๋‹ค๋Š” ์ด์•ผ๊ธฐ
rename(columns={0:'counts'}) - 0์„ count๋ผ๋Š” name์œผ๋กœ
๋ณ€๊ฒฝ์„ ํ•  ๊ฒƒ์ด๋‹ค.
sort_values('counts',ascending=False) - count๋ผ๋Š” ๊ฒƒ์„
๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ

df.groupby('host_name').size()

df.host_name.value_counts().to_frame()

๐Ÿ—’๏ธneighbourhood_group์˜ ๊ฐ’์— ๋”ฐ๋ฅธ neighbourhood์ปฌ๋Ÿผ ๊ฐ’์˜ ๊ฐฏ์ˆ˜๋ฅผ ๊ตฌํ•˜์—ฌ๋ผ

type(Ans)

pandas.core.series.Series
โ†ช๏ธ ์‚ฌ์‹ค ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด ์•„๋‹Œ ์‹œ๋ฆฌ์ฆˆ์ด๋‹ค.

Ans = df.groupby(['neighbourhood_group','neighbourhood'], as_index=False).size()
Ans.head()

โ†ช๏ธ
1. groupby - ๋‘๊ฐœ์˜ ๊ธฐ์ค€์œผ๋กœ grouping์„ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
2. as_index=False - ์‹œ๋ฆฌ์ฆˆ์—ฌ์„œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ํ™”๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ
์ด ๋ช…๋ น์–ด๋ฅผ ์“ด๋‹ค.

โ†ช๏ธ ์ธ๋ฑ์Šค๋กœ ์„ค์ •๋˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ํ•˜๋‚˜์˜ ์ปฌ๋Ÿผ์œผ๋กœ ์„ค์ •๋˜์„œ ์กฐ๊ธˆ ๋” ์‹œ๊ฐ์ ์œผ๋กœ ์ง๊ด€์ ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

๐Ÿ—’๏ธ neighbourhood_group ๊ฐ’์— ๋”ฐ๋ฅธ reviews_per_month ํ‰๊ท , ๋ถ„์‚ฐ, ์ตœ๋Œ€, ์ตœ์†Œ ๊ฐ’์„ ๊ตฌํ•˜์—ฌ๋ผ

Ans = df.groupby('neighbourhood_group')['reviews_per_month'].agg(['mean','var','max','min'])
Ans

โ†ช๏ธ
agg() - ์‚ฌ์น™ ์—ฐ์‚ฐ์„ ์–ด๋–ค ๊ฑธ ํ•ด์ค„์ง€์— ๋Œ€ํ•ด ์„ ์–ธ์„ ํ•ด์ค€ ๊ฒƒ์ด๋‹ค./ ํ•œ๋ฒˆ์— ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์‚ฌ์น™์—ฐ์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ“Œ ์ฐธ๊ณ 

  1. ๊ณ„์ธต์  indexing ์—†์ด ๊ตฌํ•˜๋ผ-> ๊ณ„์ธต์ ์œผ๋กœ ๋ณด์ด๋Š” ๊ฒƒ์„ ์ข€ ๋” ์ง๊ด€์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‹ฌ๋ผ๋Š” ๋œป
  2. fillna(-999) - ๋นˆ ๊ฐ’์ด ์žˆ์œผ๋ฉด -999๋กœ ์ฑ„์›Œ๋„ฃ๊ฒ ๋‹ค๋Š” ๋ง

๐Ÿ—’๏ธ๋ฐ์ดํ„ฐ์ค‘ neighbourhood_group ๊ฐ’์— ๋”ฐ๋ฅธ room_type ์ปฌ๋Ÿผ์˜ ์ˆซ์ž๋ฅผ ๊ตฌํ•˜๊ณ  neighbourhood_group ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐ ๊ฐ’์˜ ๋น„์œจ์„ ๊ตฌํ•˜์—ฌ๋ผ

Ans = df[['neighbourhood_group','room_type']].groupby(['neighbourhood_group','room_type']).size().unstack()
Ans.loc[:,:] = (Ans.values /Ans.sum(axis=1).values.reshape(-1,1))
Ans

โ†ช๏ธ
1. [['neighbourhood_group','room_type']์ด ๋ฐ์ดํ„ฐ๋“ค์„ ๊ฐ€์ ธ์™€์„œ groupby( ['neighbourhood_group','room_type']๋ฅผ ์”Œ์šด ๊ฒƒ์ด๋‹ค.
2. size().unstack() - ์ˆซ์ž๋ฅผ ์„ธ๊ณ  ํ’€์–ด์ค€๋‹ค.
3. Ans.loc[:,:]- ๋ชจ๋“  ๊ฐ’์„ ๊ฐ€์ ธ์˜จ๋‹ค.

Ans.values


โ†ช๏ธ Ans.values- ๊ฐ ๊ฐ’์— ๋Œ€ํ•œ array๋ฅผ Matrix ํ˜•ํƒœ๋กœ ๋ฝ‘์•„์ค€๋‹ค.

Ans.sum(axis=1)


โ†ช๏ธ Ans.sum(axis=1) - ์ปฌ๋Ÿผ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ sum์„ ํ•˜๊ฒ ๋‹ค๋Š” ๋ง

Ans.sum(axis=1).values

**Ans.values /Ans.sum(axis=1).values.reshape(-1,1)**

โ†ช๏ธ
1. Ans.values๋ฅผ Ans.sum(axis=1).values.reshape(-1,1)๋กœ
๋‚˜๋ˆ ์ค€ ๊ฒƒ์ด๋‹ค.
2. values.reshape(-1,1)๋กœ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๊ณ  ๊ฐ™์€ ํ˜•ํƒœ๋ผ๋ฆฌ ๋‚˜๋ˆ ์ค„ ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.

๐Ÿ“‚ Apply, Map

๐Ÿ—’๏ธ Income_Category์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ map ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณ€๊ฒฝํ•˜์—ฌ newIncome ์ปฌ๋Ÿผ์— ๋งคํ•‘ํ•˜๋ผ

dic = {
    'Unknown'        : 'N',
    'Less than $40K' : 'a',
    '$40K - $60K'    : 'b',
    '$60K - $80K'    : 'c',
    '$80K - $120K'   : 'd',
    '$120K +'        : 'e'
}

df['newIncome'] = df.Income_Category.map(lambda x: dic[x])

Ans = df[['newIncome', 'Income_Category']]
Ans.head()

โ†ช๏ธ
1. dic - dictionary / ํ•œ์Œ์˜ ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ง
2. Income_Category ์ด ๋ณ€์ˆ˜๋ฅผ x๋กœ ๋ณด๋Š” ๊ฒƒ์ด๋‹ค.
3. dic[x]- ์•ž์— ์žˆ๋Š” dictionary์— income ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋„ฃ์–ด์ฃผ๊ฒ ๋‹ค๋Š” ๋œป

๐Ÿ“Œ
1. ๋žŒ๋‹ค(lambda) ํ•จ์ˆ˜๋Š” ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ์ค‘์š”ํ•œ ๊ฐœ๋… ์ค‘ ํ•˜๋‚˜๋กœ, ์ต๋ช… ํ•จ์ˆ˜(anonymous function)๋ผ๊ณ ๋„ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.
2. ๋žŒ๋‹ค ํ•จ์ˆ˜๋Š” ์ด๋ฆ„์ด ์—†๋Š” ํ•จ์ˆ˜๋กœ, ์ผ๋ฐ˜์ ์œผ๋กœ ํ•จ์ˆ˜๋ฅผ ํ•œ ๋ฒˆ๋งŒ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ํ•จ์ˆ˜๋ฅผ ์ธ์ž๋กœ ์ „๋‹ฌํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ์— ๋งค์šฐ ์œ ์šฉํ•˜๊ฒŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
3. apply()๋Š” ์‚ฌ์šฉ์ž ์ •์˜ํ•œ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์‹œ๋ฆฌ์ฆˆ์— ์ ์šฉํ•  ๋•Œ
๋งŽ์ด ์“ด๋‹ค.
4. def - ๋‚ด๊ฐ€ ์ •์˜๋ฅผ ๋งŒ๋“ค์–ด์„œ ์“ฐ๊ณ  ์‹ถ์„ ๋•Œ ์“ฐ๋Š” ๊ฒƒ

๐Ÿ—’๏ธ Customer_Age์˜ ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ๋‚˜์ด ๊ตฌ๊ฐ„์„ AgeState ์ปฌ๋Ÿผ์œผ๋กœ ์ •์˜ํ•˜๋ผ. (0-9: 0 , 10-19: 10 , 20-29: 20) โ€ฆ ๊ฐ ๊ตฌ๊ฐ„์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๋ผ

df['AgeState']  = df.Customer_Age.map(lambda x: x//10 *10)

Ans = df['AgeState'].value_counts().sort_index()
Ans

โ†ช๏ธ
1. Customer_Age - ์—ฐ์†ํ˜• ์ˆซ์žํ˜•์˜ ๋ณ€์ˆ˜
2. map(lambda x- ๋งตํ•‘์„ ๋ณ€๊ฒฝ์„ ํ•ด์ค€๋‹ค๋Š” ๋ง
3. x//10 - ์—ฌ๊ธฐ์„œ //๋Š” ๋‚˜๋ˆ—์…ˆ ํ›„ ๋ชซ์„ ๋‚˜ํƒ€๋‚ด์ค€๋‹ค.

   49//10*10

40

๐Ÿ—’๏ธ Education_Level์˜ ๊ฐ’์ค‘ Graduate๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜๋Š” ๊ฐ’์€ 1 ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋Š” 0์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ newEduLevel ์ปฌ๋Ÿผ์„ ์ •์˜ํ•˜๊ณ  ๋นˆ๋„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๋ผ

df['newEduLevel'] = df.Education_Level.map(lambda x : 1 if 'Graduate' in x else 0)
Ans = df['newEduLevel'].value_counts()
Ans

โ†ช๏ธ
1. if 'Graduate' in x else 0 - if else์˜ ์ถ•์•ฝ๋ฌธ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
2. map(lambda x : 1 if 'Graduate' in x else 0)- ๋‘๋ฒˆ์งธ x์— Graduate๊ฐ€ ์žˆ์œผ๋ฉด 1์„ ๋‚ด๋ฑ‰๊ณ  ์—†์œผ๋ฉด 0์„ ๋‚ด๋ณด๋‚ธ๋‹ค.
์—ฌ๊ธฐ์„œ map ์„ apply๋กœ ๋ฐ”๊ฟ”๋„ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค.

import numpy as np
df['newEduLevel'] = np.where( df.Education_Level.str.contains('Graduate'), 1, 0)
Ans = df['newEduLevel'].value_counts()
Ans

โ†ช๏ธ
1. np.where( df.Education_Level.str.contains('Graduate'), 1, 0)

โ†ช๏ธ np.where์€ ์•ž์— ์žˆ๋Š” ์กฐ๊ฑด(df.Education_Level.str.contains('Graduate'))์„ ๋งŒ์กฑํ•˜๋ฉด 1 ์•„๋‹ˆ๋ฉด 0์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ผ๋Š” ์˜๋ฏธ์ด๋‹ค.

๐Ÿ“Œ ๋ฐ”๋กœ ์œ„์˜ ๋‘๊ฐœ์˜ ์ฝ”๋“œ์ฒ˜๋Ÿผ ๋™์ผํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ์„ ํ†ตํ•ด์„œ ๋‹ค์–‘ํ•œ ๋ช…๋ น์–ด๋ฅผ ํ†ตํ•ด์„œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋ฑ‰์„ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ—’๏ธ Marital_Status ์ปฌ๋Ÿผ๊ฐ’์ด Married ์ด๊ณ  Card_Category ์ปฌ๋Ÿผ์˜ ๊ฐ’์ด Platinum์ธ ๊ฒฝ์šฐ 1 ๊ทธ์™ธ์˜ ๊ฒฝ์šฐ์—๋Š” ๋ชจ๋‘ 0์œผ๋กœ ํ•˜๋Š” newState์ปฌ๋Ÿผ์„ ์ •์˜ํ•˜๋ผ. newState์˜ ๊ฐ ๊ฐ’๋“ค์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•˜๋ผ

def check(x):
   if x.Marital_Status =='Married' and x.Card_Category =='Platinum':
       return 1
   else:
       return 0


df['newState'] = df.apply(check,axis=1)

Ans  = df['newState'].value_counts()
Ans

โ†ช๏ธ
1. map๋ณด๋‹จ ์‚ฌ์šฉ์ž ์ •์˜ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด ๋†“๊ณ  apply๋กœ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
2. axis=1 -ํ–‰๊ธฐ์ค€์œผ๋กœ ์ถœ๋ ฅ์ด๋ƒ ์—ด๊ธฐ์ค€์œผ๋กœ ์ถœ๋ ฅ์ด๋ƒ ์ •ํ•ด์•ผ๋˜๋Š”๋ฐ ์ด๊ฑด ์—ด๊ธฐ์ค€์œผ๋กœ ์ถœ๋ ฅํ•œ๋‹ค.

๐Ÿ“Œ ์ฐธ๊ณ 
df['Gender'] = df.Gender.apply(changeGender)
โ†ช๏ธ gender๋ผ๋Š” ํ•˜๋‚˜์— ์ ์šฉํ•˜๋ฏ€๋กœ ์ถ•๊ธฐ์ค€์„ ์•ˆ์ •ํ•ด์ค˜๋„ ๋œ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€

๊ด€๋ จ ์ฑ„์šฉ ์ •๋ณด