๐Ÿงญ ๋ฐ์ดํ„ฐ ๊ฒฐํ•ฉ๊ณผ ์ „์ฒ˜๋ฆฌ (Merge ยท Join ยท Concat ยท Apply)

okorionยท2025๋…„ 10์›” 29์ผ
0

1. ๋ฐ์ดํ„ฐ ๊ฒฐํ•ฉ์˜ ๊ฐœ๋…

ํ˜„์‹ค์˜ ๋ฐ์ดํ„ฐ๋Š” ํ•œ ํŒŒ์ผ์— ๋ชจ๋‘ ๋“ค์–ด ์žˆ์ง€ ์•Š๋‹ค.
๊ณ ๊ฐ ์ •๋ณด, ๊ฑฐ๋ž˜ ๋‚ด์—ญ, ์„ค๋ฌธ ์‘๋‹ต ๋“ฑ์ด ์„œ๋กœ ๋‹ค๋ฅธ DataFrame์œผ๋กœ ์กด์žฌํ•œ๋‹ค.
์ด๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ํ•ฉ์น˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ฐ”๋กœ ๊ฒฐํ•ฉ(Concatenate), ๋ณ‘ํ•ฉ(Merge), ์—ฐ๊ฒฐ(Join)์ด๋‹ค.

๊ตฌ๋ถ„๋ชฉ์ ๊ธฐ์ค€ ์—ด ํ•„์š” ์—ฌ๋ถ€๋Œ€ํ‘œ ํ•จ์ˆ˜
concat()์œ„์•„๋ž˜ ๋˜๋Š” ์ขŒ์šฐ๋กœ ๋‹จ์ˆœ ์—ฐ๊ฒฐํ•„์š” ์—†์Œpd.concat()
append()concat์˜ ๊ฐ„๊ฒฐํ•œ ํ˜•ํƒœ (deprecated ์˜ˆ์ •)ํ•„์š” ์—†์Œdf1.append(df2)
merge()SQL์˜ JOIN์ฒ˜๋Ÿผ ๊ณตํ†ต ์—ด ๊ธฐ์ค€ ๋ณ‘ํ•ฉํ•„์š” ์žˆ์Œpd.merge()
join()์ธ๋ฑ์Šค ๊ธฐ์ค€ ๋ณ‘ํ•ฉ๊ธฐ๋ณธ์ ์œผ๋กœ ์ธ๋ฑ์Šค ๊ธฐ์ค€df1.join(df2)

2. pd.concat() โ€” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ด์–ด ๋ถ™์ด๊ธฐ

2.1 ๊ธฐ๋ณธ ๊ฐœ๋…

๋‘ DataFrame์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ ํ–‰(row)์„ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ์‹.
๋‹จ์ˆœํ•œ ์ˆ˜์ง ๊ฒฐํ•ฉ์œผ๋กœ, ๊ณตํ†ต ์—ด ๊ตฌ์กฐ๋งŒ ๋งž์œผ๋ฉด ๋œ๋‹ค.

import pandas as pd

bank1_df = pd.DataFrame({
    'Bank Client ID': [1, 2, 3, 4, 5],
    'First Name': ['Nancy', 'Alex', 'Shep', 'Max', 'Allen'],
    'Last Name': ['Rob', 'Ali', 'George', 'Mitch', 'Steve']
})

bank2_df = pd.DataFrame({
    'Bank Client ID': [6, 7, 8, 9, 10],
    'First Name': ['Bill', 'Dina', 'Sarah', 'Heather', 'Holly'],
    'Last Name': ['Christian', 'Moe', 'Steve', 'Rob', 'Bob']
})

bank_all_df = pd.concat([bank1_df, bank2_df])
  • ์ธ๋ฑ์Šค๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ์œ ์ง€๋œ๋‹ค (์ค‘๋ณต๋จ).
  • ignore_index=True๋ฅผ ์ฃผ๋ฉด ์ž๋™์œผ๋กœ ์žฌ์ •๋ ฌ๋œ๋‹ค.
bank_all_df = pd.concat([bank1_df, bank2_df], ignore_index=True)

โžก ๊ฒฐ๊ณผ:
0~9๊นŒ์ง€ ์ƒˆ๋กœ์šด ์ธ๋ฑ์Šค๊ฐ€ ์ž๋™ ์ƒ์„ฑ๋œ๋‹ค.


2.2 append() vs concat()

๊ณผ๊ฑฐ์—” append()๋„ ์‚ฌ์šฉํ–ˆ๋‹ค:

bank_all_df = bank1_df.append(bank2_df, ignore_index=True)

ํ•˜์ง€๋งŒ Pandas 2.0 ์ดํ›„ ๋น„๊ถŒ์žฅ(deprecated) ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—
ํ•ญ์ƒ pd.concat()์„ ์‚ฌ์šฉํ•  ๊ฒƒ์„ ๊ถŒ์žฅํ•œ๋‹ค.


2.3 MultiIndex๋กœ ๊ทธ๋ฃน ๊ฒฐํ•ฉ

keys ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ƒ์œ„ ์ธ๋ฑ์Šค๊ฐ€ ์ถ”๊ฐ€๋œ๋‹ค.

bank_all_df = pd.concat(
    [bank1_df, bank2_df],
    keys=['Customer Group 1', 'Customer Group 2']
)

๊ฒฐ๊ณผ๋Š” ๋ฉ€ํ‹ฐ ์ธ๋ฑ์Šค(MultiIndex) ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค.

(Customer Group, Index)...
(Customer Group 1, 0)...
(Customer Group 1, 1)...
(Customer Group 2, 0)...

๋ฉ€ํ‹ฐ ์ธ๋ฑ์Šค ๋ฐ์ดํ„ฐ๋Š” ๊ณ„์ธต์  ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

bank_all_df.loc['Customer Group 1']     # ํŠน์ • ๊ทธ๋ฃน ์ „์ฒด
bank_all_df.loc[('Customer Group 2', 3)] # ํŠน์ • ๊ทธ๋ฃน์˜ ํ–‰

โžก ํ™œ์šฉ ์˜ˆ์‹œ:

  • ์—ฌ๋Ÿฌ ์ง€์ /๊ตญ๊ฐ€/๊ธฐ๊ฐ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•ด ์ €์žฅ
  • ๊ทธ๋ฃน ๋‹จ์œ„ ์—ฐ์‚ฐ ๋ฐ ํ•„ํ„ฐ๋ง

3. pd.merge() โ€” ๊ณตํ†ต ์—ด ๊ธฐ์ค€ ๋ณ‘ํ•ฉ

3.1 ๊ธฐ๋ณธ ๊ตฌ์กฐ

pd.merge(left, right, on='๊ณตํ†ต์—ด', how='inner')
  • on: ๊ธฐ์ค€ ์—ด ์ง€์ •
  • how: ๊ฒฐํ•ฉ ๋ฐฉ์‹ ์„ ํƒ
    • 'inner': ๊ณตํ†ต ํ‚ค๋งŒ (๊ต์ง‘ํ•ฉ)
    • 'outer': ๋ชจ๋“  ํ‚ค ํฌํ•จ (ํ•ฉ์ง‘ํ•ฉ)
    • 'left': ์™ผ์ชฝ ๊ธฐ์ค€ ์œ ์ง€
    • 'right': ์˜ค๋ฅธ์ชฝ ๊ธฐ์ค€ ์œ ์ง€

3.2 ์‹ค์Šต ์˜ˆ์‹œ

๊ธฐ์กด ๊ณ ๊ฐ ๋ฐ์ดํ„ฐ + ์—ฐ๋ด‰ ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ

bank_all_df = pd.concat([bank1_df, bank2_df], ignore_index=True)

bank_salary_df = pd.DataFrame({
    'Bank Client ID': [1,2,3,4,5,6,7,8,9,10],
    'Annual Salary [$]': [2983, 4176, 5369, 5726, 4900, 3200, 3300, 3400, 2744, 2625]
})

merged_df = pd.merge(bank_all_df, bank_salary_df, on='Bank Client ID')

๊ฒฐ๊ณผ:
ํ•˜๋‚˜์˜ DataFrame์— ๊ณ ๊ฐ ID, ์ด๋ฆ„, ์„ฑ, ์—ฐ๋ด‰์ด ๋ชจ๋‘ ํฌํ•จ๋œ๋‹ค.


3.3 ๋‹ค์–‘ํ•œ ๋ณ‘ํ•ฉ ์˜ˆ์ œ

์œ ํ˜•์ฝ”๋“œ์„ค๋ช…
Inner Joinpd.merge(df1, df2, on='ID')๊ณตํ†ต ID๋งŒ ์œ ์ง€
Outer Joinpd.merge(df1, df2, on='ID', how='outer')๋ชจ๋“  ID ํฌํ•จ, ๊ฒฐ์ธก๊ฐ’ NaN
Left Joinpd.merge(df1, df2, on='ID', how='left')df1 ๊ธฐ์ค€ ์œ ์ง€
Right Joinpd.merge(df1, df2, on='ID', how='right')df2 ๊ธฐ์ค€ ์œ ์ง€

3.4 ์‹ค์ œ ํ™•์žฅ ์˜ˆ์‹œ

์นด๋“œ๋นš(Credit Card Debt)๊ณผ ๋‚˜์ด(Age) ์ •๋ณด๋„ ํ•ฉ์น  ์ˆ˜ ์žˆ๋‹ค.

bank_credit_age_df = pd.DataFrame({
    'Bank Client ID': [1,2,3,4,5,6,7,8,9,10],
    'Credit Card Debt': [119, 12, 60, 72, 0, 20, 360, 127, 3000, 262],
    'Age': [44, 35, 67, 19, 22, 45, 48, 33, 34, 36]
})

bank_all_df = pd.merge(merged_df, bank_credit_age_df, on='Bank Client ID')

๊ฒฐ๊ณผ:
๋ชจ๋“  ๊ณ ๊ฐ ๋ฐ์ดํ„ฐ(์ด๋ฆ„ยท์—ฐ๋ด‰ยท์นด๋“œ๋นšยท๋‚˜์ด)๊ฐ€ ํ•œ ํ…Œ์ด๋ธ”์— ํ†ตํ•ฉ๋œ๋‹ค.


4. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์™€ apply()

4.1 apply() ๊ธฐ๋ณธ ๊ฐœ๋…

๊ฐ ํ–‰(row) ๋˜๋Š” ์—ด(column)์— ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค.
๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ๋ฒกํ„ฐํ™” ์—ฐ์‚ฐ์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด ํšจ์œจ์ ์ด๋‹ค.

bank_all_df['Debt Ratio'] = bank_all_df.apply(
    lambda row: row['Credit Card Debt'] / row['Annual Salary [$]'],
    axis=1
)
  • axis=1 โ†’ ํ–‰ ๊ธฐ์ค€
  • lambda โ†’ ๊ฐ„๋‹จํ•œ ํ•œ ์ค„์งœ๋ฆฌ ํ•จ์ˆ˜

4.2 ์˜ˆ์‹œ: ์กฐ๊ฑด ๊ธฐ๋ฐ˜ ์ „์ฒ˜๋ฆฌ

bank_all_df['Risk Level'] = bank_all_df['Debt Ratio'].apply(
    lambda x: 'High' if x > 0.5 else 'Low'
)

โžก ๊ณ ๊ฐ๋ณ„ ๋ถ€์ฑ„ ๋น„์œจ๊ณผ ๋ฆฌ์Šคํฌ ๋“ฑ๊ธ‰์„ ์ž๋™ ๊ณ„์‚ฐ.


4.3 applymap()๊ณผ map() ๋น„๊ต

ํ•จ์ˆ˜์ ์šฉ ๋‹จ์œ„์ฃผ์š” ์‚ฌ์šฉ์ฒ˜
map()Series(1์ฐจ์›)ํ•œ ์—ด์— ๊ฐ„๋‹จํ•œ ๋ณ€ํ™˜ ์ ์šฉ
apply()ํ–‰ ๋˜๋Š” ์—ด ๋‹จ์œ„๋‹ค์ค‘ ์—ด ๊ธฐ๋ฐ˜ ๊ณ„์‚ฐ
applymap()์ „์ฒด DataFrame ์š”์†Œ๋ณ„์ˆซ์ž๋‚˜ ๋ฌธ์ž์—ด ํฌ๋งท ์ผ๊ด„ ๋ณ€๊ฒฝ

์˜ˆ์‹œ:

df.map(str.upper)        # Series
df.apply(np.log1p)       # DataFrame ์ˆ˜ํ•™ ๋ณ€ํ™˜
df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

5. ํ•ต์‹ฌ ์š”์•ฝ

์ฃผ์ œํ•ต์‹ฌ ๋ฌธ๋ฒ•์‹ค๋ฌด ํ™œ์šฉ ํฌ์ธํŠธ
concatpd.concat([df1, df2], ignore_index=True)๋‹จ์ˆœ ๊ฒฐํ•ฉ, ๋ฐ์ดํ„ฐ ์ถ•์ 
keyspd.concat(..., keys=['A','B'])๋ฉ€ํ‹ฐ ์ธ๋ฑ์Šค ๊ตฌ์„ฑ
mergepd.merge(df1, df2, on='๊ณตํ†ต์—ด')SQL์‹ ๋ณ‘ํ•ฉ
joindf1.join(df2)์ธ๋ฑ์Šค ๊ธฐ์ค€ ๋ณ‘ํ•ฉ
applydf.apply(lambda row: ...)์‚ฌ์šฉ์ž ์ •์˜ ์ „์ฒ˜๋ฆฌ
lambdalambda x: x*1000๋น ๋ฅธ ํ•จ์ˆ˜ ์ •์˜

6. ํ•™์Šต ์ •๋ฆฌ

  • concat โ†’ ๋ฐ์ดํ„ฐ ์Œ“๊ธฐ (ํ–‰ ๊ธฐ์ค€)
  • merge โ†’ ๋ฐ์ดํ„ฐ ๋งž๋ถ™์ด๊ธฐ (์—ด ๊ธฐ์ค€)
  • apply โ†’ ๋ฐ์ดํ„ฐ ๋‹ค๋“ฌ๊ธฐ (ํ•จ์ˆ˜ ์ ์šฉ)
  • ์ „์ฒ˜๋ฆฌ๋ž€ โ€œ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์ข‹์€ ํ˜•ํƒœ๋กœ ๋งŒ๋“œ๋Š” ๊ณผ์ •โ€์ด๋ฉฐ, Pandas๋Š” ์ด๋ฅผ ์œ„ํ•œ ๊ฑฐ์˜ ๋ชจ๋“  ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•œ๋‹ค.

๐Ÿงฉ ์‹ค์ „ ์ ์šฉ ์•„์ด๋””์–ด

  1. ๋งˆ์ผ€ํŒ… ๋ฐ์ดํ„ฐ ํ†ตํ•ฉ
    • ๊ณ ๊ฐ ํ…Œ์ด๋ธ” + ๊ตฌ๋งค ํ…Œ์ด๋ธ” + ์„ค๋ฌธ ํ…Œ์ด๋ธ”์„ merge
    • ๊ณตํ†ต ํ‚ค: ๊ณ ๊ฐ ID
  2. ์‹ ์šฉ ๋ฆฌ์Šคํฌ ๋ชจ๋ธ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ
    • apply๋กœ ๋ถ€์ฑ„๋น„์œจ, ์—ฐ๋ น๋Œ€ ์นดํ…Œ๊ณ ๋ฆฌํ™” ๋“ฑ ์ž๋™ ๊ณ„์‚ฐ
  3. ์ง€์ ๋ณ„ ์˜์—… ๋ฐ์ดํ„ฐ
    • keys๋กœ ๋ฉ€ํ‹ฐ ์ธ๋ฑ์Šค ๊ตฌ์„ฑ ํ›„ ๊ทธ๋ฃน ๋ถ„์„
profile
okorion's Tech Study Blog.

0๊ฐœ์˜ ๋Œ“๊ธ€