Pandas: Tutorial 2๏ธโƒฃ

m_ngyeongยท2025๋…„ 5์›” 28์ผ
0

Pandas

๋ชฉ๋ก ๋ณด๊ธฐ
4/4
post-thumbnail

๐Ÿง Pandas: Tutorial


๐Ÿ”ธ Filtering Data

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

https://github.com/justmarkham/DAT8/blob/master/data/chipotle.tsv
import pandas as pd

data_url = 'https://raw.githubusercontent.com/Datamanim/pandas/main/chipo.csv'
df = pd.read_csv(data_url)

.loc[์กฐ๊ฑด]

# ์‹๋‹น๋ฐ์ดํ„ฐ : 
print('๐Ÿ• quantity ๊ฐ’์ด 3์ธ ๋ฐ์ดํ„ฐ ์ƒ์œ„ 5๊ฐœ ์ถœ๋ ฅ:')
print(df.loc[df['quantity']==3].head())
  • df['quantity']==3 : quantity๋ผ๋Š” ์—ด์˜ ๊ฐ’์ด 3๊ณผ ๊ฐ™์€ ๋น„๊ตํ•˜์—ฌ True/False ์ค‘์— ๋ฐ˜ํ™˜.
  • df.loc[์กฐ๊ฑด] : .loc[]๋Š” ์กฐ๊ฑด์— ๋งž๋Š” ํ–‰๋“ค์„ ์„ ํƒ, quantity == 3์ธ ํ–‰๋“ค๋งŒ ์„ ํƒ
  • .head() : ์ƒ์œ„ 5๊ฐœ ํ–‰๋งŒ ๋ณด์—ฌ์ฃผ๋Š” ํ•จ์ˆ˜.
    .head(3)์ด๋ฉด ์ƒ์œ„ 3๊ฐœ, .head(10)์ด๋ฉด 10๊ฐœ.
๐Ÿ• quantity ๊ฐ’์ด 3์ธ ๋ฐ์ดํ„ฐ ์ƒ์œ„ 5๊ฐœ ์ถœ๋ ฅ:
     order_id  quantity          item_name                                 choice_description item_price
409       178         3       Chicken Bowl  [[Fresh Tomato Salsa (Mild), Tomatillo-Green C...    $32.94 
445       193         3               Bowl  [Braised Carnitas, Pinto Beans, [Sour Cream, C...    $22.20 
689       284         3  Canned Soft Drink                                        [Diet Coke]     $3.75 
818       338         3      Bottled Water                                                NaN     $3.27 
850       350         3  Canned Soft Drink                                           [Sprite]     $3.75 

.reset_index(): ์ธ๋ฑ์Šค ์žฌ์ •๋ ฌ

print('๐Ÿ• quantity ๊ฐ’์ด 3์ธ ๋ฐ์ดํ„ฐ ์ƒ์œ„ 5๊ฐœ๋ฅผ ์ถœ๋ ฅํžˆ๋Š”๋ฐ ์ธ๋ฑ์Šค๋ฅผ 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๊ฒŒ ์žฌ์ •์˜:')
print(df.loc[df['quantity']==3].head().reset_index(drop=True))
  • .reset_index() :
    • ๊ธฐ์กด์˜ ์ธ๋ฑ์Šค๋ฅผ ์ƒˆ๋กœ์šด ์—ด๋กœ ์ด๋™์‹œํ‚ด
    • ์ƒˆ๋กœ์šด 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑ
  • .reset_index(drop=True):
    • ๊ธฐ์กด ์ธ๋ฑ์Šค๋ฅผ ์™„์ „ํžˆ ๋ฒ„๋ฆผ
    • 0๋ถ€ํ„ฐ ๋‹ค์‹œ ๋งค๊ธด ์ƒˆ๋กœ์šด ์ธ๋ฑ์Šค๋งŒ ์‚ฌ์šฉ
    • ๐Ÿ‘‰ ๊ธฐ์กด ์ธ๋ฑ์Šค๋ฅผ ์—ด๋กœ ๋ณด์กดํ•˜์ง€ ์•Š์Œ
๐Ÿ• quantity ๊ฐ’์ด 3์ธ ๋ฐ์ดํ„ฐ ์ƒ์œ„ 5๊ฐœ๋ฅผ ์ถœ๋ ฅํžˆ๋Š”๋ฐ ์ธ๋ฑ์Šค๋ฅผ 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๊ฒŒ ์žฌ์ •์˜:
   order_id  quantity          item_name                                 choice_description item_price
0       178         3       Chicken Bowl  [[Fresh Tomato Salsa (Mild), Tomatillo-Green C...    $32.94 
1       193         3               Bowl  [Braised Carnitas, Pinto Beans, [Sour Cream, C...    $22.20 
2       284         3  Canned Soft Drink                                        [Diet Coke]     $3.75 
3       338         3      Bottled Water                                                NaN     $3.27 
4       350         3  Canned Soft Drink                                           [Sprite]     $3.75 

์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ •์˜

print('๐Ÿ†• quantity, item_price์˜ ์ปฌ๋Ÿผ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ •์˜:')
print(df[['quantity', 'item_price']])
๐Ÿ†• quantity, item_price์˜ ์ปฌ๋Ÿผ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ •์˜:
      quantity item_price
0            1     $2.39 
1            1     $3.39 
2            1     $3.39 
3            1     $2.39 
4            2    $16.98 
...        ...        ...
4617         1    $11.75 
4618         1    $11.75 
4619         1    $11.25 
4620         1     $8.75 
4621         1     $8.75 

[4622 rows x 2 columns]

โ“ ์™œ df['quantity', 'item_price'] ์ด๋ ‡๊ฒŒ ์“ฐ๋ฉด ์•ˆ๋˜๋‚˜์š”?
์ด๊ฑด Pandas ์ž…์žฅ์—์„  ํŠœํ”Œ ํ•˜๋‚˜๋ฅผ ์ธ๋ฑ์‹ฑํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ํ•ด์„๋œ๋‹ค. ๋ ๋ผ์„œ df[['quantity', 'item_price']]์ฒ˜๋Ÿผ ๋Œ€๊ด„ํ˜ธ 2๊ฐœ๋ฅผ ์ž‘์„ฑํ•ด์•ผ๋งŒ ์ปฌ๋Ÿผ ๋‘ ๊ฐœ๋ฅผ ์„ ํƒํ•˜๋ผ๋Š” ์˜๋ฏธ๋กœ ํ•ด์„๋œ๋‹ค.

ํ‘œํ˜„์˜๋ฏธ๋™์ž‘ ์—ฌ๋ถ€
df['quantity']ํ•˜๋‚˜์˜ ์ปฌ๋Ÿผ (์‹œ๋ฆฌ์ฆˆ ๋ฐ˜ํ™˜)โœ…
df[['quantity']]ํ•˜๋‚˜์˜ ์ปฌ๋Ÿผ (๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋ฐ˜ํ™˜)โœ…
df[['quantity', 'item_price']]๋‘ ๊ฐœ ์ด์ƒ์˜ ์ปฌ๋Ÿผ ์„ ํƒโœ…
df['quantity', 'item_price']ํŠœํ”Œ ํ‚ค โ†’ ํ•ด๋‹น ์ด๋ฆ„ ๊ฐ€์ง„ ์—ด์„ ์ฐพ์Œ (์˜ค๋ฅ˜)โŒ ์˜ค๋ฅ˜

.astype() : type ๋ณ€๊ฒฝ

print('๐Ÿ’ต item_price์— $๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  float ํƒ€์ž…์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ item_price_float์— ์ €์žฅ:')
df['item_price_float'] = df['item_price'].str.replace('$', '', regex=False).astype(float)
print(df['item_price_float'].head())
  • .str.replace('$', '', regex=False) โ†’ $ ๊ธฐํ˜ธ๋ฅผ ์—†์•ฐ
    • .str.replace(): Pandas๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ pat์„ ์ •๊ทœํ‘œํ˜„์‹(Regex)์œผ๋กœ ์ธ์‹ํ•˜์—ฌ, regex=True๊ฐ€ ๊ธฐ๋ณธ๊ฐ’
    • $๋Š” "๋ฌธ์žฅ์˜ ๋"์„ ์˜๋ฏธํ•˜๋Š” ๋ฉ”ํƒ€๋ฌธ์ž๋กœ regex=False๋ฅผ ์ž‘์„ฑํ•ด ๋ฌธ์ž์˜ ์˜๋ฏธ๋ฅผ ์—†์• ์คŒ
  • .astype(float) โ†’ ๋ฌธ์ž์—ด "3.50" โ†’ ์‹ค์ˆ˜ 3.50์œผ๋กœ ๋ณ€ํ™˜
๐Ÿ’ต item_price์— $๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  float ํƒ€์ž…์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ item_price_float์— ์ €์žฅ:
0     2.39
1     3.39
2     3.39
3     2.39
4    16.98
Name: item_price_float, dtype: float64

๐Ÿง

print('๐Ÿ— item_name์ด "Chicken Salad Bowl"์ธ ๊ฒƒ๊ณผ item_price_float๊ฐ€ 9 ์ดํ•˜์ธ ๊ฒƒ์„ ์ฐพ์•„๋ผ:')
filtered_df = df.loc[(df.item_name == "Chicken Salad Bowl") & (df.item_price_float <= 9)]
print("โ˜‘๏ธ ์กฐ๊ฑด์— ๋งž๋Š” ์ƒ์œ„ 5๊ฐœ: ")
print(filtered_df.head())
print('โ˜‘๏ธ ์ „์ฒด ๊ฐœ์ˆ˜:', len(filtered_df))
๐Ÿ— item_name์ด "Chicken Salad Bowl"์ธ ๊ฒƒ๊ณผ item_price_float๊ฐ€ 9 ์ดํ•˜์ธ ๊ฒƒ์„ ์ฐพ์•„๋ผ:
โ˜‘๏ธ ์กฐ๊ฑด์— ๋งž๋Š” ์ƒ์œ„ 5๊ฐœ: 
     order_id  quantity           item_name                                 choice_description item_price  item_price_float
44         20         1  Chicken Salad Bowl  [Fresh Tomato Salsa, [Fajita Vegetables, Pinto...     $8.75               8.75
256       111         1  Chicken Salad Bowl  [Fresh Tomato Salsa, [Fajita Vegetables, Rice,...     $8.75               8.75
526       220         1  Chicken Salad Bowl  [Roasted Chili Corn Salsa, [Black Beans, Sour ...     $8.75               8.75
528       221         1  Chicken Salad Bowl  [Tomatillo Green Chili Salsa, [Fajita Vegetabl...     $8.75               8.75
529       221         1  Chicken Salad Bowl  [Tomatillo Green Chili Salsa, [Fajita Vegetabl...     $8.75               8.75
โ˜‘๏ธ ์ „์ฒด ๊ฐœ์ˆ˜: 56

.sort_values(): ์˜ค๋ฆ„์ฐจ์ˆœ ์ •๋ ฌ

print('๐Ÿ’ต item_price_float๋ฅผ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ์ •๋ฆฌํ•˜๊ณ  index๋ฅผ ์ดˆ๊ธฐํ™”:')
print(df.sort_values('item_price_float').reset_index(drop=True).head())
๐Ÿ’ต item_price_float๋ฅผ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ์ •๋ฆฌํ•˜๊ณ  index๋ฅผ ์ดˆ๊ธฐํ™”:
   order_id  quantity      item_name choice_description item_price  item_price_float
0       471         1  Bottled Water                NaN     $1.09               1.09
1       338         1    Canned Soda        [Coca Cola]     $1.09               1.09
2      1575         1    Canned Soda       [Dr. Pepper]     $1.09               1.09
3        47         1    Canned Soda       [Dr. Pepper]     $1.09               1.09
4      1014         1    Canned Soda        [Coca Cola]     $1.09               1.09

ascending=False: ๋‚ด๋ฆผ์ฐจ์ˆœ

print('๐Ÿ’ต item_price_float๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ฆฌํ•˜๊ณ  index๋ฅผ ์ดˆ๊ธฐํ™”:')
print(df.sort_values('item_price_float', ascending=False).reset_index(drop=True).head())
๐Ÿ’ต item_price_float๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ฆฌํ•˜๊ณ  index๋ฅผ ์ดˆ๊ธฐํ™”:
   order_id  quantity                     item_name                                 choice_description item_price  item_price_float
0      1443        15  Chips and Fresh Tomato Salsa                                                NaN    $44.25              44.25
1      1398         3                 Carnitas Bowl  [Roasted Chili Corn Salsa, [Fajita Vegetables,...    $35.25              35.25
2       511         4               Chicken Burrito  [Fresh Tomato Salsa, [Fajita Vegetables, Rice,...    $35.00              35.00
3      1443         4               Chicken Burrito  [Fresh Tomato Salsa, [Rice, Black Beans, Chees...    $35.00              35.00
4      1443         3                Veggie Burrito  [Fresh Tomato Salsa, [Fajita Vegetables, Rice,...    $33.75              33.75

.str.contains(): ํ•ด๋‹น ๋ฌธ์ œ์—ด ํฌํ•จ ์—ฌ๋ถ€

print('๐Ÿช item_name์— "Chips"๊ฐ€ ํฌํ•จ๋˜๋Š” ์ƒ์œ„ 5๊ฐœ ๋ฐ์ดํ„ฐ ์ถ”์ถœ:')
print(df.loc[df.item_name.str.contains('Chips')].head())
  • .str.contains('Chips') โ†’ ๋ฌธ์ž์—ด "Chips"๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ True/False๋กœ ๋ฐ˜ํ™˜
  • str.contains('chips', case=False): ๋Œ€์†Œ๋ฌธ์ž ๋ฌด์‹œ
  • str.contains('Chips', regex=False): ์ •๊ทœํ‘œํ˜„์‹ ์‚ฌ์šฉ ์•ˆํ•จ
๐Ÿช item_name์— "Chips"๊ฐ€ ํฌํ•จ๋˜๋Š” ์ƒ์œ„ 5๊ฐœ ๋ฐ์ดํ„ฐ ์ถ”์ถœ:
    order_id  quantity                              item_name choice_description item_price  item_price_float
0          1         1           Chips and Fresh Tomato Salsa                NaN     $2.39               2.39
3          1         1  Chips and Tomatillo-Green Chili Salsa                NaN     $2.39               2.39
6          3         1                          Side of Chips                NaN     $1.69               1.69
10         5         1                    Chips and Guacamole                NaN     $4.45               4.45
14         7         1                    Chips and Guacamole                NaN     $4.45               4.45

๐Ÿฅ— ๋‘˜ ์ค‘ ํ•˜๋‚˜๋งŒ ํฌํ•จ

print('๐Ÿฅ— item_name์— "Steak Salad" ๋˜๋Š” "Bowl"๊ฐ€ ํฌํ•จ๋˜๋Š” ๋ฐ์ดํ„ฐ ์ถ”์ถœ:')
answer = df.loc[(df.item_name == 'Steak Salad') | (df.item_name == 'Bowl')]
print(answer)
      order_id  quantity  ... item_price item_price_float
445        193         3  ...    $22.20             22.20
664        276         1  ...     $8.99              8.99
673        279         1  ...     $7.40              7.40
752        311         1  ...     $8.99              8.99
893        369         1  ...     $8.99              8.99
3502      1406         1  ...     $8.69              8.69

[6 rows x 6 columns]

.drop_duplicates(''): ์ค‘๋ณตํ–‰ ์ œ๊ฑฐํ›„ ์ฒซ๋ฒˆ์งธ ์ผ€์ด์Šค ์ถœ๋ ฅ

print('๐Ÿฅ— item_name์— "Steak Salad" ๋˜๋Š” "Bowl"๊ฐ€ ํฌํ•จ๋˜๋Š” ๋ฐ์ดํ„ฐ ์ถ”์ถœ:')
print(answer.drop_duplicates('item_name'))
     order_id  quantity  ... item_price item_price_float
445       193         3  ...    $22.20             22.20
664       276         1  ...     $8.99              8.99

.drop_duplicates('', keep='last'): ์ค‘๋ณตํ–‰ ์ œ๊ฑฐํ›„ ๋งˆ์ง€๋ง‰ ์ผ€์ด์Šค ์ถœ๋ ฅ

print('๐Ÿฅ— item_name์— "Steak Salad" ๋˜๋Š” "Bowl"๊ฐ€ ํฌํ•จ๋˜๋Š” ๋ฐ์ดํ„ฐ ์ถ”์ถœ:')
print(answer.drop_duplicates('item_name', keep='last'))
      order_id  quantity  ... item_price item_price_float
673        279         1  ...     $7.40              7.40
3502      1406         1  ...     $8.69              8.69

df.iloc[:,::2] : ์ง์ˆ˜์—ด

print("๐Ÿ˜ˆ ์ง์ˆ˜๋ฒˆ์งธ ์ปฌ๋Ÿผ๋งŒ ์ถœ๋ ฅ:")
print(df.iloc[:,::2].head())
  • slicing[start : end : step]
    • ::2๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ 2์นธ์”ฉ ๊ฑด๋„ˆ๋›ฐ๊ธฐ
      (์ธ๋ฑ์Šค: 0, 2, 4, 6, ...)
  • [:,::2]:
    • :: ํ–‰ ์œ„์น˜, ๋ชจ๋“  ํ–‰ ์„ ํƒ
    • ::2: ์—ด ์œ„์น˜, ์—ด ์ธ๋ฑ์Šค๋ฅผ 2์นธ์”ฉ ๊ฑด๋„ˆ๋›ฐ๋ฉฐ ์„ ํƒ
  • df.iloc[::2, :]: ๋ชจ๋“  ์—ด์—์„œ ํ–‰ ์ธ๋ฑ์Šค๋ฅผ 2์นธ์”ฉ ๊ฑด๋„ˆ๋›ฐ๋ฉฐ ์„ ํƒ
๐Ÿ˜ˆ ์ง์ˆ˜๋ฒˆ์งธ ์ปฌ๋Ÿผ๋งŒ ์ถœ๋ ฅ:
   order_id                              item_name item_price
0         1           Chips and Fresh Tomato Salsa     $2.39 
1         1                                   Izze     $3.39 
2         1                       Nantucket Nectar     $3.39 
3         1  Chips and Tomatillo-Green Chili Salsa     $2.39 
4         2                           Chicken Bowl    $16.98 


์ฐธ๊ณ , https://www.datamanim.com/dataset/99_pandas/pandasMain.html#filtering-sorting

profile
สšศ‰ษž

0๊ฐœ์˜ ๋Œ“๊ธ€