Pandas 실습

Heejun Kim·2022년 5월 9일

목록 보기

1/6

Pandas 실습

(1) 환경 준비

1) 라이브러리 불러오기

import pandas as pd
import numpy as np

2) 데이터 불러오기

다음 세개의 데이터를 불러와서 저장

sales = pd.read_csv('https://raw.githubusercontent.com/DA4BAM/dataset/master/sales.csv')
products = pd.read_csv('https://raw.githubusercontent.com/DA4BAM/dataset/master/products.csv')
customers = pd.read_csv('https://raw.githubusercontent.com/DA4BAM/dataset/master/customers2.csv')

데이터 출처: https://github.com/DA4BAM/dataset
세 데이터프레임 조회화기.

sales.head()

	OrderID	Seq	OrderDate	ProductID	Qty	Amt	CustomerID
0	107	2	2016-01-02	p1036481	2	2100	c150417
1	69	1	2016-01-02	p1152861	1	1091	c212716
2	69	7	2016-01-02	p1013161	1	2600	c212716
3	69	8	2016-01-02	p1005771	1	1650	c212716
4	69	11	2016-01-02	p1089531	1	2600	c212716

products.head()

	ProductID	ProductName	Category	SubCategory	CategoryOrd
0	p1052661	새우깡	간식	과자	3
1	p1054261	고구마스틱	간식	과자	3
2	p1097821	짱구	간식	과자	3
3	p1097831	감자칩	간식	과자	3
4	p1119071	뿌셔뿌셔	간식	과자	3

customers.head()

	CustomerID	RegisterDate	Gender	BirthYear
0	c328222	2014-09-25	F	1960
1	c281448	2013-06-18	F	1974
2	c038336	2003-10-10	F	1968
3	c084237	2007-03-09	F	1982
4	c162600	2010-06-14	F	1978

(2) 데이터 집계, 수정, 조회하기.

1) sales의 ProductID 별 판매량(Qty)과 판매액(Amt)의 합계를 tmp에 저장하고 조회하기.

tmp = sales.groupby('ProductID', as_index = True)[['Qty', 'Amt']].sum()
tmp

	Qty	Amt
ProductID
p1001771	1055	3354827
p1002841	903	11011541
p1005621	906	2601703
p1005771	3963	7319963
p1005891	5194	10119037
...	...	...
p1246581	2180	2456346
p1255281	809	955633
p1256521	701	1241545
p1284851	2350	6437323
p1299491	530	1094319

62 rows × 2 columns

2) 1번 결과를 판매액 기준 내림차순으로 정렬하고 상위 5개 상품을 조회하기.

tmp.sort_values(by = 'Amt', ascending = False).head()

	Qty	Amt
ProductID
p1072601	4058	18129067
p1178011	1653	14078818
p1002841	903	11011541
p1005891	5194	10119037
p1194801	990	7517664

3) customers의 성별(Gender) 별 고객 수를 조회하기.

customers.groupby(by = 'Gender', as_index = False).sum()

	Gender	BirthYear
0	F	4060490
1	M	360704

4) customers의 출생연도(BirthYear) 로 부터 나이(Age)를 계산해서 변수로 추가하기.

customers['Age'] = 2022 - customers['BirthYear']
customers.head()

	CustomerID	RegisterDate	Gender	BirthYear	Age
0	c328222	2014-09-25	F	1960	62
1	c281448	2013-06-18	F	1974	48
2	c038336	2003-10-10	F	1968	54
3	c084237	2007-03-09	F	1982	40
4	c162600	2010-06-14	F	1978	44

5) customers의 나이(Age)를 나이대(AgeGroup)로 변환하기.

[ , 30) : 30미만
[30, 40) : 30대
[40, 50) : 40대
[50, 60) : 50대
[60, 70) : 60대
[70, ) : 70이상

[30, 40) 의 의미 : 30 <= , < 40

bins = [0, 29, 39, 49, 59, 69, np.inf]
labels = ['30미만', '30대', '40대', '50대', '60대', '70대이상']
customers['AgeGroup'] = pd.cut(customers['Age'], bins = bins, labels = labels)
customers.head()

	CustomerID	RegisterDate	Gender	BirthYear	Age	AgeGroup
0	c328222	2014-09-25	F	1960	62	60대
1	c281448	2013-06-18	F	1974	48	40대
2	c038336	2003-10-10	F	1968	54	50대
3	c084237	2007-03-09	F	1982	40	40대
4	c162600	2010-06-14	F	1978	44	40대

6) customers의 나이대(AgeGroup) 별 고객수를 구하기.

customers.groupby('AgeGroup')['CustomerID'].count()

AgeGroup
30미만       6
30대      186
40대      925
50대      749
60대      267
70대이상    110
Name: CustomerID, dtype: int64

7) 매출액 top 5 상품명을 조회하기.

# 매출액 컬럼 만들기
sales['sale'] = sales['Amt'] * sales['Qty']
sales.head()

	OrderID	Seq	OrderDate	ProductID	Qty	Amt	CustomerID	sale
0	107	2	2016-01-02	p1036481	2	2100	c150417	4200
1	69	1	2016-01-02	p1152861	1	1091	c212716	1091
2	69	7	2016-01-02	p1013161	1	2600	c212716	2600
3	69	8	2016-01-02	p1005771	1	1650	c212716	1650
4	69	11	2016-01-02	p1089531	1	2600	c212716	2600

# sales와 products를 ProductID 기준으로 inner 조인

total = pd.merge(sales, products, on = 'ProductID', how = 'inner')
total.head()

	OrderID	Seq	OrderDate	ProductID	Qty	Amt	CustomerID	sale	ProductName	Category	SubCategory	CategoryOrd
0	107	2	2016-01-02	p1036481	2	2100	c150417	4200	순두부	반찬류	두부	1
1	137	4	2016-01-02	p1036481	2	2100	c280590	4200	순두부	반찬류	두부	1
2	63	16	2016-01-03	p1036481	1	1050	c037915	1050	순두부	반찬류	두부	1
3	135	3	2016-01-04	p1036481	3	3150	c100815	9450	순두부	반찬류	두부	1
4	63	13	2016-01-06	p1036481	10	10500	c048405	105000	순두부	반찬류	두부	1

# 상품명 별 매출액 합계를 집계
total.groupby('ProductID')[['sale']].sum().head()

	sale
ProductID
p1001771	4289867
p1002841	13280561
p1005621	3350069
p1005771	10166705
p1005891	17797123

# 매출액 합계 기준 내림차순으로 정렬하고 top 5를 조회
top = total.groupby('ProductID')[['sale']].sum()
top.sort_values(by = 'sale', ascending = False, inplace = True)
top.head()

	sale
ProductID
p1072601	24874795
p1011291	19203973
p1005891	17797123
p1178011	15407898
p1002841	13280561

8) 연령대 별 매출액을 조회하기.

# sales와 customers를  CustomerID 기준으로 inner 조인
total = pd.merge(sales, customers, on = 'CustomerID', how = 'inner')
total.head()

	OrderID	Seq	OrderDate	ProductID	Qty	Amt	CustomerID	sale	RegisterDate	Gender	BirthYear	Age	AgeGroup
0	107	2	2016-01-02	p1036481	2	2100	c150417	4200	2010-03-03	F	1974	48	40대
1	107	1	2016-01-02	p1175481	1	1300	c150417	1300	2010-03-03	F	1974	48	40대
2	185	1	2016-01-04	p1162631	1	4600	c150417	4600	2010-03-03	F	1974	48	40대
3	67	2	2016-01-11	p1012751	1	1350	c150417	1350	2010-03-03	F	1974	48	40대
4	201	3	2016-01-12	p1005891	1	1950	c150417	1950	2010-03-03	F	1974	48	40대

# AgeGroup 별 매출액 합계를 집계
total.groupby('AgeGroup', as_index = True)[['sale']].sum()

	sale
AgeGroup
30미만	198393
30대	19445744
40대	145359466
50대	129270847
60대	29372090
70대이상	14902634

9) [심화]연령대 별, 상품 카테고리 매출 비중(%)을 조회하기.

# 세 데이터프레임을 merge 
tmp = pd.merge(sales, products, on = 'ProductID', how = 'inner')
total = pd.merge(tmp, customers, on = 'CustomerID', how = 'inner')
total.head()

	OrderID	Seq	OrderDate	ProductID	Qty	Amt	CustomerID	sale	ProductName	Category	SubCategory	CategoryOrd	RegisterDate	Gender	BirthYear	Age	AgeGroup
0	107	2	2016-01-02	p1036481	2	2100	c150417	4200	순두부	반찬류	두부	1	2010-03-03	F	1974	48	40대
1	197	5	2017-01-24	p1036481	1	1050	c150417	1050	순두부	반찬류	두부	1	2010-03-03	F	1974	48	40대
2	251	4	2016-07-15	p1152861	2	2182	c150417	4364	포토아이스크림	유제품	아이스크림	4	2010-03-03	F	1974	48	40대
3	71	10	2016-01-22	p1013161	1	2900	c150417	2900	느타리버섯	채소	버섯	5	2010-03-03	F	1974	48	40대
4	69	7	2016-09-13	p1013161	1	2950	c150417	2950	느타리버섯	채소	버섯	5	2010-03-03	F	1974	48	40대

# for loop를 이용하여, 각 연령대에 대해서 다음의 작업을 반복 수행
age_group = total.groupby(['AgeGroup'])
print('=' * 50)
for age in age_group:
    age_categories = age[1].groupby(['Category'], as_index = False)[['sale']].sum()
    age_total = age[1]['sale'].sum()
    print(f'{age[0]}')
    print('=' * 50)
    
    for i in range(len(age_categories['Category'])):
        foot_category = age_categories.loc[i, 'Category']
        foot_percent = round((age_categories.loc[i, 'sale'] / age_total) * 100, 1)
        print(f'{foot_category} {foot_percent}%')
    print('=' * 50)

==================================================
30미만
==================================================
간식 8.8%
과일 7.5%
반찬류 33.5%
유제품 15.5%
채소 34.7%
==================================================
30대
==================================================
간식 12.2%
과일 17.3%
반찬류 10.5%
유제품 26.6%
채소 33.4%
==================================================
40대
==================================================
간식 25.0%
과일 18.6%
반찬류 11.8%
유제품 25.4%
채소 19.2%
==================================================
50대
==================================================
간식 19.9%
과일 18.4%
반찬류 14.9%
유제품 22.7%
채소 24.1%
==================================================
60대
==================================================
간식 4.2%
과일 18.4%
반찬류 27.8%
유제품 20.4%
채소 29.1%
==================================================
70대이상
==================================================
간식 17.7%
과일 20.4%
반찬류 25.8%
유제품 10.2%
채소 26.0%
==================================================

Heejun Kim

다음 포스트

Pandas 실습

Data Analysis

Pandas 실습

(1) 환경 준비

1) 라이브러리 불러오기

2) 데이터 불러오기

(2) 데이터 집계, 수정, 조회하기.

1) sales의 ProductID 별 판매량(Qty)과 판매액(Amt)의 합계를 tmp에 저장하고 조회하기.

2) 1번 결과를 판매액 기준 내림차순으로 정렬하고 상위 5개 상품을 조회하기.

3) customers의 성별(Gender) 별 고객 수를 조회하기.

4) customers의 출생연도(BirthYear) 로 부터 나이(Age)를 계산해서 변수로 추가하기.

5) customers의 나이(Age)를 나이대(AgeGroup)로 변환하기.

6) customers의 나이대(AgeGroup) 별 고객수를 구하기.

7) 매출액 top 5 상품명을 조회하기.

8) 연령대 별 매출액을 조회하기.

9) [심화]연령대 별, 상품 카테고리 매출 비중(%)을 조회하기.

Seaborn 실습

0개의 댓글

관련 채용 정보