19. 데이터프레임 응용 - 데이터프레임 합치기

김동웅·2021년 9월 19일

Pandas with python

목록 보기

19/23

데이터가 여러군데 나누어져 있을떄 하나로 합치거나 데이터를 연결해야 하는 경우

대표적으로 concat(), merge(), join() 등 여러 메소드가 있다.

1. 데이터프레임 연결

pandas.concat(데이터프레임의 리스트)

concat() 함수에 데이터프레임을 원소로 갖는 리스트를 전달하면 여러 개의 데이터 프레임을 서로 연결한다.

import pandas as pd


df1 = pd.DataFrame({'a':['a0','a1','a2','a3'],
                    'b' :['b0','b1','b2','b3'],
                    'c':['c0','c1','c2','c3']},
                    index=[0,1,2,3]
                    )

df2 = pd.DataFrame({'a':['a0','a1','a2','a3'],
                    'b' :['b0','b1','b2','b3'],
                    'c':['c0','c1','c2','c3'],
                    'd':['d0','d1','d2','d3']},
                    index=[2,3,4,5]
                    )

print(df1,'\n',df2)

result1= pd.concat([df1,df2])
print(result1)

a   b   c
0  a0  b0  c0
1  a1  b1  c1
2  a2  b2  c2
3  a3  b3  c3 
     a   b   c   d
2  a0  b0  c0  d0
3  a1  b1  c1  d1
4  a2  b2  c2  d2
5  a3  b3  c3  d3
    a   b   c    d
0  a0  b0  c0  NaN
1  a1  b1  c1  NaN
2  a2  b2  c2  NaN
3  a3  b3  c3  NaN
2  a0  b0  c0   d0
3  a1  b1  c1   d1
4  a2  b2  c2   d2
5  a3  b3  c3   d3

pd.concat() 에 ignore_index=True 옵션을 사용하면 기존의 행 인덱스를 무시하고 병합한다.

result2 = pd.concat([df1,df2],ignore_index=True)
print(result2)

    a   b   c    d
0  a0  b0  c0  NaN
1  a1  b1  c1  NaN
2  a2  b2  c2  NaN
3  a3  b3  c3  NaN
4  a0  b0  c0   d0
5  a1  b1  c1   d1
6  a2  b2  c2   d2
7  a3  b3  c3   d3

pd.concat() 에 axis=1옵션을 사용하면 좌우 열 방향으로 연결한다.

result3 = pd.concat([df1,df2],axis=1)
print(result3)

     a    b    c    a    b    c    d
0   a0   b0   c0  NaN  NaN  NaN  NaN
1   a1   b1   c1  NaN  NaN  NaN  NaN
2   a2   b2   c2   a0   b0   c0   d0
3   a3   b3   c3   a1   b1   c1   d1
4  NaN  NaN  NaN   a2   b2   c2   d2
5  NaN  NaN  NaN   a3   b3   c3   d3

pd.concat()에 join='inner'옵션 + axis=1 옵션을 적용하면 행인덱스들의 교집합 기준으로 연결한다.
pd.concat()에 join='inner'옵션 + axis=0 옵션을 적용하면 열이름들의 교집합 기준으로 연결한다.

result4 = pd.concat([df1,df2],axis=1,join='inner')
print(result4)

   a   b   c   a   b   c   d
2  a2  b2  c2  a0  b0  c0  d0
3  a3  b3  c3  a1  b1  c1  d1

데이터프레임과 시리즈를 좌우 열 방향으로 연결할 수 있다.
-> 사실상 데이터프레임에 열추가하는 느낌?

import pandas as pd

df1 = pd.DataFrame({'a':['a0','a1','a2','a3'],
                    'b' :['b0','b1','b2','b3'],
                    'c':['c0','c1','c2','c3']},
                    index=[0,1,2,3]
                    )

df2 = pd.DataFrame({'a':['a0','a1','a2','a3'],
                    'b' :['b0','b1','b2','b3'],
                    'c':['c0','c1','c2','c3'],
                    'd':['d0','d1','d2','d3']},
                    index=[2,3,4,5]
                    )

sr1= pd.Series(['e0','e1','e2','e3'],name='e')
sr2= pd.Series(['f0','f1','f2'],name='f',index=[3,4,5])
sr3= pd.Series(['g0','g1','g2','g3'],name='g')

result1 =pd.concat([df1,sr1],axis=1)
print(result1)


result2 =pd.concat([df2,sr2],axis=1,sort=True)
print(result2)

result1

    a   b   c   e
0  a0  b0  c0  e0
1  a1  b1  c1  e1
2  a2  b2  c2  e2
3  a3  b3  c3  e3

result2


    a   b   c   d    f
2  a0  b0  c0  d0  NaN
3  a1  b1  c1  d1   f0
4  a2  b2  c2  d2   f1
5  a3  b3  c3  d3   f2

2. 데이터프레임 병합

concat() 함수가 여러 데이터프레임을 이어 붙이듯 연결하는 것이라면,
merge()함수는 SQL의 join 명령과 비슷한 방식으로 어떤 기준에 의해 두 데이터프레임을 병합하는 개념.

데이터프레임 병합 :
pandas.merge(df_left, df_right, how='inner',on=None)
on = '열이름' 기준으로

how = 'inner' 교집합개념

how = 'outer' 합집합개념

pd.merge()의 디폴트 옵션 how='inner',on=None

on=None 옵션은 두 데이터프레임에 공통으로 속하는 모든 열을 키준으로 병합한다는 뜻

import pandas as pd

# Ipython 디스플레이 변경

pd.set_option('display.max_columns',10) # 출력할 최대열개수
pd.set_option('display.max_colwidth',20) # 출력할 열의 너비
pd.set_option('display.unicode.east_asian_width',True) # 유니코드 사용 너비 조정

df1 = pd.read_excel(r'data_analysis\sample\part6\stock price.xlsx',engine='openpyxl')
df2 = pd.read_excel(r'data_analysis\sample\part6\stock valuation.xlsx',engine='openpyxl')

print(df1)
print('\n')
print(df2)

merge_inner = pd.merge(df1,df2)
print(merge_inner)

▼ df1

       id    stock_name          value   price
0  128940      한미약품   59385.666667  421000
1  130960        CJ E&M   58540.666667   98900
2  138250    엔에스쇼핑   14558.666667   13200
3  139480        이마트  239230.833333  254500
4  142280  녹십자엠에스     468.833333   10200
5  145990        삼양사   82750.000000   82000
6  185750        종근당   40293.666667  100500
7  192400    쿠쿠홀딩스  179204.666667  177500
8  199800          툴젠   -2514.333333  115400
9  204210  모두투어리츠    3093.333333    3475

▼ df2

       id              name           eps     bps        per       pbr
0  130960            CJ E&M   6301.333333   54068  15.695091  1.829178
1  136480              하림    274.166667    3551  11.489362  0.887074
2  138040    메리츠금융지주   2122.333333   14894   6.313806  0.899691
3  139480            이마트  18268.166667  295780  13.931338  0.860437
4  145990            삼양사   5741.000000  108090  14.283226  0.758627
5  161390        한국타이어   5648.500000   51341   7.453306  0.820007
6  181710   NHN엔터테인먼트   2110.166667   78434  30.755864  0.827447
7  185750            종근당   3990.333333   40684  25.185866  2.470259
8  204210      모두투어리츠     85.166667    5335  40.802348  0.651359
9  207940  삼성바이오로직스   4644.166667   60099  89.790059  6.938551

▼ df1+df2

      id    stock_name          value   price          name           eps  \
0  130960        CJ E&M   58540.666667   98900        CJ E&M   6301.333333
1  139480        이마트  239230.833333  254500        이마트  18268.166667
2  145990        삼양사   82750.000000   82000        삼양사   5741.000000
3  185750        종근당   40293.666667  100500        종근당   3990.333333
4  204210  모두투어리츠    3093.333333    3475  모두투어리츠     85.166667
      bps        per       pbr
0   54068  15.695091  1.829178
1  295780  13.931338  0.860437
2  108090  14.283226  0.758627
3   40684  25.185866  2.470259
4    5335  40.802348  0.651359

# id열을 기준으로 모든종목의 데이터가 포함됨
merge_outer = pd.merge(df1,df2,how='outer',on='id')
print(merge_outer)

>
        id    stock_name          value     price              name  \
0   128940      한미약품   59385.666667  421000.0               NaN
1   130960        CJ E&M   58540.666667   98900.0            CJ E&M
2   138250    엔에스쇼핑   14558.666667   13200.0               NaN
3   139480        이마트  239230.833333  254500.0            이마트
4   142280  녹십자엠에스     468.833333   10200.0               NaN
5   145990        삼양사   82750.000000   82000.0            삼양사
6   185750        종근당   40293.666667  100500.0            종근당
7   192400    쿠쿠홀딩스  179204.666667  177500.0               NaN
8   199800          툴젠   -2514.333333  115400.0               NaN
9   204210  모두투어리츠    3093.333333    3475.0      모두투어리츠
10  136480           NaN            NaN       NaN              하림
11  138040           NaN            NaN       NaN    메리츠금융지주
12  161390           NaN            NaN       NaN        한국타이어
13  181710           NaN            NaN       NaN   NHN엔터테인먼트
14  207940           NaN            NaN       NaN  삼성바이오로직스

             eps       bps        per       pbr
0            NaN       NaN        NaN       NaN
1    6301.333333   54068.0  15.695091  1.829178
2            NaN       NaN        NaN       NaN
3   18268.166667  295780.0  13.931338  0.860437
4            NaN       NaN        NaN       NaN
5    5741.000000  108090.0  14.283226  0.758627
6    3990.333333   40684.0  25.185866  2.470259
7            NaN       NaN        NaN       NaN
8            NaN       NaN        NaN       NaN
9      85.166667    5335.0  40.802348  0.651359
10    274.166667    3551.0  11.489362  0.887074
11   2122.333333   14894.0   6.313806  0.899691
12   5648.500000   51341.0   7.453306  0.820007
13   2110.166667   78434.0  30.755864  0.827447
14   4644.166667   60099.0  89.790059  6.938551

how='left' 옵션을 설정하면 왼쪽 데이터프레임의 키 열에 속하는 데이터 값을 기준으로 병합한다.
한편 left_on 옵션과 right_on 옵션을 사용하여 좌우 데이터프레임에 각각 다르게 키를 지정할 수 있다.


merge_left = pd.merge(df1,df2,how='left',left_on='stock_name',right_on='name')
print(merge_left)

     id_x    stock_name          value   price      id_y          name  \
0  128940      한미약품   59385.666667  421000       NaN           NaN
1  130960        CJ E&M   58540.666667   98900  130960.0        CJ E&M
2  138250    엔에스쇼핑   14558.666667   13200       NaN           NaN
3  139480        이마트  239230.833333  254500  139480.0        이마트
4  142280  녹십자엠에스     468.833333   10200       NaN           NaN
5  145990        삼양사   82750.000000   82000  145990.0        삼양사
6  185750        종근당   40293.666667  100500  185750.0        종근당
7  192400    쿠쿠홀딩스  179204.666667  177500       NaN           NaN
8  199800          툴젠   -2514.333333  115400       NaN           NaN
9  204210  모두투어리츠    3093.333333    3475  204210.0  모두투어리츠

            eps       bps        per       pbr
0           NaN       NaN        NaN       NaN
1   6301.333333   54068.0  15.695091  1.829178
2           NaN       NaN        NaN       NaN
3  18268.166667  295780.0  13.931338  0.860437
4           NaN       NaN        NaN       NaN
5   5741.000000  108090.0  14.283226  0.758627
6   3990.333333   40684.0  25.185866  2.470259
7           NaN       NaN        NaN       NaN
8           NaN       NaN        NaN       NaN
9     85.166667    5335.0  40.802348  0.651359

merge_right = pd.merge(df1,df2,how='right',left_on='stock_name',right_on='name')
print(merge_left)

       id_x    stock_name          value     price    id_y              name  \
0  130960.0        CJ E&M   58540.666667   98900.0  130960            CJ E&M
1       NaN           NaN            NaN       NaN  136480              하림
2       NaN           NaN            NaN       NaN  138040    메리츠금융지주
3  139480.0        이마트  239230.833333  254500.0  139480            이마트
4  145990.0        삼양사   82750.000000   82000.0  145990            삼양사
5       NaN           NaN            NaN       NaN  161390        한국타이어
6       NaN           NaN            NaN       NaN  181710   NHN엔터테인먼트
7  185750.0        종근당   40293.666667  100500.0  185750            종근당
8  204210.0  모두투어리츠    3093.333333    3475.0  204210      모두투어리츠
9       NaN           NaN            NaN       NaN  207940  삼성바이오로직스

            eps     bps        per       pbr
0   6301.333333   54068  15.695091  1.829178
1    274.166667    3551  11.489362  0.887074
2   2122.333333   14894   6.313806  0.899691
3  18268.166667  295780  13.931338  0.860437
4   5741.000000  108090  14.283226  0.758627
5   5648.500000   51341   7.453306  0.820007
6   2110.166667   78434  30.755864  0.827447
7   3990.333333   40684  25.185866  2.470259
8     85.166667    5335  40.802348  0.651359
9   4644.166667   60099  89.790059  6.938551

merge함수를 불린 인덱싱과 함께 사용하여 필터링할 수 있다.



price = df1[df1['price']<50000]

print(price.head())

value = pd.merge(price,df2)
print(value)

   id    stock_name         value  price
2  138250    엔에스쇼핑  14558.666667  13200
4  142280  녹십자엠에스    468.833333  10200
9  204210  모두투어리츠   3093.333333   3475

       id    stock_name        value  price          name        eps   bps  \
0  204210  모두투어리츠  3093.333333   3475  모두투어리츠  85.166667  5335
         per       pbr
0  40.802348  0.651359

데이터프레임 결합

판다스 join() 메소드는 merge()함수를 기반으로 만들어 졌기 때문에 기본 작동 방식이 서로 비슷하다.
하지만 join()메소드는 두 데이터프레임의 행인덱스를 기준으로 결합하는 점에서 merge()와 차이가있다.

행 인덱스를 기준으로 결합 : DataFrame1.join(DateFrame2,how='left)

import pandas as pd

# Ipython 디스플레이 변경

pd.set_option('display.max_columns',10) # 출력할 최대열개수
pd.set_option('display.max_colwidth',20) # 출력할 열의 너비
pd.set_option('display.unicode.east_asian_width',True) # 유니코드 사용 너비 조정

df1 = pd.read_excel(r'data_analysis\sample\part6\stock price.xlsx',index_col='id',engine='openpyxl')
df2 = pd.read_excel(r'data_analysis\sample\part6\stock valuation.xlsx',index_col='id',engine='openpyxl')

print(df1.head())
print(df2.head())

df3 = df1.join(df2)
print(df3)

      stock_name          value   price          name           eps  \
id
128940      한미약품   59385.666667  421000           NaN           NaN
130960        CJ E&M   58540.666667   98900        CJ E&M   6301.333333
138250    엔에스쇼핑   14558.666667   13200           NaN           NaN
139480        이마트  239230.833333  254500        이마트  18268.166667
142280  녹십자엠에스     468.833333   10200           NaN           NaN
145990        삼양사   82750.000000   82000        삼양사   5741.000000
185750        종근당   40293.666667  100500        종근당   3990.333333
192400    쿠쿠홀딩스  179204.666667  177500           NaN           NaN
199800          툴젠   -2514.333333  115400           NaN           NaN
204210  모두투어리츠    3093.333333    3475  모두투어리츠     85.166667
            bps        per       pbr
id
128940       NaN        NaN       NaN
130960   54068.0  15.695091  1.829178
138250       NaN        NaN       NaN
139480  295780.0  13.931338  0.860437
142280       NaN        NaN       NaN
145990  108090.0  14.283226  0.758627
185750   40684.0  25.185866  2.470259
192400       NaN        NaN       NaN
199800       NaN        NaN       NaN
204210    5335.0  40.802348  0.651359

how='inner'옵션 ( 마찬가지로 행인덱스기준 겹치는부분만 추출)

  df4 = df1.join(df2,how='inner')
  print(df4)

 stock_name          value   price          name           eps  \
id
130960        CJ E&M   58540.666667   98900        CJ E&M   6301.333333
139480        이마트  239230.833333  254500        이마트  18268.166667
145990        삼양사   82750.000000   82000        삼양사   5741.000000
185750        종근당   40293.666667  100500        종근당   3990.333333
204210  모두투어리츠    3093.333333    3475  모두투어리츠     85.166667
          bps        per       pbr
id
130960   54068  15.695091  1.829178
139480  295780  13.931338  0.860437
145990  108090  14.283226  0.758627
185750   40684  25.185866  2.470259
204210    5335  40.802348  0.651359