해당 내용은 Udemy 강의 'Pandas 및 Python을 이용한 데이터 분석:마스터 클래스'를 수강 후 정리한 내용입니다.
DATAFRAME CONCATENATION
# Creating a dataframe from a dictionary
raw_data = {'Bank Client ID': ['1', '2', '3', '4', '5'],
'First Name': ['Nancy', 'Alex', 'Shep', 'Max', 'Allen'],
'Last Name': ['Rob', 'Ali', 'George', 'Mitch', 'Steve'],
}
raw_data
bank1_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])
bank1_df
# Let's define another dataframe for a separate list of clients (IDs = 6, 7, 8, 9, 10)
raw_data = {'Bank Client ID': ['6', '7', '8', '9', '10'],
'First Name': ['BIll', 'Dina', 'Sarah', 'Heather', 'Holly'],
'Last Name': ['Christian', 'Mo', 'SXteve', 'Bob', 'Michelle'],
}
bank2_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])
bank2_df
ingore_index=False가 default 설정이며, ingore_index=True의 경우 index가 모두 통합되는 것을 확인 가능
# Note that by default ignore_index has been set to False meaning indexes from both dataframes are kept unchanged
bank_all_df = pd.concat([bank1_df, bank2_df])
bank_all_df
# Note that by setting ignore_index = True, the index has been automatically set to numeric and now ranges from 1 to 9
bank_all_df = pd.concat([bank1_df, bank2_df], ignore_index=True)
bank_all_df
append를 활용한 방법도 강의에 나와있지만 현재 버전은 사용 불가능
MINI CHALLENGE #1:
raw_data = {'Bank Client ID': ['11', '12', '13', '14', '15'],
'First Name': ['Cho', 'Kim', 'Jung', 'Hwang', 'Cheon'],
'Last Name': ['Changseo', 'Minhyung', 'SXteve', 'Bob', 'Michelle'],
}
bank3_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])
bank_all_df = pd.concat([bank1_df, bank2_df, bank3_df], ignore_index=True)
DATAFRAME CONCATENATION WITH MULTI-INDEXING
index layer 생성 & Multi Indexing
# We can perform concatenation and also use multi-indexing dataframe as follows:
bank_all_df = pd.concat([bank1_df, bank2_df], keys=['Customers Group 1', 'Customers Group 2'])
bank_all_df
# You can access elements using multi-indexing as follows
bank_all_df.loc[('Customers Group 1'), :]
# You can access elements using multi-indexing as follows
bank_all_df.loc[('Customers Group 1'), 0]
# You can access elements using multi-indexing as follows
bank_all_df.loc[('Customers Group 2'), 'First Name']
MINI CHALLENGE #2:
raw_data = {'Bank Client ID': ['11', '12', '13', '14', '15'],
'First Name': ['Cho', 'Kim', 'Jung', 'Hwang', 'Cheon'],
'Last Name': ['Changseo', 'Minhyung', 'SXteve', 'Bob', 'Michelle'],
}
bank3_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])
bank_all_df = pd.concat([bank1_df, bank2_df, bank3_df], keys=['Customors Group 1', 'Customers Group 2', 'Customers Group 3'])
bank_all_df
bank_all_df.loc[('Customers Group 3'), :]
DATA MERGING
# Let's assume we obtained additional information (Annual Salary) about our bank customers
# Note that data obtained is for all clients with IDs 1 to 10
raw_data = {'Bank Client ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
'Annual Salary [$/year]': [25000, 35000, 45000, 48000, 49000, 32000, 33000, 34000, 23000, 22000]
}
bank_salary_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'Annual Salary [$/year]'])
bank_salary_df
# Let's merge all data on 'Bank Client ID'
bank_all_df = pd.merge(bank_all_df, bank_salary_df, on='Bank Client ID')
bank_all_df
MINI CHALLENGE #3:
raw_data = {'Bank Client ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
'Credit card debt': [100, 200, 150, 400, 1000, 250, 700, 900, 530, 1100],
'Age': [26, 25, 22, 21, 30, 33, 31, 33, 28, 29]}
new_df = pd.DataFrame(data=raw_data, columns=['Bank Client ID', 'Credit card debt', 'Age'])
new_df
bank_all_df = pd.merge(bank_all_df, new_df, on='Bank Client ID')
bank_all_df