[Section4] Concatenation, Merging, and Appending

Jinyoung Cheon·2025년 2월 5일

Data Analysis Data Analysis Masterclass study

Data Analysis Masterclass

목록 보기

3/9

해당 내용은 Udemy 강의 'Pandas 및 Python을 이용한 데이터 분석:마스터 클래스'를 수강 후 정리한 내용입니다.

https://www.udemy.com/course/best-pandas-python

Section1

DATAFRAME CONCATENATION

# Creating a dataframe from a dictionary
raw_data = {'Bank Client ID': ['1', '2', '3', '4', '5'],
            'First Name': ['Nancy', 'Alex', 'Shep', 'Max', 'Allen'],
            'Last Name': ['Rob', 'Ali', 'George', 'Mitch', 'Steve'],
            }

raw_data

bank1_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])
bank1_df

# Let's define another dataframe for a separate list of clients (IDs = 6, 7, 8, 9, 10)

raw_data = {'Bank Client ID': ['6', '7', '8', '9', '10'],
            'First Name': ['BIll', 'Dina', 'Sarah', 'Heather', 'Holly'],
            'Last Name': ['Christian', 'Mo', 'SXteve', 'Bob', 'Michelle'],
            }

bank2_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])
bank2_df

ingore_index=False가 default 설정이며, ingore_index=True의 경우 index가 모두 통합되는 것을 확인 가능

# Note that by default ignore_index has been set to False meaning indexes from both dataframes are kept unchanged

bank_all_df = pd.concat([bank1_df, bank2_df])

bank_all_df

# Note that by setting ignore_index = True, the index has been automatically set to numeric and now ranges from 1 to 9

bank_all_df = pd.concat([bank1_df, bank2_df], ignore_index=True)

bank_all_df

append를 활용한 방법도 강의에 나와있지만 현재 버전은 사용 불가능

MINI CHALLENGE #1:

Assume that you and your significant other become a new client at the bank and would like to add your first names, last names and unique client IDs. Define a new DataFrame and add it to the master list "bank_all_df"

raw_data = {'Bank Client ID': ['11', '12', '13', '14', '15'],
            'First Name': ['Cho', 'Kim', 'Jung', 'Hwang', 'Cheon'],
            'Last Name': ['Changseo', 'Minhyung', 'SXteve', 'Bob', 'Michelle'],
            }

bank3_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])

bank_all_df = pd.concat([bank1_df, bank2_df, bank3_df], ignore_index=True)

Section2

DATAFRAME CONCATENATION WITH MULTI-INDEXING

index layer 생성 & Multi Indexing

# We can perform concatenation and also use multi-indexing dataframe as follows:
bank_all_df = pd.concat([bank1_df, bank2_df], keys=['Customers Group 1', 'Customers Group 2'])
bank_all_df

# You can access elements using multi-indexing as follows
bank_all_df.loc[('Customers Group 1'), :]

# You can access elements using multi-indexing as follows
bank_all_df.loc[('Customers Group 1'), 0]

# You can access elements using multi-indexing as follows
bank_all_df.loc[('Customers Group 2'), 'First Name']

MINI CHALLENGE #2:

Assume that you and your significant other belong to Customers Group #3. Use multindexing to add both names to the master list. Write a line of code to access Group #3 only.

raw_data = {'Bank Client ID': ['11', '12', '13', '14', '15'],
            'First Name': ['Cho', 'Kim', 'Jung', 'Hwang', 'Cheon'],
            'Last Name': ['Changseo', 'Minhyung', 'SXteve', 'Bob', 'Michelle'],
            }

bank3_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'First Name', 'Last Name'])

bank_all_df = pd.concat([bank1_df, bank2_df, bank3_df], keys=['Customors Group 1', 'Customers Group 2', 'Customers Group 3'])

bank_all_df

bank_all_df.loc[('Customers Group 3'), :]

Section3

DATA MERGING

# Let's assume we obtained additional information (Annual Salary) about our bank customers
# Note that data obtained is for all clients with IDs 1 to 10

raw_data = {'Bank Client ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
            'Annual Salary [$/year]': [25000, 35000, 45000, 48000, 49000, 32000, 33000, 34000, 23000, 22000]
            }

bank_salary_df = pd.DataFrame(raw_data, columns=['Bank Client ID', 'Annual Salary [$/year]'])

bank_salary_df

# Let's merge all data on 'Bank Client ID'

bank_all_df = pd.merge(bank_all_df, bank_salary_df, on='Bank Client ID')
bank_all_df

MINI CHALLENGE #3:

Let's assume that you were able to obtain two new pieces of information about the bank clients such as: (1) credit card debt, (2) age
Define a new DataFrame that contains this new information
Merge this new information to the DataFrame "bank_all_df".

raw_data = {'Bank Client ID': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'],
            'Credit card debt': [100, 200, 150, 400, 1000, 250, 700, 900, 530, 1100],
            'Age': [26, 25, 22, 21, 30, 33, 31, 33, 28, 29]}

new_df = pd.DataFrame(data=raw_data, columns=['Bank Client ID', 'Credit card debt', 'Age'])

new_df

bank_all_df = pd.merge(bank_all_df, new_df, on='Bank Client ID')

bank_all_df

Jinyoung Cheon

데이터를 향해, 한 걸음씩 천천히.

이전 포스트

[Section3] Pandas DataFrames Fundamentals

다음 포스트