[Section3] Pandas DataFrames Fundamentals

Jinyoung Cheon·2025년 2월 3일

Data Analysis Data Analysis Masterclass study

Data Analysis Masterclass

목록 보기

2/9

해당 내용은 Udemy 강의 'Pandas 및 Python을 이용한 데이터 분석:마스터 클래스'를 수강 후 정리한 내용입니다.

https://www.udemy.com/course/best-pandas-python

Section1

DEFINE A PANDAS DATAFRAME

MINI CHALLENGE #1:

A porfolio contains a collection of securities such as stocks, bonds and ETFs. Define a dataframe named 'portfolio_df' that holds 3 different stock ticker symbols, number of shares, and price per share (feel free to choose any stocks)
Calculate the total value of the porfolio including all stocks

portfolio_df = pd.DataFrame({'stock ticker symbol':['AAPL', 'AMZN', 'T'],
                             'price per share [$]':[3500, 200, 40],
                             'Number of stocks':[3, 4, 9]})

portfolio_df

stocks_dollar_value = portfolio_df['price per share [$]'] * portfolio_df['Number of stocks']

stocks_dollar_value.sum()

print('Total Portfolio Value = {}'.format(stocks_dollar_value.sum()))

Section2

INPUTS (READ CSV AND HTML DATA)

MINI CHALLENGE #2:

Write a code that uses Pandas to read tabular US retirement data
You can use data from here: https://www.ssa.gov/oact/progdata/nra.html

retirement_df = pd.read_html('https://www.ssa.gov/oact/progdata/nra.html')
retirement_df[0]

Section3

OUTPUTS (WRITE DATAFRAME INTO CSV)

MINI CHALLENGE #3:

Use set index = False and rerun the cell. Comment on the output CSV.
Try to leverage the attribute compression and rerun the cell

bank_df.to_csv('sample_output.csv', index=False)

bank_df.to_csv('sample_output.csv.gz', index=False, compression='gzip')

Section4

SETTING/RESETTING INDEX

# Note that a numeric index is being set by default

# You can assign a specific column to be the index as follows
bank_df.set_index('First Name', inplace=True)

# You can go back and use numeric index using reset_index
bank_df.reset_index(inplace=True)
bank_df

# Alternatively, you can set the index name when you read the csv file as follows
bank_df = pd.read_csv('bank_client_information.csv', index_col='First Name')
bank_df

MINI CHALLENGE #4:

Load the bank_client_information.csv and then set the "Last Name" column as the index

bank_df = pd.read_csv('bank_client_information.csv')
bank_df

Section5

SELECTING COLUMNS FROM A DATAFRAME

# Note that the output will be a Pandas Series (you can automatically tell by the lack of formating)
sample = bank_df['Email']

# you need to define a list containing all column names that you would like to select
# Note that since we collected more than one column, the output is a DataFrame (Notice the rich formatting)
sample1 = bank_df[['First Name', 'Net Worth']]

bank_df.Email

다음과 같은 방법은 띄어쓰기가 존재하는 열에 대해서는 사용할 수 없다.

MINI CHALLENGE #5:

Select the following columns from the dataFrame Net Worth, Years with Bank, and Postal Code

selected_columns = ['Net Worth', 'Years with Bank', 'Postal Code']

sample2 = bank_df[selected_columns]
sample2

Section6

ADDING/DELETING COLUMNS TO DATAFRAME

# You can also insert a new column in a given position
bank_df.insert(0, column='Credit Score', 
									value=[600, 700, 750, 699, 550, 600, 750, 500, 520, 510])

# Delete a column from a DataFrame
del bank_df['Email']

# Remove a column from a DataFrame and store it somehwere else using pop
Years_with_bank = bank_df.pop('Years with Bank')

MINI CHALLENGE #6:

load the bank_client_information.csv file and perform the following: (assume any reasonable values)

Add a column indicating whether the client has a mortgage or not

Add a column indicating the value of mortage in dollars

bank_df = pd.read_csv('bank_client_information.csv')

# We can indicate whether the client has a mortgage or not as 0 or 1
bank_df.insert(5, column='Mortgage', value=[True, True, False, False, False, True, True, False, True, False])

bank_df.insert(6, column='value of Mortgage', value=[100, 120, 0, 0, 0, 200, 150, 0, 230, 0])

bank_df

Section7

LABEL-BASED ELEMENTS SELECTION FROM A DATAFRAME ".LOC()"

# loc is used to filter rows and columns
# loc is label-based meaning you need to give a name of the rows (or columns) that you are interested in selecting
# Note that iloc is "integer index-based" meaning you can filter rows/columns by their integer index.
# Note that we obtained a ***Series*** because last name "Small" existed only once in the dataframe

# You can also randomly select a fraction of the DataFrame
# Setting axis = 0 means rows, setting index = 1 means columns
bank_df.sample(n=5, axis=0)

# return a percentage (Ex: 30%) of the rows
bank_df.sample(frac=0.3, axis=0)

MINI CHALLENGE #7:

Load the csv data and use the "first name" column as the index
Randomly select 2 rows from the DataFrame. Rerun the code and ensure that random rows are being selected

bank_df.sample(n=2, axis=0)

bank_df.sample(n=2, axis=0)

Section8

INTEGER INDEX-BASED ELEMENTS SELECTION FROM A DATAFRAME "iLOC()"

# You can access rows with their numeric index using iloc
bank_df.iloc[9]

# You can access multiple rows with their numeric index using iloc and colon :
# Note that using iloc is exclusive, meaning that we did not include the last element (quite confusing I know!)
bank_df.iloc[2:5]

MINI CHALLENGE #8:

Write a code that selects the last two rows in the DataFrame using two different methods

bank_df.iloc[8:10]

bank_df.iloc[[8,9]]

bank_df.iloc[-2:]

Section9

BROADCASTING OPERATIONS & SETTING NEW DATAFRAME VALUES

# Alternatively, you can add or subtract as follows
bank_df['Net Worth'] = bank_df['Net Worth'].add(1000)

# Let's convert from USD to CAD using the exchange rate 1 USD = 1.3 CAD
bank_df['Net Worth (CAD)'] = bank_df['Net Worth'].mul(1.3)
bank_df

MINI CHALLENGE #9:

Let's assume that all clients in the bank has been investing their assets in a broad S&P500 ETF. The market has been performing really well and clients networth has increased by 12% annualy. Calculate the sum of all client's networth.

bank_df['Net Worth'] = bank_df['Net Worth'].mul(1.12)

bank_df['Net Worth'].sum()

print('Total Networth of all cients = ${}'.format(bank_df['Net Worth'].sum()))

Section10

SORTING AND ORDERING

# You can sort the values in the dataframe according to number of years with bank
bank_df.sort_values(by='Years with Bank')

# Set inplace = True to ensure that change has taken place in memory
bank_df.sort_values(by='Years with Bank', inplace=True)

# You can sort the values in a descending order as follows
# 내림차순 -> ascending=False
bank_df.sort_values(by='Years with Bank', ascending=False, inplace=True)
bank_df

MINI CHALLENGE #10:

Sort customers by networth instead of years with bank. Make sure to update values in-memory.

bank_df.sort_values(by='Net Worth', inplace=True)

Section11

PANDAS WITH FUNCTIONS

# Define a function that increases all clients networth by a fixed value of 10% (for simplicity sake)
def networth_update(balance):
    return balance * 1.1

# You can apply a function to the DataFrame
bank_df['Net Worth'] = bank_df['Net Worth'].apply(networth_update)

MINI CHALLENGE #11:

Define a function that doubles an argument and adds $100
Apply the function to the Net Worth Column in the DataFrame
Calculate the updated total networth of all clients combined

def update_networth2(value):
    return value * 2 +100

bank_df['new Networth'] = bank_df['Net Worth'].apply(update_networth2)
bank_df['new Networth'].sum()

code

Section12

PANDAS OPERATIONS/FILTERING

# Pick certain rows that satisfy a certain criteria
df_loyal = bank_df[(bank_df['Years with Bank'] >= 10)]

# Pick certain rows that satisfy 2 or more critirea

mask_1 = bank_df['Years with Bank'] >= 10
mask_2 = bank_df['Net Worth'] >= 50000

df_loyal = bank_df[mask_1 & mask_2]

mask = bank_df['Last Name'].isin(['Steve', 'Mo'])
bank_df[mask]

# values that fall between a given range
bank_df[bank_df['Net Worth'].between(5000, 9000)]

# add a tilde symbol ~
mask = ~bank_df['Last Name'].duplicated(keep=False)
bank_df[mask]

# Only One
# Drop duplicates
bank_df.drop_duplicates(subset=['Last Name'], inplace=True)

# We can also filter the Dataframe using the where method as follows
# Note that all rows that don't satisfy this critirea are set to NaN
bank_df = pd.read_csv('bank_client_information.csv')
mask = bank_df['Net Worth'] >= 20000

bank_df.where(mask)

MINI CHALLENGE #12:

Using "bank_client_df" DataFrame, leverage pandas operations to only select high networth individuals with minimum $15000
What is the combined networth for all customers with 15000+ networth?

bank_df = pd.read_csv('bank_client_information.csv')

df_high_networth = bank_df[(bank_df['Net Worth'] >= 15000)]

df_high_networth['Net Worth'].sum()

Section13

FEATURE ENGINEERING AND DEALING WITH MISSING DATASET

# first, let's locate rows that have Null values
employee_df.isnull()

# first, let's locate rows that have Null values
employee_df.isnull().sum()

# We can also indicate which columns we want to drop NaN from
employee_df.dropna(how='any', inplace=True, subset=['MonthlyIncome', 'PercentSalaryHike'])
employee_df

# You can use Fillna to fill a given column with a certain value
employee_df['MonthlyIncome'].fillna(employee_df['MonthlyIncome'].mean(), inplace=True)

MINI CHALLENGE #13:

Calculate the median monthly rate. Use the calculated median values to fill out missing data. Confirm that the process is successful

employee_df = pd.read_csv('Human_Resources.csv')
employee_df.isnull().sum()

employee_df['MonthlyIncome'].median()
employee_df['MonthlyIncome'].fillna(employee_df['MonthlyIncome'].median(), inplace=True)

Section14

CHANGE DATATYPES

# Let's convert the hourly rate from int64 to float64
employee_df['HourlyRate'] = employee_df['HourlyRate'].astype('float64')
employee_df.info()

# Since we have limited number of classes, we can use the categrory datatype
# check the memory usage with the info method to ensure that the size has been reduced
employee_df['PerformanceRating'] = employee_df['PerformanceRating'].astype('category')
employee_df['RelationshipSatisfaction'] = employee_df['RelationshipSatisfaction'].astype('category')

MINI CHALLENGE #14:

Convert the BusinessTravel column to category format.
How many KBytes in memory have been saved?

employee_df['BusinessTravel'] = employee_df['BusinessTravel'].astype('category')
employee_df.info()

Jinyoung Cheon

데이터를 향해, 한 걸음씩 천천히.

이전 포스트

[Section2] Pandas Series Fundamentals

다음 포스트