[Section3] Pandas DataFrames Fundamentals

Jinyoung Cheon·2025년 2월 3일

해당 내용은 Udemy 강의 'Pandas 및 Python을 이용한 데이터 분석:마스터 클래스'를 수강 후 정리한 내용입니다.

https://www.udemy.com/course/best-pandas-python

Section1

DEFINE A PANDAS DATAFRAME

MINI CHALLENGE #1:

  • A porfolio contains a collection of securities such as stocks, bonds and ETFs. Define a dataframe named 'portfolio_df' that holds 3 different stock ticker symbols, number of shares, and price per share (feel free to choose any stocks)
  • Calculate the total value of the porfolio including all stocks
portfolio_df = pd.DataFrame({'stock ticker symbol':['AAPL', 'AMZN', 'T'],
                             'price per share [$]':[3500, 200, 40],
                             'Number of stocks':[3, 4, 9]})

portfolio_df
stocks_dollar_value = portfolio_df['price per share [$]'] * portfolio_df['Number of stocks']

stocks_dollar_value.sum()
print('Total Portfolio Value = {}'.format(stocks_dollar_value.sum()))

Section2

INPUTS (READ CSV AND HTML DATA)

MINI CHALLENGE #2:

retirement_df = pd.read_html('https://www.ssa.gov/oact/progdata/nra.html')
retirement_df[0]

Section3

OUTPUTS (WRITE DATAFRAME INTO CSV)

MINI CHALLENGE #3:

  • Use set index = False and rerun the cell. Comment on the output CSV.
  • Try to leverage the attribute compression and rerun the cell
bank_df.to_csv('sample_output.csv', index=False)
bank_df.to_csv('sample_output.csv.gz', index=False, compression='gzip')

Section4

SETTING/RESETTING INDEX

# Note that a numeric index is being set by default

# You can assign a specific column to be the index as follows
bank_df.set_index('First Name', inplace=True)
# You can go back and use numeric index using reset_index
bank_df.reset_index(inplace=True)
bank_df
# Alternatively, you can set the index name when you read the csv file as follows
bank_df = pd.read_csv('bank_client_information.csv', index_col='First Name')
bank_df

MINI CHALLENGE #4:

  • Load the bank_client_information.csv and then set the "Last Name" column as the index
bank_df = pd.read_csv('bank_client_information.csv')
bank_df

Section5

SELECTING COLUMNS FROM A DATAFRAME

# Note that the output will be a Pandas Series (you can automatically tell by the lack of formating)
sample = bank_df['Email']
# you need to define a list containing all column names that you would like to select
# Note that since we collected more than one column, the output is a DataFrame (Notice the rich formatting)
sample1 = bank_df[['First Name', 'Net Worth']]
bank_df.Email

다음과 같은 방법은 띄어쓰기가 존재하는 열에 대해서는 사용할 수 없다.

MINI CHALLENGE #5:

  • Select the following columns from the dataFrame Net Worth, Years with Bank, and Postal Code
selected_columns = ['Net Worth', 'Years with Bank', 'Postal Code']

sample2 = bank_df[selected_columns]
sample2

Section6

ADDING/DELETING COLUMNS TO DATAFRAME

# You can also insert a new column in a given position
bank_df.insert(0, column='Credit Score', 
									value=[600, 700, 750, 699, 550, 600, 750, 500, 520, 510])
# Delete a column from a DataFrame
del bank_df['Email']
# Remove a column from a DataFrame and store it somehwere else using pop
Years_with_bank = bank_df.pop('Years with Bank')

MINI CHALLENGE #6:

  • load the bank_client_information.csv file and perform the following: (assume any reasonable values)
    • Add a column indicating whether the client has a mortgage or not
    • Add a column indicating the value of mortage in dollars
      bank_df = pd.read_csv('bank_client_information.csv')
      
      # We can indicate whether the client has a mortgage or not as 0 or 1
      bank_df.insert(5, column='Mortgage', value=[True, True, False, False, False, True, True, False, True, False])
      
      bank_df.insert(6, column='value of Mortgage', value=[100, 120, 0, 0, 0, 200, 150, 0, 230, 0])
      
      bank_df

Section7

LABEL-BASED ELEMENTS SELECTION FROM A DATAFRAME ".LOC()"

# loc is used to filter rows and columns
# loc is label-based meaning you need to give a name of the rows (or columns) that you are interested in selecting
# Note that iloc is "integer index-based" meaning you can filter rows/columns by their integer index.
# Note that we obtained a ***Series*** because last name "Small" existed only once in the dataframe
# You can also randomly select a fraction of the DataFrame
# Setting axis = 0 means rows, setting index = 1 means columns
bank_df.sample(n=5, axis=0)
# return a percentage (Ex: 30%) of the rows
bank_df.sample(frac=0.3, axis=0)

MINI CHALLENGE #7:

  • Load the csv data and use the "first name" column as the index
  • Randomly select 2 rows from the DataFrame. Rerun the code and ensure that random rows are being selected
bank_df.sample(n=2, axis=0)
bank_df.sample(n=2, axis=0)

Section8

INTEGER INDEX-BASED ELEMENTS SELECTION FROM A DATAFRAME "iLOC()"

# You can access rows with their numeric index using iloc
bank_df.iloc[9]
# You can access multiple rows with their numeric index using iloc and colon :
# Note that using iloc is exclusive, meaning that we did not include the last element (quite confusing I know!)
bank_df.iloc[2:5]

MINI CHALLENGE #8:

  • Write a code that selects the last two rows in the DataFrame using two different methods
bank_df.iloc[8:10]
bank_df.iloc[[8,9]]
bank_df.iloc[-2:]

Section9

BROADCASTING OPERATIONS & SETTING NEW DATAFRAME VALUES

# Alternatively, you can add or subtract as follows
bank_df['Net Worth'] = bank_df['Net Worth'].add(1000)
# Let's convert from USD to CAD using the exchange rate 1 USD = 1.3 CAD
bank_df['Net Worth (CAD)'] = bank_df['Net Worth'].mul(1.3)
bank_df

MINI CHALLENGE #9:

  • Let's assume that all clients in the bank has been investing their assets in a broad S&P500 ETF. The market has been performing really well and clients networth has increased by 12% annualy. Calculate the sum of all client's networth.
bank_df['Net Worth'] = bank_df['Net Worth'].mul(1.12)
bank_df['Net Worth'].sum()
print('Total Networth of all cients = ${}'.format(bank_df['Net Worth'].sum()))

Section10

SORTING AND ORDERING

# You can sort the values in the dataframe according to number of years with bank
bank_df.sort_values(by='Years with Bank')
# Set inplace = True to ensure that change has taken place in memory
bank_df.sort_values(by='Years with Bank', inplace=True)
# You can sort the values in a descending order as follows
# 내림차순 -> ascending=False
bank_df.sort_values(by='Years with Bank', ascending=False, inplace=True)
bank_df

MINI CHALLENGE #10:

  • Sort customers by networth instead of years with bank. Make sure to update values in-memory.
bank_df.sort_values(by='Net Worth', inplace=True)

Section11

 PANDAS WITH FUNCTIONS

# Define a function that increases all clients networth by a fixed value of 10% (for simplicity sake)
def networth_update(balance):
    return balance * 1.1
# You can apply a function to the DataFrame
bank_df['Net Worth'] = bank_df['Net Worth'].apply(networth_update)

MINI CHALLENGE #11:

  • Define a function that doubles an argument and adds $100
  • Apply the function to the Net Worth Column in the DataFrame
  • Calculate the updated total networth of all clients combined
def update_networth2(value):
    return value * 2 +100
bank_df['new Networth'] = bank_df['Net Worth'].apply(update_networth2)
bank_df['new Networth'].sum()

code


Section12

PANDAS OPERATIONS/FILTERING

# Pick certain rows that satisfy a certain criteria
df_loyal = bank_df[(bank_df['Years with Bank'] >= 10)]
# Pick certain rows that satisfy 2 or more critirea

mask_1 = bank_df['Years with Bank'] >= 10
mask_2 = bank_df['Net Worth'] >= 50000

df_loyal = bank_df[mask_1 & mask_2]
mask = bank_df['Last Name'].isin(['Steve', 'Mo'])
bank_df[mask]
# values that fall between a given range
bank_df[bank_df['Net Worth'].between(5000, 9000)]
# add a tilde symbol ~
mask = ~bank_df['Last Name'].duplicated(keep=False)
bank_df[mask]
# Only One
# Drop duplicates
bank_df.drop_duplicates(subset=['Last Name'], inplace=True)
# We can also filter the Dataframe using the where method as follows
# Note that all rows that don't satisfy this critirea are set to NaN
bank_df = pd.read_csv('bank_client_information.csv')
mask = bank_df['Net Worth'] >= 20000

bank_df.where(mask)

MINI CHALLENGE #12:

  • Using "bank_client_df" DataFrame, leverage pandas operations to only select high networth individuals with minimum $15000
  • What is the combined networth for all customers with 15000+ networth?
bank_df = pd.read_csv('bank_client_information.csv')

df_high_networth = bank_df[(bank_df['Net Worth'] >= 15000)]
df_high_networth['Net Worth'].sum()

Section13

FEATURE ENGINEERING AND DEALING WITH MISSING DATASET

# first, let's locate rows that have Null values
employee_df.isnull()
# first, let's locate rows that have Null values
employee_df.isnull().sum()
# We can also indicate which columns we want to drop NaN from
employee_df.dropna(how='any', inplace=True, subset=['MonthlyIncome', 'PercentSalaryHike'])
employee_df
# You can use Fillna to fill a given column with a certain value
employee_df['MonthlyIncome'].fillna(employee_df['MonthlyIncome'].mean(), inplace=True)

MINI CHALLENGE #13:

  • Calculate the median monthly rate. Use the calculated median values to fill out missing data. Confirm that the process is successful
employee_df = pd.read_csv('Human_Resources.csv')
employee_df.isnull().sum()
employee_df['MonthlyIncome'].median()
employee_df['MonthlyIncome'].fillna(employee_df['MonthlyIncome'].median(), inplace=True)

Section14

CHANGE DATATYPES

# Let's convert the hourly rate from int64 to float64
employee_df['HourlyRate'] = employee_df['HourlyRate'].astype('float64')
employee_df.info()
# Since we have limited number of classes, we can use the categrory datatype
# check the memory usage with the info method to ensure that the size has been reduced
employee_df['PerformanceRating'] = employee_df['PerformanceRating'].astype('category')
employee_df['RelationshipSatisfaction'] = employee_df['RelationshipSatisfaction'].astype('category')

MINI CHALLENGE #14:

  • Convert the BusinessTravel column to category format.
  • How many KBytes in memory have been saved?
employee_df['BusinessTravel'] = employee_df['BusinessTravel'].astype('category')
employee_df.info()

profile
데이터를 향해, 한 걸음씩 천천히.

0개의 댓글