해당 내용은 Udemy 강의 'Pandas 및 Python을 이용한 데이터 분석:마스터 클래스'를 수강 후 정리한 내용입니다.
DEFINE A PANDAS DATAFRAME
MINI CHALLENGE #1:
portfolio_df = pd.DataFrame({'stock ticker symbol':['AAPL', 'AMZN', 'T'],
'price per share [$]':[3500, 200, 40],
'Number of stocks':[3, 4, 9]})
portfolio_df
stocks_dollar_value = portfolio_df['price per share [$]'] * portfolio_df['Number of stocks']
stocks_dollar_value.sum()
print('Total Portfolio Value = {}'.format(stocks_dollar_value.sum()))
INPUTS (READ CSV AND HTML DATA)
MINI CHALLENGE #2:
retirement_df = pd.read_html('https://www.ssa.gov/oact/progdata/nra.html')
retirement_df[0]
OUTPUTS (WRITE DATAFRAME INTO CSV)
MINI CHALLENGE #3:
bank_df.to_csv('sample_output.csv', index=False)
bank_df.to_csv('sample_output.csv.gz', index=False, compression='gzip')
SETTING/RESETTING INDEX
# Note that a numeric index is being set by default
# You can assign a specific column to be the index as follows
bank_df.set_index('First Name', inplace=True)
# You can go back and use numeric index using reset_index
bank_df.reset_index(inplace=True)
bank_df
# Alternatively, you can set the index name when you read the csv file as follows
bank_df = pd.read_csv('bank_client_information.csv', index_col='First Name')
bank_df
MINI CHALLENGE #4:
bank_df = pd.read_csv('bank_client_information.csv')
bank_df
SELECTING COLUMNS FROM A DATAFRAME
# Note that the output will be a Pandas Series (you can automatically tell by the lack of formating)
sample = bank_df['Email']
# you need to define a list containing all column names that you would like to select
# Note that since we collected more than one column, the output is a DataFrame (Notice the rich formatting)
sample1 = bank_df[['First Name', 'Net Worth']]
bank_df.Email
다음과 같은 방법은 띄어쓰기가 존재하는 열에 대해서는 사용할 수 없다.
MINI CHALLENGE #5:
selected_columns = ['Net Worth', 'Years with Bank', 'Postal Code']
sample2 = bank_df[selected_columns]
sample2
ADDING/DELETING COLUMNS TO DATAFRAME
# You can also insert a new column in a given position
bank_df.insert(0, column='Credit Score',
value=[600, 700, 750, 699, 550, 600, 750, 500, 520, 510])
# Delete a column from a DataFrame
del bank_df['Email']
# Remove a column from a DataFrame and store it somehwere else using pop
Years_with_bank = bank_df.pop('Years with Bank')
MINI CHALLENGE #6:
bank_df = pd.read_csv('bank_client_information.csv')
# We can indicate whether the client has a mortgage or not as 0 or 1
bank_df.insert(5, column='Mortgage', value=[True, True, False, False, False, True, True, False, True, False])
bank_df.insert(6, column='value of Mortgage', value=[100, 120, 0, 0, 0, 200, 150, 0, 230, 0])
bank_dfLABEL-BASED ELEMENTS SELECTION FROM A DATAFRAME ".LOC()"
# loc is used to filter rows and columns
# loc is label-based meaning you need to give a name of the rows (or columns) that you are interested in selecting
# Note that iloc is "integer index-based" meaning you can filter rows/columns by their integer index.
# Note that we obtained a ***Series*** because last name "Small" existed only once in the dataframe
# You can also randomly select a fraction of the DataFrame
# Setting axis = 0 means rows, setting index = 1 means columns
bank_df.sample(n=5, axis=0)
# return a percentage (Ex: 30%) of the rows
bank_df.sample(frac=0.3, axis=0)
MINI CHALLENGE #7:
bank_df.sample(n=2, axis=0)
bank_df.sample(n=2, axis=0)
INTEGER INDEX-BASED ELEMENTS SELECTION FROM A DATAFRAME "iLOC()"
# You can access rows with their numeric index using iloc
bank_df.iloc[9]
# You can access multiple rows with their numeric index using iloc and colon :
# Note that using iloc is exclusive, meaning that we did not include the last element (quite confusing I know!)
bank_df.iloc[2:5]
MINI CHALLENGE #8:
bank_df.iloc[8:10]
bank_df.iloc[[8,9]]
bank_df.iloc[-2:]
BROADCASTING OPERATIONS & SETTING NEW DATAFRAME VALUES
# Alternatively, you can add or subtract as follows
bank_df['Net Worth'] = bank_df['Net Worth'].add(1000)
# Let's convert from USD to CAD using the exchange rate 1 USD = 1.3 CAD
bank_df['Net Worth (CAD)'] = bank_df['Net Worth'].mul(1.3)
bank_df
MINI CHALLENGE #9:
bank_df['Net Worth'] = bank_df['Net Worth'].mul(1.12)
bank_df['Net Worth'].sum()
print('Total Networth of all cients = ${}'.format(bank_df['Net Worth'].sum()))
SORTING AND ORDERING
# You can sort the values in the dataframe according to number of years with bank
bank_df.sort_values(by='Years with Bank')
# Set inplace = True to ensure that change has taken place in memory
bank_df.sort_values(by='Years with Bank', inplace=True)
# You can sort the values in a descending order as follows
# 내림차순 -> ascending=False
bank_df.sort_values(by='Years with Bank', ascending=False, inplace=True)
bank_df
MINI CHALLENGE #10:
bank_df.sort_values(by='Net Worth', inplace=True)
PANDAS WITH FUNCTIONS
# Define a function that increases all clients networth by a fixed value of 10% (for simplicity sake)
def networth_update(balance):
return balance * 1.1
# You can apply a function to the DataFrame
bank_df['Net Worth'] = bank_df['Net Worth'].apply(networth_update)
MINI CHALLENGE #11:
def update_networth2(value):
return value * 2 +100
bank_df['new Networth'] = bank_df['Net Worth'].apply(update_networth2)
bank_df['new Networth'].sum()
code
PANDAS OPERATIONS/FILTERING
# Pick certain rows that satisfy a certain criteria
df_loyal = bank_df[(bank_df['Years with Bank'] >= 10)]
# Pick certain rows that satisfy 2 or more critirea
mask_1 = bank_df['Years with Bank'] >= 10
mask_2 = bank_df['Net Worth'] >= 50000
df_loyal = bank_df[mask_1 & mask_2]
mask = bank_df['Last Name'].isin(['Steve', 'Mo'])
bank_df[mask]
# values that fall between a given range
bank_df[bank_df['Net Worth'].between(5000, 9000)]
# add a tilde symbol ~
mask = ~bank_df['Last Name'].duplicated(keep=False)
bank_df[mask]
# Only One
# Drop duplicates
bank_df.drop_duplicates(subset=['Last Name'], inplace=True)
# We can also filter the Dataframe using the where method as follows
# Note that all rows that don't satisfy this critirea are set to NaN
bank_df = pd.read_csv('bank_client_information.csv')
mask = bank_df['Net Worth'] >= 20000
bank_df.where(mask)
MINI CHALLENGE #12:
bank_df = pd.read_csv('bank_client_information.csv')
df_high_networth = bank_df[(bank_df['Net Worth'] >= 15000)]
df_high_networth['Net Worth'].sum()
FEATURE ENGINEERING AND DEALING WITH MISSING DATASET
# first, let's locate rows that have Null values
employee_df.isnull()
# first, let's locate rows that have Null values
employee_df.isnull().sum()
# We can also indicate which columns we want to drop NaN from
employee_df.dropna(how='any', inplace=True, subset=['MonthlyIncome', 'PercentSalaryHike'])
employee_df
# You can use Fillna to fill a given column with a certain value
employee_df['MonthlyIncome'].fillna(employee_df['MonthlyIncome'].mean(), inplace=True)
MINI CHALLENGE #13:
employee_df = pd.read_csv('Human_Resources.csv')
employee_df.isnull().sum()
employee_df['MonthlyIncome'].median()
employee_df['MonthlyIncome'].fillna(employee_df['MonthlyIncome'].median(), inplace=True)
CHANGE DATATYPES
# Let's convert the hourly rate from int64 to float64
employee_df['HourlyRate'] = employee_df['HourlyRate'].astype('float64')
employee_df.info()
# Since we have limited number of classes, we can use the categrory datatype
# check the memory usage with the info method to ensure that the size has been reduced
employee_df['PerformanceRating'] = employee_df['PerformanceRating'].astype('category')
employee_df['RelationshipSatisfaction'] = employee_df['RelationshipSatisfaction'].astype('category')
MINI CHALLENGE #14:
employee_df['BusinessTravel'] = employee_df['BusinessTravel'].astype('category')
employee_df.info()