Machine Learning & Deep Learning practice code.4

AI Engineering Course Log·2023년 7월 12일
0

road to AI Engineering

목록 보기
58/83
  • read files and save as df
df = pd.read_csv('data_v1.csv')
df
  • EDA(Exploratory Data Analysis)
df.head()
df.tail()
df.info()
df.index
df.columns
df.values
  • check if null data exists
df.isnull().sum()
  • get the information of statistics
df.describe()

<Data Preprocessing>

  • grasp the structure of the data
df.info()
  • delete the columns
df.drop('customerID', axis=1, inplace=True)
df.info()
  • change of the types of columns
df['TotalCharges']
  • change the colomn type to float
df['TotalCharges'].astype(float) ---> WRONG
  • search with boolean indexing
(df[TotalCharges'] == '') | (df[TotalCharges'] == ' ')
cond = (df['TotalCharges'] == '') | (df[Totalcharges'] == ' ')
df[cond]
  • change df['TotalCharges'] to Zero
df['TotalCharges'].replace([' '], ['0'], inplace=True)
  • check change of TotalCharges column to float
df['TotalCharges'] = df['TotalCharges'].astype(float)
cond = (df['TotalCharges'] == '') | (df['TotalCharges'] == ' ')
df[cond]
  • checking
df.info()
  • change column 'Churn's value format to numbers
df['Churn'].value_counts()
  • change 'Churn's Yes, No to 1, 0
df['churn'].replace[replace([Yes', 'No'], [1, 0], inplace=True)
  • check the column's distribution
df['Churn'].value_counts()

check existence of null data

df.isnull().sum()
  • delete columns that has many null data with drop
df.drop('DeviceProtection', axis=1, inplace=True)
df.dropna(inplace=True)
  • check if there's another null
df.isnull().sum()
df.info()

\< Visualization>

import matplotlib.pyplot as plt
%matplotlib inline
df['gender'].value_counts()
df['gender'].value_counts().plot(kind='bar')
  • column patner's distribution. bar chart
df['Partner'].value_counts().plot(kind='bar')
  • make a bar chart of 'object'column at once using select_dtype() function
df.select_dtypes('O').head(3)
  • select only Object column names
df.select_dtypes('O').columns.values
  • draw bar chart of object column one by one
    dependents, phoneService -> has unbalance -> delete needed
object_list = df.select_dtypes('object').columns.values

for col in object_list:
	df[col].value_counts().plot(kind='bar')
    plt.title(col)
    plt.show()
  • deleting unbalanced columns
df.drop('PhoneService', axis=1, inplace=True)
  • visualize columns that has number type(int, float)
df.select_dtypes('number').head(3)
  • checkgin Churn column
df['Churn'].value_counts()
  • checking bar chart of 'Churn' column
df['Churn'].value_counts().plot(kind='bar')
  • same process for 'SeniorCitizen' Column
df['SeniorCitizen'].value_counts()
df['SeniorCitizen'].value_counts().plot(kind='bar')
df.drop('SeniorCitizen', axis=1, inplace=True)
df.info()
  • Histogram
sns.histplot(data=df, x='tenure')
sns.histplot(data=df, x='tenure', hue='Churn')
  • make it curve gragh
sns.kdeplot(data=df, x='tenure', hue='Churn')
sns.histplot(data=df, x='TotalCharges')
sns.kdeplot(data=df, x='TotalCharges', hue='Churn')
sns.countplot(data=df, x='MultipleLines', hue='Churn')
  • Heatmap
  • correalation between columns
df[['tenure', 'MonthlyCharges', 'TotalCharges']].corr()
sns.heatmap(df[['tenure', 'MonthlyCharges', 'TotalCharges']].corr(), annot=True)

-Boxplot

sns.boxplot(data=df, x='Churn', y='TotalCharges')
  • restore the result as csv file
df.to_csv('data_v1_save.csv', index=False)
pd.read_csv('data_v1_save.csv').head()

0개의 댓글