This part of the blog covers how the tabular data can be handled for preprocessing, training and evaluation.
This work is most relavant to the example questions on AICE Associate course on the official website.
The work covers the line of example codes for solving the example questions and its variant according to my personal experience.
import pandas as pd
DATA_PATH_json = './data/data.json'
DATA_PATH_csv = './data/data.csv'
df = pd.read_json("data.json")
df = pd.read_json(DATA_PATH_json)
# or
df = pd.read_csv("data.csv")
df = pd.read_json(DATA_PATH_csv)
import json
def is_nested(json_data):
if isinstance(json_data, dict):
for value in json_data.values():
if isinstance(value, (dict, list)):
return True
return False
with open("data.json", "r") as file:
data = json.load(file)
print(is_nested(data))
### How to observe the first few rows
print(df.head())
df = df[df["Address1"] != "-"]
# or
df = df.drop(df[df["Address1"] == "-"].index)
# or
df = df.query("Address1 != '-'")
df = df[~df["Address1"].isin(["-", "Unknown", "N/A"])]
df = df[~((df["Address1"] == "-") & (df["City"] == "Seoul"))]
# drop rows where...
# certain column
df = df.dropna(subset=["Column1"])
# certain multiple columns
df = df.dropna(subset=["Column1", "Column2"])
# any columns have null value
df = df.dropna()
# All columns of the rwo have null value
df = df.dropna(how="all")
# threshold% of columns have null values
threshold = 0.4 # 40% missing values
df = df[df["Column1"].isnull().mean() < threshold]
# drop columns that all values are missing
df = df.dropna(axis=1, how="all")
# drop columns that have at least one missing value
df = df.dropna(axis=1)
df = df[~df.apply(lambda row: '-' in row.values, axis=1)]
# or
values_to_remove = ["-", "Unknown", "N/A"]
df = df[~df.apply(lambda row: any(val in row.values for val in values_to_remove), axis=1)]
# or
# Drop Rows Where Any Column Contains Certain Values (Using isin())
values_to_remove = ["-", "Unknown", "N/A"]
df = df[~df.isin(values_to_remove).any(axis=1)]
# or
# Drop Rows Where Specific Columns Contain Certain Values
columns_to_check = ["Column1", "Column2"]
values_to_remove = ["-", "Unknown", "N/A"]
df = df[~df[columns_to_check].isin(values_to_remove).any(axis=1)]
# or
# Drop Rows Where ALL Columns Contain Certain Values
values_to_remove = ["-", "Unknown", "N/A"]
df = df[~df.isin(values_to_remove).all(axis=1)]
# or
# Drop Rows Where a Column Has More Than X% of Unwanted Values
threshold = 0.5 # 50% or more columns containing unwanted values
values_to_remove = ["-", "Unknown", "N/A"]
df = df[(df.isin(values_to_remove).sum(axis=1) / df.shape[1]) < threshold]
count = (df["Address1"] == "-").sum()
print("Number of rows with '-' as Address1:", count)
count = df["Address1"].isin(["-", "Unknown", "N/A"]).sum()
print("Number of unwanted Address1 values:", count)
df_merged = pd.merge(df1, df2, on="RID", how="inner")
df_merged = pd.merge(df1, df2, on="RID", how="inner")
df_merged = pd.merge(df_merged, df3, on="RID", how="inner")
df_merged = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
They are merged according to the index
df_merged = pd.concat([df1, df2, df3], axis=1)
df_merged = pd.merge(df1, df2, on=["RID", "Date"], how="inner")
df_merged = pd.merge(df1, df2, left_on="ID", right_on="RID", how="inner")
df_merged = pd.merge(df1, df2, on="RID", how="outer")
What is One-Hot encoding?
One-Hot encoding involves converting categorical variable(columns in a dataframe) into numerical representations of 0 or 1
In Pandas, the object datatype is a flexible type that typically stores strings, but it can also hold mixed types (numbers, lists, or even dictionaries).
import pandas as pd
df = pd.DataFrame({
"A": ["apple", "banana", "cherry"], # Strings
"B": [1, 2, 3], # Integers
"C": [3.14, 2.71, 1.61], # Floats
"D": ["1", "2", "3"] # Numbers stored as strings
})
object_columns = df.select_dtypes(include=["object"]).columns
print("Object columns:", object_columns)
# Object columns: Index(['A', 'D'], dtype='object')
#####################################################################
df_encoded = pd.get_dummies(df, columns=df.select_dtypes(include=["object"]).columns)
#####################################################################
Even though Pandas stores text data as object, you can explicitly convert it to str:
df["A"] = df["A"].astype(str)
Sometimes, object columns contain numeric values as strings:
df["D"] = df["D"].astype(int) # Converts "1", "2", "3" to integers
df_encoded = pd.get_dummies(df, columns=["Column1", "Column2"])
# Encodes only Column1 and Column2, leaving other columns unchanged.
df_encoded = df.copy()
df_encoded = pd.get_dummies(df_encoded, columns=["Category"], prefix="Category", drop_first=False)
# Keeps the original categorical column and adds one-hot-encoded features.
df_encoded = pd.get_dummies(df, columns=["Category"], drop_first=True)
# Drops the first category in each one-hot encoding. Useful for regression models to prevent multicollinearity (dummy variable trap).
df_encoded = pd.get_dummies(df, columns=["Category"], prefix="Cat")
# Encodes Category but uses "Cat_" as a prefix for the new columns.
Cat_A Cat_B Cat_C
0 1 0 0
1 0 1 0
2 0 0 1
The Dummy Variable Trap is a problem that occurs when using one-hot encoding (pd.get_dummies(), OneHotEncoder()) where one categorical variable is perfectly predicted by the others, causing multicollinearity in regression models.
import pandas as pd
df = pd.DataFrame({"Color": ["Red", "Green", "Blue", "Red", "Blue"]})
df_encoded = pd.get_dummies(df, columns=["Color"])
print(df_encoded)
Color_Blue Color_Green Color_Red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
The Color_Red column is redundant because if Color_Blue = 0 and Color_Green = 0, then Color_Red must be 1.
This creates perfect multicollinearity, meaning one feature can be predicted from the others, leading to unstable regression models.
df_encoded = pd.get_dummies(df, columns=["Color"], drop_first=True)
print(df_encoded)
# or
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop="first", sparse_output=False)
df_encoded = encoder.fit_transform(df)
Color_Blue Color_Green
0 0 0
1 0 1
2 1 0
3 0 0
4 1 0
🔹 Now, "Color_Red" is removed, preventing redundancy and multicollinearity.
Linear Models (Linear Regression, Logistic Regression, SVM, etc.)
🚨 Yes, it affects them.
Tree-Based Models (Decision Trees, Random Forests, XGBoost, etc.)
✅ No, they don't suffer from multicollinearity.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
df = pd.DataFrame({"Color": ["Red", "Green", "Blue", "Red", "Blue"]})
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df)
df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(["Color"]))
print(df_encoded)
Color_Blue Color_Green Color_Red
0 0.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
encoder = OneHotEncoder(drop="first", sparse_output=False)
encoded = encoder.fit_transform(df)
df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(["Color"]))
print(df_encoded)
Color_Blue Color_Green
0 0.0 0.0
1 0.0 1.0
2 1.0 0.0
3 0.0 0.0
4 1.0 0.0
🔹 Removes Color_Red, preventing the Dummy Variable Trap.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoded = encoder.fit_transform(df)
# Simulate new test data
df_test = pd.DataFrame({"Color": ["Red", "Purple"]})
encoded_test = encoder.transform(df_test)
df_test_encoded = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(["Color"]))
print(df_test_encoded)
Color_Blue Color_Green Color_Red
0 0.0 0.0 1.0
1 0.0 0.0 0.0
🔹 handle_unknown="ignore" ensures "Purple" doesn't break the model—it gets all zeros.
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
df = pd.DataFrame({
"Color": ["Red", "Green", "Blue", "Red"],
"Size": ["Small", "Medium", "Large", "Small"]
})
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df)
df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(df.columns))
print(df_encoded)
Color_Blue Color_Green Color_Red Size_Large Size_Medium Size_Small
0 0.0 0.0 1.0 0.0 0.0 1.0
1 0.0 1.0 0.0 0.0 1.0 0.0
2 1.0 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 1.0 0.0 0.0 1.0
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df = pd.DataFrame({"Color": ["Red", "Green", "Blue", "Red", "Blue"]})
encoder = LabelEncoder() # Initialize encoder
df["Color_encoded"] = encoder.fit_transform(df["Color"]) # Fit and transform
print(df)
Color Color_encoded
0 Red 2
1 Green 1
2 Blue 0
3 Red 2
4 Blue 0
df = pd.DataFrame({
"Color": ["Red", "Green", "Blue", "Red"],
"Size": ["Small", "Medium", "Large", "Small"]
})
encoder = LabelEncoder()
for col in ["Color", "Size"]:
df[col + "_encoded"] = encoder.fit_transform(df[col])
print(df)
Color Size Color_encoded Size_encoded
0 Red Small 2 2
1 Green Medium 1 1
2 Blue Large 0 0
3 Red Small 2 2
🔹 Applies LabelEncoder() to multiple columns using a loop.
df["Color_original"] = encoder.inverse_transform(df["Color_encoded"])
print(df)
Color Color_encoded Color_original
0 Red 2 Red
1 Green 1 Green
2 Blue 0 Blue
3 Red 2 Red
🔹 Recovers the original labels from their encoded values.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Sample dataset: Fruit classification (multi-class problem)
df = pd.DataFrame({
"Weight": [150, 120, 130, 180, 160, 140, 170, 200],
"Size": [7, 6, 6, 8, 7, 6, 7, 9],
"Fruit": ["Apple", "Banana", "Apple", "Cherry", "Banana", "Apple", "Cherry", "Cherry"]
})
print(df)
Weight Size Fruit
0 150 7 Apple
1 120 6 Banana
2 130 6 Apple
3 180 8 Cherry
4 160 7 Banana
5 140 6 Apple
6 170 7 Cherry
7 200 9 Cherry
encoder = LabelEncoder()
# Transform categorical target labels into numerical labels
df["Fruit_encoded"] = encoder.fit_transform(df["Fruit"])
print(df)
Weight Size Fruit Fruit_encoded
0 150 7 Apple 0
1 120 6 Banana 1
2 130 6 Apple 0
3 180 8 Cherry 2
4 160 7 Banana 1
5 140 6 Apple 0
6 170 7 Cherry 2
7 200 9 Cherry 2
"Apple" → 0
"Banana" → 1
"Cherry" → 2
Now, we use Fruit_encoded as the target variable (y) in classification.
# Define features (X) and target (y)
X = df[["Weight", "Size"]]
y = df["Fruit_encoded"]
# Split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# Predict on test data
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Accuracy: 1.00
Once the model makes predictions (y_pred), we can convert the numerical labels back to categorical names.
y_pred_labels = encoder.inverse_transform(y_pred)
print(y_pred_labels)
['Cherry' 'Apple']
from sklearn.model_selection import train_test_split
X = df.drop(columns=["Time_Driving"]) # Features
y = df["Time_Driving"] # Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
✅ This method prevents overfitting by ensuring that the model is tested on unseen data.
Instead of just train-test split, it's common to have a validation set for model tuning.
# First, split into Train (80%) and Test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Further split Train into Train (80%) and Validation (20%)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
- Train: 64%
- Validation: 16%
- Test: 20%
🔹 Explanation
Step 1: Split X and y into X_train (80%) and X_test (20%).
Step 2: Further split X_train into X_train (64%) and X_val (16%) for validation.
🔹 Why use a validation set?
Helps tune hyperparameters before testing on unseen data.
Prevents overfitting to the test set.
Used for early stopping in deep learning models.
For classification tasks with imbalanced classes, ensure that both training and test sets maintain class proportions.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
🔹 Explanation
stratify=y ensures that the class distribution is preserved in both sets.
Useful when dealing with imbalanced data (e.g., fraud detection, medical diagnosis).
✅ Example Use Case: If y contains 90% class A and 10% class B, stratify=y ensures that the split retains this ratio.
Instead of splitting once, cross-validation repeatedly trains on different splits of data.
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
🔹 Explanation
n_splits=5: Splits the data into 5 different train-test sets.
shuffle=True: Ensures randomization of data.
The model trains and tests on different subsets, improving generalization.
✅ Best for: Small datasets where a single train-test split may not be representative.
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 🔹 Step 1: Create a Sample Dataset
df = pd.DataFrame({
"Feature1": [2, 3, 5, 7, 9, 10, 11, 13, 15, 18],
"Feature2": [1, 4, 6, 8, 10, 12, 14, 16, 17, 20],
"Target": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1] # Binary classification
})
X = df.drop(columns=["Target"]) # Features
y = df["Target"] # Target variable
# 🔹 Step 2: Initialize K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Store results
accuracies = []
# 🔹 Step 3: Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
# Split into training and testing sets
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Initialize the RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"Fold Accuracy: {accuracy:.4f}")
# 🔹 Step 4: Compute Average Accuracy Across Folds
average_accuracy = np.mean(accuracies)
print(f"\nAverage K-Fold Accuracy: {average_accuracy:.4f}")
Fold Accuracy: 1.0000
Fold Accuracy: 1.0000
Fold Accuracy: 0.5000
Fold Accuracy: 1.0000
Fold Accuracy: 1.0000
Average K-Fold Accuracy: 0.9000
🔹 Explanation
Create Dataset (df) → Includes 2 features and binary target (0 or 1).
Split Data (KFold(n_splits=5)) → Divides the dataset into 5 folds.
Train Model (RandomForestClassifier()) → Trains a new Random Forest model in each fold.
Evaluate Accuracy (accuracy_score()) → Measures model performance in each fold.
Compute Average Accuracy → Gets the overall model performance.
🚀 Use K-Fold CV to ensure your model generalizes well across different subsets of data!
Would you like a Stratified K-Fold example for imbalanced datasets? 🎯
If data is normally distributed → Use StandardScaler.
If data has many outliers → Use RobustScaler.
If data needs to be bounded (0-1) → Use MinMaxScaler.
import numpy as np
import pandas as pd
# Sample dataset with outliers
data = {
"Feature1": [10, 200, 30, 400, 50, 600, 20, 700, 40, 1000], # Outliers present
"Feature2": [5, 500, 15, 700, 25, 800, 35, 900, 45, 1000], # Outliers present
}
df = pd.DataFrame(data)
# Display original statistics
print("Before Scaling:\n", df.describe())
Shows mean, standard deviation, min, max, and quartiles.
Helps identify outliers that may influence scaling.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Initialize scalers
scalers = {
"StandardScaler": StandardScaler(),
"MinMaxScaler": MinMaxScaler(),
"RobustScaler": RobustScaler()
}
# Apply each scaler and store results
scaled_dfs = {}
for scaler_name, scaler in scalers.items():
scaled_dfs[scaler_name] = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(f"\nAfter {scaler_name}:\n", scaled_dfs[scaler_name].describe())
Compare min, max, and quartiles to see how scaling affects values.
Check if outliers still dominate (e.g., StandardScaler is affected by extreme values).
To prevent data leakage, fit RobustScaler only on the training data and apply it to the test data.
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)
# Fit only on training data
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Apply the same scaler to test data
X_test_scaled = scaler.transform(X_test)
✅ Why?
If you fit the scaler on the full dataset, it may leak information from the test set.
Always fit on X_train and transform both X_train & X_test.
✅ Good for normally distributed data, but sensitive to outliers.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
✅ Scales data between 0 and 1 but is affected by outliers.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)
✅ Best for sparse data where values are positive and large.
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
df_scaled = scaler.fit_transform(df)

Logistic Regression을 사용하기 위해서는 로드하는 과정이 필요하다.
C: 규제 강도
max_iter: 반복 횟수
# import
from sklearn.linear_model import LogisticRegression
# 모델 생성
lg = LogisticRegression(C=1.0, max_iter=1000)
# 모델 학습
lg.fit(X_train, y_train)
# 모델 평가
lg.score(X_test, y_test)
DecisionTreeClassifier / DecisionTreeRegressor
Decision Tree를 사용하기 위해서는 로드하는 과정이 필요하다.
max_depth: 트리 깊이
random_state: 랜덤 시드
# import
from sklearn.tree import DecisionTreeClassifier # 분류
from sklearn.tree import DecisionTreeRegressor # 회귀
# 모델 생성
dt = DecisionTreeClassifier(max_depth=5, random_state=2023)
# 모델 학습
dt.fit(X_train, y_train)
# 모델 평가
dt.score(X_test, y_test)
RandomForestClassifier / RandomForestRegressor
Random Forest를 사용하기 위해서는 로드하는 과정이 필요하다.
n_estimators: 사용하는 트리 개수
random_state: 랜덤 시드
# import
from sklearn.ensemble import RandomForestClassifier # 분류
from sklearn.ensemble import RandomForestRegressor # 회귀
# 모델 생성
rf = RandomForestClassifier(n_estimators=100, random_state=2023)
# 모델 학습
rf.fit(X_train, y_train)
# 모델 평가
rf.score(X_test, y_test)
XGBClassifier / XGBRegressor
XGBoost를 사용하기 위해서는 설치 및 로드하는 과정이 필요하다.
n_estimators: 사용하는 트리 개수
# install
!pip install xgboost
# import
from xgboost import XGBClassifier # 분류
from xgboost import XGBRegressor # 회귀
# 모델 생성
xgb = XGBClassifier(n_estimators=100)
# 모델 학습
xgb.fit(X_train, y_train)
# 모델 평가
xgb.score(X_test, y_test)
LGBMClassifier / LGBMRegressor
Light GBM을 사용하기 위해서는 설치 및 로드하는 과정이 필요하다.
n_estimators: 사용하는 트리 개수
# install
!pip install lightgbm
# import
from lightgbm import LGBMClassifier # 분류
from lightgbm import LGBMegressor # 회귀
# 모델 생성
lgbm = LGBMClassifier(n_estimators=100)
# 모델 학습
lgbm.fit(X_train, y_train)
# 모델 평가
lgbm.score(X_test, y_test)
predict()
# 모델 예측
model.predict(X_test)
regression model: MSE, RMSE, MAE, ...
classfication model: Accuracy, Precision, Recall, F1 Score, ...
# import
from sklearn.metrics import * # 모든 함수 로드
# 분류 모델 평가 지표
from sklearn.metrics import accuracy_score # 정확도
from sklearn.metrics import precision_score # 정밀도
from sklearn.metrics import recall_score # 재현율
from sklearn.metrics import f1_score # f1 score
from sklearn.metrics import confusion_matrix # 혼동행렬
from sklearn.metrics import accuracy_score # 모든 지표 한번에
# 회귀 모델 평가 지표
from sklearn.metrics import r2_score # R2 결정계수
from sklearn.metrics import mean_squared_error # MSE
from sklearn.metrics import mean_absolute_error # MAE
# Example usage for classification
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro') # Use 'macro' for multi-class
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
cf_matrix = confusion_matrix(y_test, y_pred)
# Print results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"Confusion Matrix:\n{cf_matrix}")
✅ Explanation:
average='macro' → Used for multi-class classification.
confusion_matrix() → Returns a matrix of actual vs. predicted values.
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(cf_matrix, annot=True, fmt='d', cmap='Blues') # Visualize confusion matrix
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()
✅ Why?
Helps identify misclassified examples in classification tasks.
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
# Example usage for regression
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
# Print results
print(f"R² Score: {r2:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
✅ Explanation:
r2_score() → Measures how well the model explains variance in data.
mean_squared_error() → Penalizes large errors (better for big errors).
mean_absolute_error() → Measures absolute error size (better for interpretation).
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
dt = DecisionTreeRegressor(max_depth=5, min_samples_split=3, random_state=120)
dt.fit(X_train, y_train)
rf = RandomForestRegressor(max_depth=5, min_samples_split=3, random_state=120, n_estimators=100)
rf.fit(X_train, y_train)
from sklearn.metrics import mean_absolute_error
y_pred_dt = dt.predict(X_test)
y_pred_rf = rf.predict(X_test)
mae_dt = mean_absolute_error(y_test, y_pred_dt)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
print("Decision Tree MAE:", mae_dt)
print("Random Forest MAE:", mae_rf)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Convert Data to Tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1) # Ensure float for BCEWithLogitsLoss
X_valid_tensor = torch.tensor(X_test, dtype=torch.float32)
y_valid_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)
# Create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
valid_dataset = TensorDataset(X_valid_tensor, y_valid_tensor)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=16, shuffle=False)
# Define Binary Classification Neural Network
class BinaryClassifier(nn.Module):
def __init__(self, input_size):
super(BinaryClassifier, self).__init__()
self.model = nn.Sequential(
nn.Linear(input_size, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(32, 1), # Single output neuron for binary classification
nn.Sigmoid() # Sigmoid activation to output probabilities (0 to 1)
)
def forward(self, x):
return self.model(x)
# Initialize Model, Loss Function, and Optimizer
input_size = X_train.shape[1]
model = BinaryClassifier(input_size)
criterion = nn.BCELoss() # Binary Cross-Entropy Loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training Loop
epochs = 30
history = {"train_loss": [], "valid_loss": []}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(epochs):
model.train()
train_loss = 0
for X_batch, y_batch in train_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
history["train_loss"].append(train_loss)
model.eval()
valid_loss = 0
with torch.no_grad():
for X_batch, y_batch in valid_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
valid_loss += loss.item()
valid_loss /= len(valid_loader)
history["valid_loss"].append(valid_loss)
print(f"Epoch {epoch+1}/{epochs} - Train Loss: {train_loss:.4f}, Validation Loss: {valid_loss:.4f}")
print("Training Complete!")
# how to used the model for evaluation
model.eval()
with torch.no_grad():
y_pred = model(X_valid_tensor.to(device))
# Convert probabilities to binary labels (0 or 1)
y_pred_binary = (y_pred.cpu().numpy() > 0.5).astype(int)
print(y_pred_binary)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
import numpy as np
# Split Data (Make sure y contains binary labels: 0 or 1)
X_train_np, X_valid_np, y_train_np, y_valid_np = train_test_split(X, y, test_size=0.2, random_state=42)
# Convert Data to TensorFlow Tensors
X_train_tf = tf.convert_to_tensor(X_train_np, dtype=tf.float32)
y_train_tf = tf.convert_to_tensor(y_train_np.values.reshape(-1, 1), dtype=tf.float32)
X_valid_tf = tf.convert_to_tensor(X_valid_np, dtype=tf.float32)
y_valid_tf = tf.convert_to_tensor(y_valid_np.values.reshape(-1, 1), dtype=tf.float32)
# Create TensorFlow Dataset for Batching
batch_size = 16
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_tf, y_train_tf)).shuffle(1024).batch(batch_size)
valid_dataset = tf.data.Dataset.from_tensor_slices((X_valid_tf, y_valid_tf)).batch(batch_size)
# Define Binary Classification Neural Network
model = keras.Sequential([
layers.Dense(64, activation="relu", input_shape=(X_train.shape[1],)),
layers.Dense(32, activation="relu"),
layers.Dropout(0.2),
layers.Dense(1, activation="sigmoid") # Sigmoid activation for binary classification
])
# Compile Model with Binary Cross-Entropy Loss
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss="binary_crossentropy", # Correct loss function for binary classification
metrics=["accuracy"] # Track accuracy during training
)
# Train the Model
epochs = 30
history = model.fit(
train_dataset,
validation_data=valid_dataset,
epochs=epochs,
verbose=1
)
print("Training Complete!")
#✅ Making Predictions and Converting to Binary Labels
# Predict on validation set
y_pred_probs = model.predict(X_valid_tf)
# Convert probabilities to binary labels (0 or 1)
y_pred_binary = (y_pred_probs > 0.5).astype(int)
print(y_pred_binary[:10]) # Show first 10 prediction
import matplotlib.pyplot as plt
epochs = range(1, len(history["train_loss"]) + 1)
plt.plot(epochs, history["train_loss"], label="Train Loss", marker="o")
plt.plot(epochs, history["valid_loss"], label="Validation Loss", marker="s")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.show()
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Sample dataset with categorical labels
X_train_np, X_valid_np, y_train_np, y_valid_np = train_test_split(X, y, test_size=0.2, random_state=42)
# Encode labels (if they are categorical)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_np)
y_valid_encoded = label_encoder.transform(y_valid_np)
# Convert Data to Tensors
X_train_tensor = torch.tensor(X_train_np, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_encoded, dtype=torch.long) # Long type for classification
X_valid_tensor = torch.tensor(X_valid_np, dtype=torch.float32)
y_valid_tensor = torch.tensor(y_valid_encoded, dtype=torch.long)
# Create DataLoaders
batch_size = 16
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
valid_dataset = TensorDataset(X_valid_tensor, y_valid_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)
# Define Multi-Class Classification Model
class NeuralNet(nn.Module):
def __init__(self, input_size, num_classes):
super(NeuralNet, self).__init__()
self.model = nn.Sequential(
nn.Linear(input_size, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(32, num_classes) # No softmax; handled by CrossEntropyLoss
)
def forward(self, x):
return self.model(x) # Logits for CrossEntropyLoss
# Initialize Model, Loss Function, and Optimizer
num_classes = len(label_encoder.classes_)
input_size = X_train.shape[1]
model = NeuralNet(input_size, num_classes)
criterion = nn.CrossEntropyLoss() # Multi-class classification loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training Loop
epochs = 30
history = {"train_loss": [], "valid_loss": []}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(epochs):
model.train()
train_loss = 0
for X_batch, y_batch in train_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch) # No need for softmax
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
history["train_loss"].append(train_loss)
# Validation
model.eval()
valid_loss = 0
all_preds = []
all_labels = []
with torch.no_grad():
for X_batch, y_batch in valid_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
valid_loss += loss.item()
# Convert logits to class predictions
predictions = torch.argmax(y_pred, dim=1).cpu().numpy()
all_preds.extend(predictions)
all_labels.extend(y_batch.cpu().numpy())
valid_loss /= len(valid_loader)
history["valid_loss"].append(valid_loss)
# Compute accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f"Epoch {epoch+1}/{epochs} - Train Loss: {train_loss:.4f}, Validation Loss: {valid_loss:.4f}, Accuracy: {accuracy:.4f}")
print("Training Complete!")
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
# Split Data (Ensure y contains categorical labels)
X_train_np, X_valid_np, y_train_np, y_valid_np = train_test_split(X, y, test_size=0.2, random_state=42)
# Encode labels (if they are categorical)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_np)
y_valid_encoded = label_encoder.transform(y_valid_np)
# Convert Data to Tensors
X_train_tf = tf.convert_to_tensor(X_train_np, dtype=tf.float32)
y_train_tf = tf.convert_to_tensor(y_train_encoded, dtype=tf.int32)
X_valid_tf = tf.convert_to_tensor(X_valid_np, dtype=tf.float32)
y_valid_tf = tf.convert_to_tensor(y_valid_encoded, dtype=tf.int32)
# Create TensorFlow Dataset for Batching
batch_size = 16
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_tf, y_train_tf)).shuffle(1024).batch(batch_size)
valid_dataset = tf.data.Dataset.from_tensor_slices((X_valid_tf, y_valid_tf)).batch(batch_size)
# Define Multi-Class Classification Model
num_classes = len(label_encoder.classes_)
model = keras.Sequential([
layers.Dense(64, activation="relu", input_shape=(X_train.shape[1],)),
layers.Dense(32, activation="relu"),
layers.Dropout(0.2),
layers.Dense(num_classes, activation="softmax") # Softmax for multi-class classification
])
# Compile Model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss="sparse_categorical_crossentropy", # Correct loss function for multi-class classification
metrics=["accuracy"] # Track accuracy during training
)
# Train the Model
epochs = 30
history = model.fit(
train_dataset,
validation_data=valid_dataset,
epochs=epochs,
verbose=1
)
print("Training Complete!")
# ✅ Making Predictions on New Data
new_data = np.array([[0.5, 1.2, -0.8, 3.4]]) # Example input
# Get predicted class
y_pred_probs = model.predict(new_data)
y_pred_class = np.argmax(y_pred_probs, axis=1) # Convert probabilities to class labels
print(f"Predicted Class: {label_encoder.inverse_transform(y_pred_class)}") # Decode back to original label
import matplotlib.pyplot as plt
epochs = range(1, len(history["train_loss"]) + 1)
plt.plot(epochs, history["train_loss"], label="Train Loss", marker="o")
plt.plot(epochs, history["valid_loss"], label="Validation Loss", marker="s")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.show()
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import r2_score
# Generate Sample Data (or load your dataset)
X_train_np, X_valid_np, y_train_np, y_valid_np = train_test_split(X, y, test_size=0.2, random_state=42)
# Convert Data to Tensors
X_train_tensor = torch.tensor(X_train_np, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train_np.values.reshape(-1, 1), dtype=torch.float32)
X_valid_tensor = torch.tensor(X_valid_np, dtype=torch.float32)
y_valid_tensor = torch.tensor(y_valid_np.values.reshape(-1, 1), dtype=torch.float32)
# Create DataLoaders
batch_size = 16
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
valid_dataset = TensorDataset(X_valid_tensor, y_valid_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)
# Define Regression Model
class RegressionModel(nn.Module):
def __init__(self, input_size):
super(RegressionModel, self).__init__()
self.model = nn.Sequential(
nn.Linear(input_size, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(32, 1) # Output layer for regression
)
def forward(self, x):
return self.model(x)
# Initialize Model, Loss Function, and Optimizer
input_size = X_train.shape[1]
model = RegressionModel(input_size)
criterion = nn.MSELoss() # Mean Squared Error for regression
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training Loop
epochs = 30
history = {"train_loss": [], "valid_loss": []}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(epochs):
model.train()
train_loss = 0
for X_batch, y_batch in train_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
history["train_loss"].append(train_loss)
# Validation loop with R² Score
model.eval()
valid_loss = 0
all_preds = []
all_labels = []
with torch.no_grad():
for X_batch, y_batch in valid_loader:
X_batch, y_batch = X_batch.to(device), y_batch.to(device)
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
valid_loss += loss.item()
all_preds.extend(y_pred.cpu().numpy().flatten())
all_labels.extend(y_batch.cpu().numpy().flatten())
valid_loss /= len(valid_loader)
r2 = r2_score(all_labels, all_preds) # Calculate R² score
print(f"Validation Loss: {valid_loss:.4f}, R² Score: {r2:.4f}")
print("Training Complete!")
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
import numpy as np
# Generate Sample Data (or load your dataset)
X_train_np, X_valid_np, y_train_np, y_valid_np = train_test_split(X, y, test_size=0.2, random_state=42)
# Convert Data to Tensors
X_train_tf = tf.convert_to_tensor(X_train_np, dtype=tf.float32)
y_train_tf = tf.convert_to_tensor(y_train_np.values.reshape(-1, 1), dtype=tf.float32)
X_valid_tf = tf.convert_to_tensor(X_valid_np, dtype=tf.float32)
y_valid_tf = tf.convert_to_tensor(y_valid_np.values.reshape(-1, 1), dtype=tf.float32)
# Create TensorFlow Dataset for Batching
batch_size = 16
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_tf, y_train_tf)).shuffle(1024).batch(batch_size)
valid_dataset = tf.data.Dataset.from_tensor_slices((X_valid_tf, y_valid_tf)).batch(batch_size)
# Define Regression Model
model = keras.Sequential([
layers.Dense(64, activation="relu", input_shape=(X_train.shape[1],)),
layers.Dense(32, activation="relu"),
layers.Dropout(0.2),
layers.Dense(1) # Single neuron output for regression
])
# Compile Model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss="mse", # Mean Squared Error for regression
metrics=["mae"] # Mean Absolute Error
)
# Train the Model
epochs = 30
history = model.fit(
train_dataset,
validation_data=valid_dataset,
epochs=epochs,
verbose=1
)
print("Training Complete!")


🔹 How to Save a Dictionary as a CSV File
A dictionary can be converted into a Pandas DataFrame and then saved as a CSV file.
import pandas as pd
# Example dictionary
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "San Francisco", "Los Angeles"]
}
# Convert dictionary to DataFrame
df = pd.DataFrame(data)
# Save DataFrame as CSV
df.to_csv("output.csv", index=False) # index=False prevents saving row numbers
🔹 Single Line Code to Save a Matplotlib Plot as PNG
plt.savefig("plot.png", dpi=300) # Save current plot as 'plot.png' with high resolution
Here are one-liners to save and load models for both PyTorch and TensorFlow.
🔹 PyTorch
# Save model
torch.save(model.state_dict(), "model.pth")
# Load model
model.load_state_dict(torch.load("model.pth"))
model.eval() # Set to evaluation mode
🔹 TensorFlow (Keras)
# Save model
model.save("model.h5")
# Load model
model = keras.models.load_model("model.h5")