π‘ Regression
A technique for modeling the relationship between multiple independent variables (X) and a single dependent variable (y)
π¨ RSS and Gradient Descent
-
A method where the squared error values (Errorα΅’) of each data point are calculated and summed
RSS=i=1βnβ(Erroriβ)2
-
RSS can be expressed as a function of variables w0β and w1β
, and the key objective of machine learning-based regression is to find the regression coefficients w0β and w1β that minimize this RSS through training
RSS(w0β,w1β)=N1βi=1βNβ(yiββ(w0β+w1βΓxiβ))2
-
In regression, the Residual Sum of Squares (RSS) is referred to as the cost, and the RSS, composed of the coefficients w, is called the Cost Function
-
The ultimate goal is to find the minimum value of the cost function, where the error returned by the function no longer decreases. This is also known as the Loss Function.
Gradient Descent
- Gradient Descent is a method of searching for the π parameter that minimizes the error value while updating the π parameter value through βgraduallyβ repetitive calculations.

- If the error value no longer decreases, the error value is determined as the minimum cost and the W parameter at that time is returned as the optimal parameter
π¨ Linear Regression
A method for modeling the linear relationship between the independent variable X and the dependent variable Y
Simple Linear Regression
-
Explain the relationship between a single independent variable X and the dependent variable Y
Y=Ξ²0β+Ξ²1βX+Ο΅
-
Here, Ξ²0β represents the Intercept, Ξ²1β represents the Slope, and Ο΅ represents the Error
Multiple Linear Regression
-
Explains the relationship between multiple independent variables X1β,X2β,...,Xpβ and the dependent variable Y
Y=Ξ²0β+Ξ²1βX1β+Ξ²2βX2β+β―+Ξ²pβXpβ+Ο΅
-
Here, Ξ²0β represents the Intercept, Ξ²1β,Ξ²2β,...,Ξ²pβ represent the slopes for each independent variable, and Ο΅ represents the Error
Coefficient of Determination (R2)
-
One of the key metrics for evaluating the performance of a linear regression model
-
Represents the proportion of the variance in the dependent variable that is explained by the independent variables
R2=1βTSSRSSβ
-
Here, RSS is Residual Sum of Squares, and TSS is Total Sum of Squares
π¨ Bias-Variance TradeOff
Balancing Bias and Variance for optimal performance
Bias
- A measure of how well the model explains the patterns in the actual data
- A model with high bias fails to capture the complex patterns in the data and provides only a simple approximation -> Underfitting
Variance
- Indicates how sensitively the model reacts to the training data
- A model with high variance is heavily influenced by small changes in the data, making it overly complex and even learning the noise in the data -> Overfitting
Bias-Variacne TradeOff
- In general, when bias is high, variance tends to be low, and when bias is low, variance is often high
- Looking at the left side of the graph, the model has low complexity, meaning it is not capturing the training data well -> This indicates a state of high bias and low variance, also known as underfitting
- The middle of the graph shows the model finding the optimal balance between bias and variance
- On the right side of the graph, the model is overinterpreting the training data -> Bias is low, but variance is very high, indicating overfitting

- Manage this tradeoff through Cross-Validation, Regularization, and Hyperparameter Tuning
π¨ Regularized Linear Regression
A linear regression technique that introduces regularization to prevent overfitting of the model
Lidge Regression
-
A linear regression model with L2 regularization -> Adds an extra term to the cost function that minimizes the sum of squared regression coefficients
-
Cost Function
J(Ξ²)=i=1βnβ(YiββYiβ^β)2+Ξ»j=1βpβΞ²j2β
-
Here, Ξ» is a hyperparameter that controls the strength of regularization. The larger Ξ» gets, the greater the penalty, making the model simpler. Ξ²jβ represents the regression coefficients for each independent variable
-
Characteristics
- Brings all regression coefficients closer to 0 but doesn't make them exactly 0
- Useful for solving multicollinearity problems -> When there is a strong correlation between independent variables, Ridge Regression improves model stability
Lasso Regression
-
A linear regression model with L1 regularization -> Adds a penalty based on the sum of the absolute values of the regression coefficients
-
Cost Function
J(Ξ²)=i=1βnβ(YiββYiβ^β)2+Ξ»j=1βpββ£Ξ²jββ£
-
Here, Ξ» is a hyperparameter that controls the strength of regularization, and β£Ξ²jββ£ represents the absolute value of each regression coefficient
-
Characteristics
- Lasso Regression allows the coefficients to fully shrink to 0 through the βj=1pββ£Ξ²jββ£ term
- Feature Selection: Automatically eliminates unnecessary variables, enabling the creation of a more streamlined model
Elastic Net Regression
-
The method that combines Ridge Regression and Lasso Regression
-
Cost Function
J(Ξ²)=i=1βnβ(YiββYiβ^β)2+Ξ»1βj=1βpββ£Ξ²jββ£+Ξ»2βj=1βpβΞ²j2β
-
Here, Ξ»1β is the hyperparameter controlling the strength of L1 regularization, and Ξ»2β is the hyperparameter controlling the strength of L2 regularization
-
Characteristics
- Useful when the data has high-dimensional features or when multicollinearity is significant
- Enables feature selection through L1 regularization while ensuring model stability with L2 regularization
The role of regularization strength Ξ»
- Ξ» is a hyperparameter that controls the strength of regularization, with larger values applying stronger regularization
- Ξ»=0: No regularization is applied, making it equivalent to standard linear regression
- If Ξ» is too large, the model can become overly simplified, leading to underfitting
- Finding the right Ξ» value helps optimize the bias-variance tradeoff of the model
π¨ Logistic Regression
A regression technique used to solve Binary Classification problems, typically used when the dependent variable is categorical, and mainly for predicting the probability of a specific event occurring
Logistic Function
-
If data is predicted using a simple straight line like in linear regression, the predicted values can exceed 0 and 1, which doesn't align with categorical variables (0 or 1)
-
Therefore, logistic regression uses a logistic function that converts predicted values into values between 0 and 1
hΞΈβ(x)=1+eβ(ΞΈTx)1β
-
Here, h0β(x) represents the output probability for input x (between 0 and 1), and ΞΈTx is the linear combination of independent variables and regression coefficients (the predicted value in linear regression). e is the natural constant (approximately 2.718)
-
This function has the characteristic that when ΞΈTx is negative, it approaches 0, and when it is positive, it approaches 1 -> The logistic function converts any real number into a probability between 0 and 1
-
In logistic regression, the output value can be interpreted as the probability of an event occurring
-
In other words, hΞΈβ(x) represents the probability that the dependent variable y=1, while 1βhΞΈβ(x) represents the probability that y=0
P(y=1β£x;ΞΈ)=hΞΈβ(x)
P(y=0β£x;ΞΈ)=1βhΞΈβ(x)
Decision Boundary
-
In Logistic Regression, if the predicted probability is greater than 0.5, it classifies as y=1, and if it is less than 0.5, it classifies as y=0
y^β={10βifΒ hΞΈβ(x)β₯0.5ifΒ hΞΈβ(x)<0.5β
-
Here, y^β represents the predicted class (0 or 1) and hΞΈβ(x) is the predicted probability value for the independent variable x
Logistic Regression in scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# Load data (Iris dataset)
iris = load_iris()
X = iris.data
y = iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the model
logreg = LogisticRegression()
# Train the model
logreg.fit(X_train, y_train)
# Make predictions
y_pred = logreg.predict(X_test)
# Evaluate performance (Accuracy)
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))
- Main Hyperparameters
- penalty : Regularization method selection
- Default: L2, Options: L1, L2, elasticnet, none
- C : Regularization strength (inverse regularization coefficient)
- Default:
1.0, Larger values mean weaker regularization, and smaller values mean stronger regularization
- solver : Optimization algorithm selection
- Default:
lbfgs, Options: liblinear, lbfgs, saga, newton-cg
π¨ Tree-based Regression
A regression method that uses Decision Trees to predict continuous values, a nonlinear regression model that predicts values based on input variables by repeatedly splitting the data
Decision Tree Regression
-
Data splitting -> Split criterion selection: Each split is made in the direction that minimizes variance, meaning the data within each region is split to be as uniform as possible -> Prediction output: The predicted value is returned as the average or median of the data in the node
-
In Decision Tree Regression, the loss function is defined as the Residual Sum of Squares
RSS=iβRmβββ(yiββy^βRmββ)2
-
Here, yiβ is the actual dependent variable value, and y^βRmββ is the predicted value (mean) for region Rmβ -> The model splits the data by minimizing RSS
Random Forest Regression
- An ensemble learning method based on decision trees
- Key concepts
- Ensemble: Trains multiple decision trees and combines their predictions to generate the final prediction value
- Bagging: Trains each tree using only a subset of the data through bootstrap sampling
- Randomness introduction: When training each tree, only a random subset of features is selected as the split criterion -> Reduces correlation between trees, resulting in better performance
Gradient Boosting Regression
- An ensemble learning method based on decision trees
- Key concepts
- Boosting: Sequentially trains models, where each subsequent model corrects the errors of the previous model
- Residual Learning: Each tree learns the residuals (errors) not captured by the previous tree, gradually improving the overall model performance
Tree-based Regressionμ λΉκ΅
| Model | Overfitting Risk | Training Speed | Prediction Performance | Interpretability | Computational Cost |
|---|
| Decision Tree Regression | High | Fast | Medium | Very High | Low |
| Random Forest Regression | Low | Medium | High | Low | Medium |
| Gradient Boosting | Low (requires tuning) | Slow | Very High | Low | High |
π¨ Regression Metrics
Used to evaluate the performance of a regression model
MSE (Mean Squared Error)
-
The average of the squared differences between the predicted values and the actual values, sensitive to large errors
MSE=n1βi=1βnβ(yiββyiβ^β)2
-
Here, yiβ is the actual value, y^βiβ is the predicted value, and n is the number of data points
-
Characteristics
- The closer the value is to 0, the closer the model's predictions are to the actual values
- Sensitive to large errors, and the presence of outliers can cause MSE to increase significantly
MAE (Mean Absolute Error)
R2 (Coefficient of Determination)
-
Evaluates how well the model's predictions explain the variability in the actual data
R2=1ββi=1nβ(yiββyΛβ)2βi=1nβ(yiββyiβ^β)2β
-
Here, yiβ is the actual value, y^βiβ is the predicted value, yΛβ is the mean of the actual values, and βi=1nβ(yiββyΛβ)2 represents TSS, which indicates the total variability of the data
-
Characteristics
- R2=1 -> The model perfectly predicts all the data
- R2=0 -> The model does not explain any of the variability in the data
- Negative values can occur, indicating that the model performs worse than simply predicting the mean
RMSE (Root Mean Squared Error)
MSLE (Mean Squared Logarithmic Error)
νκ·μ κ΄ν μ μ΅ν μ 보 λ무 κ°μ¬λλ €μ!!!