AIVLE TIL ('23.03.08) 2차 미니프로젝트 3일차

cjkangme·2023년 3월 8일

AIVLE

에이블스쿨

목록 보기

29/81

조별 Kaggle Competition

악성 사이트을 탐지하는 모델을 개발해 성능 점수를 경쟁하는 조별 competition을 진행하였다.

문제 정의

악성 사이트들의 URL과 HTML 스크립트에는 공통 Feature가 존재하여 이를 통해 악성 사이트 여부를 알 수 있다.
Test 데이터에 대해 micro-f1_score가 가장 높은 모델을 만드는 것이 목표이다.

데이터 분석

학습 데이터

ydata-profiling 라이브러리를 이용하여 데이터 단변량, 이변량 분석을 빠르고 효율적으로 진행
의미가 없을 것 같은 칼럼을 추리고, 칼럼을 조합하여 새로운 feature를 고민
상관분석을 진행하여 상관 관계가 높은 변수들 확인

새로 알게 된 점 (타 조의 발표 기반)

데이터의 분포가 치우쳐 있을 때, 정규분포 형태로 변환하는 다양한 방법을 시도해 볼 수 있다.

Gausinan Rank Transform(GRT)
로컬 이웃과의 순위를 이용하여 데이터 값을 평균 0의 정규분포로 변환하는 방법

로그 변환
로그 변환을 통해 데이터의 범위를 줄이고, 왜도 및 첨도를 줄일 수 있다.

데이터 전처리

학습 데이터

데이터 분석 단계에서 추려낸 컬럼 삭제 (의미 없는 변수 및 상관관계 높은 변수)
변수를 조합 및 가공하여 새로운 특징 생성
VIF Factor를 이용하여 공선성 제거

테스트 데이터

특이하게 테스트 데이터에 약 20%는 결측치였다.
결측치를 처리하기 위해 0으로 처리하는 방법, 평균으로 처리하는 방법, 주변 변수와의 관계를 이용해 결측치를 채우는 작업을 진행했다.

새로 알게 된 점 (타 조의 발표 기반)

KNN Imputer

KNN 알고리즘을 이용하여 결측치를 채우는 방법이다.

이번 competition의 경우 KNN Imputer를 이용해 테스트 데이터의 결측치를 채우는 것이 성능 점수의 핵심이 되었다.

IterativeImputer (다변량 대치)

다른 모든 feature로 부터 추출하여 결측치 값을 예측하는 알고리즘이다.

라운드 로빈(Round Robin) 방식을 이용한다.
그룹 내에 있는 모든 요소를 합리적인 순서에 입각하여 뽑는 방식, 대개 리스트에서 제일 위부터 아래를 차례로 뽑고, 모든 리스트를 순회하면 다시 맨 위로 돌아가는 방식이다.

모델링

XGBoost, Random Forest, SVC, ANN, Stacking 등 최대한 다양한 모델에 대해 하이퍼파라미터 없이 학습하여, 검증 데이터에 대한 score를 확인했다.
성능이 일정 수준 이상이었던 모델들에 대해 조원 별로 한 두개의 모델을 맡아 하이퍼파라미터 튜닝을 진행하였다.

새로 알게 된 점 (타 조의 발표 기반)

Cat Boasting Model

기존 부스팅 모델은 이전 훈련데이터의 잔차 전체를 사용하여 학습을 진행해 과적합이 발생할 우려가 있다는 문제점

Cat Boasting Model은 이전 훈련데이터의 일부만을 사용하여 잔차를 구한 뒤 학습을 진행하는 모델이다.

현재 SOAT 모델 중 하나이다.

자세한 것은 XGB, LGBM과 함께 따로 공부해서 정리할 것이다.

Auto ML 프레임 워크

현재 오픈소스를 통해 자동으로 최적 모델 및 하이퍼파라미터 값을 구해주는 프레임워크가 많이 존재한다.

이를 이용하면 정말 단시간 내에 최적의 모델을 구할 수 있다.

mljar-supervised

하이퍼파라미터 튜닝

GridSearchCV를 이용하여 최적의 파라미터를 찾는 작업을 진행했다.
모델에 대한 지식이 부족하여 Chat GPT에게 질문하였고, 상세한 답변을 얻을 수 있었다.

Q . How to tune hyperparameters in the xgboost classifier?

A.
XGBoost is a popular implementation of gradient boosting for classification and regression tasks. Hyperparameters are settings that can be adjusted to improve the performance of an XGBoost model. Here are some steps for tuning hyperparameters in the XGBoost classifier:

Start with default hyperparameters: The first step is to use the default hyperparameters provided by the XGBoost library. This will give you a baseline performance of your model, and you can then tune hyperparameters to improve upon this baseline.

Choose hyperparameters to tune: There are several hyperparameters in XGBoost that you can tune to improve model performance. Some of the most important ones include the learning rate, maximum depth of trees, subsample ratio, and regularization parameters.

Define a search space: For each hyperparameter that you want to tune, you need to define a search space. This is a range of values that you want to test to find the best hyperparameter value. You can define the search space manually or use a library like scikit-learn to automate this process.

Choose a tuning strategy: There are several tuning strategies you can use, such as grid search, random search, or Bayesian optimization. Grid search is the simplest and most straightforward, but it can be computationally expensive. Random search is faster, but it may not find the optimal hyperparameter values. Bayesian optimization is a more advanced technique that uses a probabilistic model to search the hyperparameter space more efficiently.

Evaluate performance: For each combination of hyperparameters that you test, you need to evaluate the model's performance using a validation set. This can be done using metrics like accuracy, precision, recall, or F1 score.

Select the best hyperparameters: After testing all combinations of hyperparameters, you need to select the best set of hyperparameters that give the best performance on the validation set. This set of hyperparameters can then be used to train the final model.

Test final model: Finally, you need to test the final model on a test set to ensure that it generalizes well to new data.

In summary, tuning hyperparameters in XGBoost involves defining a search space for each hyperparameter, selecting a tuning strategy, evaluating performance, and selecting the best hyperparameters to use in the final model.

Q. Which hyperparameters should I tune first?

A.
The choice of which hyperparameters to tune first depends on the specific problem and dataset you are working with. However, here are some general recommendations for hyperparameters to tune first in XGBoost:

Learning rate (eta): This is a crucial hyperparameter that controls the step size during gradient descent. A smaller learning rate can lead to better convergence but requires more iterations, while a larger learning rate can result in faster convergence but may miss the optimal solution. A good starting point is to try a small range of learning rates (e.g., 0.01 to 0.1) and see how the performance changes.

Maximum depth of trees (max_depth): This hyperparameter controls the maximum depth of the decision trees in the ensemble. A larger max_depth can lead to overfitting, while a smaller max_depth can lead to underfitting. A good starting point is to try a range of max_depth values (e.g., 3 to 10) and see how the performance changes.

Subsample ratio (subsample): This hyperparameter controls the fraction of observations to be randomly sampled for each tree. A lower subsample ratio can lead to overfitting, while a higher subsample ratio can lead to underfitting. A good starting point is to try a range of subsample values (e.g., 0.5 to 0.9) and see how the performance changes.

Regularization parameters (lambda and alpha): These hyperparameters control the strength of L1 and L2 regularization, respectively. Regularization can help prevent overfitting by penalizing large weights. A good starting point is to try a small range of regularization values (e.g., 0 to 10) and see how the performance changes.

Once you have found good values for these hyperparameters, you can then move on to tuning other hyperparameters such as the number of trees (n_estimators), the minimum sum of instance weight needed in a child (min_child_weight), the gamma value for minimum loss reduction required to make a further partition on a leaf node, and more.

Chat GPT의 위력을 새삼 실감할 수 있었다.

소감

배운 점

결측치 처리에 대해 완전히 새로운 방법을 배우게 되었다. 또 결측치 처리가 얼마나 성능에 중요한 영향을 끼치는지도 깨달았다.
CatBoost와 같은 최신 알고리즘을 알고 활용하는 것이 중요하다는 생각을 했다. 배움을 게을리 하지 말자.
결측치 처리와 연장되는 이야기인데, 탐색과 전처리의 중요성을 더욱 실감하였다.

좋았던 점

정말 주어진 시간에 갖고 있는 지식을 활용해서 최선을 다한 시간이었다.
팀원간 분업이 잘 이루어져 처음 목표했던 일을 전부 완수할 수 있었다.

cjkangme

이전 포스트

AIVLE TIL ('23.03.07) 2차 미니프로젝트 2일차

다음 포스트

AIVLE TIL ('23.03.08) 2차 미니프로젝트 3일차

에이블스쿨

조별 Kaggle Competition

문제 정의

데이터 분석

학습 데이터

새로 알게 된 점 (타 조의 발표 기반)

데이터의 분포가 치우쳐 있을 때, 정규분포 형태로 변환하는 다양한 방법을 시도해 볼 수 있다.

데이터 전처리

학습 데이터

테스트 데이터

새로 알게 된 점 (타 조의 발표 기반)

KNN Imputer

IterativeImputer (다변량 대치)

모델링

새로 알게 된 점 (타 조의 발표 기반)

Cat Boasting Model

Auto ML 프레임 워크

하이퍼파라미터 튜닝

소감

배운 점

좋았던 점

AIVLE TIL ('23.03.07) 2차 미니프로젝트 2일차

AIVLE TIL ('23.03.09) AI 모델 해석/평가

0개의 댓글

관련 채용 정보