캐글필사 - ICR(Identifying Age-Related Conditions)

Sooin Yoon·2025년 3월 21일

google colab link(notebook필사) : 링크텍스트

Description

Goal of the Competition(경쟁의 목표)

The goal of this competition is to predict if a person has any of three medical conditions. You are being asked to predict if the person has one or more of any of the three medical conditions (Class 1), or none of the three medical conditions (Class 0). You will create a model trained on measurements of health characteristics.
이 경쟁의 목표는 만약 한 사람이 3개의 의료조건을 가지고 있을 때 예측하는 것이다. 너는 예측해달라고 요청받고 있다. 3개의 의료조건중 하나 또는 한개이상이 있다면 class 1 또는 아무 의료 조건이 없다면 class 0. 건강특징들을 측정하는 훈련된 모델을 만들것이다

To determine if someone has these medical conditions requires a long and intrusive process to collect information from patients. With predictive models, we can shorten this process and keep patient details private by collecting key characteristics relative to the conditions, then encoding these characteristics.
누군가 특정 의료 조건이 있는지 확인하기 위해서는 길고 불필요한 과정을 환자들로부터 요구한다 정보를 모으기 위해서. 예측된 모델을 사용하면, 우리는 이 과정을 짧게 할수있고 유지할수있다 환자들의 자세한 개인정보를 조건과 관련된 주요 특증을 모은 후 그런다음 이들 특징을 인코딩하여
Your work will help researchers discover the relationship between measurements of certain characteristics and potential patient conditions.
너의 일은 돕는것이다 연구자들이 발견할수 있게 어떤 특징의 측정방법(특정 특성의 측정값)과 잠재적인 환자 상태의 관계를

Context

They say age is just a number but a whole host of health issues come with aging. 그들이 말하는 나이는 단지 숫자이지만 건강 이상의 대부분은 나이가 듬에 함께 온다.
From heart disease and dementia(치매) to hearing loss and arthritis(관절염), aging is a risk factor for numerous diseases and complications(합병증).
심장질병과 치매부터 청각상실과 관절염까지 나이가 드는것(노화)은 위험요소이다. 수많은 질병과 합병증의
The growing field of bioinformatics(생물정보학) includes research into interventions(개입) that can help slow and reverse biological aging and prevent major age-related ailments(질병, 건강상의 문제).
성장하는 생물정보학 분야는 개입방법에 대한 연구를 포함한다 도울수있다 생물학적 노화를 늦추고 되돌리는 그리고 예방한다 주요 연령관련된 질병을

Data science could have a role to play in developing new methods to solve problems with diverse data, even if the number of samples is small.
데이터 과학은 역활을 가질수있다. 새로운 방법을 발견하는 이와같은 다양한 데이터의 문제를 풀수있는 심지어 샘플의 숫자가 작아도

Currently, models like XGBoost and random forest are used to predict medical conditions yet the models' performance is not good enough. Dealing with critical(중요한) problems where lives are on the line, models need to make correct predictions reliably(신뢰할수 있는) and consistently(일관되게) between different cases.
현재 XGBOOST와 랜덤 포레스와 같은 모델은 의료조건을 예측하는데 사용되어지지만, 이들 모델의 성능은 충분히 좋지 않다. ~~살아있는~~생명이 걸린 중요한 문제를 다루는 것은 한줄에 있지만 모델은 필요하다 정확한 예측을 하는것을 신뢰할수 있고 꾸준하게 다른 케이스들 사이에서

Founded in 2015, competition host InVitro Cell Research, LLC (ICR) is a privately funded company focused on regenerative(재생 의학) and preventive(예방적) personalized medicine. Their offices and labs in the greater New York City area offer state-of-the-art research space. InVitro Cell Research's Scientists are what set them apart, helping guide and defining their mission of researching how to repair aging people fast.
2015에 설립되었고, 이 대회의 주인인 ICR은 민간 자금으로 투자된 회사이다. 재생의학과 예방적 개인 맞춤의학에 집중된
In this competition, you’ll work with measurements of health characteristic data to solve critical problems in bioinformatics(생물정보학에서).
이 대회에서 생물정보학에서 중요한 문제를 풀기위한 건강특징데이터를 측정하는 것이 너의 작업이다.

Based on minimal training, you’ll create a model to predict if a person has any of three medical conditions, with an aim to improve on existing methods.
최소한의 훈련을 기반으로 예측된 모델을 만들어야한다. 한사람이 3개 의료상황을 가지는 기존(존재하는) 방법을 개선하는 목표로 해야한다.

You could help advance(성장을 촉진하다) the growing field of bioinformatics and explore new methods to solve complex problems with diverse data.
너는 생물정보학의 분야의 성장을 촉진하고, 다양한 데이터와 복잡합 문제를 풀기위한 새로운 방법을 발견하는데 도움을 줄 수있다.

Evaluation

Submissions are evaluated using a balanced logarithmic loss(균형잡힌 로그손실). The overall effect is such that each class is roughly equally important for the final score.
제줄은 a balanced logarithmic loss을 사용하여 평가되어진다(클래스의 불균형을 고려한). 전반적인 효과는 각 클래스는 대략 동등하게 중요하게 반영된다 마지막 점수에서

Each observation is either of class 0 or of class 1. For each observation, you must submit a probability for each class.
각 관찰은 class 0 또는 class 1이다. 각 관찰에 각 클래스의 확률를 제출해야만 한다
The formula is then:

\text{Log Loss} = -\frac{1}{N_0} \sum_{i=1}^{N_0} y_0^i \log p_0^i - \frac{1}{N_1} \sum_{i=1}^{N_1} y_1^i \log p_1^i

where (N{c}) is the number of observations of class (c), (\log) is the natural logarithm, (y{c i}) is 1 if observation (i) belongs to class (c) and 0 otherwise, (p{c i}) is the predicted probability that observation (i) belongs to class (c).
N{c} : 클래스 c의 관촬된 총 개수
log : 자연로그, e를 밑으로 하는 로그. e=2.71828, ln(e)=1, le(1)=0
y{c i} : i번째 관찰이 클래스 c에 속하면 1, 아니면 0
p{c i} : i번째 관찰이 클래스 c에 속할 확률
The submitted probabilities for a given row are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum).
주어진 행에 대한 제출된 확률은 1이 될 필요가 없다. 왜냐하면 평가되기 전에 재조정되기 때문이다(각 행은 행의 합으로 나누어진다)
In order to avoid the extremes of the log function, each predicted probability p is replaced with
로그함수의 극단적인 값을 피하기 위해서 각 예측된 확률 p는 대체된다.

\text{max}(\text{min}(p, 1 - 10^{-15}), 10^{-15})

Dataset

train.csv
- 각 관측값에 대한 고유 식별자(ID)
- 56개의 익명화된 건강 특성(AB-GL), categorical인 EJ를 제외하고 모두 숫자
- 타겟인 class : 1은 피험자가 세 가지 조건 중 하나로 진단받았음을 나타내고, 0은 진단받지 않았음을 나타냄
test.csv : 모델의 예측을 위한 테스트 데이터(피험자가 두 class 각각에 속할 확률을 예측하는 것)
greeks.csv
- 훈련데이터에만 제공되는 보조 메타데이터(Id + Alpha, Beta, Gamma, Delta, Epsilon 컬럼)
- alpha (array(['B', 'A', 'D', 'G'], dtype=object))
  - class 0 : alpha 존재, class 1 : aplha 없음
- beta (array(['C', 'B', 'A'], dtype=object))
- gamma (['A', 'B', 'E', 'F', 'G', 'H', 'M', 'N'])
- delta (array(['D', 'B', 'C', 'A'], dtype=object))
- epsilon : 이 피험자에 대한 데이터가 수집된 날짜, 테스트 세트의 모든 데이터는 훈련세트가 수집 된 후 수집

KFOLD

교차검증기법(Cross-validation)
data를 K조각으로 나눈 뒤, K번 반복하면서 각 조각을 한번씩 검증에 사용해보는 방식

cv = KFOLD(n_split=3, shuffle=True, random_state=42)
#전체 데이터 인텍스를 shffule로 섞고 -> 3등분 -> 그 중 1개를 test, 나머지를 train으로 쓰는 방식

import pandas as pd
from sklearn.model_selection import KFOLD

df = pd.DataFrame({'feature':[10,20,30,40,50,]})
print("원래 인덱스:", df.index.tolist())
<<< 원래 인덱스 : [0,1,2,3,4]
cv = KFold(n_splits=3, shuffle=True, random_state=42)

for fold, (train_index, test_index) in enumereat(cv.split(df)):
	print(f"\nfold {fold+1}")
    print("Train Index:", train_idx)
    print("Test Index :", test_idx)

처음 인덱스 목록 : [0,1,2,3,4]
shuffle(random_state) : [1,4,2,0,3]
섞인 순서를 3등분
Fold 1
[1,4] -> test
[2,0] -> train
[3] -> train
Fold 2
[2,0] -> test
[1,4] -> train
[3] -> train
Fold 3
[3] -> test
[1,4] -> train
[2,0] -> train

Learned Lesson

결측값을 모델로 채우는 방식 (CatBoost Regressor) 사용해봄
익명화된 컬럼도(categoriacl이라면) 인코딩 필요
pd.to_datetime, map()함수, fillna()함수에 따른 전처리 스킬
KFold에 따른 교차검증기법
CatBoost로 multi-class 분류하고 predict_proba()로 확률 출력
Class 0과 1의 불균형 → np.bincount, weight 설정 으로 처리
CatBoost의 Pool 객체에 weight 설정하는 방법

Sooin Yoon

이전 포스트

[GROUP BY] 가격대 별 상품 개수 구하기

다음 포스트