[Coursera]How to win a data science competition - 1주차 1강

환공지능·2021년 7월 5일

Data Science kaggle

[Coursera] How to win a data science competition

목록 보기

1/11

Competition's Concept

(1) Data
(2) Model
(3) Submission
(4) Evaluation
(5) Leaderboard

1. Data

Data is what the organizers give us as training material
csv, txt, archive with pictures, database dump, code의 형태로 제공.
일반적으로 데이터에 대한 description이 제공되며, description을 통해 데이터의 feature에 대해 깊이 있는 이해 가능함.

2. Model

What we will build during the competition.
Something that transfroms data into answers.
- (1) best possible prediction (높은 예측성)
- (2) can be reproducible (재현성)

3. Submission

specify the ID of the object and then specify our prediction for it
대부분, 우리는 예측값만 제출하면 됨. 그 값을 예측하는 데에 쓰인 소스나 모델은 필요하지 않음.
자신의 코드를 자랑하기 위한 목적으로 코드도 공개 가능함.

4. Evaluation

You need to know how good is your model
The quality of the model is defined by evaluation function: (predictions, right answers) -> score
Examples : Accuracy, Logistic Loss, AUC, RMSE, MAE

5. LeaderBoard

public test : used during competion
private test : used for final ranking
(The split is hidden from users)
Only public leaderboard is available for the users, and private leaderboard is revealed until competition deadline.

Real World Application vs Competitions

(1) 비즈니스 관점에서 문제의 이해
(2) 작업의 공식화 : 정확히 무엇을 예측해야 하는가
(3) 데이터의 수집 : 어떤 데이터를 사용할 수 있는가
(4) 데이터의 처리 : 데이터를 전처리하여 사용할 수 있는 형태로 변형
(5) 모델의 선택
(6) 모델의 배포 방법 결정
(7) 모델의 성능 모니터링 및 retraining

현실의 문제는 더 복잡하나, competition은 과정을 배우기 좋은 방법
Competition의 철학
- 알고리즘 자체가 모든 것을 해결해주지는 않음.
- 항상 ML이 필요한 것은 아니며, heuristic, manual된 데이터 분석도 가능
- 복잡한 방법, 심화된 feature engineering, 거대 연산에 도전할 것
- 기존 알고리즘, 타인의 코드를 기반으로 창의적으로 접근해볼 것.

환공지능

데이터사이언티스트 대학원생

다음 포스트