[Coursera]How to win a data science competition - 2주차 1강

환공지능·2021년 7월 6일

Coursera kaggle machine learning

0

[Coursera] How to win a data science competition

목록 보기

4/11

1. Building Intuition about data

(1) Getting domain knowledge

일반적으로 주어지는 주제에 대해서 도메인 지식을 가지지 않음
너무 깊이 있는 지식은 필요 없으나, 우리의 목표, 데이터, 사람들이 어떻게 문제를 다루며 기준을 만드는지에 대한 이해는 필요
위키피디아에 검색, 구글링

(2) Checking if the data is intuitive

직감에 기반하여 데이터를 체크할 것.
ex) 나이에 336이 있다면 오타인지, 33인지, 36인지를 직감으로 판단할 것.

(3) Understanding how the data was generated

데이터베이스에서 개체를 샘플링하는 알고리즘은 무엇인지
호스트 샘플에서 임의의 개체를 가져오거나 특정 클래스의 샘플을 오버샘플링 한 것인지
데이터셋을 균형있게 만들기 위해서는 실제로 데이터가 생성되는 방법을 알고 있어야 함. 예를 들면, train set과 test set이 생성된 방법이 다른 경우에는 train set의 일부를 validation set으로 만들 수 없음.

2. Exploring anonymized data

(1) What is anonymized data?

보안 상의 문제로 encoding된 데이터
type of feature을 알 수 없게 만들었음
feature 탐색을 통해 column의 이미지, 타입 추측
feature 관계를 통해 group 찾기

(2) What can we do with it?

합법적인 방법으로 데이터를 해석하거나 익명화
특징들의 진정한 의미를 추출
특징들이 어떻게 관련되는지 찾아봄

3. Visualizations

(1) Explore individual features

Histograms : plt.hist(x)
Plots : plt.plot(x,'.') , plt.scatter(range(len(x)), x, c=y)

Statistics : df.describe(), x.mean(), x.var()
Art tools

(2) Explore feature relations

Scatter plots : pd.scatter_matrix(df)
Correlation plots : df.corr(), plt.matshow()
Plot (index vs feature statistics) : df.mean().plot()
And more

4. Dataset Cleaning and Other things to check

(1) Duplicated Column : 중복되는 column은 메모리 상 제거하는 것이 좋음

trainset.T.drop_duplicates()

(2) Duplicated Row : 같은 label을 가지는 row 제거

왜 해당 row가 중복되는지 이해

(3) Check if dataset is shuffled

rolling_mean과 mean의 비교를 통해 data leakages 탐색

데이터사이언티스트 대학원생

이전 포스트

[Coursera]How to win a data science competition - 1주차 3강

다음 포스트

[Coursera]How to win a data science competition - 2주차 2강

0개의 댓글

관련 채용 정보