[Pandas]drop_duplicates

olxtar·2022년 11월 3일

Comment :

특정 Column의 값이 (일부분만) 중복되는 두 개의 DataFrame이 있다. 두 개를 Rows + Rows로 붙이면서 특정 Column의 중복값은 1개로 처리(첫번째 DataFrame의 Row를 살리기)하고 싶으면?

pd.DataFrame.drop_duplicates

pd.DataFrame.drop_duplicates(self,
							 subset,
                             keep,
                             inplace,
                             ignore_index)

[!] df.drop_duplicates() 형태로 바로 사용가능

subset : 중복값이 있냐 없냐를 볼 Column명
ex) 프리미어리그 1 라운드 리버풀 출전명단 (Column이 경기일자, 선수이름, 득점, 어시스트, Row는 각 선수들)와 2 라운드 리버풀 출전명단이 있다. 두 DataFrame을 합치되, 1,2 라운드 중복 출전한 선수는 2라운드 기준의 Row를 살리고자 한다.
그러면 subset을 경기일자나 선수이름으로 해주면 된다.
keep : 중복이 되면 중복되는 만큼 Row가 쌓일거잖아? 거기서 첫번째를 살리려면 'first', 맨 마지막을 살리려면 'last'
inplace : df에 덮어씌울것인지 boolean값
ignore_index : You know

[!][+] 대용량의 DataFrame에서 .drop_duplicates 사용 시 더 많은 메모리 용량을 사용하므로 주의

olxtar

예술과 기술

이전 포스트

[Pandas] read_csv

다음 포스트

[Pandas]drop_duplicates

pd.DataFrame.drop_duplicates

[Pandas] read_csv

[Pandas]groupby

0개의 댓글