[Pandas]문자 데이터 전처리

차보경·2022년 5월 20일

TIL

목록 보기

3/37

최근 ML을 하는데 있어서 데이터의 전처리의 중요성을 크게 느낀다.
사람마다 데이터 처리에 있어서 사용하는 툴은 비슷하지만, 그 모델에 어떤 값을 집어넣느냐에 따라서 결과가 천차 만별이기 때문이다. (feat. 왜 내 loss값은 안떨어져...?😬😬)

오늘은 최근에 진행하고 있는 뉴스 요약봇 만들기(문자 데이터, RNN)을 하면서 사용했던 문자 데이터 전처리 방법을 정리하려고 한다.

1. 🔎중복 샘플과 NULL 값이 존재하는 샘플 제거

1) 중복샘플 유무 확인 => 전체 데이터 숫자 vs 유일한 데이터 숫자 비교

#먼저 데이터의 전체 숫자를 파악함
print('전체 샘플수 :', (len(data))) 
print('A열에서 중복을 배제한 유일한 샘플의 수 :', data['A'].nunique())

nunique(): 고유값의 개수를 출력
Ex) df['A'].nunique()
unique 함수 관련 추가
- unique() : 모든 고유값 출력 (데이터의 종류들을 알고 싶을때 사용)
  Ex) df['A'].unique() : df 데이터의 A열에 고유값 종류 모두 출력
- value_counts() : 값별로 데이터의 수를 출력
  Ex) df['A'].value_counts() : df 데이터의 A열에 고유값 종류별 갯수 출력
  Ex2) df['A'].value_counts(ascending=True) : df 데이터의 A열에 고유값 종류별 갯수 오름차순으로 출력
- data.duplicates(subset=['colunms'])를 통해 true/false 값으로 출력할수도있음

2) 중복 데이터 제거

DataFrame.drop_duplicates()를 사용해 중복값 제거

DataFrame.drop_duplicates(subset = ['colunms'], inplace=True)

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)가 기본 틀임
* subset : 제거하고자하는 열의 이름 (작성하지 않으면 전체 열에 적용함)
- keep : {‘first’, ‘last’, False} 중에 선택할 수 있으며, default는 ‘first’임.
  - first : 첫번째 중복 데이터 제외하고 나머지 삭제
    - last : 마지막 중복 데이터 제외하고 나머지 삭제
    - False : 전체 데이터 삭제
- inplace : False면 원본 DataFrame을 수정하지 않고 중복 데이터가 수정되지 않고 복사본만 전달함. true면 원본을 수정함

2. 🔎제거 후에도 Null값이 있는지 재확인

.isnull().sum() 을 사용하여 남은 Null이 있는지 다시 확인한 후 dropna()함수로 제거하지

data.isnull().sum()
data.dropna(axis=0, inplace=True)
print('전체 샘플수 :', (len(data)))

df.isnull() : 데이터 내 null값 true/false로 출력
df.isnull().sum() : 각 컬럼별 null값 갯수 출력(확인하기 편함)
df[df.colunm.isnull()] : 해당 colunm에 null값 있는 행 추출
df.dropna() or df.dropna(axis=0) : null값 있는 행 제거
df.dropna(axis=1) : null값 있는 열 제거
df.fillna(특정값) : null값을 특정값으로 채우기
* df.fillna(method='ffill') : 앞 행의 값으로 채우기
- df.fillna(method='bfill') : 뒷 행의 값으로 채우기
- df.fillna(df.mean()) : 각 열의 평균으로 채우기

3. 🧹텍스트 정규화 불용어 제거

it'll, I'd 와 같은 줄임표현 바꿔주기 (텍스트 정규화 사전 참고

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

#test로 잘되는지 확인도 가능
test = "Hey I'm Chabbo, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))

상기 방법도 있지만, 나중에 전처리 함수에 모두 넣기 위해서 따로 정리해줌

contractions = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}

print("정규화 사전의 수: ", len(contractions))

NLTK 사용하여 불용어 제거하기

#import로 nlrk 불러오기
import nltk
nltk.download('stopwords')

#어떤걸 불용어로 사용하는지 확인:)
print('불용어 개수 :', len(stopwords.words('english') ))
print(stopwords.words('english'))

대문자 변경, html태그 제거, 등 전처리 함수 사용
(짧은 글엔 불용어를 제거하지 않을 예정이라 함수에 따로 불용어 true/false를 나눠줬습니다😉😉)

# 데이터 전처리 함수
def preprocess_sentence(sentence, remove_stopwords=True):
    sentence = sentence.lower() # 텍스트 소문자화
    sentence = BeautifulSoup(sentence, "lxml").text # <br />, <a href = ...> 등의 html 태그 제거
    sentence = re.sub(r'\([^)]*\)', '', sentence) # 괄호로 닫힌 문자열 (...) 제거 Ex) my husband (and myself!) for => my husband for
    sentence = re.sub('"','', sentence) # 쌍따옴표 " 제거
    sentence = ' '.join([contractions[t] if t in contractions else t for t in sentence.split(" ")]) # 약어 정규화
    sentence = re.sub(r"'s\b","", sentence) # 소유격 제거. Ex) roland's -> roland
    sentence = re.sub("[^a-zA-Z]", " ", sentence) # 영어 외 문자(숫자, 특수문자 등) 공백으로 변환
    sentence = re.sub('[m]{2,}', 'mm', sentence) # m이 3개 이상이면 2개로 변경. Ex) ummmmmmm yeah -> umm yeah
    
    # 불용어 제거 (Text)
    if remove_stopwords:
        tokens = ' '.join(word for word in sentence.split() if not word in stopwords.words('english') if len(word) > 1)
    # 불용어 미제거 (Summary)
    else:
        tokens = ' '.join(word for word in sentence.split() if len(word) > 1)
    return tokens
print('=3')

4. 🔎데이터 전처리 및 확인

상기 preprocess_sentence() 함수를 사용하여 전처리 하고 확인해보기

#전처리 후 담을 함수 만들고
clean_text = []

# 전체 Text 데이터에 대한 전처리 진행
for s in data['Text']:
    clean_text.append(preprocess_sentence(s))

# 전처리 후 출력
print("전처리 후 Text: ", clean_text[:5])

이렇게 TEXT data의 전처리를 완료했습니다!
정리용으로 작성하였는데, 다른 분들에게도 도움이 되었으면 좋겠네용🥰🥰

그리고 함수로 데이터 전처리를 했지만, 정제를 한 후에도 다시한번 null값이 있는지 확인해보는 것이 좋습니다!

data.isnull().sum()
data.dropna(axis=0, inplace=True)

아무리 강조해도 아깝지 않은 중간 확인...!!!

Reference

차보경

차보의 Data Engineer 도전기♥ (근데 기록을 곁들인)

이전 포스트

[ML]seq2seq & RNN & LSTM 이란?

다음 포스트