EDA TEST | 화장품 성분 데이터 분석

소리·2023년 11월 18일

제로베이스 데이터분석 공부

목록 보기

52/84

1-1) 성분사전 DataFrame 만들기

# 1-1
ingredients_df = pd.concat(ingredients_list, ignore_index=False)
#ignore_index=False: 이 옵션은 합치는 과정에서 원래의 index를 유지할 것인지를 결정합니다. False로 설정된 경우, 원래 DataFrame들의 index가 그대로 유지됩니다. 
#만약 True로 설정한다면, index는 리셋되어 0부터 다시 시작됩니다.

✏️ concat과 merge를 언제 사용해야 하는지 잘 알기. concat은 특정 행을 기준으로 합칠 때 활용

1-2) 성분사전 DataFrame 내의 Data 수정

hint1: '\r'를 대체 할 때 한글('표준 성분명', '구명칭')의 경우 띄어쓰기가 없고, 영어('표준 영문명', '구영문명)의 경우 띄어쓰기를 해야 합니다.
hint2: 모든 영문명의 경우 대문자로 시작합니다.

# 1-2
ingredients_df['표준 성분명'] = ingredients_df['표준 성분명'].replace('\r', "", regex = True)
ingredients_df['구명칭'] = ingredients_df['구명칭'].replace('\r', "", regex = True)

ingredients_df['표준 영문명'] = ingredients_df['표준 영문명'].replace('\r', " ", regex = True)
ingredients_df['구영문명'] = ingredients_df['구영문명'].replace('\r', " ", regex = True)

ingredients_df.head(20)

def delete_r(row):

    for idx, cell in enumerate(row): #여기서 cell은 성분코드 등 열 이름
        if type(cell) is str and '\r' in cell:
            cell = cell.replace('\r', ' ') if cell[0].isupper() else cell.replace('\r', '')
        row[idx] = cell #cell[0]은 각 데이터의 맨 앞글자가 대문자이냐 체크
    return row

ingredients_df = ingredients_df.apply(delete_r, axis=1) #한줄한줄 발라서 적용시키기
check_01_02(ingredients_df)

✏️ 여러 열에 같은 방식을 적용해야 한다면 함수를 만들어서 진행하자.

1-3) 성분사전 DataFrame 내의 Data 수정

pdf를 dataframe으로 전환하면서 일부 누락된 데이터가 있습니다.
아래 cell의 replace_dict는 현재값(key):변경할값(value)의 쌍으로 이루어져 있습니다. 이 replace_dict를 이용하여 성분사전 dataframe '표준 영문명' column의 값을 변경하세요

ingredients_df['표준 영문명'] = ingredients_df['표준 영문명'].replace(replace_dict)

def replace_ingredients_dict(ingredients_str: str):
    replace_dict = {
        문제에서 제시
    }
    if type(ingredients_str) is str:
        
        # Replace
        for key, value in replace_dict.items():
            ingredients_str = ingredients_str.replace(key, value)

    return ingredients_str

ingredients_df['표준 영문명'] = ingredients_df['표준 영문명'].apply(replace_ingredients_dict)
check_01_03(ingredients_df)

✏️ 뒤에 함수 정의에서 매개변수를 넣는 곳에 :str이 문자열만 가능하다는 뜻인가?
내가 푼 게 훨씬 간결하고, 답도 맞다고 나오는데 너무 간결해서 맞는지 의심이 든다.

2-1) Target DataFrame 중 Ingredients Column 내의 Data 수정하기

조건1: 맨 끝에 마침표('.')가 있다면 마지막 마침표만 제거하세요
ex) 'Algae (Seaweed) Extract. Sea Salt.' -> 'Algae (Seaweed) Extract. Sea Salt'
조건2: '. May Contain'를 포함하고 있다면, '. May Contain' 이후의 데이터를 제거하세요
ex) 'Algae (Seaweed) Extract. May Contain: Sea Salt, Fragrance' -> 'Algae (Seaweed) Extract'
조건3: 아래의 replace_str_dict는 현재값(key):변경할값(value)의 쌍으로 이루어져 있습니다. 이 replace_str_dict를 이용하여 데이터를 변경하세요

def fixData(text):
    if pd.isna(text):  # NaN 값인 경우 그대로 반환
        return text

    #맨 끝에 마침표가 있다면 마지막 마침표를 제거하라
    if text.endswith("."):
        text = text[:-1]
    
    #May contain을 포함하고 있다면, MayContain 이후의 데이터를 제거하라 
    if '. May Contain' in text:
        text = text.split('. May Contain')[0]
        
    #replace_str_dict를 이용하여 데이터를 변경한다.
    for old, new in replace_str_dict.items():
        text = text.replace(old, new)
        
    return text

df_target['Ingredients'] = df_target['Ingredients'].apply(fixData)

df_target

def replace_ingredients_str(ingredients_str: str):

    # 마지막 마침표 제거
    ingredients_str = ingredients_str[:-1] if ingredients_str[-1] == '.' else ingredients_str

    # delete
    del_list = ['. May Contain']
    for del_str in del_list:
        if del_str in ingredients_str:
            ingredients_str = ingredients_str[:ingredients_str.find(del_str)]

    replace_str_dict = {
        'Algae (Seaweed) Extract': 'Algae Extract',
        'Citrus Aurantifolia (Lime) Extract': 'Citrus Aurantifolia (Lime) Fruit Extract',
        'Eucalyptus Globulus (Eucalyptus) Leaf Oil': 'Eucalyptus Globulus Leaf Oil',
        'Galactomyces Ferment Filtrate (Pitera)': 'Galactomyces Ferment Filtrate',
        'Bacillus/Soybean/ Folic Acid Ferment Extract': 'Bacillus/Folic Acid/Soybean Ferment Extract',
        'Butyrospermum Parkii (Shea Butter)': 'Butyrospermum Parkii (Shea) Butter',
        'Sea Salt/Maris Sal/Sel Marin': 'Sea Salt',
        'Parfum/Fragrance': 'Fragrance|Perfume|Parfum',
        ', Fragrance': ', Fragrance|Perfume|Parfum',
        }


    # Replace
    for key, value in replace_str_dict.items():
        ingredients_str = ingredients_str.replace(key, value)

    return ingredients_str

df_target['Ingredients'] = df_target['Ingredients'].apply(replace_ingredients_str)

✏️ del_list로 리스트를 지정해놓으면, 이후에 코드를 수정할 때 좀더 원활하다.

2-2) Target DataFrame 중 'Ingredients' Column Data 변환

조건1: 'Ingredients' Column의 각 데이터를 ', '(쉼표+띄어쓰기)로 분리하여 List로 변환하세요
조건2: 조건1에서 변경한 list의 각 Element 앞뒤의 공백이 있다면 공백을 삭제하세요
조건3: 'Ingredients List' Column을 새로 생성하여 조건1과 조건2에서 만든 list를 각 행에 맞게 입력하세요

# 2-2
#조건1
df_target['Ingredients List'] = df_target['Ingredients'].str.split(", ")

# 조건2
df_target['Ingredients List'] =  df_target['Ingredients List'].apply(lambda x: [ingredient.strip() for ingredient in x])

#조건3 : 조건1에서 설정 완료

check_02_02(df_target)

df_target['Ingredients List'] = df_target['Ingredients'].apply(lambda each_ingredients_str: list(map(lambda x: x.strip(), each_ingredients_str.split(', '))))
check_02_02(df_target)

✏️ strip() : 문자열 및 공백 제거

3-1) Target DataFrame 의 'Ingredients List' Column를 Mapping하여 'Code List' Column 만들기

성분사전(Ingredients Dictionary)를 이용하여 Target DataFrame의 'Ingredients List'를 각 성분에 Mapping되는 'Code List'로 만들고, 'Code List' Column을 만들어 Code List를 각 행에 맞게 입력하세요.

def ingredient_to_code(ingredient_list: list) -> list:

    code_list = []
    for ingredient in ingredient_list:
        try:
            code = ingredients_df[ingredients_df['표준 영문명'].str.lower() == ingredient.lower()]['성분코드'].values[0]
        except:
            code = ingredients_df[ingredients_df['구영문명'].str.lower() == ingredient.lower()]['성분코드'].values[0]
        finally:
            code_list.append(code)
    return code_list
    
df_target['Code List'] = df_target['Ingredients List'].apply(ingredient_to_code)
check_03_01(df_target)

3-2) 다음 조건을 만족하는 code들을 찾아 그 code들에 해당하는 DataFrame을 구하라

Target DataFrame의 Code List를 각 행 내에서 중복 없이 모두 합쳐 두 번 나온 수를 오름차순으로 정렬하고, 첫번째부터 다섯번째까지의 수들을 찾아 성분사전(Ingredients Dictionary)를 이용하여 해당 Code들의 DataFrame을 구하세요

import itertools

code_list = list(itertools.chain(*df_target['Code List'].to_list())) #리스트에 묶였던 걸 다 풀어냄
sort_list = sorted(set((code, code_list.count(code)) for code in code_list), key=lambda x: (-x[1], x[0]))
result_code = [code for code, cnt in sort_list if cnt == 2][:5]

result_df = ingredients_df[ingredients_df['성분코드'].isin(result_code)]

🔎 itertools : 효율적인 루핑을 위한 이터레이터를 만드는 함수
itertools.chain() : 모든 이터러블을 요소로 반환 # chain('ABC', 'DEF') --> A B C D E F 출처

🔎 df.to_list() : DataFrame을 list로 변환하기 (하나의 값)
복수를 변환하려면 df[['col']].values.tolist() 가능하다 (대괄호 두개의 타입은 데이터프레임)

✏️dictionary 사용법에 대해서 좀더 잘 알 수 있었던 문제들이다. 처음 문제를 풀 때 dictionary의 특징을 잘 활용하지 못하는 내 코드가 보였고 복습을 하면서 어느 정도 익혀볼 수 있었다.

소리

데이터로 경로를 탐색합니다.

이전 포스트

EDA TEST | 서울시 인구 데이터

다음 포스트