pandas 기초! (value_counts()!, set_option(), columns, info(), describe(), values!!, drop()!, index)

김경민·2022년 12월 30일

read_csv()

read_csv()를 이용하여 csv파일을 편리하게 DataFrame으로 로딩합니다.
read_csv()의 sep인자를 콤마(,)가 아닌 다른 분리자로 변경하여 다른 유형의 파일도 로드가 가능합니다.

import pandas as pd
titanic_df = pd.read_csv('titanic_train.csv')

head() tail()
맨 앞 또는 맨 뒤부터 데이터를 가져올 수 있다. 기본 값은 5

titanic_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

set_option()

pd.set_option('display.max_rows', 1000) # 행 개수
pd.set_option('display.max_colwidth', 100) # 컬럼 글자 너비
pd.set_option('display.max_columns', 100) # 컬럼 개수

DataFrame의 생성

dic1 = {'Name': ['Chulmin', 'Eunkyung','Jinwoong','Soobeom'],
        'Year': [2011, 2016, 2015, 2015],
        'Gender': ['Male', 'Female', 'Male', 'Male']
       }
# 딕셔너리를 DataFrame으로 변환
data_df = pd.DataFrame(dic1)
print(data_df)
print("#"*30)

# 새로운 컬럼명을 추가
data_df = pd.DataFrame(dic1, columns=["Name", "Year", "Gender", "Age"])
print(data_df)
print("#"*30)

# 인덱스를 새로운 값으로 할당. 
data_df = pd.DataFrame(dic1, index=['one','two','three','four'])
print(data_df)
print("#"*30)

       Name  Year  Gender
0   Chulmin  2011    Male
1  Eunkyung  2016  Female
2  Jinwoong  2015    Male
3   Soobeom  2015    Male
##############################
       Name  Year  Gender  Age
0   Chulmin  2011    Male  NaN
1  Eunkyung  2016  Female  NaN
2  Jinwoong  2015    Male  NaN
3   Soobeom  2015    Male  NaN
##############################
           Name  Year  Gender
one     Chulmin  2011    Male
two    Eunkyung  2016  Female
three  Jinwoong  2015    Male
four    Soobeom  2015    Male
##############################

columns, index, values

print("columns:",titanic_df.columns)
print("index:",titanic_df.index)
print("index value:", titanic_df.head().index.values)

columns: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
index: RangeIndex(start=0, stop=891, step=1)
index value: [0 1 2 3 4]

info()

DataFrame내의 컬럼명, 데이터 타입, Null건수, 데이터 건수 정보를 제공합니다.

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

describe()

데이터값들의 평균,표준편차,4분위 분포도를 제공합니다. 숫자형 컬럼들에 대해서 해당 정보를 제공합니다.

titanic_df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

value_counts()

동일한 개별 데이터 값이 몇건이 있는지 정보를 제공합니다. 즉 개별 데이터값의 분포도를 제공합니다.

value_counts() 메소드를 사용할 때는 Null 값을 무시하고 결과값을 내놓기 쉽습니다. value_counts()는 Null값을 포함하여 개별 데이터 값의 건수를 계산할지 여부를 dropna 인자로 판단합니다.

value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)
print(titanic_df['Embarked'].value_counts(dropna=False)) # Null 값 포함한 분포도

3    491
1    216
2    184
Name: Pclass, dtype: int64
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

# DataFrame에서도 value_counts() 적용 가능. 
titanic_df[['Pclass', 'Embarked']].value_counts()

Pclass  Embarked
3       S           353
2       S           164
1       S           127
        C            85
3       Q            72
        C            66
2       C            17
        Q             3
1       Q             2
dtype: int64

DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

넘파이 ndarray, 리스트, 딕셔너리를 DataFrame으로 변환하기

import numpy as np

col_name1=['컬럼']
list1 = [1, 2, 3]
array1 = np.array(list1)

print('리스트로 만든 DataFrame \n', pd.DataFrame(list1, columns=col_name1))
print('ndarray로 만든 DataFrame \n', pd.DataFrame(array1, columns=col_name1))

리스트로 만든 DataFrame 
    컬럼
0   1
1   2
2   3
ndarray로 만든 DataFrame 
    컬럼
0   1
1   2
2   3

col_name = ['이름', '나이', '성별']
list1 = [
    ['김경민', '25', '남'],
    ['김윤민', '27', '남'],
    ['이효리', '22', '여'],
]
array1 = np.array(list1)
print('2차원 리스트로 만든 DataFrame \n', pd.DataFrame(list1, columns=col_name))
print('2차원 ndarray로 만든 DataFrame \n', pd.DataFrame(array1, columns=col_name))

2차원 리스트로 만든 DataFrame 
     이름  나이 성별
0  김경민  25  남
1  김윤민  27  남
2  이효리  22  여
2차원 리스트로 만든 DataFrame 
     이름  나이 성별
0  김경민  25  남
1  김윤민  27  남
2  이효리  22  여

dic1 = {
    '이름' : ['김경민', '김윤민', '이효리'],
    '나이' : [25, 27, 22],
    '성별' : ['남', '남', '여'],
}
df_user = pd.DataFrame(dic1)
print('딕셔너리로 만든 DataFrame \n', df_user)

딕셔너리로 만든 DataFrame 
     이름  나이 성별
0  김경민  25  남
1  김윤민  27  남
2  이효리  22  여

DataFrame을 ndarray, 리스트, 딕셔너리로 변환하기

col_name = ['col1', 'col2', 'col3']
list1 = [
    [1,2,3],
    [2,3,4]
] 
df_list = pd.DataFrame(list1, columns = col_name)
print('DataFrame : \n',df_list)
print('DataFrame\'s values \n', df_list.values, 'type?', type(df_list.values))
print('DataFrame to list \n',df_list.values.tolist(), 'type?', type(df_list.values.tolist()))

DataFrame : 
    col1  col2  col3
0     1     2     3
1     2     3     4
DataFrame's values 
 [[1 2 3]
 [2 3 4]] type? <class 'numpy.ndarray'>
DataFrame to list 
 [[1, 2, 3], [2, 3, 4]] type? <class 'list'>

print('DataFrame to dict \n', df_list.to_dict('list'), 'type?', type(df_list.to_dict()))

DataFrame to dict 
 {'col1': [1, 2], 'col2': [2, 3], 'col3': [3, 4]} type? <class 'dict'>

DataFrame의 칼럼 데이터 세트 생성과 수정, 삭제

df_user['address'] = 'Seoul'
df_user

	이름	나이	성별	address
0	김경민	25	남	Seoul
1	김윤민	27	남	Seoul
2	이효리	22	여	Seoul

df_user['10년후'] = df_user['나이'] + 10
df_user

	이름	나이	성별	address	10년후
0	김경민	25	남	Seoul	35
1	김윤민	27	남	Seoul	37
2	이효리	22	여	Seoul	32

df_user['나이/10후나이'] = df_user['나이']/df_user['10년후']
df_user

	이름	나이	성별	address	10년후	나이/10후나이
0	김경민	25	남	Seoul	35	0.714286
1	김윤민	27	남	Seoul	37	0.729730
2	이효리	22	여	Seoul	32	0.687500

df_user = df_user.drop('address', axis=1) # 열을 지울 때는 label과 axis 1
df_user = df_user.drop(index = 0, axis=0) # 행을 지울 때는 index과 axis 0

	이름	나이	성별	10년후	나이/10후나이
1	김윤민	27	남	37	0.729730
2	이효리	22	여	32	0.687500

index 객체

print(titanic_df.head().index)
col_name = ['이름', '나이', '성별']
list1 = [
    ['김경민', '25', '남'],
    ['김윤민', '27', '남'],
    ['이효리', '22', '여'],
]
df_user=pd.DataFrame(list1,index=['1짱','2짱','3짱'], columns=col_name)
print(df_user.index)

RangeIndex(start=0, stop=5, step=1)
Index(['1짱', '2짱', '3짱'], dtype='object')

indexes = df_user.index
print(indexes,type(indexes.values))
print(indexes[0])

Index(['1짱', '2짱', '3짱'], dtype='object') <class 'numpy.ndarray'>
1짱

indexes[0] = '5짱' # 변경불가

Cell In [83], line 1
----> 1 indexes[0] = '5짱'

TypeError: Index does not support mutable operations

김경민

안녕하세요

이전 포스트

카카오 REST API 소셜 로그인 feat. Django

다음 포스트

pandas 기초! (value_counts()!, set_option(), columns, info(), describe(), values!!, drop()!, index)

DataFrame의 생성

DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

DataFrame의 칼럼 데이터 세트 생성과 수정, 삭제

index 객체

카카오 REST API 소셜 로그인 feat. Django

TIL 2022.11.14 ~ 11.20

0개의 댓글

관련 채용 정보