[pandas] 개요

kkiyou·2021년 6월 20일

pandas python

Data Science

목록 보기

8/11

참고자료

pandas

데이터 분석을 위해 사용하는 Python library로 NumPy를 기반으로 만들어졌다. NumPy와 다르게 이종의 data type을 다룰 수 있다. pandas Package overview

일반적으로 Alias(별칭) pd를 사용하여 import한다. 코드의 보편성 및 가독성을 위해서 관행적으로 사용하는 용어를 함께 사용하는 것이 바람직하다.

import pandas as pd

pandas의 Data structures는 아래와 같다.

Name	Dimensions	Description
Series	1	1D labeled homogeneously-typed array
DataFrame	2	General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

Series는 이종의 data type으로 구성된 1차원 array이다.
DataFrame은 이종의 data type으로 구성된 2차원 array로, MS Excel Sheet라고 할 수 있다.

pandas는 numpy를 기반으로 만들어졋다. 때문에 type(DataFrame.values)를 확인해보면 numpy.ndarray를 반환한다.

1. Series

class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

data
array-like, Iterable, dict, or scalar 값을 의미한다.
data에 dictionary값이 할당되면 keys값은 index column에, values값은 value column에 할당된다. 또한 dictionary의 순서가 유지된다. 주의할 점은 data에 dictionary값을 입력하고 index 값을 추가로 입력하면 Series의 values는 NaN(Not a Number)값을 가지게 된다.

>>> dic = {'a': "apple",
             'b': "banana",
             'c': "cherry",}
>>> print(pd.Series(dic))
a     apple
b    banana
c    cherry
dtype: object

>>> pd.Series(data=dic, index=[1, 2, 3])
1    NaN
2    NaN
3    NaN
dtype: object

index
반드시 data의 개수와 동일해야 한다.
index값이 없을 경우 자동으로 0부터 n-1까지의 숫자를 부여한다. idnex값이 입력될 경우 index column에 해당 값을 할당한다.
한편, 중복값 또한 index에 활용할 수 있으나, 권장되지 않는다.

>>> data = ['a', 'b', 'c', 'd']
>>> index = [1, 2, 3, 3]
>>> inde = [1, 2, 3]
>>> indexx = [1, 2, 3, 4, 5]

>>> pd.Series(data)
0    a
1    b
2    c
3    d
dtype: object

>>> pd.Series(data=data, index=inde)
ValueError: Length of passed values is 4, index implies 3.
>>> pd.Series(data=data, index=indexx)
ValueError: Length of passed values is 4, index implies 5.

>>> ser = pd.Series(data=data, index=index)
>>> ser
1    a
2    b
3    c
3    d
dtype: object
>>> ser.index
Int64Index([1, 2, 3, 3], dtype='int64')
>>> ser.values
array(['a', 'b', 'c', 'd'], dtype=object)

>>> ser[3]
3    c
3    d
dtype: object

pandas.Series

2. DataFrame

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

rows와 columns로 구성된 2차원 Table이다.

pandas.DataFrame

특징
1. index와 columns가 None이고 data도 값을 포함하고 있지 않으면, 0부터 n-1까지의 숫자를 자동으로 부여한다.
2. data의 행의 개수는 index와 같아야 하며, 열의 개수는 columns와 같아야 한다.

>>> data = [[1, 2, 3],
            [10, 20, 30],
            [100, 200, 300]]
>>> pd.DataFrame(data)

0 1 2
0 1 2 3
1 10 20 30
2 100 200 300

	0	1	2
0	1	2	3
1	10	20	30
2	100	200	300

>>> dic = {"Alphabet": ['a', 'b', 'c'],
           "Number": [1, 2, 3],
           "Special Character": ['@', '#', '$'],}
>>> pd.DataFrame(data=dic)

Alphabet Number Special Characters
0 a 1 @
1 b 2 #
2 c 3 $

	Alphabet	Number	Special Characters
0	a	1	@
1	b	2	#
2	c	3	$

>>> data = [['a', 'b', 'c'],
        [1, 2, 3],
        ['@', '#', '$']]
>>> pd.DataFrame(data=data, columns=["Alphbet", "Number", "Special Character"])

Alphabet Number Special Characters
0 a 1 @
1 b 2 #
2 c 3 $

	Alphabet	Number	Special Characters
0	a	1	@
1	b	2	#
2	c	3	$

3. pandas arrays

Kind of Data	pandas Data Type	Scalar	Array
TZ-aware datetime	DatetimeTZDtype	Timestamp	Datetime data
Timedeltas	(none)	Timedelta	Timedelta data
Period (time spans)	PeriodDtype	Period	Timespan data
Intervals	IntervalDtype	Interval	Interval data
Nullable Integer	Int64Dtype, …	(none)	Nullable integer
Categorical	CategoricalDtype	(none)	Categorical data
Sparse	SparseDtype	(none)	Sparse data
Strings	StringDtype	str	Text data
Boolean (with NA)	BooleanDtype	bool	Boolean data with missing values

pandas arrays

3.1. Category

성별, 형액형, 등급 등 몇 가지 다른 값으로 구성된 카테고리화할 수 있는 변수에 적용할 수 있다. 데이터 크기를 줄일 수 있다.

>>> data = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

>>> data.sex
0      Female
1        Male
2        Male
3        Male
4      Female
        ...  
239      Male
240    Female
241      Male
242      Male
243    Female
Name: sex, Length: 244, dtype: object

>>> data2 = data
>>> data2.sex = data2.sex.astype("category")
>>> data2.sex
0      Female
1        Male
2        Male
3        Male
4      Female
        ...  
239      Male
240    Female
241      Male
242      Male
243    Female
Name: sex, Length: 244, dtype: category
Categories (2, object): ['Female', 'Male']

# DataFrame의 크기가 감소했다.
>>> import sys
>>> sys.getsizeof(data)
65553
>>> sys.getsizeof(data2)
50971

Categorical data

kkiyou

다음 포스트