파이썬 - 데이터 관련 라이브러리

hanss·2022년 4월 7일

학습 내용

Numpy

가장 기본이 되는 라이브러리로 Numpy의 배열이 파이썬보다 잘 만들어져 있어서 많이들 쓴다.
다차원의 데이터도 표현할 수 있다.

ndarray 생성

import numpy as np	#numpy를 np라고 부르겠다고 지정
arr = np.array([1,2,3,4])	#arr이라는 이름의 배열 생성
type(arr)	#궁금한게 있으면 type으로 찍어보면 된다

#결과
numpy.ndarray #numpy배열로 만들어졌음을 알 수 있다.

np.zeros((3,3)) #0으로 초기화된 3X3배열 생성
출력결과)
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])
       
np.ones((3,3)) #1으로 초기화된 3X3배열 생성
출력결과)
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])
       
np.empty((4,4)) #공간만 할당하는 배열 생성, 가장 빠르다
출력결과)
array([[6.23042070e-307, 4.67296746e-307, 1.69121096e-306,
        1.33511018e-306],
       [8.34441742e-308, 1.78022342e-306, 6.23058028e-307,
        9.79107872e-307],
       [6.89807188e-307, 7.56594375e-307, 6.23060065e-307,
        1.78021527e-306],
       [8.34454050e-308, 1.11261027e-306, 1.15706896e-306,
        1.33512173e-306]])
        
np.arange(10)
출력결과)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

float 사용

#astype : 형을 바꾼 것을 리턴해준다, arr의 값은 그대로다
arr_float = arr.astype(np.float64)
출력결과)
[[1. 2. 3.]
 [4. 5. 6.]]

연산

실습)
arr1 = np.array([[1,2],[3,4]])
arr2 = np.array([[5,6],[7,8]])
arr3 = arr1 + arr2
print(arr3)
출력)
[[ 6  8]
 [10 12]]

같은 위치에 있는 것들끼리 연산되었다.

arr1 = np.array([[1,2],[3,4]])
arr2 = np.array([[5,6],[7,8]])

arr3 = arr1 + arr2
print(arr3)
arr3 = np.add(arr1, arr2)
print(arr3)
arr3 = arr1 * arr2
print(arr3)
arr3 = np.multiply(arr1, arr2)
print(arr3)

결과)
[[ 6  8]
 [10 12]]
[[ 6  8]
 [10 12]]
[[ 5 12]
 [21 32]]
[[ 5 12]
 [21 32]]

배열 슬라이싱하기

arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr_1 = arr[:2,1:3]
print(arr_1)

출력)
[[2 3]
 [5 6]]

설명
arr[:2,1:3] 에서
:2 는
콜론(:) 앞에 아무것도 없으면 끝까지라는 의미(처음부터)
1:3 는
인덱스 1 부터 인덱스 3 전까지

arr_int = np.arange(10)
print(arr_int)
출력)
[0 1 2 3 4 5 6 7 8 9]

print(arr_int)
print(arr_int[:5])
print(arr_int[5:])
print(arr_int[:8])
출력)
[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4]
[5 6 7 8 9]
[0 1 2 3 4 5 6 7]
[0 2 4 6]

Numpy 추가 기능

arr = ([[1,2,3],[4,5,6]])
idx = arr > 3 
print(idx)
출력)
[[False False False]
 [ True  True  True]]
#3보다 큰 결과를 저장해서 출력

print(arr[idx])
출력)
[4 5 6]
#idx의 값의 true인 값만 출력

csv 파일 불러와서 사용하기

샘플데이터 'winequality-red.csv' 사용

redwine = np.loadtxt(fname='samples/winequality-red.csv',delimiter=';',skiprows=1)

#배열의 모양을 확인한다
redwine.shape()
결과)
(1599, 12)
#전체 데이터는 1599개이고 12개의 컬럼을 가지고 있다

새로운 데이터를 읽어들였을 때는 습관적으로 확인해보자

Numpy를 이용한 기초 통계 분석

#합계
print(redwine.sum()) 
출력)
152084.78194

#평균
print(redwine.mean())
7.926036165311652

원하는 항목의 평균, 합계 구하기

#같은 컬럼끼리 평균 내기
#axis는 축을 의미, 0이면 컬럼끼리 계산, 1이면 같은 행끼리 계산
print(redwine.mean(axis=0))
결과)
[ 8.31963727  0.52782051  0.27097561  2.5388055   0.08746654 15.87492183
 46.46779237  0.99674668  3.3111132   0.65814884 10.42298311  5.63602251]

#슬라이싱으로 원하는 데이터의 평균만 구하기
print(redwine[:,0].mean())
결과)
8.31963727329581
#전체 데이터 중에 첫번째 컬럼만 가져와서 평균 냈다

Pandas

실제 데이터 분석을 수행하기 위한 기능들을 제공한다.

판다스에서 중요한 두가지
Series 객체, DataFrame 객체
표를 DataFrame 객체,
컬럼 하나 하나를 Series 객체 라고 부른다
Series 객체의 모음이 DataFrame, DataFrame은 엑셀에서 시트같은 것이다.

Series

import pandas as pd
from pandas import Series, DataFrame
#하나의 컬럼에 들어가는 값
fruit = Series([2500,3800,1200,6000],
              index=['apple','banana','pear','cherry'])	#인덱스 추가
fruit

결과)
apple     2500
banana    3800
pear      1200
cherry    6000
dtype: int64

print(fruit.values) #값만 뽑기
결과)
[2500 3800 1200 6000]
print(fruit.index) #인덱스만 뽑기
결과)
Index(['apple', 'banana', 'pear', 'cherry'], dtype='object')

딕셔너리 타입 시리즈 객체로 만들기

fruitdata = {'apple':2500, 'banana':3800, 'pear':1200, 'cherry':6000}
fruit = Series(fruitData)
print(type(fruitdata))
print(type(fruit))
출력)
<class 'dict'> #딕셔너리타입
<class 'pandas.core.series.Series'> #판다스의 시리즈객체이다

print(fruit)
결과)
apple     2500
banana    3800
pear      1200
cherry    6000
dtype: int64

이름 지정

fruit.name = 'fruitPrice'
fruit
출력)
apple     2500
banana    3800
pear      1200
cherry    6000
Name: fruitPrice, dtype: int64

인덱스 이름 적용하기

fruit.index.name = 'fruitName'
fruit
출력)
fruitName		#인덱스 이름 지정됐다
apple     2500
banana    3800
pear      1200
cherry    6000
Name: fruitPrice, dtype: int64

DataFrame

fruitData = {'fruitName':['apple','banana','cherry','pear'],
            'fruitPrice':[2500, 3800,6000,1200],
             'num':[10,5,3,8]
            }
#3개의 컬럼을 가진 fruitFrame 생성
fruitFrame = DataFrame(fruitData)
print(fruitFrame)
출력)
  fruitName  fruitPrice  num
0     apple        2500   10
1    banana        3800    5
2    cherry        6000    3
3      pear        1200    8

컬럼 순서 지정

fruitFrame = DataFrame(fruitData, columns=['fruitPrice', 'num', 'fruitName'])
fruitFrame
출력)
fruitPrice	num	fruitName
0	2500	10	apple
1	3800	5	banana
2	6000	3	cherry
3	1200	8	pear
#순서가 바뀌어 출력됨

특정한 항목만 뽑기

fruitFrame['fruitName']
결과)
0     apple
1    banana
2    cherry
3      pear
Name: fruitName, dtype: object

#같은 기능
fruitFrame.fruitName
출력)
0     apple
1    banana
2    cherry
3      pear
Name: fruitName, dtype: object

새로운 컬럼 추가

fruitFrame['Year'] = '2022'
fruitFrame
출력)
fruitPrice	num	fruitName	Year
0	2500	10	apple		2022
1	3800	5	banana		2022
2	6000	3	cherry		2022
3	1200	8	pear		2022
#Year가 추가됨

Year의 전체값 변경

fruitFrame['Year'] = '2016'
fruitFrame
	fruitPrice	num	fruitName	Year
0	2500		10	apple		2016
1	3800		5	banana		2016
2	6000		3	cherry		2016
3	1200		8	pear		2016
#추가할 때와 같다

몇개의 값만 변경하고 싶을 때, Series 객체 사용

variable = Series([4,2,1],index=[0,2,3])
print(variable)
결과)
0    4
2    2
3    1
dtype: int64


fruitFrame['stock'] = variable
print(fruitFrame)
결과)
		fruitPrice	num 	fruitName  	Year  	stock
0        2500   	10		apple		2016    4.0
1        3800    	5    	banana  	2016    NaN
2        6000    	3    	cherry  	2016    2.0
3        1200    	8      	pear  		2016    1.0
#stock 컬럼의 인덱스 0,2,3에만 값을 넣음

Series와 DataFrame으로 데이터 다루기

행(row) 삭제하기

fruit = Series([2500,3800,1200,6000],index=['apple','banana','pear','cherry'])
print(fruit)
결과)
apple     2500
banana    3800
pear      1200
cherry    6000
dtype: int64

new_fruit = fruit.drop('banana')
print(new_fruit)
결과)
apple     2500
pear      1200
cherry    6000
dtype: int64

인덱스를 텍스트로 지정하기

#먼저 인덱스로 쓰기 위한 fruitName만 가져온다
fruitName = fruitData['fruitName']
print(fruitName)
결과)
['apple', 'banana', 'cherry', 'pear']

fruitFrame = DataFrame(fruitData,
                      index=fruitName,	#인덱스를 fruitName로 만듦
                      columns=['fruitPrice','num'])
fruitFrame
결과)
		fruitPrice	num
apple	2500		10
banana	3800		5
cherry	6000		3
pear	1200		8

원하는 줄 삭제하기

fruitFrame2 = fruitFrame.drop(['apple','cherry'])
fruitFrame2
출력)
		fruitPrice	num
banana	3800		5
pear	1200		8
#apple과 cherry 줄이 없어진 것을 확인할 수 있음

원하는 컬럼 삭제하기

# dataFrame에서 행(row)은 0번, 컬럼은 1번, axis으로 기준을 바꾸여줘야한다.
fruitFrame3 = fruitFrame.drop('num', axis =1)
fruitFrame3
출력)
		fruitPrice
apple	2500
banana	3800
cherry	6000
pear	1200
#num컬럼이 없어진 것을 확인할 수 있음

Series의 슬라이싱

fruit['apple':'pear'] # 애플에서 피어까지
출력)
apple    2500
pear     1200
dtype: int64

DataFrame의 슬라이싱

- 프레임에서도 슬라이싱할 수 있다
fruitFrame['apple':'banana']
출력)
		fruitPrice	num
apple	2500		10
banana	3800		5

연산

Series의 연산

fruit1 = Series([5,9,10,3], index ={'apple','banana','cherray','pear'})
fruit2 = Series([3,2,9,5,10], index ={'apple','orange','banana','cherray','mango'})
fruit1 + fruit2
결과)
apple       8.0
banana     18.0
cherray    15.0
mango       NaN
orange      NaN
pear        NaN
dtype: float64
#인덱스가 양쪽에 있는 것들은 연산 결과가 나온다
#한쪽에만 있는 것들은 연산이 불가능하다

DataFrame의 연산

fruitData1 = {'Ohio' : [4,8,3,5],'Texas' : [0,1,2,3]}
fruitFrame1 = DataFrame(fruitData1,columns=['Ohio','Texas'],index = ['apple','banana','cherry','peer'])
fruitData2 = {'Ohio' : [3,0,2,1,7],'Colorado':[5,4,3,6,0]}
fruitFrame2 = DataFrame(fruitData2,columns =['Ohio','Colorado'],index = ['apple','orange','banana','cherry','mango'])

fruitFrame1 + fruitFrame2
출력)
		Colorado	Ohio	Texas
apple	NaN			7.0		NaN
banana	NaN			10.0	NaN
cherry	NaN			4.0		NaN
mango	NaN			NaN		NaN
orange	NaN			NaN		NaN
peer	NaN			NaN		NaN
# 일치하는 데이터만 연산을 한다

정렬

fruit
결과)
apple     2500
pear      1200
cherry    6000
dtype: int64

fruit.sort_values()
결과) #가격순으로 정렬
pear      1200
apple     2500
cherry    6000
dtype: int64

#옵션을 추가해서 역순으로 뽑을 수도 있다.
#ascending의 기본값은 True
fruit.sort_values(ascending=False)
결과) 
cherry    6000
apple     2500
pear      1200
dtype: int64

#인덱스 순서대로 정렬, ascending 사용 가능
fruit.sort_index()
결과) 
apple     2500
cherry    6000
pear      1200
dtype: int64

#데이터프레임 정렬
fruitFrame
결과)
		fruitPrice	num
apple		2500		10
banana	3800		5
cherry		6000		3
pear		1200		8

fruitFrame.sort_index()
결과)
		fruitPrice	num
apple		2500		10
banana	3800		5
cherry		6000		3
pear		1200		8

#컬럼 기준으로 정렬
fruitFrame.sort_values(by=['fruitPrice'])
결과)
		fruitPrice	num
pear		1200		8
apple		2500		10
banana	3800		5
cherry		6000		3
# 기준을 두개를 사용할 수도 있음
fruitFrame.sort_values(by=['fruitPrice','num'])

CSV 데이터 분석하기

[실습파일] german_credit.csv

german = pd.read_csv('http://freakonometrics.free.fr/german_credit.csv')
# 컬럼값을 불러올 수 있다 (Numpy 배열로 넘어옴)
german.columns.values
결과)
array(['Creditability', 'Account Balance', 'Duration of Credit (month)',
       'Payment Status of Previous Credit', 'Purpose', 'Credit Amount',
       'Value Savings/Stocks', 'Length of current employment',
       'Instalment per cent', 'Sex & Marital Status', 'Guarantors',
       'Duration in Current address', 'Most valuable available asset',
       'Age (years)', 'Concurrent Credits', 'Type of apartment',
       'No of Credits at this Bank', 'Occupation', 'No of dependents',
       'Telephone', 'Foreign Worker'], dtype=object)
       
#리스트로도 볼 수 있다.
list(german.columns.values)
결과)
['Creditability',
 'Account Balance',
 'Duration of Credit (month)',
 'Payment Status of Previous Credit',
 'Purpose',
 'Credit Amount',
 'Value Savings/Stocks',
 'Length of current employment',
 'Instalment per cent',
 'Sex & Marital Status',
 'Guarantors',
 'Duration in Current address',
 'Most valuable available asset',
 'Age (years)',
 'Concurrent Credits',
 'Type of apartment',
 'No of Credits at this Bank',
 'Occupation',
 'No of dependents',
 'Telephone',
 'Foreign Worker']

원하는 컬럼만 가져오기

german_sample = german[['Creditability','Duration of Credit (month)','Purpose', 'Credit Amount']]
german_sample
결과)
	Creditability	Duration of Credit (month)	Purpose	Credit Amount
0	1	18	2	1049
1	1	9	0	2799
2	1	12	9	841
3	1	12	0	2122
4	1	12	0	2171
...	...	...	...	...
995	0	24	3	1987
996	0	24	0	2303
997	0	21	0	12680
998	0	12	3	6468
999	0	30	2	6350

#각 컬럼의 최소값 찾기
german_sample.min()
결과)
Creditability				0
Duration of Credit (month)	4
Purpose						0
Credit Amount				250
dtype: int64

#각 컬럼의 최대값 찾기
german_sample.max()
결과)
Creditability               1
Duration of Credit (month)	72
Purpose                     10
Credit Amount               18424
dtype: int64

#평균값
german_sample.mean()
결과)
Creditability				0.700
Duration of Credit (month)	20.903
Purpose						2.828
Credit Amount				3271.248
dtype: float64

전체 데이터에 대한 요약정보를 보여줌

german_sample.describe
결과)
<bound method NDFrame.describe of      Creditability  Duration of Credit (month)  Purpose  Credit Amount
0                1                          18        2           1049
1                1                           9        0           2799
2                1                          12        9            841
3                1                          12        0           2122
4                1                          12        0           2171
..             ...                         ...      ...            ...
995              0                          24        3           1987
996              0                          24        0           2303
997              0                          21        0          12680
998              0                          12        3           6468
999              0                          30        2           6350

[1000 rows x 4 columns]>

상관계수 확인

german_sample = german[['Duration of Credit (month)','Credit Amount','Age (years)']]
german_sample.corr()
결과)
		Duration of Credit (month)	Credit Amount	Age (years)
Duration of Credit (month)	1.000000	0.624988	-0.037550
Credit Amount				0.624988	1.000000	0.032273
Age (years)					-0.037550	0.032273	1.000000
#각각의 기준일 때 얼마나 관련이 있나, 가장 관련있으면 1

#신용카드 사용액과 주거 형태를 뽑는다
german_sample = german[['Credit Amount','Type of apartment']]
결과)
	Credit Amount	Type of apartment
0		1049		1
1		2799		1
2		841			1
3		2122		1
4		2171		2
...		...			...
995		1987		1
996		2303		2
997		12680		3
998		6468		2
999		6350		2
1000 rows × 2 columns

주거종류에 따라 Credit Amount 통계내기

#그룹화
german_grouped = german_sample['Credit Amount'].groupby(german_sample['Type of apartment'])german_grouped.mean()

결과)
Type of apartment
1    3122.553073
2    3067.257703
3    4881.205607
Name: Credit Amount, dtype: float64
#각각의 그룹별로 평균이 나온다

그룹 두 개 사용하기

german_grouped = german_sample['Credit Amount'].groupby(
                                                    [german_sample['Purpose'],
                                                     german_sample['Type of apartment']])
german_grouped.mean()
결과)
Purpose  	Type of apartment
0        	1                    	2597.225000
         	2                    	2811.024242
         	3                    	5138.689655
1        	1                    	5037.086957
         	2                    	4915.222222
         	3                    	6609.923077
... 중략 ...
#purpose 별 주거 타입을 각각 보여준다

학습후기

데이터를 다루는 라이브러리로 Numpy와 Pandas를 사용해보았다. numpy의 배열은 np.zeros와 np.ones로 배열을 생성함과 동시에 초기화할 수 있어서 편리했다. 이런 쓰기 편하다는 점이 python의 장점인가보다. 배울수록 왜 많은 사람들이 데이터를 분석하는 것으로 파이썬을 사용하는지 알 것 같다.
수업 마지막에는 공공데이터를 받아와 분석하는 방법을 배웠는데 분석된 결과를 보는 것도 재밌는 일이였다.
라이브러리에 익숙해질 겸 python으로 공공데이터를 이용해 여러가지를 분석해봐야겠다.