[AI][Data Analysis and Machine Learning]Pandas-DataFrame_1

수·2025년 2월 21일

AI[Data Analysis and Machine Learning]

목록 보기

5/9

Series 가 1차원이라면 DataFrame은 2차원으로 확대된 버전
Excel spreadsheet 라고 생각하면 쉬움
2차언이기 때문에 인덱스가 row,column로 구성됨
- row는 각 개별 데이터를 column은 개별 속성을 의미
Data Analysis, Machine Learning 에서 data 변형을 위해 많이 사용

import numpy as np
import pandas as pd
import os

base_path=r'드라이브경로'
filepath = os.path.join(base_path,'titanic.csv')
pd.read_csv(filepath)

df=None
def load_titanic():
	return pd.read_csv(filepath)

DataFrame 데이터 파악

shape 속성( row, column)
describe 함수- 숫자형 데이터의 통계치 계산
info 함수- 데이터 타입, 각 아이템의 개수 등 출력

인덱스(index)

index 속성
각 아이템을 특정할 수 있는 고유의 값 저장
복잡한 데이터의 경우, 멀티 인덱스로 표현 가능

df.index

>RangeIndex(start=0, stop=891, step=1)

df.index.values
>array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82....

컬럼(column)

columns 속성
각각의 특성을 나타냄
복잡한 데이터의 경우, 멀티 컬럼으로 표현 가능

df.columns
>Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
      
df.columns.values
>array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

DataFrame 생성

df = pd.DataFrame([1,2,3])
df

	0
0	1
1	2
2	3

df = pd.DataFrame([
        [1,2,3],  #첫번째 행
        [4,5,6],  #두번째 행
        ])
df

	0	1	2
0	1	2	3
1	4	5	6

shape,ndim,size,len()

len(df)
>2

df.shape
>(2,3)

df.size
>6

df.ndim
>2

df.dtypes

	0
0	int64
1	int64
2	int64

column,index 변경

df2 = pd.DataFrame([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
],columns=['a','b','c','d'])

df2

	a	b	c	d
0	1	2	3	4
1	5	6	7	8
2	9	10	11	12

#컬럼 변경
df2.columns=['국어','영어','수학','과학'
df2

국어	영어	수학	과학
0	1	2	3
1	5	6	7
2	9	10	11

df2.index
>RangeIndex(start=0, stop=3, step=1)

df2.index=['홍길동','둘리','마이클']
df2

	국어	영어	수학	과학
고길동	1	2	3	4
둘리	5	6	7	8
마이클	9	10	11	12

dict 로 DataFrame 만들기

data={'a':100,'b':200,'c':300}
pd.DataFrame(data,index=['x'])

	a	b	c
x	100	200	300

pd.DataFrame(data,index=['x','y','z'])

	a	b	c
x	100	200	300
y	100	200	300
z	100	200	300

data={'a':[100],'b':[200],'c':[300]}
pd.DataFrame(data)
# value 가 1차원 data인 경우 index= 없이도 생성됨

	a	b	c
0	100	200	300

Series로 부터 DataFrame 생성

각 Series 의 인덱스 →column
결국 DataFrame=SeriesXSeriesX .. (Series 가 쌓인 형태가 DataFrame)

a = pd.Series([100, 200, 300], ['a', 'b', 'c'])
b = pd.Series([101, 202, 303], ['a', 'b', 'c'])
c = pd.Series([110, 220, 330], ['a', 'b', 'c'])

pd.DataFrame([a,b,c])

	a	b	c
0	100	200	300
1	101	202	303
2	110	220	330

d = pd.Series([400,430,430],['a','b','d'])

pd.DataFrame([a,b,c,d])

	a	b	c	d
0	100.0	200.0	300.0	NaN
1	101.0	202.0	303.0	NaN
2	110.0	220.0	330.0	NaN
3	400.0	430.0	NaN	430.0

‘특정 컬럼’의 이름 변경 rename()

df2

	국어	영어	수학	과학
고길동	1	2	3	4
둘리	5	6	7	8
마이클	9	10	11	12

df2.rename(columns={'국어':'kor','과학':'Sci'})

	kor	영어	수학	Sci
고길동	1	2	3	4
둘리	5	6	7	8
마이클	9	10	11	12

set_index(), reset_index()

df2.reset_index()
#기존의 index가 column 레벨로 올라오고 새로운 index 가 붙는다

	index	국어	영어	수학	과학
0	고길동	1	2	3	4
1	둘리	5	6	7	8
2	마이클	9	10	11	12

df2.reset_index(drop=True)
#-> 기존의 index는 제거됨

	국어	영어	수학	과학
0	1	2	3	4
1	5	6	7	8
2	9	10	11	12

df2.set_index('국어')
#국어 컬럼이 index로 내려가고 기존의 인덱스를 대테한다.

	영어	수학	과학
국어
1	2	3	4
5	6	7	8
9	10	11	12

(마찬가지로 inplace=True 하면 원본이 변경된다)

Multi-level index, Multi-level column

pd.DataFrame({'k':[10]})

	k
0	10

pd.DataFrame({'k','k1'):[10]})

	k
	k1
0	10

pd.DataFrame({('k0','k1'):{'a':10,'b':40}})

	k0
	k1
a	10
b	40


pd.DataFrame({
    ('k', 'k1') : [10, 20, 30, 31],
    ('k', 'k2') : [40, 50, 60, 61],
    ('j', 'j1') : [70, 80, 90, 91],
    ('j', 'j2') : [100, 110, 120, 121],
}, index=[['서울', '서울', '경기', '경기'], ['평일', '휴일', '평일', '휴일']])

		k		j
		k1	k2	j1	j2
서울	평일	10	40	70	100
	휴일	20	50	80	110
경기	평일	30	60	90	120
	휴일	31	61	91	121

column 선택하기

기본적으로 [] 는 column 추출
컬럼 인덱스인 경우 인덱스의 리스트 사용 가능
- 리스트를 전달할 경우 결과는 DataFrame
- 하나의 컬럼명을 전달할 경우 결과는 Series

df = load_titanic()

단일 컬럼 선택

df['Survived']

	Survived
0	0
1	1
2	1
3	1
4	0
...	...
886	0
887	1
888	0
889	1
890	0

891 rows × 1 columns

dtype: int64

복수의 컬럼 선택하기

df['Survived'] -> 결과는 Series

df[['Survived']] -> 결과는 DataFrame

df[['Survived','Age','Name']]

	Survived	Age	Name
0	0	22.0	Braund, Mr. Owen Harris
1	1	38.0	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	1	26.0	Heikkinen, Miss. Laina
3	1	35.0	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	0	35.0	Allen, Mr. William Henry
...	...	...	...
886	0	27.0	Montvila, Rev. Juozas
887	1	19.0	Graham, Miss. Margaret Edith
888	0	NaN	Johnston, Miss. Catherine Helen "Carrie"
889	1	26.0	Behr, Mr. Karl Howell
890	0	32.0	Dooley, Mr. Patrick

891 rows × 3 columns

row 선택하기

DataFrame slicing

dataframe 의 경우 기본적으로 [] 연산은 column 선택에 사용
하지만 slicing 은 row레벨로 지원한다

#10개의 row 선택 
df[:10]

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN
9	10	1

df[7:10]

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN

loc,iloc 선택하기

Series 의 경우 [] 로 row 선택이 가능하나 DataFrame 의 경우는 기본적으로 column을 선택하도록 설계되어있다.
.loc[], .iloc[] 로 row 선택 가능
- loc - 인덱스 자체를 사용
- iloc - 0based index 로 사용
- 위 두가지는 , 를 사용하여 column 선택도 가능하다

#index 변경
df.index += 100

df

100	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
101	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
102	3

…

df.loc[986] #결과 Series임

	986
PassengerId	887
Survived	0
Pclass	2
Name	Montvila,Rev.Juozas
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked

dtype: object

df.loc[[986]]
# 결과 DataFrame

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
986	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0	NaN

df.loc[[986,100,110,990]]

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
986	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN
100	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.25	NaN
110	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.70	G6
990	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN

df.loc[np.arange(100,104)]

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
100	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN
101	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85
102	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN
103	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	1138

# slicing 과 loc[] 차이-loc는 인덱스였는데 얘는 백번째 행이라는 뜻
df[100:105]

PassengerId	Survived	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
200	101	3	Petranec, Miss. Matilda	female	28.0	0	0	349245	7.8958	NaN
201	102	3	Petroff, Mr. Pastcho ("Pentcho")	male	NaN	0	0	349215	7.8958	NaN
202	103	1	White, Mr. Richard Frasar	male	21.0	0	1	35281	77.2875	D26
203	104	3	Johansson, Mr. Gustaf Joel	male	33.0	0	0	7540	8.6542	NaN
204	105	3	Gustafsson, Mr. Anders Vilhelm	male	37.0	2	0	3101276	7.9250	NaN

iloc

#0번 인덱스가 아니라 0번째 row임 
df.iloc[0]
#결과는 Series

	100
PassengerId	1
Survived	0
Pclass	3
Name	Braund, Mr. Owen Harris
Sex	male
Age	22.0
SibSp	1
Parch	0
Ticket	A/5 21171
Fare	7.25
Cabin	NaN
Embarked	S

dtype: object

df.iloc[[0]]
#결과 DataFrame

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
100	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.25	NaN

df.iloc[[0,100,200,2]]

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
100	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN
200	101	0	3	Petranec, Miss. Matilda	female	28.0	0	0	349245	7.8958	NaN
300	201	0	3	Vande Walle, Mr. Nestor Cyriel	male	28.0	0	0	345770	9.5000	NaN
102	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN

df.head(3)
df[:3]
df.iloc[:3]

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked
100	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN
101	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85
102	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN

row, column 동시에 선택하기

loc[], iloc[] 속성을 이용 시 , 콤마를 사용하여 row column 둘 다 명시 가능


#loc[row,column] <- loc[axis0,axis1]

df.loc[986,'Survived']
>0

df.iloc[-5,-3]
>13.0

df.loc[102,'Name']
>'Heikkinen,Miss.Laina'

df.iloc[2,3]
>'Heikkinen,Miss.Laina'

df.loc['Name',102]
>'Heikkinen,Miss.Laina'

df.loc[[986,100,110,990],['Pclass','Name','Sex','Age']]
df.loc[[986,100,110,990]][['Pclass','Name','Sex','Age']]
df[['Pclass','Name','Sex','Age']].loc[[986,100,110,990]]

	Pclass	Name	Sex	Age
986	2	Montvila, Rev. Juozas	male	27.0
100	3	Braund, Mr. Owen Harris	male	22.0
110	3	Sandstrom, Miss. Marguerite Rut	female	4.0
990	3	Dooley, Mr. Patrick	male	32.0

boolean selection으로 row 선택

numpy에서와 동일한 방식으로 해당 조건에 맞는 row만 선택

df.Pclass.unique()
>array([3,1,2])
df.Pclass.value_counts()

	count
Pclass
3	491
1	216
2	184

dtype: int64

pclass_mask = df['Pclass'] ==1
pclass_mask

	Pclass
100	False
101	True
102	False
103	True
104	False
...	...
986	False
987	True
988	False
989	True
990	False

891 rows × 1 columns

dtype: bool

#boolean selection
df[pclass_mask]

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
101	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85
103	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123
106	7	1	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46
111	12	1	1	Bonnell, Miss. Eliza

#30대 승객에 대한 boolean mask
age_mask = (df.Age >= 30) & (df.Age<40)
# 1class 이면서 30대인 승객들 
df[age_mask & pclass_mask]

row, column 추가, 삭제

column 추가

[] 사용하여 추가
insert() 사용하여 원하는 위치에 추가

#Age*2 한 결과를 df 의 새로운 칼럼으로 추가
df['Age_Double']=df['Age']*2

df[['Age','Age_Double']].head()

	Age	Age_Double
100	22.0	44.0
101	38.0	76.0
102	26.0	52.0
103	35.0	70.0
104	35.0	70.0

파생변수

기존에 존재하는 속성(변수,컬럼)으로부터 새로운 속성(컬럼)을 만들어 낸 것.

df['Age_triple']=df['Age_Double']+df['Age']
df[['Age','Age_Double','Age_triple']].head()

	Age	Age_Double	Age_triple
100	22.0	44.0	66.0
101	38.0	76.0	114.0
102	26.0	52.0	78.0
103	35.0	70.0	105.0
104	35.0	70.0	105.0

insert()

원하는 위치에 컬럼 추가

df['Fare']/10

# 컬럼 인덱스 3 위치에 'Fare10'이라는 새로운 컬럼 삽입
df.insert(3,'Fare10',df['Fare']/10)

수

어리둥절 빙글빙글 돌아가는 코딩세상~

이전 포스트

[AI][Data Analysis and Machine Learning]Pandas-Series_2

다음 포스트

[AI][Data Analysis and Machine Learning]Pandas-DataFrame_1

AI[Data Analysis and Machine Learning]

DataFrame 데이터 파악

인덱스(index)

컬럼(column)

DataFrame 생성

shape,ndim,size,len()

column,index 변경

dict 로 DataFrame 만들기

Series로 부터 DataFrame 생성

‘특정 컬럼’의 이름 변경 rename()

set_index(), reset_index()

Multi-level index, Multi-level column

column 선택하기

단일 컬럼 선택

복수의 컬럼 선택하기

row 선택하기

DataFrame slicing

loc,iloc 선택하기

row, column 동시에 선택하기

boolean selection으로 row 선택

row, column 추가, 삭제

column 추가

insert()

[AI][Data Analysis and Machine Learning]Pandas-Series_2

[AI][Data Analysis and Machine Learning]Pandas-DataFrame_2

0개의 댓글