[Python] Pandas: 3. DataFrame 조회

Jae Gyeong Lee·2024년 4월 18일

import pandas as pd
import numpy as np
import seaborn as sns

df = sns.load_dataset("iris") #연습용 데이터셋(iris) 호출

>>>
	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

1. Data 정보 조회

1.1. column별 정보

df.info() #데이터의 수, 타입(Dtype) 조회

>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Dtype = object는 문자열을 의미

1.2. column별 통계 정보

df.describe() #각 column별 통계 정보 조회(numerical만 해당)

>>>
	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

df.describe(include='object') #Dtype = object에 대한 통계 정보

>>>
	species
count	150
unique	3
top 	setosa
freq	50

1.3. 데이터 크기 조회

df.shape

>>>
(150, 5) #행, 열

1.4. 데이터 칼럼 조회

df.columns

>>>
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

2. 상/하단 일부 조회

default = 5

2.1. 상단부터 n건 조회

df.head()

>>>
	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

2.2. 하단부터 n건 조회

df.tail()

	sepal_length	sepal_width	petal_length	petal_width	species
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

3. 최댓값/최솟값 조회

3.1. 각 column별 최댓값/최솟값 Series 형태로 반환

df.max()
df.min()

3.2. 특정 column(sepal_length)이 갖는 최댓값/최솟값

df['sepal_length'].max()
df['sepal_length'].min()

https://jimmy-ai.tistory.com/254

4. 데이터 정렬

default = 오름차순(ascending=True)

4.1. index 기준 정렬(sort_index())

# 오름차순
df.sort_index()
df.sort_index(ascending=True)

# 내림차순
df.sort_index(ascending=False)

4.2. 특정 값 기준 정렬(sort_values(by=' '))

문자열도 가능(알파벳 순)

# 오름차순
df.sort_values(by='petal_width')
df.sort_values(by='petal_width', ascending=True)

# 내림차순
df.sort_values(by='petal_width', ascending=False)

5. 특정 조건을 만족하는 것들만 조회

5.1. 특정 column들에 해당하는 값만 조회

col = ['sepal_length', 'petal_length','species']
df[col].head()

>>>
	sepal_length	petal_length	species
0	5.1	1.4	setosa
1	4.9	1.4	setosa
2	4.7	1.3	setosa
3	4.6	1.5	setosa
4	5.0	1.4	setosa

5.2. 특정 조건을 만족하는 값(1가지)만 조회

condition = df['species'] == 'virginica'

df[condition].head(3)

>>>
sepal_length	sepal_width	petal_length	petal_width	species
100	6.3	3.3	6.0	2.5	virginica
101	5.8	2.7	5.1	1.9	virginica
102	7.1	3.0	5.9	2.1	virginica

5.3. 특정 조건을 만족하는 값(2가지 이상) 조회

condition1 = df['species'] == 'virginica'
condition2 = df['sepal_width'] > 3

df[condition1 & condition2].head(3)

>>>
	sepal_length	sepal_width	petal_length	petal_width	species
100	6.3	3.3	6.0	2.5	virginica
109	7.2	3.6	6.1	2.5	virginica
110	6.5	3.2	5.1	2.0	virginica

5.4. 특정 문자열을 포함하는 값들만 조회

.str.contains() 활용, 특정 문자열이 포함된 경우 True 반환

한가지 문자열만 포함된 경우(ver)

contain_ver = df['species'].str.contains("ver")
df[contain_ver].head(3)

>>>
sepal_length	sepal_width	petal_length	petal_width	species
50	7.0	3.2	4.7	1.4	versicolor
51	6.4	3.2	4.5	1.5	versicolor
52	6.9	3.1	4.9	1.5	versicolor

두가지 문자열이 포함된 경우(ver|vir)

contain_ver_vir = df['species'].str.contains("ver|vir")
df[contain_ver_vir]

# 데이터프레임을 섞어줌(head 먹였을 떄, ver 또는 vir 하나만 보일까봐)
df_shuffled = df[contain_ver_vir].sample(frac=1).reset_index(drop=True)
df_shuffled.head()

>>>
sepal_length	sepal_width	petal_length	petal_width	species
0	5.8	2.7	5.1	1.9	virginica
1	5.6	2.9	3.6	1.3	versicolor
2	5.7	2.9	4.2	1.3	versicolor
3	6.2	2.2	4.5	1.5	versicolor
4	7.2	3.2	6.0	1.8	virginica

6. 특정 값만 조회(row/column), loc/iloc

loc, iloc 활용

Indexing, Slicing을 활용해 특정 row/column값에 해당하는 것만 조회할 때 사용

6.1. loc (location)

Access a group of rows and columns by (1)label(s) or (2)a boolean array.
사람이 읽을 수 있는 label을 통해 데이터에 접근.

6.1.1) by label(s)

df.loc[row label 값, column label 값]

# row label이 인덱스 '0'인 값 조회, column label은 전체
df.loc[0, ]

>>>
sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object

# row label이 인덱스 '0'인 값 중, column label이 'species'인 것만 조회
df.loc[0,'species']

>>>
'setosa'

# row label이 인덱스 '2~4'인 값 중, column label이 'sepal_length ~ petal_length'인 것만 조회
df.loc[2:4, 'sepal_length':'petal_length']

>>>
 sepal_length sepal_width petal_length
2	4.7	3.2	1.3
3	4.6	3.1	1.5
4	5.0	3.6	1.4

6.1.2) by boolean array

Boolean은 True or False를 갖는 자료형

# column label 'species'의 값이 'virginica'인 것(True)만 출력

df[df['species'] == 'virginica'].head(3)

	sepal_length	sepal_width	petal_length	petal_width	species
100	6.3	3.3	6.0	2.5	virginica
101	5.8	2.7	5.1	1.9	virginica
102	7.1	3.0	5.9	2.1	virginica

# column label 'species'의 값이 'virginica'이고(and),
# column label 'sepal_length'의 값이 5.0 보다 작은 것(True)만 출력

df[(df['species'] == 'virginica') & (df['sepal_length'] < 5.0)].head(3)

	sepal_length	sepal_width	petal_length	petal_width	species
106	4.9	2.5	4.5	1.7	virginica

6.2. iloc (integer location)

Purely integer-location based (1)indexing for selection by position.
컴퓨터가 읽을 수 있는 index값을 통해 데이터에 접근.
- 데이터의 위치 정보를 통해 접근하는 것.
iloc의 input 값으로는 int형 값만 들어가야 함.

df.iloc[row index 값, column index 값]

# row index가 0번째인 값 조회, column index는 전체

df.iloc[0, ]

>>>
sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object

# row index가 0번째인 값 중, column index가 4번째('species')인 것만 조회

df.iloc[0,4]

>>>
'setosa'

# row index가 '2~4(5-1)번째'인 값 중, column index가 '0~2(3-1)번째'인 값만 조회

df.iloc[2:5, 0:3]

>>>
sepal_length	sepal_width	petal_length
2	4.7	3.2	1.3
3	4.6	3.1	1.5
4	5.0	3.6	1.4

7. 특정 값만 조회2(row/column), at/iat

.at/iat

loc/iloc와 동일하게 데이터프레임 내 특정 값에 접근할 수 있는 메소드
loc/iloc와 달리 '단일 값'에 대한 접근만 가능
loc/iloc 보다 속도가 빨라, '단일 값'에 접근하고자 할 때 효율적

7.1. .at

.loc와 동일

df.at[row label 값, column label 값]

# row label이 인덱스 '0'인 값 중, column label이 'species'인 것만 조회

df.at[0,'species']

>>>
'setosa'

#at는 여러개 접근 불가
df.at[2:4, 'sepal_length':'petal_length']

>>>
TypeError: unhashable type: 'slice'

7.2. .iat

.iloc와 동일

df.iat[row index 값, column index 값]

# row index가 0번째인 값 중, column index가 4번째('species')인 것만 조회

df.iat[0,4]

>>>
'setosa'

df.iat[2:5, 0:3]

>>>
ValueError: iAt based indexing can only have integer indexers

8. 특정 값의 위치 정보 조회, .isin()

.isin()

.isin() 함수 인자로 넣어준 값이 위치한 곳을 boolean 형태로 보여줌

data = {
         'A': ['     ', 'b', np.nan, 'd', 'e'],
         'B': ['o', 6, 'g', 'o', 'i'],
         'C': [7, 'o', np.nan, None, 'm'],
         'D': ['n', None, 'o', 9, 'p'],
         'E': [np.nan, 'r', 10, None, ''],
       }

>>>
	A	B	C	D	E
0		o	7	n	NaN
1	b	6	o	None	r
2	NaN	g	NaN	o	10
3	d	o	None	9	None
4	e	i	m	p

8.1 위치 정보 찾기

search = 'o'
find_location = df.isin([search])

>>>
		A	B	C	D	E
0	False	True	False	False	False
1	False	False	True	False	False
2	False	False	False	True	False
3	False	True	False	False	False
4	False	False	False	False	False

search = ['m','o']
find_location = df.isin(search) #여러개도 가능

8.2. 해당 값이 존재하는 column 찾기

#.all() 할 시, column 내 모든 셀에 'o'가 있어야 Ture 반환
fine_column = find_location.any()

#해당 되는 column들 list로 변환
col_list = fine_column[fine_column==True].index.to_list()

8.3. 확보한 column 정보를 바탕으로 row 정보 찾기

dic = {}

for col in col_list:
    condition = find_location[col] == True
    find_location[col][condition].index.to_list()

    key = col
    value = find_location[a][condition].index.to_list()
    
    dic[key] = value

dic

>>>
{'B': [0, 3], 'C': [1], 'D': [2]}

8.4. column 정보와 row 정보 데이터화

results = []

for key, values in dic.items():
    for value in values :
        result = (key, value)
        results.append(result)

results

>>>
[('B', 0), ('B', 3), ('C', 1), ('D', 2)]

Jae Gyeong Lee

안녕하세요 반갑습니다. 공부한 내용들을 기록하고 있습니다.