매일 공식문서 읽어보기 004 - Pandas 1.5.0

김영하·2022년 10월 13일

매일 공식문서 읽어보기 - Pandas

목록 보기

4/6

Selection

데이터 선택하기

Note
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().

참고
파이썬과 NumPy 표현식으로 데이터를 선택하는 방법은 직관적이며 대화식으로 가능하지만, 실제 데이터 분석시 다음과 같은 최적화된 판다스 데이터 접근 메서드를 사용하는 것을 권장합니다. DataFrame.at(), DataFrame.iat(), DataFrame.loc(), DataFrame.iloc()

추가) 원문에서 setting 이라는 표현이 사용되는데, 앞으로 나올 예제에서와 같이 판다스의 경우 데이터프레임에 컬럼을 추가하거나 특정 데이터를 변경을 의미합니다.

See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

인덱싱과 관련해서는 데이터 인덱싱과 선택 (Indexing and Selecting Data) 그리고 다중 인덱스 / 고급 인덱스 (MultiIndex / Advanced Indexing) 를 참고합니다.

Getting

데이터 선택하기

Selecting a single column, which yields a Series, equivalent to df.A:

df.A는 시리즈 (Series) 인 1개의 컬럼을 선택하는 방법입니다:

In [23]: df["A"]
Out[23]: 
2013-01-01    0.469112
2013-01-02    1.212112
2013-01-03   -0.861849
2013-01-04    0.721555
2013-01-05   -0.424972
2013-01-06   -0.673690
Freq: D, Name: A, dtype: float64

Selecting via [] (__getitem__), which slices the rows:

특정 행을 선택하려면 슬라이싱 방법을 사용합니다 [] (__getitem__):

In [24]: df[0:3]
Out[24]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804

In [25]: df["20130102":"20130104"]
Out[25]: 
                   A         B         C         D
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860

Selection by label

라벨로 데이터 선택하기

See more in Selection by Label using DataFrame.loc() or DataFrame.at().

라벨로 데이터 선택하기 (Selection by Label) 주제에서 DataFrame.loc() 나 DataFrame.at() 사용 방법을 볼 수 있습니다.

For getting a cross section using a label:

라벨을 이용해서 특정 행의 데이터를 선택할 수 있습니다:

In [26]: df.loc[dates[0]]
Out[26]: 
A    0.469112
B   -0.282863
C   -1.509059
D   -1.135632
Name: 2013-01-01 00:00:00, dtype: float64

추가) 위 예제에서 dates[0]는 2013-01-01 00:00:00 행의 모든 데이터를 가져옵니다. 그래서 컬럼이름이 인덱스와 같이 출력됩니다.

Selecting on a multi-axis by label:

여러 컬럼을 이용해서도 데이터를 선택할 수 있습니다:

In [27]: df.loc[:, ["A", "B"]]
Out[27]: 
                   A         B
2013-01-01  0.469112 -0.282863
2013-01-02  1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020
2013-01-06 -0.673690  0.113648

Showing label slicing, both endpoints are included:

슬라이싱을 사용할 때, 양 끝이 모두 포함됩니다:

In [28]: df.loc["20130102":"20130104", ["A", "B"]]
Out[28]: 
                   A         B
2013-01-02  1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04  0.721555 -0.706771

추가) 일반적으로 파이썬 슬라이싱에서는 start_id:end_id의 경우 왼쪽의 start_id는 포함이 되지만, 오른쪽의 end_id는 포함되지 않습니다.

Reduction in the dimensions of the returned object:

데이터를 선택할 때 차원이 줄어듭니다:

In [29]: df.loc["20130102", ["A", "B"]]
Out[29]: 
A    1.212112
B   -0.173215
Name: 2013-01-02 00:00:00, dtype: float64

추가) 원래 df는 여러 행과 열로 이루어 졌는데, 위 예제와 같이 데이터를 선택하면 2행 1열로 차원이 줄어든 것을 의미합니다.

For getting a scalar value:

단일 값, 즉 스칼라 형태의 값을 얻을 수 있습니다:

In [30]: df.loc[dates[0], "A"]
Out[30]: 0.4691122999071863

For getting fast access to a scalar (equivalent to the prior method):

앞의 방법과 같지만 좀더 빠른 방법도 있습니다:

In [31]: df.at[dates[0], "A"]
Out[31]: 0.4691122999071863

Selection by position

위치를 지정해서 데이터 얻기

See more in Selection by Position using DataFrame.iloc() or DataFrame.at().

위치를 지정해서 데이터 얻기 (Selection by Position) 주제에서 DataFrame.iloc() 나 DataFrame.at() 에 대해서 자세히 알 수 있습니다.

Select via the position of the passed integers:

선택할 행과 열에 대해서 정수값을 전달해서 위치를 지정할 수 있습니다:

In [32]: df.iloc[3]
Out[32]: 
A    0.721555
B   -0.706771
C   -1.039575
D    0.271860
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to NumPy/Python:

NumPy 및 파이썬과 같이 슬라이싱을 사용할 수도 있습니다:

In [33]: df.iloc[3:5, 0:2]
Out[33]: 
                   A         B
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020

By lists of integer position locations, similar to the NumPy/Python style:

NumPy 및 파이썬과 같이 선택할 행과 열들의 리스트 (list)를 전달해서 데이터를 선택할 수도 있습니다:

In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]: 
                   A         C
2013-01-02  1.212112  0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972  0.276232

For slicing rows explicitly:

특정 행만 선택하기 위해 슬라이싱을 사용할 수 있습니다:

In [35]: df.iloc[1:3, :]
Out[35]: 
                   A         B         C         D
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804

For slicing columns explicitly:

특정 컬럼만 선택하기 위해 슬라이싱을 사용할 수 있습니다:

In [36]: df.iloc[:, 1:3]
Out[36]: 
                   B         C
2013-01-01 -0.282863 -1.509059
2013-01-02 -0.173215  0.119209
2013-01-03 -2.104569 -0.494929
2013-01-04 -0.706771 -1.039575
2013-01-05  0.567020  0.276232
2013-01-06  0.113648 -1.478427

For getting a value explicitly:

특정 행과 열을 지정해서 특정 값을 얻을 수도 있습니다:

In [37]: df.iloc[1, 1]
Out[37]: -0.17321464905330858

For getting fast access to a scalar (equivalent to the prior method):

위 예제와 같은 방법이지만, 좀더 빠르게 스칼라 값을 얻을 수 있는 방법도 있습니다:

In [38]: df.iat[1, 1]
Out[38]: -0.17321464905330858

Boolean indexing

조건에 의한 데이터 선택

Using a single column’s values to select data:

특정 컬럼의 값을 조건으로 데이터를 선택할 수 있습니다:

In [39]: df[df["A"] > 0]
Out[39]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-04  0.721555 -0.706771 -1.039575  0.271860

Selecting values from a DataFrame where a boolean condition is met:

데이터프레임 전체에 조건을 비교해서 데이터를 선택할 수 있습니다:

In [40]: df[df > 0]
Out[40]: 
                   A         B         C         D
2013-01-01  0.469112       NaN       NaN       NaN
2013-01-02  1.212112       NaN  0.119209       NaN
2013-01-03       NaN       NaN       NaN  1.071804
2013-01-04  0.721555       NaN       NaN  0.271860
2013-01-05       NaN  0.567020  0.276232       NaN
2013-01-06       NaN  0.113648       NaN  0.524988

Using the isin() method for filtering:

특정 데이터만 조회하기 위해 (filtering) isin() 메서드를 사용합니다:

In [41]: df2 = df.copy()

In [42]: df2["E"] = ["one", "one", "two", "three", "four", "three"]

In [43]: df2
Out[43]: 
                   A         B         C         D      E
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632    one
2013-01-02  1.212112 -0.173215  0.119209 -1.044236    one
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804    two
2013-01-04  0.721555 -0.706771 -1.039575  0.271860  three
2013-01-05 -0.424972  0.567020  0.276232 -1.087401   four
2013-01-06 -0.673690  0.113648 -1.478427  0.524988  three

In [44]: df2[df2["E"].isin(["two", "four"])]
Out[44]: 
                   A         B         C         D     E
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804   two
2013-01-05 -0.424972  0.567020  0.276232 -1.087401  four

Setting

데이터 추가

Setting a new column automatically aligns the data by the indexes:

새로운 컬럼을 만들면 인덱스에 맞추어서 기존 데이터에 추가됩니다:

In [45]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

In [46]: s1
Out[46]: 
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [47]: df["F"] = s1

Setting values by label:

라벨을 지정해서 데이터의 값을 변경할 수 있습니다:

In [48]: df.at[dates[0], "A"] = 0

Setting values by position:

특정 행과 열을 지정해서 데이터의 값을 변경할 수 있습니다:

In [49]: df.iat[0, 1] = 0

Setting by assigning with a NumPy array:

NumPy 배열로 데이터 값을 변경할 수도 있습니다:

In [50]: df.loc[:, "D"] = np.array([5] * len(df))

The result of the prior setting operations:

앞선 4개 예제의 의해서 데이터 값이 변경된 데이터프레임은 아래와 같습니다:

In [51]: df
Out[51]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059  5  NaN
2013-01-02  1.212112 -0.173215  0.119209  5  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0
2013-01-05 -0.424972  0.567020  0.276232  5  4.0
2013-01-06 -0.673690  0.113648 -1.478427  5  5.0

A where operation with setting:

where 연산으로 데이터를 변경할 수도 있습니다:

In [52]: df2 = df.copy()

In [53]: df2[df2 > 0] = -df2

In [54]: df2
Out[54]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059 -5  NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0

김영하

항상 공부하고 나누고 싶은 낭만학습자

이전 포스트