15. 데이터 사전처리 - 시계열 데이터(time series)

김동웅·2021년 8월 28일

Pandas with python

목록 보기

15/23

판다스의 시간 표시 방식 중 자주 이용되는 두가지 유형

1. Timestamp

2. Peroid

Timestamp vs Period 차이

pd.Timestamp()와 pd.Period()는 차이가 있다. Timestamp는 한 시점을 뜻하고, Period는 1일의 시작 시점부터 종료 시점까지의 범위를 포괄한다.

1. 다른 자료형을 시계열 객체로 변환

1-1. 문자열을 Timestamp로 변환

padnas 내장함수 to_datetime() 함수

import pandas as pd

df = pd.read_csv(r'C:\Users\kjt63\OneDrive\바탕 화면\python\data_analysis\sample\part5\stock-data.csv')

print(df.head())
print('\n')
print(df.info())

print('\n')

df['new_Date']= pd.to_datetime(df['Date'])

print(df.head())
print('\n')
print(df.info())

print(type(df['new_Date'][0]))

df.set_index('new_Date',inplace=True)
df.drop('Date',axis=1,inplace=True)

print(df.head())

1-2. Timestamp를 Period로 변환

판다스의 to_period()함수를 이용하면 timestamp 객체를 일정한 기간을 나타내는 period 객체로 변환할 수 있다.
to_period() 함수를 사용할 때 freq옵션에 기준이 되는 기간을 설정한다.
ex) freq='D' -> 1일 , 'M' -> 1개월, 'A' -> 1년

import pandas as pd

dates = ['2019-04-02','2020-10-01','2021-08-30']

ts_dates = pd.to_datetime(dates)

print(ts_dates)
print('\n')

pr_day = ts_dates.to_period(freq='D')

print(pr_day)

pr_month = ts_dates.to_period(freq='M')

print(pr_month)

pr_year = ts_dates.to_period(freq='A')

print(pr_year)

2. 시계열 데이터 만들기

timestamp 배열
- 판다스 date_range()함수를 사용하면 여러 개의 날짜(timestamp)가 들어있는 배열 형태의 시계열 데이터를 만들 수 있다.

import pandas as pd

ts_ms = pd.date_range(start='2021-01-01', # 날짜 범위 시작
                      end=None, # 날짜 범위 끝
                      periods=6, # 생성할 timestamp 개수
                      freq='MS', # 시간 간격(MS=월초)
                      tz='Asia/Seoul') # 시간대(timezone)

print(ts_ms)


ts_ms2 = pd.date_range(start='2021-01-01', # 날짜 범위 시작
                      end=None, # 날짜 범위 끝
                      periods=6, # 생성할 timestamp 개수
                      freq='3MS', # 시간 간격(MS=월초)
                      tz='Asia/Seoul') # 시간대(timezone)

print(ts_ms2)

# freq옵션을 M으로 설정하면 월의 마지막 날짜를 생성한다.
ts_ms3 = pd.date_range(start='2021-08-01',end=None, periods=3, freq='M')

print(ts_ms3)

ts_ms4 = pd.date_range(start='2021-01-01',end=None, periods=5, freq='3M')

print(ts_ms4)

period 배열
- period_range()함수는 여러개의 기간(period)이 들어있는 시계열 데이터를 만든다.

import pandas as pd

pr_m = pd.period_range(start='2019-01-01',end=None,periods=3,freq='M')

print(pr_m)

# freq='H' -> 1시간 간격 
pr_h = pd.period_range(start='2019-01-01',end=None,periods=3,freq='H') 

print(pr_h)

# freq='2H' -> 2시간 간격
pr_2h = pd.period_range(start='2019-01-01',end=None,periods=3,freq='2H')

print(pr_2h)

3. 시계열 데이터 활용

날짜 데이터 분리

연-월-일 날짜 데이터에서 dt 속성을 이용하여 일부를 분리하여 추출할 수 있다.

import pandas as pd

df = pd.read_csv(r'C:\Users\kjt63\OneDrive\바탕 화면\python\data_analysis\sample\part5\stock-data.csv')

df['new Date'] = pd.to_datetime(df['Date'])

df.drop('Date',axis=1,inplace=True)

# dt속성을 이용하여 new_Date 열의 연-월-일 정보를 연,월,일로 구분
df['Year'] = df['new Date'].dt.year
df['Month'] = df['new Date'].dt.month
df['Day'] = df['new Date'].dt.day
print(df)

# timestamp를 period로 변환하여 연-월-일 표기변경
df['Date_yr'] = df['new Date'].dt.to_period(freq='A')
df['Date_m'] = df['new Date'].dt.to_period(freq='M')
df['Date_d'] = df['new Date'].dt.to_period(freq='D')
print(df)

     Close  Start   High    Low  Volume   new Date  Year  Month  Day
0   10100  10850  10900  10000  137977 2018-07-02  2018      7    2
1   10700  10550  10900   9990  170253 2018-06-29  2018      6   29
2   10400  10900  10950  10150  155769 2018-06-28  2018      6   28
3   10900  10800  11050  10500  133548 2018-06-27  2018      6   27
4   10800  10900  11000  10700   63039 2018-06-26  2018      6   26
5   11150  11400  11450  11000   55519 2018-06-25  2018

    Close  Start   High    Low  Volume   new Date  Year  Month  Day Date_yr   Date_m      Date_d
0   10100  10850  10900  10000  137977 2018-07-02  2018      7    2    2018  2018-07  2018-07-02
1   10700  10550  10900   9990  170253 2018-06-29  2018      6   29    2018  2018-06  2018-06-29
2   10400  10900  10950  10150  155769 2018-06-28  2018      6   28    2018  2018-06  2018-06-28
3   10900  10800  11050  10500  133548 2018-06-27  2018      6   27    2018  2018-06  2018-06-27
4   10800  10900  11000  10700   63039 2018-06-26  2018      6   26    2018  2018-06  2018-06-26
5   11150  11400  11450  11000   55519 2018-06-25  2018      6   25    2018  2018-06  2018-06-25

날짜 인덱스 활용

timestamp로 구성된 열을 행인덱스로 지정하면 DatetimeIndex라는 고유 속성으로 변환된다.
마찬가지로 period로 구성된 열을 행인덱스로 지정하면 PeriodIndex라는 속성을 갖는다.

import pandas as pd

df = pd.read_csv(r'C:\Users\kjt63\OneDrive\바탕 화면\python\data_analysis\sample\part5\stock-data.csv')

df['new Date'] = pd.to_datetime(df['Date'])

df.set_index('new Date',inplace=True)

df.drop('Date',axis=1,inplace=True)

print(df.head())

print(df.index)

            Close  Start   High    Low  Volume
new Date
2018-07-02  10100  10850  10900  10000  137977
2018-06-29  10700  10550  10900   9990  170253
2018-06-28  10400  10900  10950  10150  155769
2018-06-27  10900  10800  11050  10500  133548
2018-06-26  10800  10900  11000  10700   63039

DatetimeIndex(['2018-07-02', '2018-06-29', '2018-06-28', '2018-06-27',
               '2018-06-26', '2018-06-25', '2018-06-22', '2018-06-21',
               '2018-06-20', '2018-06-19', '2018-06-18', '2018-06-15',
               '2018-06-14', '2018-06-12', '2018-06-11', '2018-06-08',
               '2018-06-07', '2018-06-05', '2018-06-04', '2018-06-01'],
              dtype='datetime64[ns]', name='new Date', freq=None)

날짜 인덱스를 사용하면 연-월-일 중에서 내가 필요로 하는 레벨을 선택적으로 인덱싱할 수 있다.

예를들어, df['2018'] 과 같이 추출하려는 날짜를 입력하면 해당 날짜에 해당하는 모든 행을 선택할 수 있다.

! 행선택은 iloc이나 loc과같은 메소드를 사용해야하지만 시계열데이터는 예외인듯 하다. 사용해도 상관은 없다.


df_y = df.loc['2018']
print(df_y.head())

print('\n')

df_m = df.loc['2018-07']
print(df_m.head())
print('\n')

df_m_cols = df.loc['2018-07','Start':'High']

print('\n')
print(df_m_cols)

print('\n')

df_md = df.loc['2018-07-02']
print(df_md)

print('\n')
df_d_range = df.loc['2018-06-19':'2018-06-25']

            Close  Start   High    Low  Volume
new Date
2018-07-02  10100  10850  10900  10000  137977
2018-06-29  10700  10550  10900   9990  170253
2018-06-28  10400  10900  10950  10150  155769
2018-06-27  10900  10800  11050  10500  133548
2018-06-26  10800  10900  11000  10700   63039

            Close  Start   High    Low  Volume
new Date
2018-07-02  10100  10850  10900  10000  137977

            Start   High
new Date
2018-07-02  10850  10900

            Close  Start   High    Low  Volume
new Date
2018-07-02  10100  10850  10900  10000  137977

            Close  Start   High    Low  Volume
new Date
2018-06-25  11150  11400  11450  11000   55519
2018-06-22  11300  11250  11450  10750  134805
2018-06-21  11200  11350  11750  11200  133002
2018-06-20  11550  11200  11600  10900  308596

timestamp 객체로 표시된 두 날짜 사이의 시간간격을 구할 수 있다.


today = pd.to_datetime('2018-12-25')
df['time delta'] = today - df.index

df.set_index('time delta',inplace=True)
print(df)

df_180 = df.loc['180 days':'189 days']

print(df_180)

time delta
180 days    10400  10900  10950  10150  155769
181 days    10900  10800  11050  10500  133548
182 days    10800  10900  11000  10700   63039
183 days    11150  11400  11450  11000   55519
186 days    11300  11250  11450  10750  134805
187 days    11200  11350  11750  11200  133002
188 days    11550  11200  11600  10900  308596
189 days    11300  11850  11950  11300  180656

김동웅

이전 포스트

14. 데이터 사전처리 - 정규화(Normalization)

다음 포스트