22.07.21-22.08.21
KETI 주관 에너지 인공지능 경진대회
재생에너지 발전량 예측제도: 재생에너지 발전량을 하루 전에 미리 예측, 당일날 일정 오차율 이내로 이행할 경우 정산금을 지급하는 제도
Data:
한국서부발전(주)
제공 10분 간격 태양광/풍력 발전량
기상관측/기상예보.csv(온도, 풍속, 풍향, 습도)
Object:
17시 이전 데이터를 기반으로 다음날 시간 단위 발전량 예측 모델 개발
Evaluate:
Nominal Mean Absolute Error >> nominal value 이상인 항을 대상으로 MAE
Data format
solar_weather.csv
지점 | 지점명 | Date | 온도 | 풍속 | 풍향 | 습도 |
---|
solar_forecast.csv
Forecast time | forecast | temperature | windspeed | winddirection | humidity |
---|---|---|---|---|---|
2022-08-25 14:00:00 | 7 | --- | --- | --- | --- |
2022-08-25 14:00:00 | 10 | --- | --- | --- | --- |
2022-08-25 14:00:00 | 13 | --- | --- | --- | --- |
solar_power.csv
datetime | target |
---|---|
2022-08-25 00:00:10 | --- |
2022-08-25 00:00:20 | --- |
데이터 보안 서약에 의해 데이터 visualize 제공 불가
태양광 발전량의 경우 1시간 단위 평균값을 분석 결과, 다음의 insight를 얻음.
## solar_forecast > 특정 시간에 예측한 n시간 뒤의 기상상황
## solar_weather > 실제 관측된 기상상황
## solar_power > 10분 간격 태양광발전량
solar_forecast = pd.read_csv('/content/solar_forecast_weather.csv')
solar_weather = pd.read_csv('/content/weather_solar_actual.csv')
solar_power = pd.read_csv('/content/solar_power_2204.csv')
태양광 발전은 Day주기성이 강하기 때문에 결측치가 있다고 판단되는 Date 전체를 삭제하고 학습.
태양광 발전은 일출전, 일몰후에는 발전량이 0으로 관측.
해가 떠있는 시간에 0으로 관측된 발전량은 결측치로 판단.
해당 날짜에 발전량이 0인 항 개수를 기준으로 결측 Date를 걸러냄.
## 발전량 전처리
solar_power = solar_power.dropna()
idx_outlier = solar_power[solar_power['target'] > 20000].index
solar_power = solar_power.drop(idx_outlier)
solar_power['datetime'] = pd.to_datetime(solar_power['datetime'])
solar_power['date'] = solar_power['datetime'].dt.date
# 특정 Date에 관측된 발전량 중에 0인 항의 개수를 기준으로 outlier Date filter
# 하루를 10분 간격으로 나누면 144개의 항이 존재, 정상 Date의 통계: 51~83
non_zero_group = solar_power.groupby('date').apply(lambda x: sum(x.target != 0))
#non_zero_group.quantile(0.05) == 42
normal_date = non_zero_group.apply(lambda x: 48<x<83)
unique_date = solar_power['date'].unique()
normal_list =[]
for idx, d in enumerate(normal_date):
if d == True:
normal_list.append(unique_date[idx])
# 추가로 관찰한 outlier Date 삭제
normal_list.remove(datetime.date(2021,7,12))
normal_idx = solar_power['date'].isin(normal_list)
solar_power = solar_power.loc[normal_idx]
## 발전량 10분 간격 > 1시간 평균
solar_power.set_index(pd.to_datetime(solar_power['datetime']), inplace=True)
solar_power_hourmean = solar_power.resample(rule='1H').mean()
solar_power_hourmean = solar_power_hourmean.reset_index()
solar_power_hourmean = solar_power_hourmean.dropna()
## 기상 관측치 전처리
solar_weather = solar_weather.drop(columns = ['지점'],axis=1)
solar_weather = solar_weather.drop(columns =['지점명'], axis=1)
solar_weather.columns =['Date', 'temperature','windspeed','winddirection','humidity']
solar_weather['datetime'] = pd.to_datetime(solar_weather['Date'])
solar_weather['date'] = solar_weather['datetime'].dt.date
# 발전량에서 사용한 date만 filter
normal_idx = solar_weather['date'].isin(normal_list)
solar_weather = solar_weather.loc[normal_idx]
## 변수 추가/삭제 (month, season, con_hour, con_date)
train_data = solar_weather.copy()
# 풍향 변수 삭제
train_data = train_data.drop(columns =['winddirection','Date'])
hour = train_data['datetime'].dt.hour
month = train_data['datetime'].dt.month
month_featured = train_data['datetime'].dt.month - 1
day = train_data['datetime'].dt.day /31
date = month_featured + day
# 계절 category 변수 생성
seasons = ['1','2','2','2','3','3','3','2','2','2','1','1']
month_to_season = dict(zip(range(1,13),seasons))
# 월 변수 생성
train_data['month'] = month
train_data['season'] = train_data['month'].map(month_to_season)
# continuous hour 변수 생성 > 시간 주기성 부여
# continuous date 변수 생성 > 월 주기성 부여
train_data['con_hour'] = -24*np.cos(2*np.pi*(hour/24))
train_data['con_date'] = -np.cos(2*np.pi*(date/12))
train_data.reset_index(drop=True, inplace=True)
## 기상예보 전처리
# 17시 이전 데이터 중 가장 최신 데이터인 14시 예보 사용
# 14시 기준 다음날 0시~24시 데이터만 지정
cond1 = solar_forecast['Forecast time'].str.endswith('14:00:00')
cond2 = solar_forecast['forecast'] >=10
cond3 = solar_forecast['forecast'] <= 33
solar_forecast = solar_forecast.loc[cond1&cond2&cond3]
solar_forecast = solar_forecast.reset_index(drop=True)
solar_forecast['Forecast time'] = pd.to_datetime(solar_forecast['Forecast time'])
solar_forecast['forecasted_datetime'] = solar_forecast['Forecast time'] + solar_forecast['forecast'].map(lambda x: pd.DateOffset(hours=x))
solar_forecast.drop(columns =['Forecast time'], inplace=True)
# 3시간 간격 예보 1시간 간격 보간
datetime_from = solar_forecast['forecasted_datetime'].values[0]
datetime_to = solar_forecast['forecasted_datetime'].values[-1]
val_data = pd.DataFrame(pd.date_range(start=datetime_from, end=datetime_to, freq='H'), columns = ['forecasted_datetime'])
val_data = pd.merge(solar_forecast, val_data, on='forecasted_datetime', how='right')
val_data['temperature'] = val_data['temperature'].interpolate()
val_data['humidity'] = val_data['humidity'].interpolate()
val_data['windspeed'] = val_data['windspeed'].interpolate()
val_data['date'] = val_data['forecasted_datetime'].dt.date
# 변수 생성 (month, season, con_hour, con_date)
val_data['month'] = val_data['forecasted_datetime'].dt.month
val_data['season'] = val_data['month'].map(month_to_season)
val_data['con_hour'] = -24*np.cos(2*np.pi*(val_data['forecasted_datetime'].dt.hour/24))
val_data['con_date'] = -np.cos(2*np.pi*((val_data['forecasted_datetime'].dt.month-1) + (val_data['forecasted_datetime'].dt.day /31))/12)
# test_data 생성
test_data = val_data.copy()
normal_idx = val_data['date'].isin(normal_list)
val_data = val_data.loc[normal_idx]
val_data.drop(columns = ['winddirection', 'forecast'],inplace=True)
val_data = val_data[['forecasted_datetime','temperature','windspeed','humidity','month','season','con_hour','con_date']]
val_data = val_data.rename(columns={'forecasted_datetime': 'datetime'})
val_data.reset_index(drop=True, inplace=True)
test_data.drop(columns = ['winddirection', 'forecast'],inplace=True)
test_data = test_data[['forecasted_datetime','temperature','windspeed','humidity','month','season','con_hour','con_date']]
test_data = test_data.rename(columns={'forecasted_datetime': 'datetime'})
test_data.reset_index(drop=True, inplace=True)
## 22.05 ~ 22.06 예측대상
test1 = test_data['datetime'] >'2022-04-27 23:00:00'
test2 = test_data['datetime'] < '2022-07-01 00:00:00'
test_data = test_data.loc[test1 & test2]
test_data.reset_index(drop=True, inplace=True)
## 변수 추가 (dew_point)
c = 243.12
b = 17.62
gamma = (b * (train_data['temperature']) / (c + (train_data['temperature']))) + np.log(train_data['humidity'] / 100)
dp = ( c * gamma) / (b - gamma)
train_data['dew_point'] = dp
gamma = (b * (val_data['temperature']) / (c + (val_data['temperature']))) + np.log(val_data['humidity'] / 100)
dp = ( c * gamma) / (b - gamma)
val_data['dew_point'] = dp
gamma = (b * (test_data['temperature']) / (c + (test_data['temperature']))) + np.log(test_data['humidity'] / 100)
dp = ( c * gamma) / (b - gamma)
test_data['dew_point'] = dp
기상예보 데이터 추가(하늘상태, 강수확률)
기상청
제공 공공데이터 초단기예보 활용
하늘상태: {맑음: 1, 구름많음: 3, 흐림: 4}
강수확률: 0~100 %
Raw Data format
format: day | hour | forecast | value |
---|---|---|---|
day | hour | 3시간간격(Bf 21.06)/1시간간격(Af 21.06) | value |
day 갱신 때, Start: Date 열로 구분됨.
2021-06월 이전/이후로 예보 간격이 다름.(3시간/1시간)
## 기상청 Raw data 전처리
def forecast_preprocess(fcst_df, st_month, st_year):
month_rows = []
month_rows.extend(fcst_df[fcst_df['hour'].isna()].index)
month_rows.append(fcst_df.shape[0]+1)
month_data = []
for i in range(len(month_rows)-1):
month_data.append(fcst_df.loc[month_rows[i]+1:month_rows[i+1]-1])
month = st_month
year = st_year
out = pd.DataFrame()
for i,df in enumerate(month_data):
month += 1
if month > 12:
month = month % 12
year += 1
date = f'{year}-{month}-'+df[' format: day']+' '+(df['hour'].astype(int)//100).astype(str) + ':00'
date = pd.to_datetime(date) + pd.DateOffset(hours=9)
df['datetime'] = date
month_sky = pd.DataFrame(columns=['Forecast_time','forecast','skystate'])
month_sky = df[['datetime','forecast','value']]
out = pd.concat([out, month_sky])
return out
## 21.06 이전 3시간 보간
def fcst_3step_interpolate(df, type):
df = df[df['forecast']==4]
df['forecasted_datetime'] = pd.to_datetime(df['datetime']) + df['forecast'].map(lambda x: pd.DateOffset(hours=x))
df = df[['forecasted_datetime','value']]
datetime_from = df['forecasted_datetime'].values[0]
datetime_to = df['forecasted_datetime'].values[-1]
interpolate = pd.DataFrame(pd.date_range(start=datetime_from, end=datetime_to, freq='H'), columns = ['forecasted_datetime'])
interpolate = pd.merge(df, interpolate, on='forecasted_datetime', how='right')
interpolate['value'] = interpolate['value'].interpolate(type)
interpolate.reset_index(drop=True, inplace=True)
return interpolate
## 21.06 이후 예보값
def fcst_1step(df):
df = df[df['forecast'].isin([6,7,8])]
df['forecasted_datetime'] = pd.to_datetime(df['datetime']) + df['forecast'].map(lambda x: pd.DateOffset(hours=x))
df = df[['forecasted_datetime','value']]
df.reset_index(drop=True, inplace=True)
return df
## 22.05/22.06 예보값
def test_set(df):
df['datetime'] = df['datetime'].astype(str)
cond1 = df['datetime'].str.endswith('14:00:00')
cond2 = df['forecast'] >= 10
cond3 = df['forecast'] <= 33
test_df = df.loc[cond1&cond2&cond3]
test_df['datetime'] = pd.to_datetime(test_df['datetime'])
test_df['forecasted_datetime'] = test_df['datetime'] + test_df['forecast'].map(lambda x: pd.DateOffset(hours=x))
test_df.drop(columns=['datetime'],inplace=True)
#72step CNN모델을 사용하기에 3일전 데이터부터 포함
cond4 = test_df['forecasted_datetime'] >= '2022-04-28 00:00:00'
cond5 = test_df['forecasted_datetime'] < '2022-07-01 00:00:00'
test_df = test_df.loc[cond4 & cond5]
test_df = test_df[['forecasted_datetime','value']]
test_df.reset_index(drop=True, inplace=True)
return test_df
## sky_state: nearest 보간/ rain_prob: linear 보간
sky_state1 = fcst_3step_interpolate(sky_state, 'nearest')
rain_prob1 = fcst_3step_interpolate(rain_prob,'linear')
sky_state2 = fcst_1step(sky_state2)
rain_prob2 = fcst_1step(rain_prob2)
sky_state1 = sky_state1[sky_state1['forecasted_datetime'] < '2021-06-30 08:00:00']
sky_state2 = sky_state2[sky_state2['forecasted_datetime'] < '2022-05-01 00:00:00']
sky_concat = pd.concat([sky_state1, sky_state2])
sky_concat.reset_index(drop=True, inplace=True)
rain_prob1 = rain_prob1[rain_prob1['forecasted_datetime'] < '2021-07-01 17:00:00']
rain_prob2 = rain_prob2[rain_prob2['forecasted_datetime'] < '2022-05-01 00:00:00']
rain_concat = pd.concat([rain_prob1, rain_prob2])
rain_concat.reset_index(drop=True, inplace=True)
sky_state1 = sky_state1[sky_state1['forecasted_datetime'] < '2021-06-30 08:00:00']
sky_state2 = sky_state2[sky_state2['forecasted_datetime'] < '2022-05-01 00:00:00']
sky_concat = pd.concat([sky_state1, sky_state2])
sky_concat.reset_index(drop=True, inplace=True)
rain_prob1 = rain_prob1[rain_prob1['forecasted_datetime'] < '2021-07-01 17:00:00']
rain_prob2 = rain_prob2[rain_prob2['forecasted_datetime'] < '2022-05-01 00:00:00']
rain_concat = pd.concat([rain_prob1, rain_prob2])
rain_concat.reset_index(drop=True, inplace=True)
solar_ss_test = test_set(sky_state2)
solar_rp_test = test_set(rain_prob2)
## 변수 추가 (sky_state, rain_probability)
# 사용할 date만 선택
solar_ss['date'] = solar_ss['forecasted_datetime'].dt.date
index = solar_ss['date'].isin(normal_list)
solar_ss_st = solar_ss.loc[index]
solar_ss_st.drop(columns='date',inplace=True)
solar_ss_st.reset_index(drop=True, inplace=True)
solar_rp['date'] = solar_rp['forecasted_datetime'].dt.date
index = solar_rp['date'].isin(normal_list)
solar_rp_st = solar_rp.loc[index]
solar_rp_st.drop(columns='date',inplace=True)
solar_rp_st.reset_index(drop=True, inplace=True)
train_data['ss']= solar_ss_st['value']
train_data['rp']= solar_rp_st['value']
val_data['ss']= solar_ss_st['value']
val_data['rp'] = solar_rp_st['value']
test_data['ss'] = test_solar_ss['value']
test_data['rp'] = test_solar_rp['value']
태양 데이터 추가(고도, 방위각)
한국천문연구원
제공 태양데이터 api 활용
## api활용 태양데이터 load
import matplotlib.pyplot as plt
import pandas as pd
import urllib
import urllib.request
import json
import xmltodict
from tqdm import tqdm
def return_solar_info(date):
key='데이터 이용 요청시 받는 개인access key'
url = 'http://apis.data.go.kr/B090041/openapi/service/SrAltudeInfoService/getLCSrAltudeInfo'
queryParams = '?' + urllib.parse.urlencode(
{
urllib.parse.quote_plus('ServiceKey') : key,
urllib.parse.quote_plus('latitude') : '37.535911',
urllib.parse.quote_plus('longitude') : '126.602342',
urllib.parse.quote_plus('locdate') : date,
urllib.parse.quote_plus('dnYn') : 'Y',
}
)
response = urllib.request.urlopen(url + queryParams).read()
dict_type = xmltodict.parse(response)
return dict_type['response']['body']['items']
def angle_to_float(value):
value = value.split('˚')
if value[0][0] == '-':
value = 0
else:
value = int(value[0]) + int(value[1].split('´')[0])*0.01
return value
#Ex) date: 2022-08-25 00:00:00 > 20220825
date = date.apply(lambda x: x[:10].replace('-','')).unique()
solar_infos =[]
for date in tqdm(date):
solar_infos.append(return_solar_info(date)['item'])
# 남중고도, 09시, 12시, 15시, 18시 남중고도/방위각
angle_columns=['altitudeMeridian','altitude_09','altitude_12','altitude_15','altitude_18','azimuth_09','azimuth_12','azimuth_15','azimuth_18']
date_solars ={}
for i,d in enumerate(solar_infos):
tmp_solars=[]
for col in angle_columns:
tmp_solars.append(angle_to_float(d[col]))
date_solars[d['locdate']]= tmp_solars
# solar_info.csv에 저장
date_solars_df=pd.DataFrame(date_solars.values(), columns=angle_columns)
date_solars_df['date'] = date_solars.keys()
date_solars_df.to_csv('solar_info.csv',index=False)
train_data['altitude'] = 0
val_data['altitude'] = 0
test_data['altitude'] = 0
## 3시간 간격 데이터 활용해 변수에 1시간 보간값 저장
solar_info['date'] = pd.to_datetime(solar_info['date'], format='%Y%m%d')
altitude_09 = solar_info['altitude_09']
altitude_12 = solar_info['altitude_12']
altitude_15 = solar_info['altitude_15']
altitude_18 = solar_info['altitude_18']
altitude_10 = altitude_09 + (altitude_12-altitude_09)*1/3
altitude_11 = altitude_09 + (altitude_12-altitude_09)*2/3
altitude_13 = altitude_12 + (altitude_15-altitude_12)*1/3
altitude_14 = altitude_12 + (altitude_15-altitude_12)*2/3
altitude_16 = altitude_15 + (altitude_18-altitude_15)*1/3
altitude_17 = altitude_15 + (altitude_18-altitude_15)*2/3
# altitude 변수 추가
j=0
k=0
l=0
m=0
n=0
o=0
p=0
q=0
r=0
s=0
for i, datetime in enumerate(train_data['datetime']):
if datetime.endswith('09:00:00'):
train_data['altitude'].iloc[i] = altitude_09[j]
j += 1
elif datetime.endswith('10:00:00'):
train_data['altitude'].iloc[i] = altitude_10[k]
k += 1
elif datetime.endswith('11:00:00'):
train_data['altitude'].iloc[i] = altitude_11[l]
l += 1
elif datetime.endswith('12:00:00'):
train_data['altitude'].iloc[i] = altitude_12[m]
m += 1
elif datetime.endswith('13:00:00'):
train_data['altitude'].iloc[i] = altitude_13[n]
n += 1
elif datetime.endswith('14:00:00'):
train_data['altitude'].iloc[i] = altitude_14[o]
o += 1
elif datetime.endswith('15:00:00'):
train_data['altitude'].iloc[i] = altitude_15[p]
p += 1
elif datetime.endswith('16:00:00'):
train_data['altitude'].iloc[i] = altitude_16[q]
q += 1
elif datetime.endswith('17:00:00'):
train_data['altitude'].iloc[i] = altitude_17[r]
r += 1
elif datetime.endswith('18:00:00'):
train_data['altitude'].iloc[i] = altitude_18[s]
s += 1
## 하루 방위각 편차 변수 추가
train_data['date'] = pd.to_datetime(train_data['date'])
date = solar_info['date']
date_azimuth ={}
for i,dt in enumerate(date):
date_azimuth[dt] = solar_info['azimuth_diff'].iloc[i]
train_data['azimuth'] = train_data['date'].map(date_azimuth)
# 관측치이기 때문에 val_data에도 똑같이 적용
val_data['altitude'] = train_data['altitude']
val_data['azimuth'] = train_data['azimuth']
## (선택사항) category 변수 onehotencoding
train_data = pd.get_dummies(train_data, columns = ['ss'])
val_data = pd.get_dummies(val_data, columns = ['ss'])
# Tensorflow 사용
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import optimizers
from tensorflow.keras.layers import Dense, Flatten, Conv1D, BatchNormalization, Activation, Dropout, LSTM
from tensorflow.keras import Model
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
## 이전 72seq까지를 데이터로 구성
def seq_gen(df, target, seq_length):
if target is None:
X=[]
for i in range(0, len(df) - seq_length):
x= df[i:i+seq_length+1]
X.append(x)
return np.array(X)
else:
X =[]
Y =[]
for i in range(0, len(df) - seq_length):
x = df[i:i+ seq_length+1]
y = target[i + seq_length]
X.append(x)
Y.append(y)
return np.array(X), np.array(Y)
# 여러 시도 끝에 season 변수를 제외
train_data.drop(columns=['season'],inplace=True)
val_data.drop(columns=['season'],inplace=True)
test_data.drop(columns=['season'],inplace=True)
train_y = solar_power_hourmean['target'].to_numpy()
# 21.05~21.08 기상예측값을 이용해 검증
cond1 = val_data['datetime'] > '2021-04-27 23:00:00'
cond2 = val_data['datetime'] < '2021-09-01 00:00:00'
cond3 = solar_power_hourmean['datetime'] >'2021-04-27 23:00:00'
cond4 = solar_power_hourmean['datetime'] < '2021-09-01 00:00:00'
val_data = val_data.loc[cond1 & cond2].drop(columns=['datetime'])
val_data.reset_index(drop=True, inplace=True)
val_y = solar_power_hourmean['target'].loc[cond3 & cond4].to_numpy()
train_x, train_y = seq_gen(train_data, train_y,72)
val_x, val_y = seq_gen(val_data, val_y, 72)
test_x = seq_gen(test_data, None, 72)
## 평가 산식
def NMAE(true, pred):
absolute_error = np.abs(true - pred)
absolute_error /= capacity
target_idx = np.where(true >= capacity*0.1)
nmae = 100*absolute_error[target_idx].mean()
return nmae
def custom_metric(true, pred):
score = tf.py_function(func=NMAE, inp=[true, pred], Tout=tf.float32, name='nmae')
return score
# causal padding를 사용해 target이전의 data만 학습에 반영
# 모델 깊이를 가볍게해 기상관측 데이터의 overfit를 제어
def regression_dilated_cnn(Model_input):
x = Conv1D(8, 3, padding='causal')(Model_input)
x = BatchNormalization()(x)
x = Activation(activation='elu')(x)
x = Dropout(0.5)(x)
x = Conv1D(16, 3, padding='causal')(x)
x = BatchNormalization()(x)
x = Activation(activation='elu')(x)
x = Dropout(0.5)(x)
x = Conv1D(32, 3, padding='causal',dilation_rate=2)(x)
x = BatchNormalization()(x)
x = Activation(activation='elu')(x)
x = Dropout(0.5)(x)
x = Conv1D(64, 3, padding='causal', dilation_rate=4)(x)
x = BatchNormalization()(x)
x = Activation(activation='elu')(x)
x = Dropout(0.5)(x)
x = Flatten()(x)
x = Dense(32)(x)
x = Activation(activation='elu')(x)
x = Dense(1)(x)
output = Model(Model_input, x, name='regression_dilated_cnn')
return output
capacity = 1100
# monitor 항목이 7epochs 동안 개선이 안되면 factor만큼 LR 감소
reLR = ReduceLROnPlateau(monitor ='val_custom_metric', patience=7, factor = 0.5)
es = EarlyStopping(monitor='val_custom_metric', patience = 30, restore_best_weights=True)
model_inputs = keras.Input(shape=(73,13))
model = regression_dilated_cnn(model_inputs)
model.compile(loss = 'mae', optimizer = 'RMSprop', metrics=[custom_metric])
history = model.fit(train_x, train_y , epochs =200, validation_data = (val_x, val_y),
batch_size = 64, callbacks =[reLR, es])
9~10% 오차 산출
Preprocess 체계는 solar power와 거의 동일
wind_forecast = pd.read_csv('/content/wind_forecast_weather.csv')
wind_weather = pd.read_csv('/content/weather_wind_actual.csv')
wind_power = pd.read_csv('/content/wind_2204.csv')
wind_power visualize 결과, 시간 주기성보다 다른 factor가 더 크다고 판단
solar_power와 다르게 발전량, 풍속 결측치 제거
## 발전량, 기상관측값 전처리
wind_power = wind_power.dropna()
wind_power['datetime'] = pd.to_datetime(wind_power['datetime'])
wind_power['target'] = wind_power['target'].apply(lambda x: 0.0 if x < 0.0 else x)
idx_outlier = wind_power[wind_power['target'] > 20000].index
wind_power = wind_power.drop(idx_outlier)
wind_power.reset_index(drop=True, inplace=True)
wind_weather = wind_weather.drop(columns = ['지점'],axis=1)
wind_weather = wind_weather.drop(columns =['지점명'], axis=1)
wind_weather.columns =['datetime', 'temperature','windspeed','winddirection','humidity']
wind_weather.set_index(pd.to_datetime(wind_weather['datetime']), inplace=True)
wind_weather = wind_weather.resample(rule='1H').mean()
wind_weather = wind_weather.fillna(0)
wind_weather.reset_index(inplace=True)
power_zero = wind_power[wind_power['target'] == 0].index
wind_zero = wind_weather[wind_weather['windspeed']==0].index
unique_datetime = pd.to_datetime(wind_power['datetime'].unique())
power_error_datetime = np.array(wind_power.iloc[power_zero]['datetime'])# 10Min단위
wind_error_datetime = np.array(wind_weather.iloc[wind_zero]['datetime'])# 1H단위
target_datetime = np.setdiff1d(unique_datetime,power_error_datetime)
target_idx = wind_power['datetime'].isin(target_datetime)
wind_power = wind_power.loc[target_idx]
wind_power.reset_index(drop=True, inplace=True)
wind_power.set_index(pd.to_datetime(wind_power['datetime']), inplace=True)
wind_power_hourmean = wind_power.resample(rule='1H').mean()
wind_power_hourmean = wind_power_hourmean.dropna()
wind_power_hourmean = wind_power_hourmean.reset_index()
target_idx = ~wind_power_hourmean['datetime'].isin(wind_error_datetime)
wind_power_hourmean = wind_power_hourmean.loc[target_idx]
wind_weather['datetime'] = pd.to_datetime(wind_weather['datetime'])
target_idx = wind_weather['datetime'].isin(wind_power_hourmean['datetime'])
wind_weather = wind_weather.loc[target_idx]
wind_weather.reset_index(drop=True, inplace=True)
## 변수 변환 함수
def angle_to_dir(x):
if x >= 22.5 and x < 67.5:
return '1'
elif x >= 67.5 and x < 112.5:
return '2'
elif x >= 112.5 and x < 157.5:
return '3'
elif x >= 157.5 and x < 202.5:
return '4'
elif x >= 202.5 and x < 247.5:
return '5'
elif x >= 247.5 and x < 292.5:
return '6'
elif x >= 292.5 and x < 337.5:
return '7'
elif x >= 337.5 or x < 22.5:
return '0'
def angle_to_cos(x):
return np.cos(np.pi/180*(x+90))
def angle_to_sin(x):
return np.sin(np.pi/180*(x-90))
## 변수 생성
train_data = wind_weather.copy()
hour = train_data['datetime'].dt.hour
month = train_data['datetime'].dt.month
month_featured = month - 1
day = train_data['datetime'].dt.day /31
date = month_featured + day
seasons = ['4','4','4','1','1','1','2','2','2','3','3','3']
month_to_season = dict(zip(range(1,13),seasons))
# 풍향 category 변수 생성
train_data['cat_winddirection'] = train_data['winddirection'].apply(lambda x: angle_to_dir(x))
# 바람 vector 변수 생성
train_data['windxvec'] = train_data['winddirection'].apply(lambda x: angle_to_cos(x)) * train_data['windspeed']
train_data['windyvec'] = train_data['winddirection'].apply(lambda x: angle_to_sin(x)) * train_data['windspeed']
# 월 변수 생성
train_data['month'] = month
# 계절 변수 생성
train_data['season'] = train_data['month'].map(month_to_season)
# continuous hour 변수 생성
train_data['con_hour'] = np.cos(2*np.pi*(hour/24))
# continuous date 변수 생성
train_data['con_date'] = np.cos(2*np.pi*(date/12))
train_data = train_data.drop(columns =['datetime'])
# 풍속에 가중치를 더 주기 위한 풍속 제곱 변수 생성
train_data['windspeed^2'] = train_data['windspeed'] * train_data['windspeed']
train_data.reset_index(drop=True, inplace=True)
## validation, test data 생성
cond1 = wind_forecast['Forecast time'].str.endswith('14:00:00')
cond2 = wind_forecast['forecast'] >=10
cond3 = wind_forecast['forecast'] <= 33
wind_forecast = wind_forecast.loc[cond1&cond2&cond3]
wind_forecast = wind_forecast.reset_index(drop=True)
wind_forecast['Forecast time'] = pd.to_datetime(wind_forecast['Forecast time'])
wind_forecast['forecasted_datetime'] = wind_forecast['Forecast time'] + wind_forecast['forecast'].map(lambda x: pd.DateOffset(hours=x))
wind_forecast.drop(columns =['Forecast time'], inplace=True)
datetime_from = wind_forecast['forecasted_datetime'].values[0]
datetime_to = wind_forecast['forecasted_datetime'].values[-1]
val_data = pd.DataFrame(pd.date_range(start=datetime_from, end=datetime_to, freq='H'), columns = ['forecasted_datetime'])
val_data = pd.merge(wind_forecast, val_data, on='forecasted_datetime', how='right')
val_data['temperature'] = val_data['temperature'].interpolate()
val_data['humidity'] = val_data['humidity'].interpolate()
val_data['windspeed'] = val_data['windspeed'].interpolate()
val_data['winddirection'] = val_data['winddirection'].interpolate('nearest')
val_data['cat_winddirection'] = val_data['winddirection'].apply(lambda x: angle_to_dir(x))
val_data['windxvec'] = val_data['winddirection'].apply(lambda x: angle_to_cos(x)) * val_data['windspeed']
val_data['windyvec'] = val_data['winddirection'].apply(lambda x: angle_to_sin(x)) * val_data['windspeed']
val_data['month'] = val_data['forecasted_datetime'].dt.month
val_data['season'] = val_data['month'].map(month_to_season)
val_data['con_hour'] = np.cos(2*np.pi*(val_data['forecasted_datetime'].dt.hour/24))
val_data['con_date'] = np.cos(2*np.pi*((val_data['forecasted_datetime'].dt.month-1) + (val_data['forecasted_datetime'].dt.day /31))/12)
val_data['windspeed^2'] = val_data['windspeed'] * val_data['windspeed']
test_data = val_data.copy()
normal_idx = val_data['forecasted_datetime'].isin(wind_power_hourmean['datetime'])
val_data = val_data.loc[normal_idx]
val_data.drop(columns = ['forecast'],inplace=True)
val_data = val_data[['forecasted_datetime','temperature','windspeed','humidity','winddirection','cat_winddirection','windxvec','windyvec','month','season','con_hour','con_date','windspeed^2']]
val_data = val_data.rename(columns={'forecasted_datetime': 'datetime'})
val_data.reset_index(drop=True, inplace=True)
## target 지정
test1 = test_data['datetime'] >'2022-04-30 23:00:00'
test2 = test_data['datetime'] < '2022-07-01 00:00:00'
test_data = test_data.loc[test1 & test2]
test_data.reset_index(drop=True, inplace=True)
index = wind_ss['forecasted_datetime'].isin(wind_power_hourmean['datetime'])
wind_ss_st = wind_ss.loc[index]
wind_ss_st.reset_index(drop=True, inplace=True)
index = wind_rp['forecasted_datetime'].isin(wind_power_hourmean['datetime'])
wind_rp_st = wind_rp.loc[index]
wind_rp_st.reset_index(drop=True, inplace=True)
train_data['ss'] = wind_ss_st['value']
train_data['rp'] = wind_rp_st['value']
val_data['ss'] = wind_ss_st['value']
val_data['rp'] = wind_rp_st['value']
import lightgbm as lgb
train_data.drop(columns=['datetime'],inplace=True)
test_data.drop(columns=['datetime'],inplace=True)
train_x = train_data.values
train_y = wind_power_hourmean['target'].to_numpy()
cond1 = val_data['datetime'] > '2021-04-30 23:00:00'
cond2 = val_data['datetime'] < '2021-09-01 00:00:00'
cond3 = wind_power_hourmean['datetime'] >'2021-04-30 23:00:00'
cond4 = wind_power_hourmean['datetime'] < '2021-09-01 00:00:00'
val_data = val_data.loc[cond1 & cond2].drop(columns=['datetime'])
val_data.reset_index(drop=True, inplace=True)
val_x = val_data.values
val_y = wind_power_hourmean['target'].loc[cond3 & cond4].to_numpy()
train_dataset = lgb.Dataset(train_x, train_y)
val_dataset = lgb.Dataset(val_x, val_y)
import optuna
def objective(trial):
param ={
'objective': 'regression',
'metric': 'mae',
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-6,1e-2),
'max_depth': trial.suggest_int('max_depth',3,15),
'num_leaves': trial.suggest_int('num_leaves',30,100),
'min_data_in_leaf': trial.suggest_int('min_data_in_leaf',100,500),
'max_bin': trial.suggest_int('max_bin',100,255),
'bagging_fraction':trial.suggest_uniform('bagging_fraction',0.5,0.9),
'feature_fraction': trial.suggest_uniform('feature_fraction',0.5,0.9),
'seed': 42
}
model_lgb = lgb.train(param, train_set=train_dataset, num_boost_round= 10000, valid_sets=[train_dataset, val_dataset],
valid_names=['train','valid'], verbose_eval=500, early_stopping_rounds=100)
y_pred = model_lgb.predict(val_x)
return NMAE(y_pred, val_dataset)
## optuna
# 지정된 범위에서 랜덤으로 지목된 hyperparameter로 학습
# n_trial동안 시행하면서 최적화
capacity = 16000
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=30)
study.best_trial.params
9~10% 오차 산출