02_Split Data

Kyungtaek Oh·2022년 1월 5일
0

Machine Learning

목록 보기
2/6

Target attribute

  • The goal is to define whether it is DoH or Non-DoH

The target attribute is "DoH".
Set Y for column 'DoH'
Set X for all of columns except for 'DoH'

All 5 categories needs to be set.

Y_chrome = DF_chrome['DoH']
X_chrome = DF_chrome.drop('DoH', axis =1)
Y_firefox = DF_firefox['DoH']
X_firefox = DF_firefox.drop('DoH', axis =1)
Y_dns2tcp = DF_dns2tcp['DoH']
X_dns2tcp = DF_dns2tcp.drop('DoH', axis =1)
Y_dnscat2 = DF_dnscat2['DoH']
X_dnscat2 = DF_dnscat2.drop('DoH', axis =1)
Y_iodine = DF_iodine['DoH']
X_iodine = DF_iodine.drop('DoH', axis =1)

Split into two parts

Randomly Select 70% of data set from each category to have a better classifiers and results.

X_chrome_training, X_chrome_testing, Y_chrome_training, Y_chrome_testing = train_test_split(X_chrome, Y_chrome, test_size= 0.3, stratify = Y_chrome, random_state = 1, shuffle = True)
So on...

Append 2 sets of all 5 different categories

DF_X_training = X_chrome_training.append(X_firefox_training).append(X_dns2tcp_training).append(X_dnscat2_training).append(X_iodine_training)
DF_X_testing = X_chrome_testing.append(X_firefox_testing).append(X_dns2tcp_testing).append(X_dnscat2_testing).append(X_iodine_testing)
DF_Y_training = Y_chrome_training.append(Y_firefox_training).append(Y_dns2tcp_training).append(Y_dnscat2_training).append(Y_iodine_training)
DF_Y_testing = Y_chrome_testing.append(Y_firefox_testing).append(Y_dns2tcp_testing).append(Y_dnscat2_testing).append(Y_iodine_testing)

Check the size of training and testing sets

print(DF_X_training.shape[0])
print(DF_X_testing.shape[0])
print(DF_Y_training.shape[0])
print(DF_Y_testing.shape[0])

profile
Studying for Data Analysis, Data Engineering & Data Science

0개의 댓글