
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
https://www.kaggle.com/datasets/vipullrathod/fish-market (2024.03 현재)base_path = r'~~'
file_path = os.path.join(base_path,'fish.csv')
fish_df = pd.read_csv(file_path)
| Species | Weight | Length | Diagonal | Height | Width | |
|---|---|---|---|---|---|---|
| 0 | Bream | 242.0 | 25.4 | 30.0 | 11.5200 | 4.0200 |
| 1 | Bream | 290.0 | 26.3 | 31.2 | 12.4800 | 4.3056 |
| 2 | Bream | 340.0 | 26.5 | 31.1 | 12.3778 | 4.6961 |
| 3 | Bream | 363.0 | 29.0 | 33.5 | 12.7300 | 4.4555 |
| 4 | Bream | 430.0 | 29.0 | 34.0 | 12.4440 | 5.1340 |
fish_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 6 columns):
Column Non-Null Count Dtype
0 Species 159 non-null object
1 Weight 159 non-null float64
2 Length 159 non-null float64
3 Diagonal 159 non-null float64
4 Height 159 non-null float64
5 Width 159 non-null float64
dtypes: float64(5), object(1)
memory usage: 7.6+ KB
컬럼 정보
Species 생선의 종류 (분류형)
Weight 생선의 무게 (수치형)
Length 생선의 길이 (수치형)
Diagonal 생선의 대각선 길이 (수치형)
Height 생선의 크기 (수치형)
+..
- Perch '농어'
- Bream '도미'
- Roach '로치, Common Roach 잉어과 담수어'
- Pike '강꼬치고기, Northern pike'
- Smelt '빙어'
- Parkki '청돔'
- Whitefish '송어'fish_df.Species.unique()
array(['Bream', 'Roach', 'Whitefish', 'Parkki', 'Perch', 'Pike', 'Smelt'],
dtype=object)
fish_df.Species.value_count()
| count | |
|---|---|
| Species | |
| Perch | 56 |
| Bream | 35 |
| Roach | 20 |
| Pike | 17 |
| Smelt | 14 |
| Parkki | 11 |
| Whitefish | 6 |
#길이
bream_length=fish_df[fish_df.Species == 'Bream']['Length'].to_list()
bream_length
#무게
bream_weight=fish_df[fish_df.Species == 'Bream']['Weight'].to_list()
>[25.4, 26.3, 26.5, 29.0, 29.0, 29.7, 29.7, 30.0, 30.0, 30.7, 31.0, 31.0, 31.5, 32.0, 32.0, 32.0, 33.0, 33.0, 33.5, 33.5, 34.0, 34.0, 34.5, 35.0, 35.0, 35.0, 35.0, 36.0, 36.0, 37.0, 38.5, 38.5, 39.5, 41.0, 41.0]
[242.0, 290.0, 340.0, 363.0, 430.0, 450.0, 500.0, 390.0, 450.0, 500.0, 475.0, 500.0, 500.0, 340.0, 600.0, 600.0, 700.0, 700.0, 610.0, 650.0, 575.0, 685.0, 620.0, 680.0, 700.0, 725.0, 720.0, 714.0, 850.0, 1000.0, 920.0, 955.0, 925.0, 975.0, 950.0]
# 빙어 14 마리
#길이
smelt_length=fish_df[fish_df.Species == 'Smelt']['Length'].to_list()
smelt_length
#무게
smelt_weight=fish_df[fish_df.Species == 'Smelt']['Weight'].to_list()
>[9.8, 10.5, 10.6, 11.0, 11.2, 11.3, 11.8, 11.8, 12.0, 12.2, 12.4, 13.0, 14.3, 15.0]
[6.7, 7.5, 7.0, 9.7, 9.8, 8.7, 10.0, 9.9, 9.8, 12.2, 13.4, 12.2, 19.7, 19.9]
데이터의 특징/특성
데이터를 표현하는 성질
두 생선의 데이터 분포를 한번에 시각화
plt.scatter(bream_length,bream_weight)
plt.scatter(smelt_length,smelt_weight)
plt.xlabel('length')
plt.ylabel('weight')
plt.show()

K-최근접 이웃 (k-Nearest Neighbors,KNN) 알고리즘을 사용하여 도미와 빙어 데이터 구분
length=bream_length+smelt_length
weight=bream_weight+smelt_weight
scikit-learn 에서는 data를 각 특성들의 배열 형태로 만들어야 한다⇒2차원 배열
길이 무게
[ ↓ ↓
[25.4, 242.0],
[26.3, 290.0],
[26.5, 340.0],
...
[15.0, 19.9]
]
이를 입력(input)이라고도 하고 feature vector 라고도 한다
fish_data=[[l,w] for l,w in zip(length,weight)]
>[[25.4, 242.0], [26.3, 290.0], [26.5, 340.0], [29.0, 363.0], [29.0, 430.0], [29.7, 450.0], [29.7, 500.0], [30.0, 390.0], [
도미는 1로, 빙어는 0으로 표현된 답안 준비
fish_target=[1]* 35 +[0]*14
print(fish_target)
>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0..
from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
**KNeighborsClassifier**
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
```python
class sklearn.neighbors.KNeighborsClassifier(
n_neighbors=5,
*,
weights='uniform',
algorithm='auto',
leaf_size=30,
p=2,
metric='minkowski',
metric_params=None,
n_jobs=None)plt.scatter(bream_length, bream_weight)
plt.scatter(smelt_length, smelt_weight)
plt.scatter(30,600,marker='^')
plt.xlabel('length')
plt.ylabel('weight')
plt.show() kn.fit(fish_data,fish_target)
kn.score(fish_data,fish_target)
>1.0
predict
kn.predict([[30,600]])
>array([1])
kn.predict([
[30,600],
[16,30],
])
>array([1,0])