๐Ÿ‘ฝ Python์œผ๋กœ UFO ๋ชฉ๊ฒฉ์„ ๋ถ„์„ํ•ด๋ณด์ž!

Journey to Data Analystยท2022๋…„ 11์›” 21์ผ
0

Python

๋ชฉ๋ก ๋ณด๊ธฐ
1/4
post-thumbnail

1. ์ฃผ์ œ ์„ ์ •

ํŒ€์› ์ค‘์— ํ•œ ๋ช…์ด ๋ฏธ๊ตญ ๋‚ด์˜ UFO ์ถœํ˜„์— ๊ด€์‹ฌ์„ ๊ฐ–๊ณ  ์žˆ์–ด ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ•ด๋ณด๋Š” ๊ฒŒ ์–ด๋–ป๊ฒ ๋ƒ๊ณ  ํ•˜์—ฌ ํ™ฉ์ƒ‰ ์–ธ๋ก ์— ๋Œ€ํ•œ ๋ถ„์„์—์„œ ๋ณ€๊ฒฝํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

๊ทธ๋ฆฌํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ UFO ๋ฏธ๊ตญ ๊ด€๊ด‘ ํˆฌ์–ด๋ฅผ ์œ„ํ•œ UFO ์ถœ๋ชฐ ์ง€์—ญ ๋ฐ ํŒจํ„ด ๋ฐ์ดํ„ฐ ๋ถ„์„ ์„ ์ฃผ์ œ๋กœ ์„ ์ •ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

2. Data ์ˆ˜์ง‘

๋ฐ์ดํ„ฐ๋Š” National UFO Reporting Center์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๊ณ 
ํ‘œ ํ˜•ํƒœ์˜ ์ ํ˜€์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ ํฌ๋กค๋ง ํ•ด์ฃผ๋Š” Instant Data Scraper๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค.

National UFO Reporting Center

์ด ๊ณณ์—๋Š” 1400๋…„๋Œ€๋ถ€ํ„ฐ ํ˜„์žฌ๊นŒ์ง€ ๋ชฉ๊ฒฉ๋˜์—ˆ๋‹ค๋Š” UFO ์ถœ๋ชฐ ์ •๋ณด๊ฐ€ ๋“ค์–ด์žˆ์—ˆ๋Š”๋ฐ
์šฐ๋ฆฌ ํŒ€์€ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ์˜ ๋„์ž…์ด ๋Œ€์ค‘์ ์œผ๋กœ ์‹œ์ž‘๋œ 2000๋…„ ~ 2022๋…„ ๋ฐ์ดํ„ฐ๋“ค์„
<nuforc์˜ Data bank์˜ UFO Report Index by Month>์—์„œ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค.

ํ•ด๋‹น ์„น์…˜์˜ ๋ฐ์ดํ„ฐ ์ผ๋ถ€

3. Data EDA

ํ•˜์ง€๋งŒ ์ „์ฒ˜๋ฆฌ๋Š” ํ”ผํ•ด๊ฐˆ ์ˆ˜ ์—†๋Š” ์ผ์ด์˜€์„๊นŒโ€ฆ

์ผ๋‹จ ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

(1) Column๋ช… ์ „์ฒ˜๋ฆฌ

๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค์ž column๋ช…๋“ค์ด ๋ฏธ์ณ ๋‚ ๋›ฐ๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค.

total = pd.read_csv('/Users/sung/Desktop/Python/total_ufo.csv')

total.head()

๋”ฐ๋ผ์„œ ์ปฌ๋Ÿผ๋ช…์„ ํ™•์ธํ•˜๊ณ  ๋ณ€๊ฒฝํ•ด์ฃผ๋Š” ์ž‘์—…์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

total.columns
>>>
Index(['Unnamed: 0.1', 'Unnamed: 0', 'tablescraper-selected-row',
       'tablescraper-selected-row href', 'tablescraper-selected-row 2',
       'tablescraper-selected-row 3', 'tablescraper-selected-row 4',
       'tablescraper-selected-row 5', 'tablescraper-selected-row 6',
       'tablescraper-selected-row 7', 'tablescraper-selected-row 8',
       'tablescraper-selected-row 9'],
      dtype='object')

total.columns = ['drop', 'drop', 'date_time','url','city','statecode','country','shape','duration','summary','posted','image']

total.head()

์œ„์— ์ฒ˜๋Ÿผ ์ปฌ๋Ÿผ๋ช…์ด ๊น”๋”ํ•ด์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค!

ํ•˜์ง€๋งŒ ์ค‘๊ฐ„์ค‘๊ฐ„ drop์ด๋ผ๋Š” ๊ฒƒ์ด ์žˆ๋Š”๋ฐ ์ผ๋‹จ index๊ฐ€ ์ค‘๋ณต์œผ๋กœ ๋“ค์–ด๊ฐ€์žˆ๋Š” column๋“ค์€

์‚ญ์ œํ•  ๊ฒƒ์ด์˜€๊ธฐ์— drop์œผ๋กœ ํ‘œ๊ธฐํ•˜์˜€๋‹ค.

์—ฌ๊ธฐ์„œ ์ถ”๊ฐ€๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„์— ์ ํ•ฉํ•˜์ง€ ์•Š์€ column๋“ค์„ ๊ฐ™์ด ์‚ญ์ œํ•ด์ฃผ๊ธฐ๋กœ ํ•˜์˜€๋‹ค.

total = total.drop(['drop','drop','url','posted','image'], axis = 1, inplace = False)

total.head()

column๋“ค์ด ๊น”๋”ํ•ด์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค!

column ์„ค๋ช…

- date_time: UFO๊ฐ€ ๋ชฉ๊ฒฉ๋œ ์‹œ๊ฐ„
- city: UFO๊ฐ€ ๋ชฉ๊ฒฉ๋œ ๋„์‹œ
- statecode: ๋ฏธ๊ตญ ์ฃผ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ฝ”๋“œ
- country: ๋ชฉ๊ฒฉ๋œ ๊ตญ๊ฐ€
- shape: ๋ชฉ๊ฒฉ๋œ UFO์˜ ๋ชจ์–‘
- duration: UFO๊ฐ€ ๋ชฉ๊ฒฉ๋˜์–ด ์‚ฌ๋ผ์ง€๊ธฐ ์ „๊นŒ์ง€์˜ ์‹œ๊ฐ„
- summary: UFO ๋ชฉ๊ฒฉ์ž๋“ค์˜ ์ง„์ˆ  ์š”์•ฝ

(2) Null๊ฐ’ ์ฒ˜๋ฆฌ

์ผ๋‹จ column๋“ค์„ ์ •๋ฆฌํ•œ ํ›„ Null ๊ฐ’์ด ์žˆ๋Š”์ง€ ์ฐพ์•„๋ณด์•˜๋‹ค.

total.isnull().sum()
>>>
date_time       0
city            3
statecode      16
country         0
shape         113
duration      263
summary      3312
dtype: int64

์ƒ๊ฐ๋ณด๋‹ค Null ๊ฐ’์ด ๋งŽ์•˜๋‹ค!

๊ทธ๋ฆฌํ•˜์—ฌ Null ๊ฐ’์„ ๋–จ๊ตฌ๊ฑฐ๋‚˜ 0์œผ๋กœ ์ฒ˜๋ฆฌํ•ด์•ผํ–ˆ๋Š”๋ฐ

์šฐ๋ฆฌ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์ƒ ๋ฌธ์ž์—ด์ด ๋งŽ์•˜๊ธฐ ๋•Œ๋ฌธ์— string์œผ๋กœ ๋˜์–ด์žˆ๋Š” ์—ด ๊ฐ’์€ ์ž„์˜๋กœ โ€˜nullโ€™์ด๋ผ๊ณ 

์ •์ˆ˜ ๊ฐ’์ธ ์—ด ๊ฐ’์€ โ€˜0โ€™์œผ๋กœ fillna ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ ํ•ด์ฃผ์—ˆ๋‹ค.

total['city'] = total['city'].fillna('null')
total['statecode'] = total['statecode'].fillna('null')
total['shape'] = total['shape'].fillna('null')
total['duration'] = total['duration'].fillna(0)
total['summary'] = total['summary'].fillna('null')

# null๊ฐ’์ด ์ฒ˜๋ฆฌ๋˜์—ˆ๋Š”์ง€ ํ™•์ธ
total.isnull().sum()
>>>
date_time    0
city         0
statecode    0
country      0
shape        0
duration     0
summary      0
dtype: int64

# null ํ…์ŠคํŠธ๊ฐ€ ์ž˜ ๋“ค์–ด๊ฐ”๋Š”์ง€๋„ ํ™•์ธ
total[total['city'] == 'null']

์ด๋กœ์จ Null ์ „์ฒ˜๋ฆฌ๊ฐ€ ์ž˜ ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

(3) ์„ธ์„ธํ•˜๊ฒŒ ๊ด€์ฐฐํ•˜๊ธฐ ์œ„ํ•œ date_time์œผ๋กœ ๋ณ€๊ฒฝ ๋ฐ ๋ถ„ํ• 

์ด์ œ ์‹œ๊ฐ„์„ ์„ธ์„ธํ•˜๊ฒŒ ๋‚˜๋ˆ„์–ด ๊ด€์ฐฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ date_time column์„ ์—ฐ๋„ ์›” ์‹œ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ๋กœ ํ•˜์˜€๋‹ค.

ํ•˜์ง€๋งŒ ๋ง‰์ƒ date_time์—ด์˜ type์„ ํ™•์ธํ•ด๋ณด๋‹ˆ datetime ํ˜•์‹์ด ์•„๋‹Œ ํ…์ŠคํŠธ์˜ ํ˜•์‹์ด์—ˆ๋‹ค!

pd.to_datetime ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ data type์„ ๋ฐ”๊ฟ”์ฃผ๊ณ  ๊ฐ๊ฐ ๋…„ ์›” ์‹œ๊ฐ„์„ ์ถ”์ถœํ•˜์—ฌ ์—ด์— ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.

total['date_time'] = pd.to_datetime(total['date_time'], errors = 'coerce')

# ๋ณ€๊ฒฝ ํ›„ ์ž˜ ๋ณ€๊ฒฝ๋˜์—ˆ๋Š”์ง€ ํ™•์ธ
total['date_time']
>>>
0        2000-01-31 22:21:00
1        2000-01-31 21:00:00
2        2000-01-30 23:15:00
3        2000-01-30 10:30:00
4        2000-01-29 13:00:00
                 ...        
119622   2022-06-01 03:36:00
119623   2022-06-01 03:00:00
119624   2022-06-01 01:19:00
119625   2022-06-01 00:01:00
119626   2022-06-01 00:00:00
Name: **date_time**, Length: 119627, dtype: datetime64[ns]

total['year'] = total['date_time'].dt.year
total['month'] = total['date_time'].dt.month
total['hour'] = total['date_time'].dt.hour

์•„๋ฆ„๋‹ต๊ฒŒ ๋ณ€๊ฒฝ๋˜์–ด ์ถ”๊ฐ€ ๋œ ๊ฒƒ ๊ฐ™์•˜์œผ๋‚˜โ€ฆ?!

์—ฐ ์›” ์‹œ๊ฐ„์˜ float ํ˜•์‹์ด ๋‚˜๋ฅผ ์‹ ๊ฒฝ์“ฐ์ด๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค.

int๋กœ ๋ฐ”๊พธ๋ ค๊ณ ํ–ˆ์œผ๋‚˜ ๋ฐ”๋€Œ์ง€ ์•Š์•„ ์˜ค๋ฅ˜๋ฉ”์„ธ์ง€๋ฅผ ๊ตฌ๊ธ€๋ง ํ•ด๋ณด๋‹ˆ

Null๊ฐ’์ด ์žˆ์œผ๋ฉด ๋œจ๋Š” ๋ฉ”์„ธ์ง€๋ผ๊ณ  ๋‚˜์™€ Null๊ฐ’์„ ๋‹ค์‹œ ์ฐพ์•„๋ณด๋‹ˆ ์žˆ์—ˆ๋‹ค!!!

total['year'].isnull().sum()
>>>
2

2๊ฐœ๊ฐ€ ์žˆ๋‹ค๊ณ  ๋‚˜์™€์„œ Null ๊ฐ’์„ 0์œผ๋กœ ์ฑ„์šฐ๊ณ  ๋‹ค์‹œ ์‹œ๋„ํ–ˆ๋Š”๋ฐ๋„ ๋จนํžˆ์ง€ ์•Š์•˜๋‹ค!!!!

total['year'].fillna(0)
>>>
0         2000.0
1         2000.0
2         2000.0
3         2000.0
4         2000.0
           ...  
119622    2022.0
119623    2022.0
119624    2022.0
119625    2022.0
119626    2022.0
Name: year, Length: 119627, dtype: float64

total[total['year'].isnull()]

๊ทธ๋ ‡๊ฒŒ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋งŒ ๋‚จ๊ฒจ๋†“์€ ์ฑ„ ๋‚˜๋Š” 2๊ฐœ์˜ ์—ด์„ ๋–จ๊ตฌ๊ธฐ๋กœ ๊ฒฐ์ •ํ•˜์˜€๊ณ  ๊ฒฐ๊ตญ ์„ฑ๊ณตํ•˜์˜€๋‹ค!

total['year'].dropna()
>>>
0         2000.0
1         2000.0
2         2000.0
3         2000.0
4         2000.0
           ...  
119622    2022.0
119623    2022.0
119624    2022.0
119625    2022.0
119626    2022.0
Name: year, Length: 119625, dtype: float64 **# ๊ทธ๋ž˜์„œ ์ด row์˜ ๊ฐœ์ˆ˜๋Š” 119625๊ฐœ!**

# inplace = True๋ฅผ ํ†ตํ•ด ์˜๊ตฌ์ ์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ์—ˆ๋‹ค.
****total.dropna(inplace = True)

# int๋กœ ํ˜•๋ณ€ํ™˜ ๋‹ค์‹œ ์‹œ๋„!!
total['year'] = total['year'].astype('int')

# ์„ฑ๊ณต!!!
total['year']
>>>
0         2000
1         2000
2         2000
3         2000
4         2000
          ... 
119622    2022
119623    2022
119624    2022
119625    2022
119626    2022
Name: year, Length: 119625, dtype: int64

# ์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ month, hour column๋„ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•ด์ฃผ์—ˆ๋‹ค.
total['month'] = total['month'].astype('int')
total['hour'] = total['hour'].astype('int')

total.head()

์ถ”๊ฐ€๋กœ pandas์—์„œ ์ง€์›ํ•˜๋Š” dt.day_name ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๊ฐ ์ผ์ž๋ณ„ ๋‚ ์งœ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.

usa['day_name'] = usa['date_time'].dt.day_name()
usa['day_name']
day_name = usa.groupby('day_name').count()

(4) ๊ณตํฌ์˜ duration

๊ทธ ๋‹ค์Œ ๋ฌธ์ œ๋Š” ๋ฐ”๋กœ ๋ถ„์„์— ์ค‘์š”ํ•œ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•˜๊ฒŒ ๋  duration ์ด์—ˆ๋‹ค.

๋ˆ„๊ฐ€ ๋ด๋„ duration์ด ๋“ค์ญ‰๋‚ ์ญ‰ํ•œ ๊ฒƒ์ด ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์—

๊ณผ์ •์„ ์ ์–ด๋ณด์ž๋ฉด ๋Œ€์ถฉ ์ด๋ ‡๋‹ค.

  1. str.contatins ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฌธ์ž โ€˜secโ€™, โ€˜minโ€™, โ€˜hourโ€™ ์ด ํฌํ•จ๋˜๋Š” ํ–‰ ์ถ”์ถœ

  2. str.contatins ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ถ”์ถœํ•œ ํ–‰์„ pd.concatํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ
    ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ํŒŒ์ผ๋กœ ํ•ฉ์นจ

  3. str.replace ํ•จ์ˆ˜๋ฅผ ๋ณ€ํ™˜ํ•˜๊ณ  ์‹ถ์€ ๋‹จ์–ด๋ฅผ ๋ณ€ํ™˜

  4. ์—‘์…€์˜ '๋ชจ๋‘๋ฐ”๊พธ๊ธฐ' ๊ธฐ๋Šฅ ํ†ตํ•ด ๋ถ„์„์— ํ•„์š” ์—†๋Š” ๋ฌธ์ž์—ด 1์ฐจ ์ œ๊ฑฐ
    (์˜ˆ: each, approx, about ๋“ฑ 35๊ฐœ)

  5. ๊ทธ ์ค‘ ์ฒ˜๋ฆฌ๋ถˆ๊ฐ€์ธ 16๊ฐœ ๋ฌธ์ž์—ด ๋ฐ ํŠน์ˆ˜๊ธฐํ˜ธ๋Š” ์ˆ˜๋Ÿ‰์ด ํฌ์ง€ ์•Š์•„ drop ์‹œํ‚ด
    (~, ?, +/ ๋“ฑ)

  6. ๋‹จ๊ณ„๋ณ„๋กœ ์ €์žฅ ๋ฐ ์ฝ๊ธฐ ์ค‘, ๋ฐ€๋ฆฐ ์…€ ๋‹ค์‹œ ์ •๋ฆฌ

    ์œ„ ๊ณผ์ • ํ›„ ์ดˆ ๋‹จ์œ„๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด 60์„ ์ผ๊ด„์ ์œผ๋กœ ๊ณฑํ•˜์˜€๋‹ค.

(5) ์ „์ฒ˜๋ฆฌ ํ›„ ๊ฒฐ๊ณผ

๊ณ ๋‚œ์˜ ์ „์ฒ˜๋ฆฌ๋ฅผ ๋๋‚ธ ํ›„ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์€ ์ด๋Ÿฌํ•˜๋‹ค.

usa.info()
>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85506 entries, 0 to 85505
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date_time     85506 non-null  datetime64[ns]
 1   city          85506 non-null  object        
 2   statecode     85506 non-null  object        
 3   country       85506 non-null  object        
 4   shape         85506 non-null  object        
 5   summary       85506 non-null  object        
 6   year          85506 non-null  int64         
 7   month         85506 non-null  int64         
 8   hour          85506 non-null  int64         
 9   duration_sec  85506 non-null  int64         
 10  day           85506 non-null  int64         
 11  day_name      85506 non-null  object        
dtypes: datetime64[ns](1), int64(5), object(6)
memory usage: 7.8+ MB

๊ทธ๋Ÿผ ์ด์ œ ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋ถ„์„์„ ์‹œ์ž‘ํ•ด๋ณด์ž

4. Data Analysis & Visualization

(1) ์—ฐ๋„๋ณ„ UFO Sighting Counting

usa['year'].value_counts()
>>>
2014    7072
2012    6230
2013    6201
2015    5551
2016    5001
2020    4998
2019    4479
2011    4224
2017    4069
2008    3803
2009    3466
2010    3400
2007    3319
2004    3087
2005    3074
2003    2761
2006    2748
2018    2583
2002    2263
2001    2247
2000    2123
2021    1998
2022     809
Name: year, dtype: int64

# ์—ฐ๋„๋ณ„ UFO Sighting Counting with using Countplot
plt.figure(figsize = (16, 10))
sns.countplot(data = usa, x = 'year', palette = 'pastel')

**Insight**    

- 2014๋…„์— ๊ฐ€์žฅ ๋งŽ์ด ๋ณด์˜€๊ณ  2021๋…„์— ๊ฐ€์žฅ ์ ๊ฒŒ ๋ณด์˜€๋‹ค

- 2022๋…„(์˜ฌํ•ด)์€ ์•„์ง ๋‹ค ์ง€๋‚˜์ง€ ์•Š์•˜์Œ์œผ๋กœ ๊ฐœ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ์ ์„ ์ˆ˜ ๋ฐ–์— ์—†๋‹ค!

(2) ์›”๋ณ„ UFO Sighting Counting

usa['month'].value_counts()
>>>
7     9877
8     8569
10    8512
9     8162
11    7403
6     6979
12    6394
5     6193
1     6175
4     6143
3     5883
2     5216
Name: month, dtype: int64

# ์›”๋ณ„ UFO Sighting Counting with using Countplot
plt.figure(figsize = (16, 10))
sns.countplot(data = usa, x = 'month', palette = 'Set1')

**Insight**    

- 7์›”์— ๊ฐ€์žฅ ๋งŽ์ด ๋ฐœ๊ฒฌ๋˜์—ˆ๊ณ  2์›”์— ๊ฐ€์žฅ ์ ๊ฒŒ ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค.

- Why? โ†’ ์—ฌ๋ฆ„์— ๋ฐ–์— ์ž์ฃผ ๋‚˜๊ฐ€๊ณ  ๊ฒจ์šธ์—๋Š” ์ง‘์—๋งŒ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

(3) ๊ฐ column๋ณ„ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„

๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ : ๊ธฐ์ƒ์ฒญ ์ž๋ฃŒ๊ฐœ๋ฐฉํฌํ„ธ

๊ธฐ์ƒ์ž๋ฃŒ๊ฐœ๋ฐฉํฌํ„ธ[๋ฐ์ดํ„ฐ:๊ธฐ์ƒ๊ด€์ธก:์„ธ๊ณ„๊ธฐ์ƒ์ „๋ฌธ(GTS):์ง€์ƒ(SYNOP):์ž๋ฃŒ]

๋ฏธ๊ตญ UFO ๋ฐœ๊ฒฌ ๊ฑด ์ˆ˜ TOP5 ์ฃผ ๊ธฐ์ƒ ๋ฐ์ดํ„ฐ

2000๋…„๋ถ€ํ„ฐ 2022๋…„ 6์›”๊นŒ์ง€

์ผ๋ณ„ ๊ธฐ์ƒ ๋ฐ์ดํ„ฐ(๊ธฐ์˜จ, ๊ฐ•์ˆ˜๋Ÿ‰, ํ’์†, ์Šต๋„, ๊ธฐ์••)๋ฅผ ์›”๋ณ„ ํ‰๊ท ์œผ๋กœ ๋ฐ์ดํ„ฐ ์ •์ œ

์ƒ๊ด€๊ณ„์ˆ˜๋Š” -1 ~ 1๊นŒ์ง€์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง€๋ฉฐ 0.3 ๋ถ€ํ„ฐ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค.

๋˜ํ•œ ๊ทธ ์ค‘์—์„œ ๋™๋ถ€์™€ ์„œ๋ถ€์˜ ๋Œ€ํ‘œ์ ์ธ ์ฃผ๋ฅผ ํ•˜๋‚˜์”ฉ ๊ผฝ์•„ ๋ถ„์„ํ•ด๋ณด์•˜๋‹ค.

# ๋™๋ถ€: **๋‰ด์š• ์ฃผ**
sns.heatmap(corr_NE, 
            cmap = 'RdYlBu_r', 
            annot = True,   # ์‹ค์ œ ๊ฐ’์„ ํ‘œ์‹œํ•œ๋‹ค  
            linewidths=.5,  # ๊ฒฝ๊ณ„๋ฉด ์‹ค์„ ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ธฐ
            cbar_kws={"shrink": .5},# ์ปฌ๋Ÿฌ๋ฐ” ํฌ๊ธฐ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๊ธฐ
            vmin = -1,vmax = 1   # ์ปฌ๋Ÿฌ๋ฐ” ๋ฒ”์œ„ -1 ~ 1
           )
**Insight**    

- ๋‰ด์š•์˜ ์›”๋ณ„ ํ‰๊ท  ๊ธฐ์˜จ(temperature)๊ณผ ์›”๋ณ„ UFO๋ฅผ ๋ฐœ๊ฒฌํ•œ ๊ฑด์ˆ˜(count)์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 
  0.37๋กœ ์•ฝํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

# ์„œ๋ถ€: **์ผˆ๋ฆฌํฌ๋‹ˆ์•„ ์ฃผ**
sns.heatmap(corr_CA, 
            cmap = 'RdYlBu_r', 
            annot = True,   # ์‹ค์ œ ๊ฐ’์„ ํ‘œ์‹œํ•œ๋‹ค  
            linewidths=.5,  # ๊ฒฝ๊ณ„๋ฉด ์‹ค์„ ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ธฐ
            cbar_kws={"shrink": .5},# ์ปฌ๋Ÿฌ๋ฐ” ํฌ๊ธฐ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๊ธฐ
            vmin = -1,vmax = 1   # ์ปฌ๋Ÿฌ๋ฐ” ๋ฒ”์œ„ -1 ~ 1
           )
**Insight**    

- ์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ฃผ์˜ ์›”๋ณ„ ํ‰๊ท  ๊ธฐ์˜จ(temperature)๊ณผ ์›”๋ณ„ UFO๋ฅผ ๋ฐœ๊ฒฌํ•œ ๊ฑด์ˆ˜(count)์˜ 
  ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ 0.39๋กœ ์•ฝํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

- ์›”๋ณ„ ํ’์†๊ณผ ์›”๋ณ„ UFO๋ฅผ ๋ฐœ๊ฒฌํ•œ ๊ฑด์ˆ˜์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ -0.38๋กœ ์•ฝํ•œ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

(4) ์ผ๋ณ„ UFO Sighting Counting

usa['day'].value_counts()
>>>
4     3834
15    3797
1     3781
20    2873
5     2839
7     2818
10    2805
12    2795
19    2774
24    2768
14    2745
3     2745
18    2722
25    2722
8     2718
22    2716
16    2687
11    2684
13    2682
17    2677
6     2651
23    2643
28    2633
21    2555
27    2524
2     2522
9     2495
26    2479
30    2467
29    2411
31    1944
Name: day, dtype: int64

# ์ผ๋ณ„ UFO Sighting Counting with using Countplot
plt.figure(figsize = (16, 10))
sns.countplot(data = usa, x = 'day', palette = 'muted')

**Insight**    

- 4์ผ, 15์ผ, 1์ผ ์ˆœ์œผ๋กœ ๋งŽ์ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

(5) ์š”์ผ๋ณ„ UFO Sighting Counting

usa['day_name'].value_counts()
>>>
Saturday     15945
Friday       12459
Sunday       12327
Thursday     11831
Wednesday    11536
Tuesday      10909
Monday       10499
Name: day_name, dtype: int64

# ์š”์ผ๋ณ„ UFO Sighting Counting wiht using Countplot
plt.figure(figsize = (16, 10))
sns.countplot(data = usa, x = 'day_name', palette = 'Paired')

**Insight**    

- ํ† ์š”์ผ์ด ๊ฐ€์žฅ ๋งŽ๊ณ  ์›”์š”์ผ์ด ๊ฐ€์žฅ ์ ๋‹ค

(6) ์‹œ๊ฐ„๋Œ€๋ณ„ UFO Sighting Counting

usa['hour'].value_counts()
21    12970
22    11538
20     9877
23     7817
19     6722
18     4524
0      4511
1      2992
17     2747
5      2096
2      2085
3      1995
4      1693
16     1671
6      1661
15     1360
12     1270
10     1263
14     1244
11     1231
13     1158
9      1097
7      1055
8       929
Name: hour, dtype: int64

usa[(usa['hour'] > 6) & (usa['hour'] < 21)]
**Insight**    

- ์ด 85,506 ๊ฑด ์ค‘ 36,148๊ฑด์„ ๋บ€ 49,358 ๊ฑด, ์•ฝ 57%๊ฐ€ 21์‹œ์—์„œ 5์‹œ๊นŒ์ง€ ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค.

# ์‹œ๊ฐ„๋Œ€๋ณ„ UFO Sighting Counting with using Countplot
plt.figure(figsize = (16, 10))
sns.countplot(data = usa, x = 'hour')

**Insight**    

- ์ฃผ๋กœ 21์‹œ ์ดํ›„๋ถ€ํ„ฐ 23์‹œ๊นŒ์ง€ ๋งŽ์ด ๋ถ„ํฌ๋˜์–ด์žˆ์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

(7) State ๋ณ„ Top 5 Counting

usa['statecode'].value_counts().head()
>>>
CA    10745
FL     5581
WA     4499
TX     3716
NY     3672
Name: statecode, dtype: int64

# State ๋ณ„ Counting with using Countplot
plt.figure(figsize = (16, 10))
sns.countplot(data = usa, x = 'statecode', palette = 'colorblind', order = usa['statecode'].value_counts().index)

**Insight**    

- ์บ˜๋ฆฌํฌ๋‹ˆ์•„, ํ”Œ๋กœ๋ฆฌ๋‹ค, ์›Œ์‹ฑํ„ด, ํ…์‚ฌ์Šค, ๋‰ด์š• ์ฃผ ์ˆœ์œผ๋กœ ๋งŽ์ด ๋ณด์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

(8) Shape๋ณ„ Top 5 Counting

usa['shape'].value_counts().head()
>>>
Light       19163
Circle       9658
Triangle     8089
Fireball     6846
Sphere       6046
Name: shape, dtype: int64

# shape๋ณ„ UFO sighting Counting wiht using Countplot
plt.figure(figsize = (18, 12))
sns.countplot(data = usa, x = 'shape', palette = 'muted', order = usa['shape'].value_counts().index)

**Insight**    

- ๋น› ๋ชจ์–‘, ์› ๋ชจ์–‘, ์‚ผ๊ฐํ˜• ๋ชจ์–‘, ๋ถˆ๊ฝƒ๋ชจ์–‘, ์ฐฝ ๋ชจ์–‘ ์ˆœ์œผ๋กœ ๋งŽ์ด ๋ฐœ๊ฒฌ๋˜์—ˆ์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

(9) Analysis with Duration

์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋งŒ๋“ค์–ด๋ณด๋‹ˆ shape์™€ duration_sec์˜ ๊ฐ’๋“ค์ด ๋„ˆ๋ฌด ๋‹ค์–‘ํ•˜์—ฌ

๋ฐ‘๊ณผ ๊ฐ™์ด boxplot์ด๋‚˜ violinplot์œผ๋กœ ๋ถ„ํฌ๋ฅผ ๊ทธ๋ฆฌ๊ธฐ ํž˜๋“ค์—ˆ๋‹ค.

๊ณ ๋กœ, duration_sec์˜ ๋ฒ”์œ„๋ฅผ ์กฐ์ •ํ•ด์ค„ ํ•„์š”๊ฐ€ ์žˆ๋‹ค๊ณ  ํŒ๋‹จํ–ˆ๋‹ค.

๊ทธ๋ฆฌํ•˜์—ฌ ์ž„์˜๋กœ 120์ดˆ ๋ฏธ๋งŒ์œผ๋กœ ์ œํ•œํ•ด๋ดค๋‹ค.
under_2min = usa[usa['duration_sec'] < 120]
์„ฑ๊ณต์ ์œผ๋กœ ์žฌ์ •์˜ ํ›„ ๋‹ค์‹œ boxplot์„ ๊ทธ๋ ค๋ณด์•˜๋Š”๋ฐ ์—ฌ์ „ํžˆ ๋ถ„ํฌ๋ฅผ ํฌ๊ฒŒ ๋ณด๊ธฐ ํž˜๋“ค์—ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์ด๋กœ์จ 60์ดˆ ๋ฏธ๋งŒ์˜ ๊ฐ’๋“ค์ด ๋งŽ์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์ข€ ๋” ๋ถ„ํฌ๋„๋ฅผ ๋ณด๊ธฐ ์‰ฌ์šด violinplot์œผ๋กœ๋„ ๊ทธ๋ ค๋ณด์•˜๋‹ค.
plt.figure(figsize = (18, 12))
sns.violinplot(data = under_2min, x = 'shape', y = 'duration_sec')

**Insight**    

- 30์ดˆ ๋ฏธ๋งŒ์˜ ์ˆ˜๊ฐ€ ์••๋„์ ์œผ๋กœ ๋งŽ์ง€๋งŒ 60์ดˆ๋„ ์€๊ทผํžˆ ๋งŽ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

- Why? โ†’ UFO ์˜์‹ฌ๋ฌผ์ฒด๋Š” ์ˆœ์‹๊ฐ„์— ์ง€๋‚˜๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€๋žต์ ์ธ ์‹œ๊ฐ„์„ 1๋ถ„์œผ๋กœ ๋งŽ์ด ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.



์ด๋ฅผ ํ† ๋Œ€๋กœ duration_sec๋ฅผ 70์ดˆ ๋ฏธ๋งŒ์œผ๋กœ ์ œํ•œํ•˜๊ณ  ๋‹ค์‹œ Violinplot์„ ๊ทธ๋ ค๋ณด์•˜๋‹ค.
under_70sec = usa[usa['duration_sec'] < 70]
plt.figure(figsize = (18, 18))
sns.violinplot(data = under_70sec, x = 'shape', y = 'duration_sec')

โ†’ ์ข€ ๋” ๋ณด๊ธฐ ์‰ฝ๊ฒŒ violinplot์ด ๊ทธ๋ ค์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ์ด์ œ 70์ดˆ ๋ฏธ๋งŒ์—์„œ ๊ฐ ์ดˆ๋งˆ๋‹ค ์‚ฌ๋žŒ๋“ค์ด ๋ณธ ํšŸ์ˆ˜๋Š” ์–ผ๋งˆ๋‚˜ ๋ ๊นŒ?
plt.figure(figsize = (18, 12))
sns.countplot(data = under_70sec, y = 'duration_sec', palette = 'Set2', 
              order = under_70sec['duration_sec'].value_counts().index)

**Insight**    

- ์ด๊ฒƒ์œผ๋กœ๋„ ์‚ฌ๋žŒ๋“ค์ด 10 15 20 30 60์ดˆ ๋“ฑ ์ •ํ˜•ํ™”๋œ ์‹œ๊ฐ„ ๊ฐœ๋…์— ๋”ฐ๋ผ ๋Œ€๋‹ตํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

- ๋‹น์—ฐํ•˜๊ฒ ์ง€๋งŒ? ufo๋ฅผ ๋ณด์•˜์„ ๋•Œ ๊ตฌ์ฒด์ ์œผ๋กœ ์‹œ๊ฐ„์„ ์žฐ ์‚ฌ๋žŒ์ด ๊ฑฐ์˜ ์—†๋‹ค๋Š” ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ–ˆ๋‹ค.

๊ฐ ์ดˆ๋งˆ๋‹ค ํ•ด๋‹นํ•˜๋Š” ๋น„์ค‘์ด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ๋ณด๊ธฐ ์œ„ํ•œ violinplot์„ ๊ทธ๋ ค๋ณด์•˜๋‹ค.
plt.figure(figsize = (8, 4))
sns.violinplot(data = under_70sec, x = 'duration_sec')

**Insight**    

- ์•ฝ 5์ดˆ์ •๋„๊ฐ€ ๊ฐ€์žฅ ๋งŽ๊ณ  60์ดˆ ์ •๋„์—๋„ ๋งŽ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

(10) Wordcloud

# ๋น› ๋ชจ์–‘์— ๋Œ€ํ•œ summary๋ฅผ UFO ๋ชจ์–‘์œผ๋กœ wordclouding
lig_sum = usa[usa['shape'] == 'Light']['summary']
summ = lig_sum
mask = Image.open('/Users/sung/Downloads/ufo3.jpeg')
mask = np.array(mask)

wc = WordCloud(
        background_color = 'white',
        stopwords = stopwords,
        mask = mask)
wc.generate(str(summ))
plt.figure(figsize = (8, 8))
plt.imshow(wc)
plt.axis('off')

5. ๊ฒฐ๋ก 

  • ์—ฐ๋„: 2000๋…„~2022๋…„ ์ค‘ 2014๋…„์— ๊ฐ€์žฅ ๋งŽ์ด, 2022 & 2021๋…„์— ๊ฐ€์žฅ ์ ๊ฒŒ ๋ชฉ๊ฒฉ๋˜์—ˆ๋‹ค.
  • ์›”: 1์›”~12์›” ์ค‘ 7์›”์— ๊ฐ€์žฅ ๋งŽ์ด, 2์›”์— ๊ฐ€์žฅ ์ ๊ฒŒ ๋ชฉ๊ฒฉ๋˜์—ˆ๋‹ค.
  • ์ผ: 4์ผ, 15์ผ, 1์ผ ์ˆœ์œผ๋กœ ๋งŽ์ด ๋ณด์•˜๋‹ค.
  • ์š”์ผ: ์ผ์ฃผ์ผ ์ค‘ ํ† ์š”์ผ์— ๊ฐ€์žฅ ๋งŽ์ด, ์›”์š”์ผ์— ๊ฐ€์žฅ ์ ๊ฒŒ ๋ชฉ๊ฒฉ๋˜์—ˆ๋‹ค.
  • ์‹œ๊ฐ„: 24์‹œ๊ฐ„ ์ค‘ 21์‹œ์— ๊ฐ€์žฅ ๋งŽ์ด, 8์‹œ์— ๊ฐ€์žฅ ์ ๊ฒŒ ๋ชฉ๊ฒฉ๋˜์—ˆ๋‹ค.
  • ์ง€์—ญ: ๋ฏธ๊ตญ 52๊ฐœ ์ฃผ ์ค‘ ์บ˜๋ฆฌํฌ๋‹ˆ์•„์—์„œ ๊ฐ€์žฅ ๋งŽ์ด, ์›Œ์‹ฑํ„ด DC์—์„œ ๊ฐ€์žฅ ์ ๊ฒŒ ๋ชฉ๊ฒฉ๋˜์—ˆ๋‹ค.
  • ๋ชจ์–‘: UFO ๋ชจ์–‘์€ ๋น›(ligth) ๋ชจ์–‘์ด ๊ฐ€์žฅ ๋งŽ์ด, ์‹ญ์žํ˜•ํƒœ(cross)&์ฝ˜(cone) ๋ชจ์–‘์ด ๊ฐ€์žฅ ์ ๊ฒŒ ๋ชฉ๊ฒฉ๋˜์—ˆ๋‹ค.
  • duration: ์ „์ฒด 1์ดˆ~54600์ดˆ(์•ฝ 15์‹œ๊ฐ„) ์ค‘ 5์ดˆ ๋“€๋ ˆ์ด์…˜ ๊ตฌ๊ฐ„์ด ๊ฐ€์žฅ ๋งŽ์•˜๊ณ ,
    53์ดˆ ๊ตฌ๊ฐ„์ด ๊ฐ€์žฅ ์ ๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค.
  • ์ง€์—ญ+๋ชจ์–‘: ๋ฏธ๊ตญ ์ฃผ์™€ UFO ๋ชจ์–‘์€ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†๋Š” ๊ฑธ๋กœ ๋ถ„์„๋˜์—ˆ๋‹ค.
  • shape + duration:
    ๊ณต๋ชจ์–‘(sphere), ๋‹ค์ด์•„๋ชฌ๋“œ(Diamond), ์‹ญ์ž๋ชจ์–‘(cross) ๋“ฑ์ด ๊ฐ€์žฅ ๋งŽ์€ 60์ดˆ ๋“€๋ ˆ์ด์…˜ ๊ตฌ๊ฐ„์„ ์ฐจ์ง€ํ–ˆ๊ณ , ํ”Œ๋ž˜์‰ฌ(flash), V์žํ˜•(chevron), ๋‹ฌ๊ฑ€ํ˜•(egg)์ด ๊ฐ€์žฅ ์ ์€ ๋“€๋ ˆ์ด์…˜ ๊ตฌ๊ฐ„์„ ๋ณด์—ฌ์คฌ๋‹ค.

6. ๋ฏธ๊ตญ ์„œ๋ถ€ UFO ํˆฌ์–ด ์ฝ”์Šค๋ฅผ ๊ตฌ๊ธ€์–ด์Šค๋กœ ์ œ์ž‘

profile
์„ฑ์žฅํ•˜๋Š” ์ฃผ๋‹ˆ์–ด ๋ฐ์ดํ„ฐ ๋ถ„์„๊ฐ€(Tableau, SQL and Python)

0๊ฐœ์˜ ๋Œ“๊ธ€