서울시 CCTV 현황 데이터 분석

InSung-Na·2023년 1월 26일

corr dataframe matplotlib merge pandas python 상관계수 서울시CCTV현황데이터분석 제로베이스 제로베이스데이터스쿨

Part 04. EDA&Web Crawling Project

목록 보기

1/7

해당 글은 제로베이스데이터스쿨 학습자료를 참고하여 작성되었습니다

📌서울시 CCTV 현황 데이터 분석

📌프로젝트 목표

서울시 구역별 인구수 대비 CCTV수 분석

데이터출처 : 서울열린데이터광장

📌프로젝트 절차

서울시 구별 CCTV 현황 데이터 확보

인구 현황 데이터 확보

CCTV 데이터와 인구 현황과 데이터 합치기

데이터 정리 및 정렬

데이터 시각화(그래프)

전제적인 경향 파악

이상치 강조

데이터 분석하기

📌1. 서울시 구별 CCTV 현황 데이터 확보

📝입력

import pandas as pd
CCTV_Seoul = pd.read_csv("../data/01. Seoul_CCTV.csv")
CCTV_Seoul.rename(columns={CCTV_Seoul.columns[0]: "구별"}, inplace=True)
CCTV_Seoul.head(2)

🧾출력

📌2. 인구현황 데이터 확보

원본데이터 확인 결과

세번째 행부터 필요하고, 전체계, 한국인계, 외국인계와 고령자 칼럼만 필요함

📝입력

pop_Seoul = pd.read_excel("../data/01. Seoul_Population.xls", header=2, usecols = "B, D, G, J, N")
pop_Seoul.rename(
	columns={
    	pop_Seoul.columns[0]: "구별",
        pop_Seoul.columns[1]: "인구수",
        pop_Seoul.columns[2]: "한국인",
        pop_Seoul.columns[3]: "외국인",
        pop_Seoul.columns[4]: "고령자",
    },
    inplace=True
)
pop_Seoul.head()

🧾출력

📌3. CCTV 데이터와 인구 현황과 데이터 합치기

3.0. Pandas 기초

Python에서 R만큼의 강력한 데이터 핸들링 성능을 제공하는 모듈

코딩 가능한 엑셀

Series

구성요소 : index와 value

한 가지 데이터 타입만 사용

DataFrame

구성요소 : index와 value, column

Series의 집합

3.1. CCTV 데이터현황

3.2. 인구 데이터현황

3.3. 데이터 합치기

3.3.0. Pandas merge(left, right, how='', on='')

3.3.1. CCTV와 인구수 합치기

📌4. 데이터 정리 및 정렬

📝입력

CCTV_Seoul["최근증가율"] = (
    (CCTV_Seoul["2014년"] + CCTV_Seoul["2015년"] + CCTV_Seoul["2016년"])/ CCTV_Seoul["2013년도 이전"] * 100
)
# 데이터 확인
# CCTV_Seoul.sort_values(by="최근증가율", ascending=False).head()

pop_Seoul.drop([0], axis=0, inplace=True)
pop_Seoul["외국인비율"] = pop_Seoul['외국인'] / pop_Seoul['인구수'] * 100
pop_Seoul["고령자비율"] = pop_Seoul['고령자'] / pop_Seoul['인구수'] * 100
# 데이터 확인
# pop_Seoul.sort_values(by="고령자비율", ascending=False).head()

data_result = pd.merge(CCTV_Seoul, pop_Seoul, on='구별')
del data_result['2013년도 이전']
del data_result['2014년']
data_result.drop(['2015년', '2016년'], axis=1, inplace=True)
data_result.set_index('구별', inplace=True)
data_result['CCTV비율'] = data_result['소계'] / data_result['인구수'] * 100
data_result.sort_values(by='CCTV비율', ascending=False).head()

🧾출력 : 최근증가율

🧾출력 : 고령자비율

🧾출력 : CCTV비율

📌5. 데이터 시각화

5.0. matplotlib 기초

파이썬의 대표 시각화 도구

matlab의 시각화기능을 파이썬의 모듈로 만든 것

matplotlib 기본 함수

plt.figure(figsize=(10,6)) : 그림설정, 크기는 10,6

plt.plot(x, y) : x,y 지정하고 그리기, plot이외에 scatter, bar 등으로 모양 변경가능

plt.grid() : 격자

plt.legend() : 범례

plt.title() : 타이틀

plt.xlabel(), ylabel() : 라벨명 설정

plt.xlim([x1, x2]), ylim([y1, y2]) : 한계치 설정

plt.show() : 그림 출력

plot의 속성

label : 해당 곡선의 명칭

color : 곡선 색상

linestiyle : 곡선 스타일

marker : 마커 모양

markerfacecolor : 마커 색상

markersize : 마커 크기

legend의 속성 loc="LOCATION"

%matplotlib inline : 실행한 프로그램에서 출력물(그림, 소리, 애니메이션)을 볼 수 있게 해주는 기능

Pandas의 plot

5.1. 그래프 그리기

📝입력

import matplotlib.pyplot as plt
from matplotlib import rc

plt.rcParams["axes.unicode_minus"]= False # 마이너스부호때문에 한글이 깨질수 있어서 하는 설정
rc("font", family="MalGun Gothic") # Windows: 맑은고딕, Mac: Arial Unicode MS
%matplotlib inline

def drawGraph():
    data_result["CCTV비율"].sort_values().plot(
    kind="barh",
    grid=True,
    title="구역별 CCTV비율",
    figsize=(7, 7)
    )
    
drawGraph()

🧾출력

📌6. 전제적인 경향 파악

경향파악 이유 : 단순히 데이터수가 많은 곳과 비율이 높은 곳의 전체 경향 파악으로 이해를 돕는다
LinearRegression(선형회귀) : 기존 데이터들을 이용하여 선형상관관계를 모델링하고 예측하는 회귀분석방법

numpy를 이용한 1차 직선 만들기

np.polyfit() : 직선의 계수 구하는 함수

np.poly1d() : 계수로 1차 직선 구하는 함수

📝입력

import numpy as np
fp1 = np.polyfit(data_result['인구수'], data_result['소계'], 1)
f1 = np.poly1d(fp1)
fx = np.linspace(100000, 700000, 100)

def drawGraph():
    plt.figure(figsize=(7,5))
    plt.scatter(data_result['인구수'], data_result['소계'], s=25)
    plt.plot(fx, f1(fx), ls='dashed', lw=3, color='g')
    plt.xlabel('인구수')
    plt.ylabel('CCTV')
    plt.grid()
    plt.show()
    
drawGraph()

🧾출력

📌7. 이상치(경향에서 벗어난 데이터) 강조

📝입력

# 데이터저장하기
data_result.to_csv("../data/0.1 CCTV_result.csv", sep=",", encoding='utf-8')

# 이상치 강조하기
fp1 = np.polyfit(data_result['인구수'], data_result['소계'], 1)
f1 = np.poly1d(fp1)
fx = np.linspace(100000, 700000, 100)

data_result['오차'] = data_result['소계'] - f1(data_result['인구수'])

df_sort_f = data_result.sort_values(by='오차', ascending=False)
df_sort_t = data_result.sort_values(by='오차', ascending=True)

from matplotlib.colors import ListedColormap
color_step = ['#e74c3c', '#2ecc71', '#95a9a6', '#2ecc71', '#3498db', '#3489db'] # 색상코드값
my_cmap = ListedColormap(color_step)

def drawGraph():
    plt.figure(figsize=(10,8))
    plt.scatter(data_result['인구수'], data_result['소계'],
                s=25, c=data_result['오차'], cmap=my_cmap)
    plt.plot(fx, f1(fx), ls='dashed', lw=3, color='g')
    for n in range(5):
        # 상위 5개
        plt.text(
            df_sort_f['인구수'][n] * 1.02,
            df_sort_f['소계'][n] * 0.98,
            df_sort_f.index[n],
            fontsize=10,
        )
        # 하위 5개
        plt.text(
            df_sort_t['인구수'][n] * 1.02,
            df_sort_t['소계'][n] * 0.98,
            df_sort_t.index[n],
            fontsize=10,
        )
    plt.xlabel('인구수')
    plt.ylabel('CCTV')
    plt.colorbar()
    plt.grid()
    plt.show()

drawGraph()

🧾출력

🧾0.1 CCTV_result.csv

📌8. 요약 및 분석

일반적인 상식관점 : 인구수가 많으면 CCTV도 많다
서울시는 양의 상관관계를 가지고 있다
CCTV의 주목적과 기능을 고려하였을 때, 단순히 인구수만 고려하여 분석하는 것은 바람직하지 못하다

CCTV목적 : 범죄예방, 감시, 인재(人災)예방 등

CCTV기능 : CCTV가 보는 공간만 관측가능
(장애물이 적은 평지->소수로 충분)
(장애물이 많은 골목길->다수필요)

InSung-Na

다음 포스트

서울시 CCTV 현황 데이터 분석

Part 04. EDA&Web Crawling Project

📌서울시 CCTV 현황 데이터 분석

📌프로젝트 목표

📌프로젝트 절차

📌1. 서울시 구별 CCTV 현황 데이터 확보

📌2. 인구현황 데이터 확보

📌3. CCTV 데이터와 인구 현황과 데이터 합치기

3.0. Pandas 기초

3.1. CCTV 데이터현황

3.2. 인구 데이터현황

3.3. 데이터 합치기

3.3.0. Pandas merge(left, right, how='', on='')

3.3.1. CCTV와 인구수 합치기

📌4. 데이터 정리 및 정렬

📌5. 데이터 시각화

5.0. matplotlib 기초

5.1. 그래프 그리기

📌6. 전제적인 경향 파악

📌7. 이상치(경향에서 벗어난 데이터) 강조

📌8. 요약 및 분석

서울시 범죄 현황 데이터 분석

0개의 댓글

관련 채용 정보