프로젝트 2 - 서울시 범죄현황 데이터 (1)

Jungmin·2022년 10월 13일

목록 보기

5/24

02. Analysis Seoul Crime

1. 프로젝트 개요

2. 데이터 개요

import numpy as np
import pandas as pd

# 데이터 읽기
crime_raw_data = pd.read_csv("../data/02. crime_in_Seoul.csv", thousands=",", encoding = "euc-kr")
crime_raw_data.head()

	구분	죄종	발생검거	건수
0	중부	살인	발생	2.0
1	중부	살인	검거	2.0
2	중부	강도	발생	3.0
3	중부	강도	검거	3.0
4	중부	강간	발생	141.0

crime_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65534 entries, 0 to 65533
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   구분      310 non-null    object 
 1   죄종      310 non-null    object 
 2   발생검거    310 non-null    object 
 3   건수      310 non-null    float64
dtypes: float64(1), object(3)
memory usage: 2.0+ MB

info()데이터 개요 확인
rangeindex가 65533인데, 310개이다.

crime_raw_data["죄종"].unique()

array(['살인', '강도', '강간', '절도', '폭력', nan], dtype=object)

특정 컬럼에서 unique 조사
nan 값이 들어가 있음

crime_raw_data["죄종"].isnull()

0        False
1        False
2        False
3        False
4        False
         ...  
65529     True
65530     True
65531     True
65532     True
65533     True
Name: 죄종, Length: 65534, dtype: bool

crime_raw_data[crime_raw_data["죄종"].isnull()].head()

	구분	죄종	발생검거	건수
310	NaN	NaN	NaN	NaN
311	NaN	NaN	NaN	NaN
312	NaN	NaN	NaN	NaN
313	NaN	NaN	NaN	NaN
314	NaN	NaN	NaN	NaN

crime_raw_data = crime_raw_data[crime_raw_data["죄종"].notnull()]

crime_raw_data.tail()

	구분	죄종	발생검거	건수
305	수서	강간	검거	144.0
306	수서	절도	발생	1149.0
307	수서	절도	검거	789.0
308	수서	폭력	발생	1666.0
309	수서	폭력	검거	1431.0

crime_raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 310 entries, 0 to 309
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   구분      310 non-null    object 
 1   죄종      310 non-null    object 
 2   발생검거    310 non-null    object 
 3   건수      310 non-null    float64
dtypes: float64(1), object(3)
memory usage: 12.1+ KB

pandas pivot table

-구성: index, columns, values, aggfunc(연산식)

pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
     ------------------------------------- 242.1/242.1 kB 14.5 MB/s eta 0:00:00
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10
Note: you may need to restart the kernel to use updated packages.

df = pd.read_excel("../data/02. sales-funnel.xlsx")
df.head()

	Account	Name	Rep	Manager	Product	Quantity	Price	Status
0	714466	Trantow-Barrows	Craig Booker	Debra Henley	CPU	1	30000	presented
1	714466	Trantow-Barrows	Craig Booker	Debra Henley	Software	1	10000	presented
2	714466	Trantow-Barrows	Craig Booker	Debra Henley	Maintenance	2	5000	pending
3	737550	Fritsch, Russel and Anderson	Craig Booker	Debra Henley	CPU	1	35000	declined
4	146832	Kiehn-Spinka	Daniel Hilton	Debra Henley	CPU	2	65000	won

index 설정

# Name 컬럼을 인덱스로 설정

df.pivot_table(index="Name")
# pd.pivot_table(df, index = "Name") 도 같은방법

	Account	Price	Quantity
Name
Barton LLC	740150	35000	1.000000
Fritsch, Russel and Anderson	737550	35000	1.000000
Herman LLC	141962	65000	2.000000
Jerde-Hilpert	412290	5000	2.000000
Kassulke, Ondricka and Metz	307599	7000	3.000000
Keeling LLC	688981	100000	5.000000
Kiehn-Spinka	146832	65000	2.000000
Koepp Ltd	729833	35000	2.000000
Kulas Inc	218895	25000	1.500000
Purdy-Kunde	163416	30000	1.000000
Stokes LLC	239344	7500	1.000000
Trantow-Barrows	714466	15000	1.333333

df.pivot_table(index=["Name", "Rep","Manager"])

			Account	Price	Quantity
Name	Rep	Manager
Barton LLC	John Smith	Debra Henley	740150	35000	1.000000
Fritsch, Russel and Anderson	Craig Booker	Debra Henley	737550	35000	1.000000
Herman LLC	Cedric Moss	Fred Anderson	141962	65000	2.000000
Jerde-Hilpert	John Smith	Debra Henley	412290	5000	2.000000
Kassulke, Ondricka and Metz	Wendy Yule	Fred Anderson	307599	7000	3.000000
Keeling LLC	Wendy Yule	Fred Anderson	688981	100000	5.000000
Kiehn-Spinka	Daniel Hilton	Debra Henley	146832	65000	2.000000
Koepp Ltd	Wendy Yule	Fred Anderson	729833	35000	2.000000
Kulas Inc	Daniel Hilton	Debra Henley	218895	25000	1.500000
Purdy-Kunde	Cedric Moss	Fred Anderson	163416	30000	1.000000
Stokes LLC	Cedric Moss	Fred Anderson	239344	7500	1.000000
Trantow-Barrows	Craig Booker	Debra Henley	714466	15000	1.333333

value 설정

df.pivot_table(index=["Manager", "Rep"], values="Price")

		Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000
Daniel Hilton	38333.333333
John Smith	20000.000000
Fred Anderson	Cedric Moss	27500.000000
Wendy Yule	44250.000000

# price 컬럼 sum 연산 적용
df.pivot_table(index=["Manager", "Rep"], values="Price", aggfunc=np.sum)

		Price
Manager	Rep
Debra Henley	Craig Booker	80000
Daniel Hilton	115000
John Smith	40000
Fred Anderson	Cedric Moss	110000
Wendy Yule	177000

#aggfunc 다중 연산기능 적용( 두개 이상이면 []리스트로 감싸기)
df.pivot_table(index=["Manager", "Rep"], values="Price", aggfunc=[np.sum, len])

		sum	len
		Price	Price
Manager	Rep
Debra Henley	Craig Booker	80000	4
Daniel Hilton	115000	3
John Smith	40000	2
Fred Anderson	Cedric Moss	110000	4
Wendy Yule	177000	4

columns 설정

#Product를 컬럼으로 지정
df.pivot_table(index=["Manager", "Rep"], values="Price", columns="Product", aggfunc=np.sum)

	Product	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000.0	5000.0	NaN	10000.0
Daniel Hilton	105000.0	NaN	NaN	10000.0
John Smith	35000.0	5000.0	NaN	NaN
Fred Anderson	Cedric Moss	95000.0	5000.0	NaN	10000.0
Wendy Yule	165000.0	7000.0	5000.0	NaN

#  Nan값 설정 : fill value
df.pivot_table(index=["Manager", "Rep"], values="Price", columns="Product", aggfunc=np.sum, fill_value = 0)

	Product	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	65000	5000	0	10000
Daniel Hilton	105000	0	0	10000
John Smith	35000	5000	0	0
Fred Anderson	Cedric Moss	95000	5000	0	10000
Wendy Yule	165000	7000	5000	0

# 2개 이상 index, values 설정
df.pivot_table(index=["Manager", "Rep", "Product"], values=["Price", "Quantity"], aggfunc=np.sum, fill_value=0)

			Price	Quantity
Manager	Rep	Product
Debra Henley	Craig Booker	CPU	65000	2
Maintenance	5000	2
Software	10000	1
Daniel Hilton	CPU	105000	4
Software	10000	1
John Smith	CPU	35000	1
Maintenance	5000	2
Fred Anderson	Cedric Moss	CPU	95000	3
Maintenance	5000	1
Software	10000	1
Wendy Yule	CPU	165000	7
Maintenance	7000	3
Monitor	5000	2

# aggfunc 2개 이상 설정
df.pivot_table(index=["Manager", "Rep", "Product"],
               values=["Price", "Quantity"], 
               aggfunc=[np.sum, np.mean], 
               fill_value=0,
              margins=True)  #총계(All) 추가

			sum	mean
			Price	Quantity	Price	Quantity
Manager	Rep	Product
Debra Henley	Craig Booker	CPU	65000	2	32500.000000	1.000000
Maintenance	5000	2	5000.000000	2.000000
Software	10000	1	10000.000000	1.000000
Daniel Hilton	CPU	105000	4	52500.000000	2.000000
Software	10000	1	10000.000000	1.000000
John Smith	CPU	35000	1	35000.000000	1.000000
Maintenance	5000	2	5000.000000	2.000000
Fred Anderson	Cedric Moss	CPU	95000	3	47500.000000	1.500000
Maintenance	5000	1	5000.000000	1.000000
Software	10000	1	10000.000000	1.000000
Wendy Yule	CPU	165000	7	82500.000000	3.500000
Maintenance	7000	3	7000.000000	3.000000
Monitor	5000	2	5000.000000	2.000000
All			522000	30	30705.882353	1.764706

3. 서울시 범죄 현황 데이터 정리

crime_raw_data.head()

	구분	죄종	발생검거	건수
0	중부	살인	발생	2.0
1	중부	살인	검거	2.0
2	중부	강도	발생	3.0
3	중부	강도	검거	3.0
4	중부	강간	발생	141.0

#crime_station 변수 만들어 피벗테이블 기능 적용 후 출력
crime_station = crime_raw_data.pivot_table(
    crime_raw_data, 
    index="구분", 
    columns=["죄종", "발생검거"], 
    aggfunc = [np.sum])
crime_station.head()

	sum
	건수
죄종	강간	강도	살인	절도	폭력
발생검거	검거	발생	검거	발생	검거	발생	검거	발생	검거	발생
구분
강남	269.0	339.0	26.0	24.0	3.0	3.0	1129.0	2438.0	2096.0	2336.0
강동	152.0	160.0	13.0	14.0	5.0	4.0	902.0	1754.0	2201.0	2530.0
강북	159.0	217.0	4.0	5.0	6.0	7.0	672.0	1222.0	2482.0	2778.0
강서	239.0	275.0	10.0	10.0	10.0	9.0	1070.0	1952.0	2768.0	3204.0
관악	264.0	322.0	10.0	12.0	7.0	6.0	937.0	2103.0	2707.0	3235.0

crime_station.columns   #multiindex

MultiIndex([('sum', '건수', '강간', '검거'),
            ('sum', '건수', '강간', '발생'),
            ('sum', '건수', '강도', '검거'),
            ('sum', '건수', '강도', '발생'),
            ('sum', '건수', '살인', '검거'),
            ('sum', '건수', '살인', '발생'),
            ('sum', '건수', '절도', '검거'),
            ('sum', '건수', '절도', '발생'),
            ('sum', '건수', '폭력', '검거'),
            ('sum', '건수', '폭력', '발생')],
           names=[None, None, '죄종', '발생검거'])

# 멀티인덱스에서 인덱스 접근 방법 (5가지만 확인 시)
crime_station["sum", "건수", "강도", "검거"][:5]

구분
강남    26.0
강동    13.0
강북     4.0
강서    10.0
관악    10.0
Name: (sum, 건수, 강도, 검거), dtype: float64

#sum,건수 columns 삭제
crime_station.columns = crime_station.columns.droplevel([0,1]) #다중 컬럼에서 특정 컬럼 제거
crime_station.columns

MultiIndex([('강간', '검거'),
            ('강간', '발생'),
            ('강도', '검거'),
            ('강도', '발생'),
            ('살인', '검거'),
            ('살인', '발생'),
            ('절도', '검거'),
            ('절도', '발생'),
            ('폭력', '검거'),
            ('폭력', '발생')],
           names=['죄종', '발생검거'])

# sum, 건수 컬럼 삭제 확인 
crime_station.head()

죄종	강간	강도	살인	절도	폭력
발생검거	검거	발생	검거	발생	검거	발생	검거	발생	검거	발생
구분
강남	269.0	339.0	26.0	24.0	3.0	3.0	1129.0	2438.0	2096.0	2336.0
강동	152.0	160.0	13.0	14.0	5.0	4.0	902.0	1754.0	2201.0	2530.0
강북	159.0	217.0	4.0	5.0	6.0	7.0	672.0	1222.0	2482.0	2778.0
강서	239.0	275.0	10.0	10.0	10.0	9.0	1070.0	1952.0	2768.0	3204.0
관악	264.0	322.0	10.0	12.0	7.0	6.0	937.0	2103.0	2707.0	3235.0

crime_station.index

Index(['강남', '강동', '강북', '강서', '관악', '광진', '구로', '금천', '남대문', '노원', '도봉',
       '동대문', '동작', '마포', '방배', '서대문', '서부', '서초', '성동', '성북', '송파', '수서',
       '양천', '영등포', '용산', '은평', '종로', '종암', '중랑', '중부', '혜화'],
      dtype='object', name='구분')

현재 index는 경찰서 이름으로 되어있음.
경찰서 이름으로 해당 구 알아내어야 한다.

4. python모듈 설치

pip 명령어

pip: python 공식 모듈 관리자
pip list : 현재 환경아래 설치된 내용 확인
pip install module_name : 해당 모듈 설치
pip uninstall module_name : 해당 모듈 삭제

!pip list  # get_ipython().system("pip list") 와 동일

conda 명령

conda list : 리스트 확인
conda install module_name
conda uninstall module_nale
conda install -c channel_name module_name : 패키지 배포하는 채널 이름+모듈이름 : 지정된 배포 채널에서 모듈 설치

5. Google Maps API 설치

conda install -c conda-forge googlemaps

import googlemaps

gmaps_key = "AIzaSyCB3zeKUulnVUj_yDjKqaqoqVsF0MGvx_o"
gmaps = googlemaps.Client(key=gmaps_key)

gmaps.geocode("서울영등포경찰서", language="ko")

[{'address_components': [{'long_name': '６０８',
    'short_name': '６０８',
    'types': ['premise']},
   {'long_name': '국회대로',
    'short_name': '국회대로',
    'types': ['political', 'sublocality', 'sublocality_level_4']},
   {'long_name': '영등포구',
    'short_name': '영등포구',
    'types': ['political', 'sublocality', 'sublocality_level_1']},
   {'long_name': '서울특별시',
    'short_name': '서울특별시',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': '대한민국',
    'short_name': 'KR',
    'types': ['country', 'political']},
   {'long_name': '150-043',
    'short_name': '150-043',
    'types': ['postal_code']}],
  'formatted_address': '대한민국 서울특별시 영등포구 국회대로 608',
  'geometry': {'location': {'lat': 37.5260441, 'lng': 126.9008091},
   'location_type': 'ROOFTOP',
   'viewport': {'northeast': {'lat': 37.5273930802915,
     'lng': 126.9021580802915},
    'southwest': {'lat': 37.5246951197085, 'lng': 126.8994601197085}}},
  'partial_match': True,
  'place_id': 'ChIJ1TimJLaffDURptXOs0Tj6sY',
  'plus_code': {'compound_code': 'GWG2+C8 대한민국 서울특별시',
   'global_code': '8Q98GWG2+C8'},
  'types': ['establishment', 'point_of_interest', 'police']}]