SQL을 활용한 프로젝트 전에 연습용으로 Google Analytics Sample Dataset에서 인기 있는 노트북 몇개를 필사해보기로 했다. Top2 2개를 참고해서 한글로 번역하고, 시각화를 내가 편한 방식으로 일부 바꿔봤다.
Kaggle notebok 공유는 처음인데 벌써 감사히도 누군가 vote를 눌러줬다. 초심자를 위한 고수들의 따뜻한 칭찬 같은 느낌.
# BigQuery
from google.cloud import bigquery
# viz libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
# viz settings
%matplotlib inline
sns.set()
init_notebook_mode(connected=True)
kaggle kernel에서 빅쿼리 데이터셋을 연결할 수 있다.
# Client 객체 생성
client = bigquery.Client()
# 데이터셋 참조경로(reference) 설정
# Kaggle커널에서는 bq_helper를 대신 사용할 수도 있다.
dataset_ref = client.dataset('google_analytics_sample', project='bigquery-public-data')
# 해당 경로로부터 데이터셋 추출
dataset = client.get_dataset(dataset_ref)
데이터셋은 여러 개의 테이블로 구성되어 있다.
# 데이터셋을 테이블 단위로 보기
tables = list(client.list_tables(dataset))
table_names = sorted([t.table_id for t in tables])
# 테이블 단위로 간단한 정보 확인
print(f"""table 개수 : {len(tables)}
tables : {", ".join(table_names[:3])}, ...
date 범위 : {table_names[0][-8:]} ~ {table_names[-1][-8:]}""")
데이터셋에서 하나의 테이블을 추출하고 일부를 출력해서 컬럼 정보와 구성을 확인할 수 있다.
# 테이블 경로 생성
table_ref_temp = dataset_ref.table(table_names[0])
# 테이블 가져오기
table_temp = client.get_table(table_ref_temp)
# 컬럼 확인
client.list_rows(table_temp, max_results=5).to_dataframe()
| visitorId | visitNumber | visitId | visitStartTime | date | totals | trafficSource | device | geoNetwork | customDimensions | hits | fullVisitorId | userId | channelGrouping | socialEngagementType |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NaN | 1 | 1470046245 | 1470046245 | 20160801 | {'visits': 1, 'hits': 24, 'pageviews': 17, ... | {'referralPath': None, 'campaign': '(not set)'... | {'browser': 'Firefox', 'browserVersion': 'not ... | {'continent': 'Europe', 'subContinent': 'Weste... | [{'index': 4, 'value': 'EMEA'}] | [{'hitNumber': 1, 'time': 0, 'hour': 3, 'minut... | 895954260133011192 | None | Organic Search | Not Socially Engaged |
| NaN | 1 | 1470084717 | 1470084717 | 20160801 | {'visits': 1, 'hits': 24, 'pageviews': 18, ... | {'referralPath': None, 'campaign': '(not set)'... | {'browser': 'Internet Explorer', 'browserVersi... | {'continent': 'Americas', 'subContinent': 'Nor... | [{'index': 4, 'value': 'North America'}] | [{'hitNumber': 1, 'time': 0, 'hour': 13, 'minu... | 0288478011259077136 | None | Direct | Not Socially Engaged |
| NaN | 3 | 1470078988 | 1470078988 | 20160801 | {'visits': 1, 'hits': 27, 'pageviews': 17, ... | {'referralPath': None, 'campaign': '(not set)'... | {'browser': 'Safari', 'browserVersion': 'not a... | {'continent': 'Americas', 'subContinent': 'Nor... | [{'index': 4, 'value': 'North America'}] | [{'hitNumber': 1, 'time': 0, 'hour': 12, 'minu... | 6440789996634275026 | None | Organic Search | Not Socially Engaged |
| NaN | 4 | 1470075581 | 1470075581 | 20160801 | {'visits': 1, 'hits': 27, 'pageviews': 19, ... | {'referralPath': '/', 'campaign': '(not set)',... | {'browser': 'Chrome', 'browserVersion': 'not a... | {'continent': 'Americas', 'subContinent': 'Nor... | [{'index': 4, 'value': 'North America'}] | [{'hitNumber': 1, 'time': 0, 'hour': 11, 'minu... | 8520115029387302083 | None | Referral | Not Socially Engaged |
| NaN | 30 | 1470099026 | 1470099026 | 20160801 | {'visits': 1, 'hits': 27, 'pageviews': 17, ... | {'referralPath': None, 'campaign': '(not set)'... | {'browser': 'Chrome', 'browserVersion': 'not a... | {'continent': 'Americas', 'subContinent': 'Nor... | [{'index': 4, 'value': 'North America'}] | [{'hitNumber': 1, 'time': 0, 'hour': 17, 'minu... | 6792260745822342947 | None | Organic Search | Not Socially Engaged |
결과 : Google Analytics 데이터로 totals, trafficSource, device, geoNetwork, customDimensions, hits 컬럼은 nested data(하나의 스키마 안에 여러 개의 데이터가 중첩된 구조)인 것을 확인할 수 있습니다.
아래는 위의 nested data 컬럼 내에 어떤 데이터가 포함되었는지 확인하는 과정입니다.
def format_schema_field(schema_field, indent=0):
"""
빅쿼리 스키마의 (중첩된 구조 내부까지) 필드 이름과 데이터 타입을 출력하는 함수
"""
indent_str = " " * indent
field_info = f"{indent_str}{schema_field.name} ({schema_field.field_type})"
if schema_field.mode != "NULLABLE":
field_info += f" - {schema_field.mode}"
if schema_field.description:
field_info += f" - {schema_field.description}"
nested_indent = indent + 2
if schema_field.field_type == "RECORD":
for sub_field in schema_field.fields:
field_info += "\n" + format_schema_field(sub_field, nested_indent)
return field_info
# Display schemas
print("SCHEMA field for the 'totals' column:\n")
print(format_schema_field(table_temp.schema[5]))
print()
print("\nSCHEMA field for the 'trafficSource' column:\n")
print(format_schema_field(table_temp.schema[6]))
print()
print("\nSCHEMA field for the 'device' column:\n")
print(format_schema_field(table_temp.schema[7]))
print()
print("\nSCHEMA field for the 'geoNetwork' column:\n")
print(format_schema_field(table_temp.schema[8]))
print()
print("\nSCHEMA field for the 'customDimensions' column:\n")
print(format_schema_field(table_temp.schema[9]))
print()
print("\nSCHEMA field for the 'hits' column:\n")
print(format_schema_field(table_temp.schema[10]))
데이터셋의 구조와 대략적인 내용을 파악했습니다. 이제 활용 가능한 데이터의 종류를 알고 있기 때문에 어떤 데이터를 분석해서 어떤 정보를 얻을 것인지 구체적인 목표 설정이 가능합니다.
# UNNEST(hits) : hits를 컬럼별로 분리시키기
# _TABLE_SUFIX : table 이름이 ~로 끝나는 경우
# hits.hitNumber=1 : 랜딩페이지로 한정
query = f"""
SELECT
hits.page.pagePath AS landing_page,
COUNT(*) AS views,
SUM(totals.bounces)/COUNT(*) AS bounce_rate
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801' AND
hits.type = 'PAGE' AND
hits.hitNumber=1
GROUP BY landing_page
ORDER BY 2 DESC
LIMIT 10
"""
result_bounce = client.query(query).result().to_dataframe()
result_bounce['label'] = result_bounce['landing_page'].str.split('/').str[-1]
result_bounce
| index | landing_page | views | bounce_rate | label |
|---|---|---|---|---|
| 0 | /home | 612140 | 0.495475 | home |
| 1 | /google+redesign/shop+by+brand/youtube | 81512 | 0.562347 | youtube |
| 2 | /google+redesign/apparel/men++s/men++s+t+shirts | 20685 | 0.441141 | men++s+t+shirts |
| 3 | /signin.html | 16296 | 0.361622 | signin.html |
| 4 | /google+redesign/apparel/mens/mens+t+shirts | 12691 | 0.475061 | mens+t+shirts |
| 5 | /basket.html | 9431 | 0.454565 | basket.html |
| 6 | /google+redesign/drinkware | 8833 | 0.533794 | drinkware |
| 7 | /google+redesign/bags | 8608 | 0.451208 | bags |
| 8 | /google+redesign/apparel/men++s/men++s+outerwear | 6345 | 0.432782 | men++s+outerwear |
| 9 | /google+redesign/apparel | 6326 | 0.448150 | apparel |
query = f"""
SELECT
hits.page.pagePath as page,
COUNT(*) AS views,
SUM(totals.bounces)/COUNT(*) AS exit_rate
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801' AND
hits.type='PAGE'
GROUP BY 1
ORDER BY 2 desc
LIMIT 10
"""
result_exit = client.query(query).result().to_dataframe()
result_exit['label'] = result_exit['page'].str.split('/').str[-1]
result_exit
| index | page | views | exit_rate | label |
|---|---|---|---|---|
| 0 | /home | 981285 | 0.309920 | home |
| 1 | /basket.html | 209360 | 0.020524 | basket.html |
| 2 | /google+redesign/shop+by+brand/youtube | 145026 | 0.316198 | youtube |
| 3 | /signin.html | 101299 | 0.058322 | signin.html |
| 4 | /store.html | 93551 | 0.025505 | store.html |
| 5 | /google+redesign/apparel/men++s/men++s+t+shirts | 67471 | 0.135392 | men++s+t+shirts |
| 6 | /asearch.html | 62380 | 0.031677 | asearch.html |
| 7 | /google+redesign/electronics | 56839 | 0.047116 | electronics |
| 8 | /google+redesign/apparel | 56552 | 0.050272 | apparel |
| 9 | /google+redesign/bags | 53686 | 0.072458 | bags |
위에서 이탈률(bounce_rate)을 확인했던 페이지의 종료율(exit_rate)를 확인했을 떄
# 이탈률 시각화
px.bar(result_bounce.sort_values(by='bounce_rate'), x='bounce_rate', y='label',
width=800, height=400, title='Top10 랜딩페이지의 이탈률(Bounce Rate)')

# 종료율 시각화
px.bar(result_exit.sort_values(by='exit_rate'), x='exit_rate', y='label',
width=800, height=400, title='Top10 방문페이지의 종료율(Exit Rate)')

Insight :
세션별 접속한 브라우저와 기기 정보를 확인하면 호환성 문제를 확인할 수 있다.
# 쿼리문
query = f"""
SELECT
device.Browser AS browser,
COUNT(*) AS sessions,
SUM(totals.bounces)/COUNT(*) AS exit_rate
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801'
GROUP BY 1
ORDER BY 2 desc
LIMIT 10
"""
result_browser = client.query(query).result().to_dataframe()
display(result_browser)
# 발견한 사실
print(f"""브라우저 : 접속 브라우저 별 세션수(session)와 종료율(exit_rate)를 확인했을 때
- Chrome 이 다른 브라우저에 비해 6% 이상 exit_rate이 낮다.""")
# 시각화
fig = go.Figure(
data = [
go.Bar(x=result_browser['browser'], y=result_browser['sessions'],
yaxis='y', offsetgroup=1, name='sessions'),
go.Bar(x=result_browser['browser'], y=result_browser['exit_rate'],
yaxis='y2', offsetgroup=2, name='exit_rate')
],
layout = {
'yaxis' : {'title' : 'sessions'},
'yaxis2' : {'title' : 'exit_rate', 'overlaying': 'y', 'side': 'right'}
},
)
fig.update_layout(
barmode='group', width=800, height=400, title="접속 브라우저 별 세션수와 종료율"
)
fig.show()
| index | browser | sessions | exit_rate |
|---|---|---|---|
| 0 | Chrome | 3197849 | 0.091878 |
| 1 | Safari | 629906 | 0.157622 |
| 2 | Firefox | 133880 | 0.168195 |
| 3 | Internet Explorer | 62405 | 0.190369 |
| 4 | Edge | 38063 | 0.150934 |
| 5 | Android Webview | 25979 | 0.174872 |
| 6 | Safari (in-app) | 19037 | 0.213532 |
| 7 | Opera | 15439 | 0.209988 |
| 8 | Opera Mini | 12767 | 0.314639 |
| 9 | UC Browser | 5807 | 0.283968 |

# 쿼리문
query = f"""
SELECT
device.deviceCategory AS device,
COUNT(*) AS sessions,
SUM(totals.bounces)/COUNT(*) AS exit_rate
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801'
GROUP BY 1
ORDER BY 2 desc
LIMIT 10
"""
result_device = client.query(query).result().to_dataframe()
display(result_device)
# 발견한 사실
print(f"""기기 : 접속 기기 별 세션수(session)와 종료율(exit_rate)을 확인했을 때
- desktop을 사용한 방문이 가장 많다.
- mobile의 종료율이 3% 정도 더 높다.""")
# 시각화
fig = go.Figure(
data = [
go.Bar(x=result_device['device'], y=result_device['sessions'],
yaxis='y', offsetgroup=1, name='sessions'),
go.Bar(x=result_device['device'], y=result_device['exit_rate'],
yaxis='y2', offsetgroup=2, name='exit_rate')
],
layout = {
'yaxis' : {'title' : 'sessions'},
'yaxis2' : {'title' : 'exit_rate', 'overlaying': 'y', 'side': 'right'}
},
)
fig.update_layout(
barmode='group', width=800, height=400, title="접속 기기 별 세션수와 종료율"
)
fig.show()

# 쿼리문
query = f"""
SELECT
device.deviceCategory AS device,
device.Browser AS browser,
COUNT(*) AS sessions,
SUM(totals.bounces)/COUNT(*) AS exit_rate
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801'
GROUP BY 1, 2
ORDER BY 3 desc
LIMIT 20
"""
result_browser_device = client.query(query).result().to_dataframe()
result_browser_device['browser_device'] = (
result_browser_device["browser"].astype(str) + "-" + result_browser_device["device"].astype(str)
)
result_browser_device = result_browser_device.sort_values(by=['browser', 'device'])
# 발견한 사실
print(f"""앞에서 확인한
- Chrome 브라우저(exit_rate이 상대적으로 낮았음) 중에서도 mobile기기의 경우 desktop보다 exit_rate이 4% 더 높았다.
- desktop safari를 사용한 경우 exit_rate이 다른 기기-브라우저에 비해 높다.""")
# 시각화
fig = go.Figure(
data = [
go.Bar(x=result_browser_device['browser_device'], y=result_browser_device['sessions'],
yaxis='y', offsetgroup=1, name='sessions'),
go.Bar(x=result_browser_device['browser_device'], y=result_browser_device['exit_rate'],
yaxis='y2', offsetgroup=2, name='exit_rate')
],
layout = {
'yaxis' : {'title' : 'sessions'},
'yaxis2' : {'title' : 'exit_rate', 'overlaying': 'y', 'side': 'right'}
},
)
fig.update_layout(
barmode='group', width=800, height=400, title="접속 브라우저/기기 별 세션수와 종료율"
)
fig.show()

Insight :
Google Analytics에서 사용되는 각 트래픽 소스
# 쿼리문
query = """
SELECT
trafficSource.medium AS medium,
count(*) AS sessions,
SUM(totals.bounces)/COUNT(*) AS exit_rate,
SUM(totals.transactions) AS transactions,
SUM(totals.totalTransactionRevenue)/1000000 AS total_revenue,
SUM(totals.transactions)/COUNT(*) AS conversion_rate
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801'
GROUP BY 1
ORDER BY 2 desc
"""
result_traffic = client.query(query).result().to_dataframe()
display(result_traffic.head(10))
# 시각화
fig = go.Figure(
data = [
go.Bar(x=result_traffic['medium'], y=result_traffic['total_revenue'],
yaxis='y', offsetgroup=1, name='total_revenue'),
go.Bar(x=result_traffic['medium'], y=result_traffic['conversion_rate'],
yaxis='y2', offsetgroup=2, name='conversion_rate')
],
layout = {
'yaxis' : {'title' : 'sessions'},
'yaxis2' : {'title' : 'exit_rate', 'overlaying': 'y', 'side': 'right'}
},
)
fig.update_layout(
barmode='group', width=800, height=400, title="트래픽 소스별 매출액과 전환율"
)
fig.show()

Insight :
(Conversion path)
(Bottlenecks)
전환과정중 고객들이 이탈하는 곳 파악하기
hits.eCommerceAction.action_type
query = f"""
SELECT
CASE
WHEN hits.eCommerceAction.action_type = '1' THEN 'Click through of procut lists'
WHEN hits.eCommerceAction.action_type = '2' THEN 'Product detail views'
WHEN hits.eCommerceAction.action_type = '5' THEN 'Check out'
WHEN hits.eCommerceAction.action_type = '6' THEN 'Completed purchase'
END AS action,
COUNT(fullVisitorID) AS users
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801' AND
(
hits.eCommerceAction.action_type != '0' AND
hits.eCommerceAction.action_type != '3' AND
hits.eCommerceAction.action_type != '4'
)
GROUP BY 1
ORDER BY 2 desc
"""
result = client.query(query).result().to_dataframe()
display(result.head(10))
| index | action | users |
|---|---|---|
| 0 | Click through of product lists | 445760 |
| 1 | Product detail views | 362607 |
| 2 | Check out | 248936 |
| 3 | Completed purchase | 74457 |
# 퍼널 분석 시각화
funnel_graph = go.Figure(
go.Funnel(
x = result['users'],
y = result['action'],
textposition = 'inside',
textinfo = 'value+percent initial'
),
layout=go.Layout(height=400, width=800)
)
funnel_graph.update_layout(title_text = 'Google Merchandise Store 퍼널 분석')
funnel_graph.show()

Insight :
query = f"""
SELECT
product.v2ProductCategory AS category,
SUM(totals.transactions) AS transactions,
SUM(totals.totalTransactionRevenue)/1000000 AS total_revenue
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801'
GROUP BY 1
ORDER BY 3 DESC
LIMIT 10
"""
cat_result = client.query(query).result().to_dataframe()
px.bar(cat_result.sort_values(by='total_revenue'),
x='total_revenue', y='category',
width=800, height=400, title="상품 카테고리별 매출액")

query = f"""
WITH
daily_mens_tshirs_transactions AS
(
SELECT
date,
SUM(totals.transactions) AS transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801' AND
product.v2ProductCategory = "Home/Apparel/Men's/Men's-T-Shirts/"
GROUP BY 1
ORDER BY 1
)
SELECT
date,
AVG(transactions) OVER (
ORDER BY date
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
) AS avg_transactions
FROM
daily_mens_tshirs_transactions
"""
result = client.query(query).result().to_dataframe()
result['date'] = pd.to_datetime(result['date'])
px.area(result, 'date', 'avg_transactions',
title='<Men\'s T-Shirts> Conversions 7-Day Moving Average',
height=400, width=800)

query = f"""
WITH
daily_drinkware_transactions AS
(
SELECT
date,
SUM(totals.transactions) AS transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801' AND
product.v2ProductCategory = "Home/Drinkware/Water Bottles and Tumblers/"
GROUP BY 1
ORDER BY 1
)
SELECT
date,
AVG(transactions) OVER (
ORDER BY date
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
) AS avg_transactions
FROM
daily_drinkware_transactions
"""
result = client.query(query).result().to_dataframe()
result['date'] = pd.to_datetime(result['date'])
px.area(result, 'date', 'avg_transactions',
title='<drinkware> Conversions 7-Day Moving Average',
height=400, width=800)

query = f"""
WITH
daily_electronics_transactions AS
(
SELECT
date,
SUM(totals.transactions) AS transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801' AND
product.v2ProductCategory = "Home/Electronics/"
GROUP BY 1
ORDER BY 1
)
SELECT
date,
AVG(transactions) OVER (
ORDER BY date
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
) AS avg_transactions
FROM
daily_electronics_transactions
"""
result = client.query(query).result().to_dataframe()
result['date'] = pd.to_datetime(result['date'])
px.area(result, 'date', 'avg_transactions',
title='<Electronics> Conversions 7-Day Moving Average',
height=400, width=800)

query = f"""
WITH
daily_office_transactions AS
(
SELECT
date,
SUM(totals.transactions) AS transactions
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`,
UNNEST(hits) AS hits,
UNNEST(hits.product) AS product
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170801' AND
product.v2ProductCategory = "Home/Office/"
GROUP BY 1
ORDER BY 1
)
SELECT
date,
AVG(transactions) OVER (
ORDER BY date
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
) AS avg_transactions
FROM
daily_office_transactions
"""
result = client.query(query).result().to_dataframe()
result['date'] = pd.to_datetime(result['date'])
px.area(result, 'date', 'avg_transactions',
title='<office> Conversions 7-Day Moving Average',
height=400, width=800)

Insight :
상품 카테고리에 따라 매출이 일시적으로 증가한 것이 시계열(계절)적 특성인지 다른 해의 매출과 비교할 필요가 있다.
혹은 해당 기간에 광고, 세일 등이 이뤄졌는지 확인할 필요가 있다.
위의 EDA를통해 몇 가지 눈여겨 볼 사실(Insight)을 발견했다. 이 인사이트를 기반으로 "왜 이런 현상이 발생했는지"에 대해서 더 구체적인 목표를 수립하고 세분화된 분석을 진행할 수 있다.