Build and Optimize Data Warehouses with BigQuery: Challenge Lab

thyoondev·2021년 5월 6일
1
post-thumbnail

You are expected to have the skills and knowledge for these tasks, so don't expect step-by-step guides to be provided.

Build and Optimize Data Warehouses with BigQuery의 마지막 실습에는 가이드라인이 없는 도전과제가 기다리고 있다.

2021년 5월 6일 기준으로는 BigQuery를 사용해 GCP에서 제공하는 covid관련 테이블에 엑세스하고 테이블을 새로 만들어 필요한 데이터로 채우는 과정이다.

Challenge scenario

You are part of an international public health organization which is tasked with developing a machine learning model to predict the daily case count for countries during the Covid-19 pandemic. As a junior member of the Data Science team you've been assigned to use your data warehousing skills to develop a table containing the features for the machine learning model.

Begin: BigQuery에서 Project ID 클릭 후 Dataset을 생성

아래는 과정을 진행하면서 사용한 sql문이다.

Task 1: Create a table partitioned by date

Create a new dataset and create a table in that dataset partitioned by date, with an expiry of 90 days. The table should initially use the schema defined for the oxford_policy_tracker table in the COVID 19 Government Response public dataset .
You must also populate the table with the data from the source table for all countries except the United Kingdom (GBR), Brazil (BRA), Canada (CAN) and the United States (USA).

Query

create table <DATASET_NAME>.<TABLE_NAME>
PARTITION BY date
OPTIONS(
   partition_expiration_days=90
 )
as
SELECT * FROM `bigquery-public-data.covid19_govt_response.oxford_policy_tracker`
where 1=1
  and alpha_3_code not in ('GBR','BRA','CAN','USA');

Task 2: Add new columns to your table

Update your table to add new columns to your table with the appropriate data types to ensure alignment with the specification provided to you.

New Column Name          SQL Data Type
population               INTEGER
country_area             FLOAT
mobility                 RECORD
mobility.avg_retail      FLOAT
mobility.avg_grocery     FLOAT
mobility.avg_parks       FLOAT
mobility.avg_transit     FLOAT
mobility.avg_workplace   FLOAT
mobility.avg_residential FLOAT

Query

ALTER TABLE <DATASET_NAME>.<TABLE_NAME>
ADD COLUMN IF NOT EXISTS population INT64,
ADD COLUMN IF NOT EXISTS country_area FLOAT64,
ADD COLUMN mobility STRUCT<
  avg_retail      FLOAT64,
  avg_grocery     FLOAT64,
  avg_parks       FLOAT64,
  avg_transit     FLOAT64,
  avg_workplace   FLOAT64,
  avg_residential FLOAT64
  >

Task 3: Add country population data to the population column

Add the country population data to the population column in your table with covid_19_geographic_distribution_worldwide table data from the European Center for Disease Control COVID 19 public dataset table.

Query

UPDATE
    `<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>` t0
SET
    population = t2.pop_data_2019
FROM 
    `<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>` t1
    LEFT JOIN 
    (SELECT DISTINCT country_territory_code, pop_data_2019 FROM `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`) AS t2
     ON t1.alpha_3_code = t2.country_territory_code
WHERE CONCAT(t0.country_name, t0.date) = CONCAT(t1.country_name, t1.date)

아래가 먼저 시도한 sql문인데 해결되지 않아 채팅지원으로 도움을 받았다.

Query

update `<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>` a
set a.population = b.pop_data_2019
FROM `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide` b
where a.date = b.date and a.alpha_3_code = b.country_territory_code

Task 4: Add country area data to the country_area column

Add the country area data to the country_area column in your table with country_names_area table data from the Census Bureau International public dataset.

Query

update `<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>` a
set a.country_area = b.country_area
FROM `bigquery-public-data.census_bureau_international.country_names_area` b
where a.country_area = b.country_area

Task 5: Populate the mobility record data

Populate the mobility record in your table with data from the Google COVID 19 Mobility public dataset .

Query

UPDATE `<DATASET_NAME>.<TABLE_NAME>` t0
SET t0.mobility = STRUCT<avg_retail FLOAT64, avg_grocery FLOAT64, avg_parks FLOAT64, avg_transit FLOAT64, avg_workplace FLOAT64, avg_residential FLOAT64>
(t2.avg_retail, t2.avg_grocery, t2.avg_parks, t2.avg_transit, t2.avg_workplace, t2.avg_residential)
FROM (SELECT country_region, date,
      AVG(retail_and_recreation_percent_change_from_baseline) as avg_retail,
      AVG(grocery_and_pharmacy_percent_change_from_baseline)  as avg_grocery,
      AVG(parks_percent_change_from_baseline) as avg_parks,
      AVG(transit_stations_percent_change_from_baseline) as avg_transit,
      AVG( workplaces_percent_change_from_baseline ) as avg_workplace,
      AVG( residential_percent_change_from_baseline)  as avg_residential
      FROM `bigquery-public-data.covid19_google_mobility.mobility_report`
      GROUP BY country_region, date) AS t2
WHERE t0.country_name = t2.country_region
and t0.date = t2.date;

Task 6: Query missing data in population & country_area columns

Run a query to find the missing countries in the population and country_area data. The query should list countries that do not have any population data and countries that do not have country area information, ordered by country name. If a country has neither population or country area it must appear twice.

Query

SELECT distinct country_name 
FROM <DATASET_NAME>.<TABLE_NAME>
where population is null 
UNION ALL 
SELECT distinct country_name 
from <DATASET_NAME>.<TABLE_NAME> where country_area is null
order by country_name asc

이로써 BigQuery 뱃지도 얻고 Cloud SQL 퀘스트에 이어서 Qwiklabs 퀘스트 2개를 완료했다!
최소 수료 조건을 채웠으니 Google Cloud Study Jam 완주 기념품이 기대가 된다.

profile
Practice and repeat

0개의 댓글