What is DP-203?
https://learn.microsoft.com/ko-kr/credentials/certifications/resources/study-guides/dp-203#skills-at-a-glance
Data Engineering on Microsoft Azure
Overview
to provide accessible, clean data in a useable foramt
- Data Movement
- Data Ingestion: Azure Data Factory
- Data Storage: Azure Data Lake Storage Gen2
- Data Transformation: Azure Databricks, Azure Synapse Analytics
Basic Azure Services
- Azure Blob: Primary storage service that includes Data Lakes
- Azure Data Factory: pipelines of Azure
- Azure Synapse Analytics: 구조화된 데이터를 다루는 데 최적화된 플랫폼
- Azure Stream Analytics: streaming capability and light transformation
- Azure Databricks: provides ETL, analytics, and machine learning at a massive scale
1. Introduction to Data Lakes
-
Structured vs. Unstructured Data
- Structured: Relational, Fixed Schema, Complex Queries, Vertical(수직) Scaling(~RAM, CPU Power)
- Unstructured: Non-Relational, Dynamic, Not for Complex Queries, Horizontal(수평) Scaling
-
Azure Blob Storage: general-purpose object store
-
Data Lake with Blob storage
-
Data Lake Architecture
Data Source >>> Ingestion >>> Data Lake (Raw - Processed - Curated)
2. Introduction to Azure Data Factory
-
Azure Data Factory
- cloud-based data integration service
- create data-driven workflows in the cloud
- that orchestrate and automate data movement and transformation
- data pipeline orchestration
-
Pipeline: logical grouping of activities
- activities perform a task
-
Activity: produces and consumes data set / runs on linked service
- processing steps in a pipelines
- 3 types of activities: movement, transformation, control
-
Datasets: represents a data items stored in linked service
- 데이터 저장소 내의 데이터 구조
- 입/출력 데이터가 존재하는 곳
-
Linked Services
- connection string needed to connect to data
3. Introduction to Azure Synapse Analytics
- Azure Synapse Analytics: It's SQL (more than SQL)
- Ingest: Data Factory
- Store: Data Lake, Blob Storage, SQL Database
- Prep & Train: Databricks, Azure Machine Learning
- Model & Serve: SQL Data Warehouse
4. Introduction to Azure Stream Analytics
5. Introduction to Azure Databricks
- Azure Databricks: Prep & Train
- Clusters: Group of compute resources
- Workspace (ex. 캐비넷)
- Notebooks (ex. 폴더)
- Cells: 개별 코드 조각
- Libraries: packages or modules
- Tables: storage for structured data