[DP-203] Data Engineering on Microsoft Azure

Becoming a Data Engineer ·2023년 12월 17일
0

Azure

목록 보기
1/27
post-thumbnail

What is DP-203?

https://learn.microsoft.com/ko-kr/credentials/certifications/resources/study-guides/dp-203#skills-at-a-glance

  • 데이터 스토리지 설계 및 구현(15–20%)

    • 파티션 전략
    • 데이터 탐색 레이어
  • 데이터 처리 개발(40–45%)

    • 데이터 수집 및 변환
    • 일괄 처리 솔루션 (a batch processing)
    • 스트림 처리 솔루션
    • 일괄 처리 및 파이프라인 관리
  • 데이터 스토리지 및 데이터 처리 보안, 모니터링 및 최적화(30–35%)

    • 데이터 보안
    • 데이터 스토리지 및 데이터 처리 모니터링
    • 데이터 스토리지 및 데이터 처리 최적화, 트러블슈팅

Data Engineering on Microsoft Azure

Overview

to provide accessible, clean data in a useable foramt

  • Data Movement
  • Data Ingestion: Azure Data Factory
  • Data Storage: Azure Data Lake Storage Gen2
  • Data Transformation: Azure Databricks, Azure Synapse Analytics

Basic Azure Services

  • Azure Blob: Primary storage service that includes Data Lakes
  • Azure Data Factory: pipelines of Azure
  • Azure Synapse Analytics: 구조화된 데이터를 다루는 데 최적화된 플랫폼
  • Azure Stream Analytics: streaming capability and light transformation
  • Azure Databricks: provides ETL, analytics, and machine learning at a massive scale

1. Introduction to Data Lakes

  • Structured vs. Unstructured Data

    • Structured: Relational, Fixed Schema, Complex Queries, Vertical(수직) Scaling(~RAM, CPU Power)
    • Unstructured: Non-Relational, Dynamic, Not for Complex Queries, Horizontal(수평) Scaling
  • Azure Blob Storage: general-purpose object store

  • Data Lake with Blob storage

    • 계층적 네임스페이스
      • 활성화하면: 데이터 레이크 생성 가능
  • Data Lake Architecture
    Data Source >>> Ingestion >>> Data Lake (Raw - Processed - Curated)

2. Introduction to Azure Data Factory

  • Azure Data Factory

    • cloud-based data integration service
      • create data-driven workflows in the cloud
        • that orchestrate and automate data movement and transformation
    • data pipeline orchestration
  • Pipeline: logical grouping of activities

    • activities perform a task
  • Activity: produces and consumes data set / runs on linked service

    • processing steps in a pipelines
    • 3 types of activities: movement, transformation, control
  • Datasets: represents a data items stored in linked service

    • 데이터 저장소 내의 데이터 구조
    • 입/출력 데이터가 존재하는 곳
  • Linked Services

    • connection string needed to connect to data

3. Introduction to Azure Synapse Analytics

  • Azure Synapse Analytics: It's SQL (more than SQL)
    • Ingest: Data Factory
    • Store: Data Lake, Blob Storage, SQL Database
    • Prep & Train: Databricks, Azure Machine Learning
    • Model & Serve: SQL Data Warehouse

4. Introduction to Azure Stream Analytics

  • Azure Stream Analytics

    • Input: Event Hubs, IOT Hubs, Blob Storage
    • Query: transformation
    • Output: store and save results
  • Windowing: sliding, tumbling, hopping

5. Introduction to Azure Databricks

  • Azure Databricks: Prep & Train
    • Clusters: Group of compute resources
      • Workspace (ex. 캐비넷)
      • Notebooks (ex. 폴더)
      • Cells: 개별 코드 조각
      • Libraries: packages or modules
      • Tables: storage for structured data
profile
I want to improve more 👩🏻‍💻

0개의 댓글

관련 채용 정보