About Data Engineering INTRO

All We Need is Data, itself !·2022년 2월 28일
0

Data Engineering

목록 보기
1/2

INTRO

what's data engineering?

  1. Data Collection and Storage
  2. Data preparation
  3. Exploration and Visualization
  4. Experimentation and Prediction

Data engineers deliver

  1. the correct data
  2. in the right form
  3. to the right people
  4. as efficiently as possible

A data engineer's responsibilities

  • ingest data from different sources
  • optimize databases for analysis
  • remove corrupted data
  • develop, construct, test and maintain data architectures

The five Vs

  • Volume
  • Variety
  • Velocity
  • Veracity
  • Value

These are what data engineers have to consider



The data pipeline

  • ingest
  • process
  • store
  • need pipelines
  • DS can use up-to-date, accurate data

ETL / Data pipelines

ETL

Extract data
Transform extracted data
Load transformed data to another database

Data pipelines

Move data from one system to another
May follow ETL
Data may not be transformed
Data may be directly loaded in applications



Data structures

Structed data

  • Easy to search and organize
  • Consistent model, rows and columns
  • Defined types
  • can ge grouped to form relations
  • stored in relational databases
  • about 20 per. of the data is structed
  • created and queried using SQL

Semi-structed data

  • relatively easy to search and organize
  • consistent model, less-rigid implementaion
  • different types, sizes
  • can be grouped, but needs more work
  • NoSQL databases: JSON, XML, YAML

Unstructed data

  • hard to search and organize
  • doesn't follow a model, can't be contained in rows and cols
  • usually stored in data lakes, can appear in data warehouses or databases
  • extremely valuable


SQL

  • Structured Query Language
  • Industry standard ofr Relational Database Management System (RDBMS)
  • allows yyou access many records at once, group, filter or aggregate them
CREATE TABLE (
	stu_id INT,
    stu_name VARCHAR(255)
)

SELECT *
FROM table
WHERE condition


Data warehouses and data lakes

Data lake :

  • stores all the raw data
  • unprocessed, massy
  • can be petabypes (1 million GBs)
  • stores all data structures
  • difficult to analyze
  • requires an up-to-date data catalog
  • used by DS
  • big data, real-time analysis

Data warehouse :

  • specific data for specific use
  • relatively small
  • stores mainly structured data
  • more costly to update
  • optimized for data analysis
  • used by DS and BA
  • Ad-hoc, read-only queries
    • Ad-hoc : for special purpose

Data catalog for data lakes

refs:

  • what is the source of this data?

  • Where is this data used?

  • Who is the owner?

  • How often is this updated?

  • Good practice in terms of data governance

  • Ensures reproducibillity

  • No catalog -> data swamp

  • Good practice for any data storage solution

    • reliability
    • autonomy
    • scalability
    • speed

    DB vs. DW

  • database:

    • general term
    • loosely defined
  • DW

    • type of DB


    Data processing value

  • remove unwanted data

  • to save money

  • convert data from one type to another

  • organize data

  • to fit into a schema/structure

  • increasing productivity

    How DE process data

  • data manipulation, cleaning, and tidying tasks

    • that can be automated
    • that will always need to be done
  • store data in a sanely structured DB

  • create views on top of the DB tables

  • deciding what happens with missing metadata

  • optimizing the performance of the DB



    Scheduling

  • can apply to any task listed in data processing

  • glue of the data engineering system

  • runs tasks in a specific order and resolves all dependencies

    Manual, Time, Condition

  • Manual : because~

  • Time : every, seasons

  • Condition : when, if ~

Batches and streams

  • Batches
    • group records at intervals
    • often cheaper
  • Streams
    • send individual records right away

EX ) Apache airflow, Luigi



Parallel computing

: 병렬 연산

  • Basis of modern data processing tools
  • Necessary:
    • mainly because of memory
    • Also for processing power
  • How it works:
    • split tasks up into several smaller subtasks

pros and cons

Proscons
extra procssing powermoving data incurs a cost
reduced memory footprintcommunication time

refs: https://ko.wikipedia.org/wiki/%EB%B3%91%EB%A0%AC_%EC%BB%B4%ED%93%A8%ED%8C%85



Cloud Computing

differences from on-premise

  • No need space
  • Electrical and maintenance cost can be reduced ( rented )
  • DB reliability : data replication
  • But, there's risks with sensitive data

refs: Data Engineering for everyone in DATACAMP

profile
우당탕탕 데린이 흑역사 생성중

0개의 댓글