About Data Engineering INTRO

All We Need is Data, itself !·2022년 2월 28일

data engineering

Data Engineering

목록 보기

1/2

INTRO

what's data engineering?

Data Collection and Storage
Data preparation
Exploration and Visualization
Experimentation and Prediction

Data engineers deliver

the correct data
in the right form
to the right people
as efficiently as possible

A data engineer's responsibilities

ingest data from different sources
optimize databases for analysis
remove corrupted data
develop, construct, test and maintain data architectures

The five Vs

Volume
Variety
Velocity
Veracity
Value

These are what data engineers have to consider

The data pipeline

ingest
process
store
need pipelines
DS can use up-to-date, accurate data

ETL / Data pipelines

ETL

Extract data
Transform extracted data
Load transformed data to another database

Data pipelines

Move data from one system to another
May follow ETL
Data may not be transformed
Data may be directly loaded in applications

Data structures

Structed data

Easy to search and organize
Consistent model, rows and columns
Defined types
can ge grouped to form relations
stored in relational databases
about 20 per. of the data is structed
created and queried using SQL

Semi-structed data

relatively easy to search and organize
consistent model, less-rigid implementaion
different types, sizes
can be grouped, but needs more work
NoSQL databases: JSON, XML, YAML

Unstructed data

hard to search and organize
doesn't follow a model, can't be contained in rows and cols
usually stored in data lakes, can appear in data warehouses or databases
extremely valuable

SQL

Structured Query Language
Industry standard ofr Relational Database Management System (RDBMS)
allows yyou access many records at once, group, filter or aggregate them

CREATE TABLE (
	stu_id INT,
    stu_name VARCHAR(255)
)

SELECT *
FROM table
WHERE condition

Data warehouses and data lakes

Data lake :

stores all the raw data
unprocessed, massy
can be petabypes (1 million GBs)
stores all data structures
difficult to analyze
requires an up-to-date data catalog
used by DS
big data, real-time analysis

Data warehouse :

specific data for specific use
relatively small
stores mainly structured data
more costly to update
optimized for data analysis
used by DS and BA
Ad-hoc, read-only queries
- Ad-hoc : for special purpose

Data catalog for data lakes

refs:

https://www.oracle.com/kr/big-data/what-is-a-data-catalog/

https://docs.microsoft.com/ko-kr/azure/data-catalog/overview

what is the source of this data?
Where is this data used?
Who is the owner?
How often is this updated?
Good practice in terms of data governance
Ensures reproducibillity
No catalog -> data swamp
Good practice for any data storage solution
- reliability
- autonomy
- scalability
- speed
DB vs. DW
database:
- general term
- loosely defined
DW
- type of DB
Data processing value
remove unwanted data
to save money
convert data from one type to another
organize data
to fit into a schema/structure
increasing productivity

How DE process data
data manipulation, cleaning, and tidying tasks
- that can be automated
- that will always need to be done
store data in a sanely structured DB
create views on top of the DB tables
deciding what happens with missing metadata
optimizing the performance of the DB

Scheduling
can apply to any task listed in data processing
glue of the data engineering system
runs tasks in a specific order and resolves all dependencies

Manual, Time, Condition
Manual : because~
Time : every, seasons
Condition : when, if ~

Batches and streams

Batches
- group records at intervals
- often cheaper
Streams
- send individual records right away

EX ) Apache airflow, Luigi

Parallel computing

: 병렬 연산

Basis of modern data processing tools
Necessary:
- mainly because of memory
- Also for processing power
How it works:
- split tasks up into several smaller subtasks

pros and cons

Pros	cons
extra procssing power	moving data incurs a cost
reduced memory footprint	communication time

refs: https://ko.wikipedia.org/wiki/%EB%B3%91%EB%A0%AC_%EC%BB%B4%ED%93%A8%ED%8C%85

Cloud Computing

differences from on-premise

No need space
Electrical and maintenance cost can be reduced ( rented )
DB reliability : data replication
But, there's risks with sensitive data

refs: Data Engineering for everyone in DATACAMP

All We Need is Data, itself !

분명히 처음엔 데린이었는데,, 이제 개린이인가..

다음 포스트