[BigData] Ch.1

Y_Y·2022년 9월 19일

big data

BigData

목록 보기

1/9

인천대학교 최대진 교수님 강의를 개인적으로 정리한 글입니다.

Data Mining

Problem Definition

Define the problem as a scientific form
Require domain knowledge as well as scientific problem solving capability

Data Collection

Collecting data via

Data Representation

Transforming a high-dimensional raw data into problem-relevant data
Quantifying data with various techniques

MapReduce

Large scale computing for Data Mining
-> Using machine clusters is essential

Mining big data requires large-scale computing whose key component is Parallel programming
MapReduce is Framework
Apache Hadoop MapReduce, Amazon Elastic MapReduce

ex) Word Counting

MapReduce Environment

Partitioning the input data
Scheduling the program's excution across of a set of machines
Performing the group by key step
Handling machine failures
Managing required inter-machine communication

Using MapReduce Framework

Data Flow

Input and final output are stored on a distributed file system (DFS)

Stream Data

Big Data - Volumn, Velocity, Variety
high Velocity - Stream Data

Characteristics of Stream Data

Size - Infinite, Burst (not equal speed, non-predictable) , Non-stationary
only INSTATLY accessible
Stream Management is important when the input rate is controlled externally

Applications

Mining query streams
ex) Google wants to know what queries are more frequent today than yesterday
Mining click streams
ex) Yahho wants to know which of its pages are getting an unusual number of hits in the past hour
Mining social network news feeds
look for trending topics on Twitter, Facebook
Sensor Networks
Many sensors feeding into a central controller
IP packets monitored at a switch
Gather information for optimal routing
Detect denial-of-service attacks

The Stream Model

Input elements enter at a rapid rate, at one or more input ports
The system cannot store the entire stream

SIDE NOTE : Online Learning
Online Learning enables a machine learning model to continously learn from the recent data stream

Example : Stochastic Gradient Descent (SGD)

Idea : Do slow updates to the model

Operations on Data Streams

In conclusion, we have to choose a subset of input streams

Sampling data from a stream,
- Construct a random sample.
Queries over sliding windows,
- Number of ~~
Filtering a data stream,
Counting distinct elements,
Estimating moments,
Counting itemsets.

Sampling data

fixed-size tuples

Why? Don't know length of stream in advance

Suppose at time n we havve seen s items

Reservoir Sampling

Algorithm

Store all first s elements of the stream ot S
Suppose we have seen n-1 elements, and now the n^th element arrives (n>s)

Sliding Window

A useful model of stream processing is that queries are about a window of length N, the N most recent elements recieved.

Example Problem - Counting Bits

Given a stream of 0s and 1s
Be prepared to answer queries of form "How many 1s are in the last k bits? where K <= N

Real Problem :

What if we cannot afford to store N bits?

DGIM Method does not assume uniformity

Exponential Windows

Sampling a fixed propotion of stream
Sample size grows as the stream grows
Sampling a fixed-size sample
Reservior Sampling
Counting the number of 1s in the last N elements
Exponentially

Filtering a Data Stream : Bloom Filter

Filtering Data Streams

Each element of data stream is a tuple
Given a list of keys S
Determine which tuples of stream are in S
+ NOTE : It's different from from user-based sampling

Example Application: Email Spam Filtering

수 많은 이메일 중에서 정상으로 판정 된 메일은 스팸처리 X， 정상 처리된 이메일은 검색 없이 바로 보내고 싶다.