[BigData] Data Collection & Exploration

Y_Y·2022년 10월 4일

big data

BigData

목록 보기

3/9

Data Collection

Consideration on Data Collection

Does the dataset exist (or can we build the dataset) ?
- If NOT, the DM problem tou defined cannnot be conducted
If exists, specify the target data source with considering
- Availability
- Access type (API, download) -> Need to find "limits or warnings"

Try to spend much time on searching the priviously collected or publicly available datasets

search API, existed Dataset (someone already made crawling)

Considerations on Data Collection (Cont'd)

Decide the scale of the dataset & what to store
- Reasonable amount to solve your problem
- Storeage capability
Decide the collection methodology
- Crawling web pages or using APIs
- Multiprocessing vs. single processing
TIP : Collect data as much as you can store
- Re-collection is too much expensive
- Nobody (even you) knows what to use
- During analysis, you may need new data that you ignored

NOTE : READ API ACCESS RULES

트위터에서 수집할 때 한국, 미국(서부, 동부)에 따라 다를 수 있다.
클라우드 프록시를 사용해서 데이터를 수집할 수 있다.
최대한 law 데이터를 저장해라 -> 어떤 다른 길로 갈 수 있도록

Data Preprocessing

Data Preprocessing : Overview

The first step to ensure "data quality"
- Accuracy : Correct or wrong, accurate or not
- Completeness : Not recorded, unavailable
- Consistency : Some modified but some not, dangling
- Timeliness : Timely update?
- Believability : How trusteable the data are correct ?
- Interpretability : How easily the data can be understood?

-> 선택과 가정의 단계에서 preproccesing이 진행된다.

Manipulating data for your intended use!
- Note : preprocessing should be conducted REASONABLY

Major Tasks in Data Preprocessing

Data cleaning
- Fill
Data integration
Data transforation and data discretization
Data reduction

Data Cleaning

Data in the Real World is Dirdy: Lots of potentially incorrect data, ex) instrument faulty, human or computer error, transmission error
- Incomplete
- Noisy
- Inconsistent
- Intentional

Incomplete (Missing) Data

Data is not always available
- ex) Many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data maybe due to
-

How to Handle Missing Data?

Ignore the tuple : usually done when class label is missing (when doing classification)
Fill in the missing value manually
Fill in it automatically with

Dealing with Noisy Data

Noise : Random error or variance in a measured variable
Incorrect attribute values

Dealing with Noisy Data (cont'd)

Binning (
Regression
- smooth bu fitting the data into regression functions
Clustering
- detect and remove outliers
Combined computer and human inspection
- detect suspicious values and check by human (deal with possible outliers)

Data Integration

하나의 구분자를 가지고 데이터를 모으는 것

Data integration : combines data from multiple sources into a coherent store
Schema integration : ex) cust-id, cust-#
Entity identification problem
Detecting and resolving data value conflicts

Y_Y

남을 위해(나를 위해) 글을 쓰는 Velog

이전 포스트

[BigData] Ch.2

다음 포스트