Data Collection
Consideration on Data Collection
-
Does the dataset exist (or can we build the dataset) ?
- If NOT, the DM problem tou defined cannnot be conducted
-
If exists, specify the target data source with considering
- Availability
- Access type (API, download) -> Need to find "limits or warnings"
Try to spend much time on searching the priviously collected or publicly available datasets
- search API, existed Dataset (someone already made crawling)
Considerations on Data Collection (Cont'd)
- Decide the scale of the dataset & what to store
- Reasonable amount to solve your problem
- Storeage capability
- Decide the collection methodology
- Crawling web pages or using APIs
- Multiprocessing vs. single processing
- TIP : Collect data as much as you can store
- Re-collection is too much expensive
- Nobody (even you) knows what to use
- During analysis, you may need new data that you ignored
NOTE : READ API ACCESS RULES
트위터에서 수집할 때 한국, 미국(서부, 동부)에 따라 다를 수 있다.
클라우드 프록시를 사용해서 데이터를 수집할 수 있다.
최대한 law 데이터를 저장해라 -> 어떤 다른 길로 갈 수 있도록
Data Preprocessing
Data Preprocessing : Overview
- The first step to ensure "data quality"
- Accuracy : Correct or wrong, accurate or not
- Completeness : Not recorded, unavailable
- Consistency : Some modified but some not, dangling
- Timeliness : Timely update?
- Believability : How trusteable the data are correct ?
- Interpretability : How easily the data can be understood?
-> 선택과 가정의 단계에서 preproccesing이 진행된다.
- Manipulating data for your intended use!
- Note : preprocessing should be conducted REASONABLY
Major Tasks in Data Preprocessing
- Data cleaning
- Fill
- Data integration
- Data transforation and data discretization
- Data reduction
Data Cleaning
- Data in the Real World is Dirdy: Lots of potentially incorrect data, ex) instrument faulty, human or computer error, transmission error
- Incomplete
- Noisy
- Inconsistent
- Intentional
Incomplete (Missing) Data
- Data is not always available
- ex) Many tuples have no recorded value for several attributes, such as customer income in sales data
- Missing data maybe due to
-
How to Handle Missing Data?
- Ignore the tuple : usually done when class label is missing (when doing classification)
- Fill in the missing value manually
- Fill in it automatically with
Dealing with Noisy Data
- Noise : Random error or variance in a measured variable
- Incorrect attribute values
Dealing with Noisy Data (cont'd)
- Binning (
- Regression
- smooth bu fitting the data into regression functions
- Clustering
- detect and remove outliers
- Combined computer and human inspection
- detect suspicious values and check by human (deal with possible outliers)
Data Integration
하나의 구분자를 가지고 데이터를 모으는 것
- Data integration : combines data from multiple sources into a coherent store
- Schema integration : ex) cust-id, cust-#
- Entity identification problem
- Detecting and resolving data value conflicts