[Data Mining] 1. Introduction

Jungyu Jin·2022년 3월 7일

Data Mining

목록 보기

1/3

What is Data Mining?

Extraction of Interesting (Non-Trivial, Implicit, Previously Unkown and Potentially Useful) Patterns or Knowledge from large amount of data. it is also referred to as Knowledge Discovery from Data(KDD), Knowledge Extraction, Data/Pattern analysis.

Data Mining Tasks

Classification
Regression
Association Analysis
Clustering
Anomaly Detection
Time Series Analysis
Text Mining

Category of Data Mining Tasks

Predictive tasks

To predict unknown or future value of particular attribute based on the values of other attributes.

Target Variable vs Explanatory Variables
Cost Prediction, Risk Prediction
ex) Classification, Regression, Anomaly Detection

Predictive Tasks : Bulding Representative Models

Data Mining is the process of building the representative model that fits the observation data with two purposes
Model : the representation of a relationship between variables in a dataset.

Model predicts the output based on the input variables
Model can be used to understand the relationship betseen the output variable and all the input variables

Descriptive tasks

To find human-interpretable Patterns that summarize the underlying relationships in the data

Correlations, Trends, Clusters, Anomalies
ex) Clustering, Association Analysis

Descriptive Task: Summarizing Past Events

Data Mining is the process of summarizing the observational data with two purposes

To provide new, non-trivial information about what happened
To find humaninterpretable patterns
ex) Identifying web pages thar are accesed together, Identifying and describing groups of customers with common buying behavior

Learning Models for DM Tasks

Supervised learning

To infer a function or relationship based on Labeled traning data
To predict the value of output variable based on input variables. ex) Classification, Regression

Unsupervised learning

To uncover hidden patterns in Unlabeled data. There are no output variable to predict
To find patterns based on the relationship between instances. ex) Clustering, Association analysis

Classification - Applications

Direct Marketing

Goal : Reduse cost of mailing by targeting a set of consumers likely to buy a product
Approach
- Use the data for a similar product introduced before
- We know which customer decied to buy and which dicided otherwise. This buy or don't buy decision forms the class attribute(target)
- Collect various demographic, lifestyle, and company-interaction related information about all such customers
- Use Information as input attributes to learn a classifier model

Fraud Detection

Goal: Predict fraudulent cases in credit card transactions
Approach
- Use credit card transactions and the information on its account-holder
  as attributes
- When does a customer buy, what does he buy, how often he pays on time,
  etc
- Label past transactions as fraud or fair transactions. This forms the class attribute
- Learn a model for the class of the transactions
- Use this model to detect fraud by observing credit card transactions on
  an account.

or Customer Attrition/Churn, Sky Survey Cataloging, Classifying galaxies etc..

Regression

Definition

To predict a value of a given continuous valued attribute based on the values of other attributes, assuming a linear or nonlinear model of dependency
Greatly studied in statistics, neural network fields.
Examples
1. Predicting sales amounts of new product based on advertising expenditure
2. Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
3. Time series prediction of stock market indices.

Clustering

Definition

Given a set of data objects, each having a set of attributes, and a similarity measure among them, find clusters such that 1. data objects in one cluster should be more similar to one another 2. data objects in seperate clusters are less similar to one another
Similarith measures:

Euclidean distance if attirutes are continuous
Other problem-specific measures.

Applications

Market Segmentation

Goal: To subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
Approach
- Collect different attributes of customers based on their geographical and lifestyle related information.
- Find clusters of similar customers.
- Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Document Clustering

Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
Approach:
- To identify frequently occurring terms in each document.
- Form a similarity measure basedon the frequencies of different terms.
- Use it to cluster.

etc.

Custom profiling for targeted marketing
Group related documents for browsing
Group genes and proteins that have similar functionality
Group stocks with similar price fluctuations
Reduce the size of large data sets

Association Analysis

Definition

Given a set of recoreds each of which contain some number of items sfrom a given collection.
To produce dependency rules which will predict occurrence of an item based on occurremces of other items.

Applications

Market-basket analysis : Rules are used for sales promotion, shelf management, and inventory management
Telecommunication alarm diagnosis : Rules are used to find combination of alarms that occur together frequently in the same time period
Medical Informatics : Rules are used to find combination of patient symptoms and test results associated with certain diseases

Anomaly Detection

Detect significant deviations from normal behavior
Examples : Credit Card Fraud Detection, Network Intrusion Detection

Time Series Forecasting

Time series adds an explicit order dependence between observations
Forecasting is to use historical data to predict future observations.

Text Mining

A type of data mining where the input data is text
Text can be in the form of documents, messages, emails, etc.
Text files are converted into document vectors(Structured data), then standard DM tasks can be applied

Challenges of Data Mining

Scalability : big datas
High Dimensionality : many meatures
Heterogeneous and Complex Data : many type of data
Data Ownership and Distribution
Non-traditional Analysis

Jungyu Jin

생각

다음 포스트

[Data Mining] 1. Introduction

Data Mining

What is Data Mining?

Data Mining Tasks

Category of Data Mining Tasks

Predictive tasks

Predictive Tasks : Bulding Representative Models

Descriptive tasks

Descriptive Task: Summarizing Past Events

Learning Models for DM Tasks

Supervised learning

Unsupervised learning

Classification - Applications

Regression

Definition

Clustering

Definition

Applications

Association Analysis

Definition

Applications

Anomaly Detection

Time Series Forecasting

Text Mining

Challenges of Data Mining

[Data Mining] 2. Data Mining Process

0개의 댓글