What is Data Mining?
Extraction of Interesting (Non-Trivial, Implicit, Previously Unkown and Potentially Useful) Patterns or Knowledge from large amount of data. it is also referred to as Knowledge Discovery from Data(KDD), Knowledge Extraction, Data/Pattern analysis.
Data Mining Tasks
- Classification
- Regression
- Association Analysis
- Clustering
- Anomaly Detection
- Time Series Analysis
- Text Mining
Category of Data Mining Tasks
Predictive tasks
- To predict unknown or future value of particular attribute based on the values of other attributes.
Target Variable vs Explanatory Variables
Cost Prediction, Risk Prediction
ex) Classification, Regression, Anomaly Detection
Predictive Tasks : Bulding Representative Models
- Data Mining is the process of building the representative model that fits the observation data with two purposes
- Model : the representation of a relationship between variables in a dataset.
- Model predicts the output based on the input variables
- Model can be used to understand the relationship betseen the output variable and all the input variables
Descriptive tasks
- To find human-interpretable Patterns that summarize the underlying relationships in the data
Correlations, Trends, Clusters, Anomalies
ex) Clustering, Association Analysis
Descriptive Task: Summarizing Past Events
- Data Mining is the process of summarizing the observational data with two purposes
- To provide new, non-trivial information about what happened
- To find humaninterpretable patterns
ex) Identifying web pages thar are accesed together, Identifying and describing groups of customers with common buying behavior
Learning Models for DM Tasks
Supervised learning
- To infer a function or relationship based on Labeled traning data
- To predict the value of output variable based on input variables. ex) Classification, Regression
Unsupervised learning
- To uncover hidden patterns in Unlabeled data. There are no output variable to predict
- To find patterns based on the relationship between instances. ex) Clustering, Association analysis
Classification - Applications
- Direct Marketing
- Goal : Reduse cost of mailing by targeting a set of consumers likely to buy a product
- Approach
- Use the data for a similar product introduced before
- We know which customer decied to buy and which dicided otherwise. This buy or don't buy decision forms the class attribute(target)
- Collect various demographic, lifestyle, and company-interaction related information about all such customers
- Use Information as input attributes to learn a classifier model
- Fraud Detection
- Goal: Predict fraudulent cases in credit card transactions
- Approach
- Use credit card transactions and the information on its account-holder
as attributes
- When does a customer buy, what does he buy, how often he pays on time,
etc
- Label past transactions as fraud or fair transactions. This forms the class attribute
- Learn a model for the class of the transactions
- Use this model to detect fraud by observing credit card transactions on
an account.
or Customer Attrition/Churn, Sky Survey Cataloging, Classifying galaxies etc..
Regression
Definition
- To predict a value of a given continuous valued attribute based on the values of other attributes, assuming a linear or nonlinear model of dependency
- Greatly studied in statistics, neural network fields.
- Examples
- Predicting sales amounts of new product based on advertising expenditure
- Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
- Time series prediction of stock market indices.
Clustering
Definition
-
Given a set of data objects, each having a set of attributes, and a similarity measure among them, find clusters such that 1. data objects in one cluster should be more similar to one another 2. data objects in seperate clusters are less similar to one another
-
Similarith measures:
- Euclidean distance if attirutes are continuous
- Other problem-specific measures.
Applications
- Market Segmentation
- Goal: To subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
- Approach
- Collect different attributes of customers based on their geographical and lifestyle related information.
- Find clusters of similar customers.
- Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
- Document Clustering
- Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
- Approach:
- To identify frequently occurring terms in each document.
- Form a similarity measure basedon the frequencies of different terms.
- Use it to cluster.
etc.
- Custom profiling for targeted marketing
- Group related documents for browsing
- Group genes and proteins that have similar functionality
- Group stocks with similar price fluctuations
- Reduce the size of large data sets
Association Analysis
Definition
- Given a set of recoreds each of which contain some number of items sfrom a given collection.
- To produce dependency rules which will predict occurrence of an item based on occurremces of other items.
Applications
- Market-basket analysis : Rules are used for sales promotion, shelf management, and inventory management
- Telecommunication alarm diagnosis : Rules are used to find combination of alarms that occur together frequently in the same time period
- Medical Informatics : Rules are used to find combination of patient symptoms and test results associated with certain diseases
Anomaly Detection
- Detect significant deviations from normal behavior
- Examples : Credit Card Fraud Detection, Network Intrusion Detection
Time Series Forecasting
- Time series adds an explicit order dependence between observations
- Forecasting is to use historical data to predict future observations.
Text Mining
- A type of data mining where the input data is text
- Text can be in the form of documents, messages, emails, etc.
- Text files are converted into document vectors(Structured data), then standard DM tasks can be applied
Challenges of Data Mining
- Scalability : big datas
- High Dimensionality : many meatures
- Heterogeneous and Complex Data : many type of data
- Data Ownership and Distribution
- Non-traditional Analysis