[Data Mining] 1. Introduction

Jungyu Jin·2022년 3월 7일
0

Data Mining

목록 보기
1/3

What is Data Mining?

Extraction of Interesting (Non-Trivial, Implicit, Previously Unkown and Potentially Useful) Patterns or Knowledge from large amount of data. it is also referred to as Knowledge Discovery from Data(KDD), Knowledge Extraction, Data/Pattern analysis.

Data Mining Tasks

  • Classification
  • Regression
  • Association Analysis
  • Clustering
  • Anomaly Detection
  • Time Series Analysis
  • Text Mining

Category of Data Mining Tasks

Predictive tasks

  • To predict unknown or future value of particular attribute based on the values of other attributes.

    Target Variable vs Explanatory Variables
    Cost Prediction, Risk Prediction
    ex) Classification, Regression, Anomaly Detection

Predictive Tasks : Bulding Representative Models
  • Data Mining is the process of building the representative model that fits the observation data with two purposes
  • Model : the representation of a relationship between variables in a dataset.
  1. Model predicts the output based on the input variables
  2. Model can be used to understand the relationship betseen the output variable and all the input variables

Descriptive tasks

  • To find human-interpretable Patterns that summarize the underlying relationships in the data

    Correlations, Trends, Clusters, Anomalies
    ex) Clustering, Association Analysis

Descriptive Task: Summarizing Past Events
  • Data Mining is the process of summarizing the observational data with two purposes
  1. To provide new, non-trivial information about what happened
  2. To find humaninterpretable patterns
    ex) Identifying web pages thar are accesed together, Identifying and describing groups of customers with common buying behavior

Learning Models for DM Tasks

Supervised learning

  • To infer a function or relationship based on Labeled traning data
  • To predict the value of output variable based on input variables. ex) Classification, Regression

Unsupervised learning

  • To uncover hidden patterns in Unlabeled data. There are no output variable to predict
  • To find patterns based on the relationship between instances. ex) Clustering, Association analysis

Classification - Applications

  1. Direct Marketing
  • Goal : Reduse cost of mailing by targeting a set of consumers likely to buy a product
  • Approach
    • Use the data for a similar product introduced before
    • We know which customer decied to buy and which dicided otherwise. This buy or don't buy decision forms the class attribute(target)
    • Collect various demographic, lifestyle, and company-interaction related information about all such customers
    • Use Information as input attributes to learn a classifier model
  1. Fraud Detection
  • Goal: Predict fraudulent cases in credit card transactions
  • Approach
    • Use credit card transactions and the information on its account-holder
      as attributes
    • When does a customer buy, what does he buy, how often he pays on time,
      etc
    • Label past transactions as fraud or fair transactions. This forms the class attribute
    • Learn a model for the class of the transactions
    • Use this model to detect fraud by observing credit card transactions on
      an account.

or Customer Attrition/Churn, Sky Survey Cataloging, Classifying galaxies etc..

Regression

Definition

  • To predict a value of a given continuous valued attribute based on the values of other attributes, assuming a linear or nonlinear model of dependency
  • Greatly studied in statistics, neural network fields.
  • Examples
    1. Predicting sales amounts of new product based on advertising expenditure
    2. Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
    3. Time series prediction of stock market indices.

Clustering

Definition

  • Given a set of data objects, each having a set of attributes, and a similarity measure among them, find clusters such that 1. data objects in one cluster should be more similar to one another 2. data objects in seperate clusters are less similar to one another

  • Similarith measures:

  • Euclidean distance if attirutes are continuous
  • Other problem-specific measures.

Applications

  1. Market Segmentation
  • Goal: To subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.
  • Approach
    • Collect different attributes of customers based on their geographical and lifestyle related information.
    • Find clusters of similar customers.
    • Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
  1. Document Clustering
  • Goal: To find groups of documents that are similar to each
    other based on the important terms appearing in them.
  • Approach:
    • To identify frequently occurring terms in each document.
    • Form a similarity measure basedon the frequencies of different terms.
    • Use it to cluster.

etc.

  • Custom profiling for targeted marketing
  • Group related documents for browsing
  • Group genes and proteins that have similar functionality
  • Group stocks with similar price fluctuations
  • Reduce the size of large data sets

Association Analysis

Definition

  • Given a set of recoreds each of which contain some number of items sfrom a given collection.
  • To produce dependency rules which will predict occurrence of an item based on occurremces of other items.

Applications

  • Market-basket analysis : Rules are used for sales promotion, shelf management, and inventory management
  • Telecommunication alarm diagnosis : Rules are used to find combination of alarms that occur together frequently in the same time period
  • Medical Informatics : Rules are used to find combination of patient symptoms and test results associated with certain diseases

Anomaly Detection

  • Detect significant deviations from normal behavior
  • Examples : Credit Card Fraud Detection, Network Intrusion Detection

Time Series Forecasting

  • Time series adds an explicit order dependence between observations
  • Forecasting is to use historical data to predict future observations.

Text Mining

  • A type of data mining where the input data is text
  • Text can be in the form of documents, messages, emails, etc.
  • Text files are converted into document vectors(Structured data), then standard DM tasks can be applied

Challenges of Data Mining

  • Scalability : big datas
  • High Dimensionality : many meatures
  • Heterogeneous and Complex Data : many type of data
  • Data Ownership and Distribution
  • Non-traditional Analysis

0개의 댓글