1. Pattern Mining
1.1 Frequent Patterns
Definition
- frequent patterns
- they reveal hidden regularities and relationship in data
Why studying frequent pattern mining?
- association, correlation, casuality
- sequential patterns
- partial periodicity, cyclic/temporal associations
foundation for several essential data mining tasks
Type
- frequent items
- items that occur together
- e.g., {milk, bread}
- sequential patterns
- items that occur in a specific order
- e.g., smartphone → TV → IoT device
- structed patterns
- patterns in complex data structures
- e.g., subgraphs, subtrees, fractal
2. Association Rule Mining Fundamentals
2.1 Example: Market Basket Analysis
- used to understand customer purchasing behavior by analyzing transaction data
- identify which items are frequently purchased together during a single shopping
Business Application
- 매장 배치: 자주 함께 구매되는 상품을 가까이 배치하여 매출 증대 유도 또는 멀리 배치하여 고객이 더 많은 상품을 훑어보게 함
- 프로모션: 관련 상품을 묶어서 할인 판매하는 교차 판매 전략 수립
2.2 Key Metrics for Association Rules
1. Support s
- how frequently both items appear together in all transactions
- support(A⇒B)=P(A∪B)
합집합 기호이지만, 의미상으로는 co-occurrence를 뜻한다.
2. Confidence c
- indicates how likely a customer who buys a computer will also buy antivirus software
- confidence(A⇒B)=P(B∣A)=support(A)support(A∪B)=support_count(A)support_count(A∪B)
2.3 Frequent Itemset
Notations
- I={I1,I2,...,Im}: itemset
- D: task-relevant data
- T: nonempty itemset such that T⊆I
- each transactions is associated with an identifier, called TID
Association Rule
- A⇒B
- an implication where A⊆I,B⊆I,A⊆∅,B⊆∅ and A∩B=∅
두 집합이 완전히 다른 아이템들로만 구성돼야 "A를 샀을 때 B를 살 확률이 높다"는 의미 있는 인사이트가 나온다.
Two Key Measures
- support
- proportion of transactions in D that contain A∩B
- support(A⇒B)=P(A∪B)
- = relative support(ratio)
- occurrence frequency = absolute support(count)
- if the relative support ≥ minimum support threshold, then I is a frequent itemset
- the set of frequent 𝑘-itemsets is commonly denoted by Lk
- confidence
- proportion of transactions containing A that also contain B
- confidence(A⇒B)=P(B∣A)
- = conditional probability of B given A
K-itemset
- an interest that contain exactly k items
- e.g., {computer, antivirus} is a 2-itemset
3. Association Rule Mining
3.1 Two-Step Process
1. Find all frequent itemsets
- each of these itemsets will occur at least min_sup
2. Generate strong association rules
- these rules must satisfy min_sup and min_conf
- why union is association rules?
- A∩B= 공집합이므로 A와 B가 같이 등장하는 트랜잭션을 찾으려면 intersection(∩)이 아니라 union(∪)을 써야 한다.
3.2 Challenges
Combinatorial Explosion
if an itemset is frequent, all of the subsets are also frequent
- a long itemset will contain a combinatorial number of shorter frequent subitemsets
- e.g., a frequent itemset of length 100
- 2100−1≈1.27×1030
- to overcome this difficulty,
- closed frequent itemset and maximal frequent itemset
폭발적인 조합을 줄이기 위해 줄일 수 있는 건 쳐내는 prunning을 해야 한다.
4. Apriori Algorithm
4.1 Strategy
Key Idea
- the name "Aprior" comes from the fact the algorithm uses prior knowledge of frequent properties
Iterative Level-wise Search
- where k-itemsets are used to generate (k+1)-itemsets
- first, the database is scanned to count the occurrences of each item,
- and those satisfying the min_sup are collected to form the set of frequent L1
- then, L1 is used to generate L2, which is then used to generate L3, and so on, until no more frequent items can be found
- each iteration requires a full scan of the database
4.2 Apriori Property: Antimonotonicity
Definition
- all nonempty subsets of a frequent itemset must also be frequent
- this property based on the following observations
Observation
- if an itemset I does not satisfy min_sup, then I is not frequent
- P(\mathcal{I})<min_sup
- if an item A is added to I, then the resulting itemset cannot occur more frequently than I
아이템이 늘어날수록 조건이 더 까다로워지니까 등장 횟수는 줄어들거나 같을수밖에 없다. I가 이미 not frequent하다면, 거기에 아이템을 추가해도 not frequent하다.
Antimonotonicity
- if a set cannot pass a test, all of its supersets will fail the same test as well
- e.g., {milk, bread} → support = 0.1 (min_sup = 0.3) → not frequent
4.3 Core Steps of Apriori
새로운 후보군을 만들고(join), 불필요한 것을 쳐내는(prune) 과정이다.
1. Join Step ⋈
- generate candidate k-itemsets Ck by self-joining the frequent (k-1)-itemsets Lk−1
- their first (k-2) items are identical
두 개의 셋을 합칠 때, 앞부분은 똑같고 마지막 한 아이템만 서로 달라야 한다.
- key point
- itemsets are sorted in lexicographic order
- only valid combinations are generated
- duplicate candidates are avoided
2. Prune Step
- candidate set Ck may be large
- many candidates are not frequent
- for each candidate k-itemset (using antimonotonicity)
- check all (k-1)-subsets
- if any (k-1) subset is not frequent → remove the candidate
if a subset is not frequent → the superset cannot be frequent
