6. Pattern Mining

Eunji·2026년 5월 21일

Data Mining

목록 보기
10/12

1. Pattern Mining

1.1 Frequent Patterns

Definition

  • frequent patterns
  • they reveal hidden regularities and relationship in data

Why studying frequent pattern mining?

  • association, correlation, casuality
  • sequential patterns
  • partial periodicity, cyclic/temporal associations

foundation for several essential data mining tasks

Type

  • frequent items
    • items that occur together
    • e.g., {milk, bread}
  • sequential patterns
    • items that occur in a specific order
    • e.g., smartphone \rightarrow TV \rightarrow IoT device
  • structed patterns
    • patterns in complex data structures
    • e.g., subgraphs, subtrees, fractal

2. Association Rule Mining Fundamentals

2.1 Example: Market Basket Analysis

  • used to understand customer purchasing behavior by analyzing transaction data
  • identify which items are frequently purchased together during a single shopping

Business Application

  • 매장 배치: 자주 함께 구매되는 상품을 가까이 배치하여 매출 증대 유도 또는 멀리 배치하여 고객이 더 많은 상품을 훑어보게 함
  • 프로모션: 관련 상품을 묶어서 할인 판매하는 교차 판매 전략 수립

2.2 Key Metrics for Association Rules

1. Support ss

  • how frequently both items appear together in all transactions
  • support(AB)=P(AB)support(A \Rightarrow B) = P(A \cup B)

합집합 기호이지만, 의미상으로는 co-occurrence를 뜻한다.

2. Confidence cc

  • indicates how likely a customer who buys a computer will also buy antivirus software
  • confidence(AB)=P(BA)=support(AB)support(A)=support_count(AB)support_count(A)confidence(A \Rightarrow B) = P(B|A) = \displaystyle\frac{support(A \cup B)}{support(A)} = \frac{support\_count(A \cup B)}{support\_count(A)}

2.3 Frequent Itemset

Notations

  • I={I1,I2,...,Im}\mathcal{I} = \{I_1, I_2, ..., I_m\}: itemset
  • DD: task-relevant data
  • TT: nonempty itemset such that TIT\subseteq \mathcal{I}
    • each transactions is associated with an identifier, called TID

Association Rule

  • ABA \Rightarrow B
  • an implication where AI,BI,A,BA\subseteq \mathcal{I}, B\subseteq \mathcal{I}, A\subseteq \empty, B\subseteq \empty and AB=A \cap B = \empty

두 집합이 완전히 다른 아이템들로만 구성돼야 "A를 샀을 때 B를 살 확률이 높다"는 의미 있는 인사이트가 나온다.

Two Key Measures

  1. support
    • proportion of transactions in DD that contain ABA \cap B
    • support(AB)=P(AB)support(A \Rightarrow B) = P(A \cup B)
    • = relative support(ratio)
    • occurrence frequency = absolute support(count)
  • if the relative support \ge minimum support threshold, then I\mathcal{I} is a frequent itemset
    • the set of frequent 𝑘-itemsets is commonly denoted by 𝐿𝑘𝐿_𝑘
  1. confidence
    • proportion of transactions containing AA that also contain BB
    • confidence(AB)=P(BA)confidence(A \Rightarrow B) = P(B|A)
    • = conditional probability of BB given AA

K-itemset

  • an interest that contain exactly k items
  • e.g., {computer, antivirus} is a 2-itemset

3. Association Rule Mining

3.1 Two-Step Process

1. Find all frequent itemsets

  • each of these itemsets will occur at least min_supmin\_sup

2. Generate strong association rules

  • these rules must satisfy min_supmin\_sup and min_confmin\_conf
  • why union is association rules?
    • AB=A \cap B = {} 공집합이므로 A와 B가 같이 등장하는 트랜잭션을 찾으려면 intersection()(\cap)이 아니라 union()(\cup)을 써야 한다.

3.2 Challenges

Combinatorial Explosion

if an itemset is frequent, all of the subsets are also frequent

  • a long itemset will contain a combinatorial number of shorter frequent subitemsets
    • e.g., a frequent itemset of length 100
    • 210011.27×10302^{100} - 1 \approx 1.27 \times 10^{30}
  • to overcome this difficulty,
  • closed frequent itemset and maximal frequent itemset

폭발적인 조합을 줄이기 위해 줄일 수 있는 건 쳐내는 prunning을 해야 한다.


4. Apriori Algorithm

4.1 Strategy

Key Idea

  • the name "Aprior" comes from the fact the algorithm uses prior knowledge of frequent properties

Iterative Level-wise Search

  • where k-itemsets are used to generate (k+1)-itemsets
  • first, the database is scanned to count the occurrences of each item,
    • and those satisfying the min_supmin\_sup are collected to form the set of frequent L1L_1
    • then, L1L_1 is used to generate L2L_2, which is then used to generate L3L_3, and so on, until no more frequent items can be found
    • each iteration requires a full scan of the database

4.2 Apriori Property: Antimonotonicity

Definition

  • all nonempty subsets of a frequent itemset must also be frequent
  • this property based on the following observations

Observation

  1. if an itemset I\mathcal{I} does not satisfy min_supmin\_sup, then I\mathcal{I} is not frequent
    • P(P(\mathcal{I})<min_sup) < min\_sup
  2. if an item AA is added to I\mathcal{I}, then the resulting itemset cannot occur more frequently than I\mathcal{I}

아이템이 늘어날수록 조건이 더 까다로워지니까 등장 횟수는 줄어들거나 같을수밖에 없다. I\mathcal{I}가 이미 not frequent하다면, 거기에 아이템을 추가해도 not frequent하다.

Antimonotonicity

  • if a set cannot pass a test, all of its supersets will fail the same test as well
  • e.g., {milk, bread} \rightarrow support = 0.1 (min_supmin\_sup = 0.3) \rightarrow not frequent

4.3 Core Steps of Apriori

새로운 후보군을 만들고(join), 불필요한 것을 쳐내는(prune) 과정이다.

1. Join Step \bowtie

  • generate candidate k-itemsets CkC_k by self-joining the frequent (k-1)-itemsets Lk1L_{k-1}
    • their first (k-2) items are identical

두 개의 셋을 합칠 때, 앞부분은 똑같고 마지막 한 아이템만 서로 달라야 한다.

  • key point
    • itemsets are sorted in lexicographic order
    • only valid combinations are generated
    • duplicate candidates are avoided

2. Prune Step

  • candidate set CkC_k may be large
    • many candidates are not frequent
  • for each candidate k-itemset (using antimonotonicity)
    • check all (k-1)-subsets
    • if any (k-1) subset is not frequent \rightarrow remove the candidate

if a subset is not frequent \rightarrow the superset cannot be frequent

0개의 댓글