Concept

Use Case

  • Group news articles by topic

What you will probably do

  • compile documents
  • featurize documents
  • compare the features

Simple Process

  • Bag of words : Document represented as vectors
    - “Blue House” -> (red,blue,house) -> (0,1,1)
    - “Red House” -> (red,blue,house) -> (1,0,1)
  • We can improve on Bag of Words by adjusting word counts based on their frequency in corpos (group of all the documents)
    * We can use TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF

  • Term Frequency : Importance of the term within that
    document
    - TF(d,t) = Number of occurrences of term t in document d
  • Inverse Document Frequency - Importance of the term in the corpus
    - IDF(t) = log(D/t) where
    - D = total number of documents
    - st = number of documents with the term

0개의 댓글