Natural Language Processing
Concept
Use Case
- Group news articles by topic
What you will probably do
- compile documents
- featurize documents
- compare the features
Simple Process
- Bag of words : Document represented as vectors
- “Blue House” -> (red,blue,house) -> (0,1,1)
- “Red House” -> (red,blue,house) -> (1,0,1)
- We can improve on Bag of Words by adjusting word counts based on their frequency in corpos (group of all the documents)
* We can use TF-IDF (Term Frequency - Inverse Document Frequency)
TF-IDF
- Term Frequency : Importance of the term within that
document
- TF(d,t)
= Number of occurrences of term t in document d
- Inverse Document Frequency - Importance of the term in the corpus
- IDF(t) = log(D/t)
where
- D
= total number of documents
- st
= number of documents with the term