Text Mining
Text data mining
- Sentiment analysis
- Document summarization
- News recommendation
- Text analytics in financial services
- Text analytics in healthcare
How to perform text mining?
- As computer scientists, we view it as
- Text Mining = Data Mining + Text Data
Text mining vs NLP,IR,DM...
- How does it relate to data mining in general?
- How does it relate to computational linguistics?
- How does it relate to information retrieval?
![](https://velog.velcdn.com/images/u_u/post/e62c80c9-7699-4e5d-b771-d74e819551e3/image.png)
Text mining in general
![](https://velog.velcdn.com/images/u_u/post/b92cb39b-aba9-40e3-a90c-d0e7ddb6d06d/image.png)
- Using machine learning text data transfer to Organization
Challenges in text mining
- Data collection is "free text"
- Data is not well-organized
- Semi-structured or unstructured
- Natural language text contains ambiguities on many levels
- Lexical, syntactic, semantic, and pragmatic
- Learning techniques for processing text typically need annotated training examples
- Expensive to acquire at scale
- What to mine?
Challenges in text minig(cont'd)
- Huge in size
- 80% data is unstructured (IBM,2010)
Scalability is crucial
- Large scale text processing techniques
State-of-the-art solutions
- Apache Spark (spark.apache.org)
- In-memory MapReduce
- Speciallized for machine learning algorithms
- Speed
- 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
- Genenral
- Combine SQL, streaming, and complex analytics
Document Representation
How to represent a document
- Represent by a string?
- Represent by a list of sentences?
- Sentence is just like a short document (recursive definiton)
Can't
Vector Space (VS) model
- Represent documents by concept vectors
- Each concept defines one dimension
- k concepts define a high-dimenstional space
- Element of vector corresponds to concept weight
- Distance between the vectors in this concept space
- Relationship among documents
Can calculate distance among of vectors
An illustration of VS model
- All documents are projected into this concept space
![](https://velog.velcdn.com/images/u_u/post/5218b1a2-1ead-4aa4-a597-ef1abaea827d/image.png)
What the VS model doesn't say
- How to define/select the "basic concept"
- Concepts are assumed to be orthogonal
- How to assign weights
- Weights indicate how well the concept characterizes the document
- How to define the distance metric
컨셉간 거리가 직교해야한다. 유사성이 존재해서는 안된다 축이 흔들려서는 안된다.
What is a good "Basic Concept"?
Bag-of-Words representaion
단어 하나를 컨셉으로 본다.
- Term as the basis for vector space
![](https://velog.velcdn.com/images/u_u/post/8ba26847-4b1d-4d12-8427-fbcaff7c45a1/image.png)
Tokenization
![](https://velog.velcdn.com/images/u_u/post/ec5d55ed-8ef4-44ff-85d8-c7e7cc7b54cb/image.png)
Back-of-Words with N-grams
Back-of-Words를 개선하기 위해 N-grams 도입
- N-grams: a contiguous sequence of N tokens from a given piece of text
- Pros: capture local dependency and order
- Cons: a purely statistical view, incease the vocabulary size O(V^N)
Automatic document representation
- Represent a document with all occurring words
- Pros
- Preserve all information in the text
- Fully automatic
- Cons
- Vocalbulary gap : cars versus car, talk versus talking
- Large storage : N-grams needsO(V^N)
- Solution
- Construct controlled vocabulary
A statistical property of language
- Zipf's law
- Frequency of any word is inversely proportional to its rank in the frequency table
![](https://velog.velcdn.com/images/u_u/post/1be5203a-ebc6-41a5-b369-2257fe4bc3f0/image.png)
Zipf's law tells us
수천만개중에 적게 나오는 단어, 의미 없는 말들 (관형사) 등이 의미가 있는가
- Head words take large portion of occurrences, but they are semantically meaningless
- Ex) the, a, an, we, do, to
- Tail words take major portion of vocabulary, but they rarely occur in documents
- The rest is most representative
- To be included in the conrolled vocabulary
![](https://velog.velcdn.com/images/u_u/post/df5aec79-b7e9-4d2b-9285-0bf881b8ae31/image.png)
Stopwords
- Useless words for documents analysis
![](https://velog.velcdn.com/images/u_u/post/8c76a677-bca0-446c-9f71-8c849bc45220/image.png)
Normalization
Stemming
- Reduce inflected or derived words to their root form
![](https://velog.velcdn.com/images/u_u/post/32232e38-bcc7-4457-bc25-8984e17d59b9/image.png)
Constructing a VSM reresentation
![](https://velog.velcdn.com/images/u_u/post/8ede9edd-9b7d-42f8-94e5-b8b661090e67/image.png)
3,4 번의 순서가 바뀌면 안된다. (자르는 단계에서 불용어를 먼저 삭제할 경우 묶는 순서 쌍이 바뀔 수 있다.)
How to assign weights?
- Important!
- Why?
- Corpus-wise : some terms carry more information about the document content
- Document-wise : not all terms are equally important
- How?
- Two basic heuristics
- TF (Term Frequency) = Within-doc-frequency
- IDF (Inverse Document Frequency)
Term frequency
- Idea : a term is more important if it occurs more frequently in a document
- TF Formulas
![](https://velog.velcdn.com/images/u_u/post/d0157354-ecb7-4fe2-9743-bbe3a8d97b78/image.png)
TF normalization
사용 빈도가 어느 특정 이상일 경우 아무리 많아도 의미가 없다 로그 형태의 그래프
Document frequency
- Idea : a term is more discriminative if it occurs only infrewer documents
![](https://velog.velcdn.com/images/u_u/post/6f46a4b8-3e78-4e30-bec4-7d43e9654378/image.png)
역으로
Inverse document frequency (IDF)
- Solution
![](https://velog.velcdn.com/images/u_u/post/20628b9d-ab0a-4a57-b6bc-b65c56dc4bf6/image.png)
TF-IDF weighting
![](https://velog.velcdn.com/images/u_u/post/c6756fbe-d98d-4c43-b2e5-6f80d183be5c/image.png)
How to define a good similarity metric?
- Euclidean distance?
![](https://velog.velcdn.com/images/u_u/post/70d2c6c9-47fd-4ea5-8cef-c635632d40bf/image.png)
From distance to angle
- Angle : how vectors are overlapped
![](https://velog.velcdn.com/images/u_u/post/770872cf-fe59-4eef-804a-b5d319c8613e/image.png)
Cosine similarity
- Angle between two vectors
![](https://velog.velcdn.com/images/u_u/post/8758a651-9fd9-4930-82a7-0e8e0a8529cd/image.png)