Text Mining & Data Visualization
VS model in practice
- Documen & query are represented by term vectors
- Terms are not necessarily orthogonal to each other
- Synonymy : car vs automobile
- Polysemy : fly (action vs insect)
How to build such a space
- Automatic temr expansion
- Construction of thesaurus
- Clustering of words
- Word sense disambiguation
- Dictionary-based
- Relation between a pair of words should be similar as in text and dictionary's description
- Explore word usage context
- Latent Semantic Analysis (LSA)
- Assumption : there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with recpect to text generation
- It means : the observed term-document association data is contaminated by random noise
![](https://velog.velcdn.com/images/u_u/post/cb0937bd-2b41-4022-831c-65dd46c29d8d/image.png)
Basic concepts in linear algebra
![](https://velog.velcdn.com/images/u_u/post/516db966-884a-4296-8190-d22118e76372/image.png)
![](https://velog.velcdn.com/images/u_u/post/1c7c541d-3687-469f-baa3-a7522dd38aba/image.png)
LSA
- Solve LSA by SVD
![](https://velog.velcdn.com/images/u_u/post/beeb4bbd-bc15-4756-b27d-5e68d28700b1/image.png)
What are those dimensions in LSA
- Principle component analysis (PCA)
![](https://velog.velcdn.com/images/u_u/post/3b114384-6403-410a-b7e8-d108b37ee017/image.png)
v1, v2와 같은핵심 축들을 확인할 수 있다.
LSA beyond text
- Collaborative filtering
- User item matrix stores for each user the rating for the item
![](https://velog.velcdn.com/images/u_u/post/8bc885f5-aca2-4561-93ba-ba56a159adea/image.png)
uk의 유저가 i2, i4에 대해서 구매를 했는지의 여부를 모르기 때문에 i1, i3, i5등 과 같은 알 수 있는 정보를 다른 u4와 같은 유사하게 구매한 인원의 값을 보고 유추할 수 있다.
Data Visualization
Overview
- Definiiton : To represent the data (or result) for better understanding
- How can I represent what my dadat (or result) show?
- Why visualize?
- giving better, clear understanding to audiences (including you as well)
- Can get perceptual motivations!
Principles on Visualization
- Be true to your research
- Design your display to illustrate a particular point
- Maximize information, minimize ink
- Use the simplest possible representation for the bits you want to convey
- Organize hierarchically
- What should a viewer see first? what if they look deeper?
인위적인 조작 이외에 결과에 대해서 검증을 해라. 그렇지 않다면 의도하지 않는 조작이 될 수 있다.
청중에게 보여줄 가장 핵심적인 것 몇개만 나열하라.
가장 우선적인 것을 보여주고 서브를 포함시켜라
Data Types & Aesthetics
- Data Types
- Nominal-values ex) color, profession
- Ordinal-values ex) military or academic rank
- Numeric-real numbers ex) integer or real numbers
- Aesthetics
- Shape
- Color
- Size
- Position (Axis)
![](https://velog.velcdn.com/images/u_u/post/43fd50ae-aa11-48fe-972b-3a8339f38789/image.png)
How to Choose Colors?
- Qualitative, Sequential, Diverging colors
- 가장 두드러지게 색이 구분, 같은 색과 명도의 차이, 양 끝 단의 차이
![](https://velog.velcdn.com/images/u_u/post/759a1c0d-1878-42e5-8cf6-439a60c61a8c/image.png)
![](https://velog.velcdn.com/images/u_u/post/d4799435-b842-451f-b331-04eec08edebd/image.png)
![](https://velog.velcdn.com/images/u_u/post/32b5eb2e-5b8b-4deb-bfeb-6bf6c42100a0/image.png)
Popular Plots for Visualization
![](https://velog.velcdn.com/images/u_u/post/94d5d08a-ee4a-45fb-a92b-b1fa0aab74c4/image.png)
Histograms
- One dimentional
- Shows shape by binning a continuous distribution
- Grouped histogram can indicate multiple discrete categories -> Good for comparison
![](https://velog.velcdn.com/images/u_u/post/524f1606-5b1e-407a-b42a-ada8a0914bf4/image.png)
하나의 데이터 종류에 대한 분포를 보기 위해 주로 사용한다.
Pie Chart
- A whole split into parts
- Emphasizes that all parts sum to a constant
- Single dimension with discrete categories
![](https://velog.velcdn.com/images/u_u/post/ee5e0f7b-d138-481a-b5f2-3bff7d68ed14/image.png)
전체 대비 제일 많은 것, 적은 것 상대적인 비율을 보기 위해 주로 사용한다.
Stacked Bar Plot
- Whole split into parts
- Easy to compare
- Often beter than pie chart
- Can have multiple discrete dimensions
![](https://velog.velcdn.com/images/u_u/post/b289a32f-3a5a-4888-86ce-eb56941923c8/image.png)
비율이 시간이 지남에 따라 어떻게 변하는지 보기 위해 주로 사용한다.
Venn diagram
- Shows overlpa between discrete groups
- Sometimes the only way to display overlapping sets
![](https://velog.velcdn.com/images/u_u/post/8c5a102f-ea22-4393-8940-da4a376f0ebb/image.png)
Scatter Plots
- Relationship between observations on two continous dimentions
![](https://velog.velcdn.com/images/u_u/post/c84e3f2a-473e-479a-a256-155741bbbbc6/image.png)
Line Plot
- Good for showing one variable (ex, time) as continuous even though you have discrete measures
- Can compare several discrete groups
![](https://velog.velcdn.com/images/u_u/post/2e520201-e564-4634-9ae1-8385b17e2f24/image.png)
Bar Plots
- Can be used for lots of discrete grouping factors
- Natural semantics of grouping
- Conceals data distribution (Less information)
![](https://velog.velcdn.com/images/u_u/post/2215174f-6ceb-4892-acf0-fb3564483652/image.png)
(CF) Issues on bar chart
- Even same data sets, It's different to recognize information as shape.
ex) 첫번째, 나이대에 따라 인컴이 더 잘 들어오는구나. 두번째, Asian이 돈을 많이 버는구나.
![](https://velog.velcdn.com/images/u_u/post/fa3024cf-a910-4433-8349-07fa9ef1260a/image.png)
가장 전달하고 싶은 메세지를 선정하여 잘 보일 수 있게 해야한다.
Heatmap
- Works very well when there are natural semantics or comparing intensity of data
- Color mapping can be problematic
- Usually, sequential or diverging colors are good
![](https://velog.velcdn.com/images/u_u/post/d004a136-c521-4b5f-b1c0-b8995c693a0f/image.png)
Bubble Plot
- Extension of scatter plot (+ one dimension)
- Size is not perfectly quantitative
![](https://velog.velcdn.com/images/u_u/post/a526c73e-8b69-4c59-82b9-3fb16a5850b2/image.png)
Visualizing Distributions
- The first step to observe "how data is shaped" without any manipulation
- How?
- Using error bars
- Histogram after binning -> kernel density estimation
- Bos, violin, strip, sina, ridgeline
- CDF & QQ-plots
Using Error Bars for Distribution
- Mean + standard deviation
- Can be used for lots of discrete grouping factors
- Conceals data distribution (Less information)
![](https://velog.velcdn.com/images/u_u/post/d03bb0a4-0695-4508-8144-d0ce16fae28c/image.png)
Histogram for Distribution
- Binning is essential process
- Good to see distribution of one variable
- Different binning may result in different messages
![](https://velog.velcdn.com/images/u_u/post/26ee9a88-888f-4bd4-be06-fbb3623804b5/image.png)
Histogram to Kernel Density Estimation
- Discrete variable -> Continuous trends!
- 함수로 그래프를 그리기 때문에, 실제 범위가 아닌 부분을 cut을 해주어야 한다. (ex, 나이는 0부터)
![](https://velog.velcdn.com/images/u_u/post/eaeb2076-fa62-4c59-b009-6c97d9803ba6/image.png)
수식으로 변환이 된다는 것은 다른 데이터에 적용할 수도 있다는 뜻. 수학적 모델 제시
Comparing Distributions
![](https://velog.velcdn.com/images/u_u/post/cc1b1b50-0841-41b4-9902-4aef699a437c/image.png)
How about 3 or more distributions?
![](https://velog.velcdn.com/images/u_u/post/31da6226-dd92-4d87-80d2-f01c30773bd1/image.png)
범위에서 한개의 컬럼을 가지고 min, quater, max의 값의 위치를 박스 형태로 구성
Violin plot is a super set of boxplot
![](https://velog.velcdn.com/images/u_u/post/3e67f490-891c-445a-a025-12db145a8cbd/image.png)
데이터가 연결되어야하기 때문에 값이 없는 부분이 있더라도 연결된다.
Strip or Sina plot May Be Better
- Strip : Using jitter to represent density
- Sina : Violin + Strip
![](https://velog.velcdn.com/images/u_u/post/c6750d9b-e9ea-4156-9a17-44f4ac4f0948/image.png)
If You Have More Distributions...
- Ridgeline Plots can be a solution
-WARNING : you should be enough number of data
![](https://velog.velcdn.com/images/u_u/post/7b872040-8fb1-474f-bf82-ef838528b344/image.png)
많은 Distribution을 보려고 할 때는 값이 중요한게 아닌 추세, 트렌드를 보려고 하는 것이다.
For raw (empirical) data, CDF is usually best
![](https://velog.velcdn.com/images/u_u/post/bce9b7ca-3db8-4fa0-a970-0b5a1088c898/image.png)
ex) 80점 70점 사이의 학생들의 비율을 알고 싶으면 80점 값에 70점 값을 빼면 비율이 나온다. (중첩된 값이기 때문에)
Empirical data with Theoretical fits
- Quantile-Quantile (QQ) plots
![](https://velog.velcdn.com/images/u_u/post/46cd7c68-7af1-4a37-a3b1-fecc639b9607/image.png)
Summary and conclusions
- Mapping data to a visual representation
- What dimensions matter
- What Aesthetics elements (color, shape, position, size...) will map to those dimensions
- Three principles
- Be true to your research
- Maximize information, minimize link
- Organize hierarchically