: to prepare raw data in a suitable format in order to use for further analysis
1) Data Collection & Profiling - import data
2) Data check - to make a hypothesis, which data should be considered
3) Data cleansing - removing unnecessary fields , filling missing fields
hypothesis:
The rich passengers had a higher survival rate in the RMS Titanic.
To validate the hypothesis using sample data
=> Visualize data -> create table & chart about the correlation between data
=> check 'Pclass' and 'Fare' fields to validate the hypothesis
(If the hypothesis is not correct, we should consider: what factors affected the survival of the RMS Titanic.)
@Note: only leave the reference field and remove other field to find the most influencial factor. The reference field would be survival rate('survived').
@Note: When assessing correlation, the higher the absolute value, the more effect on the result.
Positive num - As one factor increases, other factors also increase
Negative num - As one factor increases, other factors decrease
When analyzing the data, it becomes evident that the 'Sex' factor is the most influential factor.
Based on the provided data, it is apparent that women had a higher survival rate compared to men. Additionally, it shows that the higher the fare paid and the purchase price for the seat, the greater the chance of survival. However, it's important to note that these factors are related to the survival rate but are not as high as gender.
! Hypothesis:
The rich passengers had a higher survival rate in the RMS Titanic.
(the higher price, the lower seat class)
-can quickly create and manage data structures
-enable to analyze complicated data set
-used for data analysis, data visualization
-some of python libraries: pandas, matplotlib
@Note:
Pandas
Matplotlib
Today I learned ,
-The concept of data preprocessing
-The process of representing data visually
-The basics of Python for data analysis