Feature = Column or a Dimension of a DataFrame
Feature Engineering = Combining/Restructuring the existing datasets to create a new feature
Important Steps:
- dtypes/df.info()
- Can check data type and other basic information
- Missing Values
- NaN/None, Null, NA etc.
- Ways to Deal with Missing Values:
- isnull() - returns T/F Booleans for missing values
- notnull() - the opposite of isnull()
- dropna() - drops missing values
- fillna() - replaces the missing value with another (ex. 0, mean, mode, max etc.)
- sum() -- used to count the total number of existing missing values
- Strings -> Numerics:
- 25,970 + 82,524 should equal 108,464, but Python will read it as 25,97082,524 because the data above are both strings
- Ways to Convert into Numerical Data
1. string replace - string variable.replace("delete",") (will replace into white space)
2. type-casting -- uses built-in functions:
- int() = returns integers
- str() = returns strings
- float() = returns floats
3. As Functions
- Make your own function to convert data type
- Then Apply
def toInt(string):
return int(string.replace(',',''))
df['income'].apply(toInt)