NaN, a special values that is part of the IEEE floating-point specification.NaN are not available for all data types.NaN or None depending on the type of the data.pd.NA value.None is a Python object, which means that any array containing None must have dtype=object.# In[1]
vals1=np.array([1,None,2,3])
vals1
# Out[1]
array([1, None, 2, 3], dtype=object)
dtype=object means that the best common type representation Numpy could infer for the contents of the array is that they are Python objects.None, aggregations like sum or min will generally lead to an error.None as a sentinel in its numerical arrays.NaN is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation.# In[2]
vals2=np.array([1,np.nan,3,4])
vals2
# Out[2]
array([ 1., nan, 3., 4.])
NaN is a bit like data virus; it infects any other object it touches.NaN will be another NaN# In[3]
print(1+np.nan)
print(0*np.nan)
# Out[3]
nan
nan
# In[4]
vals2.sum(),vals2.min(),vals2.max()
# Out[4]
(nan,nan,nan)
NaN-aware versions of aggregations that will ignore these missing values.# In[5]
np.nansum(vals2),np.nanmin(vals2),np.nanmax(vals2)
# Out[5]
(8.0, 1.0, 4.0)
# In[6]
pd.Series([1,np.nan,2,None])
# Out[6]
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
np.nan, it will automatically be upcast to a floating-point type to accommodate the NA# In[7]
x=pd.Series(range(2),dtype=int)
x
# Out[7]
0 0
1 1
dtype: int64
# In[8]
x[0]=None
x
# Out[8]
0 NaN
1 1.0
dtype: float64
Pandas handling of NAs by type
| Typeclass | Conversion when storing NAs | NA sentinel value |
|---|---|---|
| floating | No change | np.nan |
| object | No change | None or np.nan |
| integer | Cast to float64 | np.nan |
| boolean | Cast to object | None or np.nan |
# In[9]
pd.Series([1,np.nan,2,None,pd.NA],dtype='Int32')
# Out[9]
0 1
1 <NA>
2 2
3 <NA>
4 <NA>
dtype: Int32
isnull : Generates a Boolean mask indicating missing values.notnull: Opposite of isnulldropna : Returns a filtered version of the datafillna : Returns a copy of the data with missing values filled or imputed(귀속시키다).# In[10]
data=pd.Series([1,np.nan,'hello',None])
# In[11]
data.isnull()
# Out[11]
0 False
1 True
2 False
3 True
dtype: bool
# In[12]
data[data.notnull()]
# Out[12]
0 1
2 hello
dtype: object
isnull and notnull methods produce similar Boolean results for DataFrame objects.# In[13]
data.dropna()
# Out[13]
0 1
2 hello
dtype: object
# In[14]
df=pd.DataFrame([[1 , np.nan, 2],
[2 , 3, 5],
[np.nan, 4, 6]])
df
# Out[14]
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
dropna will drop all rows in which any null value is present.# In[15]
df.dropna()
# Out[15]
0 1 2
1 2.0 3.0 5
axis=1 or axis=column# In[16]
df.dropna(axis=1)
# Out[16]
2
0 2
1 5
2 6
This drop can be specified through the how or thresh parameters.
The default is how='any' , such that any row or column containing a null value will be dropped.
You can also specify how='all' , which will only drop rows/columns that contain all null values.
# In[17]
df[3]=np.nan
df
# Out[17]
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
# In[18]
df.dropna(axis=1,how='all')
# Out[18]
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
thresh parameter lets you specify a minimum number of non-null values for the row/columns to be kept.# In[19]
df.dropna(axis=0,thresh=3)
# Out[19]
0 1 2 3
1 2.0 3.0 5 NaN
# In[20]
data=pd.Series([1,np.nan,2,None,3],index=list('abcde'),dtype='Int32')
data
# Out[20]
a 1
b <NA>
c 2
d <NA>
e 3
dtype: Int32
# In[21]
data.fillna(0)
# Out[21]
a 1
b 0
c 2
d 0
e 3
dtype: Int32
# In[22]
data.fillna(method='ffill') # forward fill
# Out[22]
a 1
b 1
c 2
d 2
e 3
dtype: Int32
# In[23]
data.fillna(method='bfill') # backward fill
# Out[23]
a 1
b 2
c 2
d 3
e 3
dtype: Int32
axis along which the fills should take place.# In[24]
df
# Out[24]
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
# In[25]
df.fillna(method='ffill',axis=1)
# Out[25]
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0
너무 좋은 글이네요. 공유해주셔서 감사합니다.