pandas 101

River·2021년 4월 5일

Data Structures of Pandas

DataFrame operations

Getting Basic Attributes

Column Selection, Addition, Deletion

Arithmetic Operations

Working with Missing/Duplicate Data

Visualization

Data Structures of Pandas

Series : 1-D labeled array with index (0 ~ len-1 in default)
DataFrame : 2-D labeled data structure with index(rows)/columns

DataFrame operations

Only includes basic operation methods. Refer to Pandas User Guide for more complex methods and explanation.

import numpy as np
import pandas as pd

Getting Basic Attributes

df.head(n)

-> shows first n rows (default = 5)

df.tail(n)

-> shows last n rows (default = 5)

df.shape()

-> gives the axis dimensions

df.info()

df.describe()

-> summarizes descriptive stats

df.columns
df.index
-> returns array of column/label names

df.to_numpy()

-> converts into np ndarray. This works only when all the columns have same data type

df.copy()

df.empty

-> checks if DF is empty

Column Selection, Addition, Deletion

Selection

df["col_name"]

df.loc["label_name"]

-> selecting on multi-axis by label is also possible
ex) df.loc["20130102":"20130104", ["A", "B"]]

df.iloc[loc]

-> selects with integer position locations

df[5:10]

df.at[dates[0], "A"]
-> gets fast access to scalar value

df[bool_vec]

-> selects where value of bool_vec is true

df1.loc[lambda df: df['A'] > 0, :]
-> selects rows where value at column 'A' is greater than 0

df1.loc[:, lambda df: ['A', 'B']]
-> selects columns 'A' and 'B'

df.sample(n=n,axis=1)

-> random selection of column (selection of row if axis = 0)

df.where(lambda x: x>4)

df.mask(df >= 0)

-> inverse boolean operation of where selection

df.query('(a < b) & (b < c)')

Addition

df["new_col_name"] = values

df.insert(col_loc,"column",values)

df.assign(col_name=df["col1"] + df["col2"])
-> assign() method creates a new column derived from existing columns

Deletion

del df["col_name"]

col = df.pop("col_name")

Arithmetic Operations

df + df2

df - df.iloc[1]

df * 5 + 2

df & df2
-> bit operation

df.T

-> transpose

np.exp(df)

df.add(df2, fill_value=0)
-> fill_value option treats NaN as 0 when doing operation

Working with Missing/Duplicate Data

df.duplicated('col', keep='first'/'last'/False)
-> returns bool vec for rows for duplicates except for the (keep) occurrence

df.drop_duplicates('col', keep='last')
-> drops duplicate rows

df.duplicated('col').sum()
-> checks if there is any duplicate

df.isnull()

-> checks for NaN values returning bool vec

df.isnull().sum()
-> counts NaN values of each column

df.isnull().sum(1)
-> counts NaN values of each row

df.isnotnull()

-> checks real values returning bool vec

df['col'].fillna(value=0)
-> fills NaN values with 0

df.dropna(how="any")
-> drops any rows that have missing data

df.apply(function)

-> applies function to the data
ex) df.apply(lambda x: x.max() - x.min())

s.value_counts()

-> counts the number of discrete values when s is Series

Visualization

import matplotlib.pyplot as plt
plt.figure(num=fig_num, figsize=(x,y))

df.plot(x='A',y='B')

df.plot.bar()
df.plot.barh(stacked=True)

df.plot.hist(stacked=True, bins=20, orientation='horizontal', cumulative=True)

df.plot.box(color=color, vert=False, positions=[1,3,5])
df.boxplot(by='x')

df.plot.area()

df.plot.scatter()
scatter_matrix(df, alpha=0.2, figsize=(6,6), diagonal='kde')

df.plot.hexbin()

df.plot.pie()

df.plot.kde()
-> density plot

River

다음 포스트

Probability Theory 101

0개의 댓글

관련 채용 정보

에고이즘

[미뇽맨션] 인형 브랜드 앱 개발자 (신입 3년이하)

귀여운 인형들이 가득한 미뇽맨션에서 앱 개발자로 새로운 브랜드 경험을 함께 만들어보세요! 다양한 기술적 도전에 참가하며 기획 초기부터 성장을 도모할 수 있는 기회를 제공합니다.

피클플러스

프론트엔드 주니어 개발자

피클플러스는 OTT 시장에서 빠르게 성장하며 '글로벌 OTT 슈퍼앱' 비전을 추구하는 스타트업입니다. TypeScript, React, Next.js를 활용한 프론트엔드 개발을 통해 뛰어난 동료들과 함께할 기회를 제공합니다.

펫프렌즈

백엔드 개발자 (Post-Order 파트)

펫프렌즈는 반려동물 생애 전반에 걸친 서비스와 심쿵배송으로 국내 1위 펫커머스 기업입니다. 백엔드 개발자로 API 설계 및 결제 시스템 운영에 참여하며, 동료들과 자율적으로 성장할 수 있는 환경을 제공합니다.