Matplot Library 사용법 (Descriptive Statistics)

김신영·2024년 6월 2일

Box Plot bar chart histogram matplotlib pie chart plot scatter plot

빅데이터분석기사 실기

목록 보기

8/14

Matplot Library

Matplotlib은 Python에서 정적, 애니메이션 및 인터랙티브 시각화를 생성하기 위한 종합적인 라이브러리입니다.

Matplotlib은 쉬운 것은 더욱 쉽게, 어려운 것도 가능하게 합니다.

Matplotlib — Visualization with Python

Install

pip install matplotlib

matplotlib 한글 및 스타일 세팅

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rcParams['font.family'] = 'AppleGothic' # 윈도우에서는 'Malgun Gothic'
mpl.rcParams['axes.unicode_minus'] = False
plt.style.use('_mpl-gallery')

Graph 종류

https://matplotlib.org/stable/plot_types/index

Plot

https://matplotlib.org/stable/plot_types/basic/plot.html#sphx-glr-plot-types-basic-plot-py

EMPTY

plt.plot(x, y, fmt, …)

펼쳐보기 (docs)

Signature:
plt.plot(
    *args: 'float | ArrayLike | str',
    scalex: 'bool' = True,
    scaley: 'bool' = True,
    data=None,
    **kwargs,
) -> 'list[Line2D]'
Docstring:
Plot y versus x as lines and/or markers.

x, y 기준으로 그래프를 그린다.

fmt 옵션 목록

Markers

Character	Description
'.'	포인트 마커
','	픽셀 마커
'o'	원형 마커
'v'	아래쪽 삼각형 마커
'^'	위쪽 삼각형 마커
'<'	왼쪽 삼각형 마커
'>'	오른쪽 삼각형 마커
'1'	아래쪽 작은 삼각형 마커
'2'	위쪽 작은 삼각형 마커
'3'	왼쪽 작은 삼각형 마커
'4'	오른쪽 작은 삼각형 마커
'8'	팔각형 마커
's'	사각형 마커
'p'	오각형 마커
'P'	플러스 (채워진) 마커
'*'	별 마커
'h'	육각형1 마커
'H'	육각형2 마커
'+'	플러스 마커
'x'	X 마커
'X'	X (채워진) 마커
'D'	다이아몬드 마커
'd'	얇은 다이아몬드 마커
`'	'`
'_'	수평선 마커

Line Styles
Character Description
'-' 실선 스타일
'--' 대시선 스타일
'-.' 대시-점선 스타일
':' 점선 스타일
Colors
Character Color
'b' blue
'g' green
'r' red
'c' cyan
'm' magenta
'y' yellow
'k' black
'w' white

Character	Description
'-'	실선 스타일
'--'	대시선 스타일
'-.'	대시-점선 스타일
':'	점선 스타일

Character	Color
'b'	blue
'g'	green
'r'	red
'c'	cyan
'm'	magenta
'y'	yellow
'k'	black
'w'	white

코드 예시

# make data
x = np.linspace(-5, 5, 20)
y1 = x ** 3
y2 = 5 * x + 30
y3 = 4 * (x ** 2) - 20
y4 = -25 * x + 20 

# plot

plt.xlim()
plt.plot(x, y1, '--g')
plt.plot(x, y2, ':b')
plt.plot(x, y3, '-.r')
plt.plot(x, y4)

# plt.xlim(-5, 5)
# plt.ylim(-10, 100)
# plt.xticks(np.arange(-5, 6))
# plt.yticks(np.arange(-100, 121, 20))
plt.grid(True) # plt.grid()
plt.show()

plt.xlim(), plt.ylim()

그래프에서 x축, y축 값의 최대, 최소값 지정
인자를 넘겨주지 않으면, get 함수
인자를 넘겨주면, set 함수

left, right = xlim()  # return the current xlim
xlim((left, right))   # set the xlim to left, right
xlim(left, right)     # set the xlim to left, right

xlim(right=3)  # adjust the right leaving left unchanged
xlim(left=1)  # adjust the left leaving right unchanged

plt.xticks(), plt.yticks()

그래프에서 x축, y축 그래프 tick 단위 선정
인자를 넘겨주지 않으면, get 함수
인자를 넘겨주면, set 함수

locs, labels = xticks()  # Get the current locations and labels.

xticks(np.arange(0, 1, step=0.2))  # Set label locations.
xticks(np.arange(3), ['Tom', 'Dick', 'Sue'])  # Set text labels.
xticks([0, 1, 2], ['January', 'February', 'March'], rotation=20)  # Set text labels and properties.
xticks([])  # Disable xticks.

plt.xlabel(xlabel), plt.ylabel(ylabel)

x축, y축에 이름을 지정해준다.

plt.title(title)

그래프 제목을 지정해준다.

plt.legend(loc=None)

loc 옵션 목록
Location String Location Code
'best' (Axes only) 0
'upper right' 1
'upper left' 2
'lower left' 3
'lower right' 4
'right' 5
'center left' 6
'center right' 7
'lower center' 8
'upper center' 9
'center' 10

Location String	Location Code
'best' (Axes only)	0
'upper right'	1
'upper left'	2
'lower left'	3
'lower right'	4
'right'	5
'center left'	6
'center right'	7
'lower center'	8
'upper center'	9
'center'	10

plt.figure(num = None)

현재까지 plot한 그래프 창을 생성한다.
만약 이미 존재한다면, 해당 그래프(Figure)를 지정한다.

펼쳐보기 (docs)

Signature:
plt.figure(
    num: 'int | str | Figure | SubFigure | None' = None,
    figsize: 'tuple[float, float] | None' = None,
    dpi: 'float | None' = None,
    *,
    facecolor: 'ColorType | None' = None,
    edgecolor: 'ColorType | None' = None,
    frameon: 'bool' = True,
    FigureClass: 'type[Figure]' = <class 'matplotlib.figure.Figure'>,
    clear: 'bool' = False,
    **kwargs,
) -> 'Figure'
Docstring:
Create a new figure, or activate an existing figure.

plt.clf()

현재 선택된 Figure 삭제

코드 예시

# make data
x = np.linspace(-5, 5, 20)
y1 = x ** 3
y2 = 5 * x + 30
y3 = 4 * (x ** 2) - 20
y4 = -25 * x + 20 

plt.figure("y=x^3")
plt.plot(x, y1, '--g')
plt.grid()

plt.figure("y=5x+30")
plt.plot(x, y2, ':b')
plt.grid()

plt.clf() # (x, y2) 그래프는 삭제됨

plt.figure("y=4x^2-20")
plt.plot(x, y3, '-.r')
plt.grid()

plt.figure("y=-25x+20")
plt.plot(x, y4)
plt.grid()

plt.clf() # (x, y4) 그래프는 삭제됨

plt.show()

plt.subplots(nrows=1, ncols=1)

하나의 그래프 창을 하위 그래프 영역으로 나눈다.

코드 예시

# using the variable ax for single a Axes
fig, ax = plt.subplots()

# using the variable axs for multiple Axes
fig, axs = plt.subplots(2, 2)

axs[0, 0].plot(x, y1)
axs[0][1].plot(x, y2

# using tuple unpacking for multiple Axes
fig, (ax1, ax2) = plt.subplots(1, 2)
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

# make data
x = np.linspace(0, 10, 100)
y1 = 4 + 1 * np.sin(2 * x)
x2 = np.linspace(0, 10, 25)
y2 = 4 + 1 * np.sin(2 * x2)
y3 = 4 + 1 * np.cos(2 * x)
y4 = 4 + 1 * np.cos(2 * x2)

# plot
fig, axs = plt.subplots(2,2)

axs[0, 0].plot(x, y1, 'x', markeredgewidth=2)
axs[0][1].plot(x2, y2, linewidth=2.0)
axs[1, 0].plot(x, y3, 'o-', linewidth=2)
axs[1][1].plot(x2, y4)

for ax in axs:
    for x in ax:
        x.set(xlim=(0, 8), xticks=np.arange(1, 8), ylim=(0, 8), yticks=np.arange(1, 8))

plt.show()

plt.subplot(nrows, ncols, index)

index : 1부터 시작하는 인덱스
subplot을 직접 추가
plt.subplots(nrows, ncols, index) 한번에 모든 subplot을 만드는 반면
- 이 함수는 한번에 하나씩 만들고 plot함

코드 예시

# make data
x = np.linspace(-5, 5, 20)
y1 = x ** 3
y2 = 5 * x + 30
y3 = 4 * (x ** 2) - 20
y4 = -25 * x + 20 

# subplot
nrows, ncols = (2, 2)

plt.subplot(nrows, ncols, 1)
plt.plot(x, y1, '--g')
plt.subplot(nrows, ncols, 2)
plt.plot(x, y2, ':b')
plt.subplot(nrows, ncols, 3)
plt.plot(x, y3, '-.r')
plt.subplot(nrows, ncols, 4)
plt.plot(x, y4)

plt.show()

plt.text(x, y, text)

Axes 에 텍스트를 추가한다.

Bar Chart

https://matplotlib.org/stable/plot_types/basic/bar.html#sphx-glr-plot-types-basic-bar-py

코드 예시

# make data:
ids = np.arange(4) + 1
member_ids = list(map(lambda x: f"m_{x:02d}", ids))

before_ex = [27, 35, 40, 33]
after_ex = [30, 38, 42, 37]

# plot
bar_width = 0.3
line_width = 1

plt.bar(ids, before_ex, label = 'before', width=bar_width, color = 'm', edgecolor="white", linewidth=line_width)
# plt.barh(ids, before_ex, label = 'before', height=bar_width, color = 'm', edgecolor="white", linewidth=line_width)
plt.bar(ids + bar_width, after_ex, label = 'after', width=bar_width, color = 'c', edgecolor="white", linewidth=line_width)
# plt.barh(ids + bar_width, after_ex, label = 'after', height=bar_width, color = 'c', edgecolor="white", linewidth=line_width)

plt.xticks(ids + bar_width, member_ids)
plt.xlabel('회원 ID')
plt.ylabel('윗몸일으키기 횟수')
plt.title('운동 시작 전과 후의 근지구력 변화 비교')
plt.legend()

plt.show()

Pie Chart

https://matplotlib.org/stable/plot_types/stats/pie.html#sphx-glr-plot-types-stats-pie-py

코드 예시

fruit = ['사과', '바나나', '딸기', '오렌지', '포도']
result = [7, 6, 3, 2, 2]

# colors = plt.get_cmap('Blues')(np.linspace(0.2, 0.7, len(fruit)))

explode_value = (0.3, 0.1, 0.1, 0.1, 0.1) # the radius with which to offset each wedge.
plt.figure(figsize=(3,3))
plt.pie(result, labels=fruit, colors=['m', 'y', 'r', 'c', 'g'], autopct='%.0f%%', startangle=90, counterclock=False, explode=explode_value, shadow=True)

plt.title("과일 판매량 비율")
plt.legend(loc=4)
plt.show()

plt.get_cmap(colorName)

색깔을 매핑시켜주는 함수를 리턴
- [0, 1] 사이의 값을 주면, 색깔 RGB값을 리턴한다.
plt.get_cmap('Reds')
plt.get_cmap('Greens')
plt.get_cmap('Blues')

colors = plt.get_cmap('Blues')(np.linspace(0.2, 0.7, 5))

Histogram

https://matplotlib.org/stable/plot_types/stats/hist_plot.html#sphx-glr-plot-types-stats-hist-plot-py

코드 예시

def round_score(x):
    if x > 100: 
        return 100 
    elif x < 0:
        return 0
    else: 
        return round(x) 

math_scores = list(map(lambda x: round_score(x), 10 * np.random.randn(100) + 70))

fig, ax = plt.subplots()
plt.hist(math_scores, bins=8, linewidth=0.5, edgecolor="white")
plt.xlabel('수학 점수')
plt.ylabel('frequency')
plt.title('수학 점수 histogram')
fig.set_size_inches(3, 3)

Box Plot

https://matplotlib.org/stable/plot_types/stats/boxplot_plot.html#sphx-glr-plot-types-stats-boxplot-plot-py

https://seaborn.pydata.org/generated/seaborn.boxplot.html

plt.boxplot 코드 예시

import matplotlib.pyplot as plt
import numpy as np

plt.style.use('_mpl-gallery')

# make data:
np.random.seed(10)
D =np.random.normal((3, 5, 4), (1.25, 1.00, 1.25), (100, 3))

# plot
fig,ax =plt.subplots()
VP =ax.boxplot(D, positions=[2, 4, 6], widths=1.5, patch_artist=True,
                showmeans=False, showfliers=False,
                medianprops={"color": "white", "linewidth": 0.5},
                boxprops={"facecolor": "C0", "edgecolor": "white",
                          "linewidth": 0.5},
                whiskerprops={"color": "C0", "linewidth": 1.5},
                capprops={"color": "C0", "linewidth": 1.5})

ax.set(xlim=(0, 8), xticks=np.arange(1, 8),
       ylim=(0, 8), yticks=np.arange(1, 8))

plt.show()

sns.boxplot 코드 예시

import seaborn as sns

before_ex = np.random.randn(100) * 10 + 30
before_ex = list(map(lambda x: round(x), before_ex))

after_ex = list(map(lambda x: x + np.random.randint(0, 21), before_ex))

data = np.array([before_ex, after_ex]).transpose()

df = pd.DataFrame(data, columns=['before_ex', 'after_ex'])

sns.boxplot(df, orient="h", palette="Set2")

Scatter Plot (산점도)

https://matplotlib.org/stable/plot_types/basic/scatter_plot.html#sphx-glr-plot-types-basic-scatter-plot-py

코드 예시

sample_size = 1000

base_normal_sample = np.random.randn(sample_size)

def transform_normal_sample(loc, scale):
    return list(map(lambda x: round(x), loc + scale * base_normal_sample))

height = list(map(lambda x: round(x), 170 + 10 * base_normal_sample))
weight = list(map(lambda x: round(x), 70 + 5 * base_normal_sample))

weight = list(map(lambda x: x + np.random.randint(0, 21), weight)) # 오차 적용

fig, ax = plt.subplots()
fig.set_size_inches(3,3)
plt.scatter(height, weight, s=2)
ax.set(xlabel="키", ylabel="체중", title="체중과 키 산점도")

plt.show()

Measurement를 이용한 자료의 정리

Mean, Median, Mode

Mean: 평균
Median: 중앙값
Mode: 최빈값

Mean

np.mean(arr)
np.average(arr)
ser.mean()
df.mean()

Median

np.median(arr)
ser.median()
df.median()

Mode

statistics.mode(arr)
ser.mode()
df.mode()

Variability (산포)

Variance

np.var(arr, ddof=1)

Standard Deviation

np.std(arr, ddof=1)

Range (max, min)

np.max(arr)
np.min(arr)
ser.max()
ser.min()
df.max()[0]
df.min()[0]

IQR (Quartile, Quantile, Percentile)

np.quantile(arr, [0.25, 0.5, 0.75])
np.percentile(arr, [25, 50, 75])
ser.quantile([0.25, 0.5, 0.75]).to_numpy()
df.quantile([0.25, 0.5, 0.75]).to_numpy()

Shape

Skewness (왜도)

scipy.stats.skew(arr)
skewness 가 음수이면,
- mean < median < mode
skewness 가 양수이면,
- mode < median < mean

Kurtosis (첨도)

scipy.stats.kurtosis(arr)

Mesokurtic : 이 분포는 정규 분포와 유사한 첨도 통계량을 가지고 있다. 분포의 극단값이 정규 분포 특성과 유사하다는 뜻이다. 표준 정규 분포는 3의 첨도 갖는다.
Leptokurtic (Kurtosis > 3) : 분포가 길고, 꼬리가 더 뚱뚱하다. 피크는 Mesokurtic보다 높고 날카롭기 때문에 데이터는 꼬리가 무겁거나 특이치(outlier)가 많다는 것을 의미한다.특이치(outlier)는 히스토그램 그래프의 수평 축을 확장하여 데이터의 대부분이 좁은 수직 범위로 나타나도록 하여 Leptokurtic 분포의 "skinniness"을 부여한다.
Platykurtic (Kurtosis < 3) : 분포는 짧고 꼬리는 정규 분포보다 얇다. 피크는 Mesokurtic보다 낮고 넓으며, 이는 데이터가 가벼운 편이나 특이치(outlier)가 부족하다는 것을 의미한다.이유는 극단값(extream value)이 정규 분포의 극단값보다 작기 때문이

김신영

Hello velog!

이전 포스트

EDA (Exploratory Data Analysis)

다음 포스트

Matplot Library 사용법 (Descriptive Statistics)

빅데이터분석기사 실기

Matplot Library

Install

matplotlib 한글 및 스타일 세팅

Graph 종류

Plot

plt.plot(x, y, fmt, …)

plt.xlim(), plt.ylim()

plt.xticks(), plt.yticks()

plt.xlabel(xlabel), plt.ylabel(ylabel)

plt.title(title)

plt.legend(loc=None)

plt.figure(num = None)

plt.clf()

plt.subplots(nrows=1, ncols=1)

plt.subplot(nrows, ncols, index)

plt.text(x, y, text)

Bar Chart

Pie Chart

plt.get_cmap(colorName)

Histogram

Box Plot

Scatter Plot (산점도)

Measurement를 이용한 자료의 정리

Mean, Median, Mode

Mean

Median

Mode

Variability (산포)

Variance

Standard Deviation

Range (max, min)

IQR (Quartile, Quantile, Percentile)

Shape

Skewness (왜도)

Kurtosis (첨도)

EDA (Exploratory Data Analysis)

빅데이터 분석과정

0개의 댓글

관련 채용 정보