Simple Scatter Plots

노정훈·2023년 7월 31일

Matplotlib

목록 보기

3/12

# In[1]
%matplotlib  inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np

Scatter Plots with plt.plot

We used plt.plot and ax.plot to produce line plots.
It turns out that this same function can produce scatter plots as well.

# In[2]
x=np.linspace(0,10,30)
y=np.sin(x)

plt.plot(x,y,'o',color='black');

The third argument in the function call is a character that represents the type of symbol used for the plotting.
As you can specify options such as '-' or '--' to control the line style, the marker style has its own set of short string codes.

# In[3]
rng=np.random.default_rng(0)
for marker in ['o','.',',','x','+','v','^','<','>','s','d']:
    plt.plot(rng.random(2),rng.random(2),marker,color='black',
    label="marker='{0}'".format(marker))
plt.legend(numpoints=1,fontsize=13)
plt.xlim(0,1.8);

For even more possibilities, these character codes can be used together with line and color codes to plot points along with a line connecting them.

# In[4]
plt.plot(x,y,'-ok');

Additional keyword arguments to plt.plot specify a wide range of properties of the lines and markers.

# In[5]
plt.plot(x,y,'-p',color='gray',
markersize=15,linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
plt.ylim(-1.2,1.2);

These kinds of options make plt.plot the primary workhorse for two-dimensional plots in Matplotlib.

For more information about plt.plot, refer to this url :
plt.plot documentation

Scatter Plots with plt.scatter

More powerful method of creating scatter plots is the plt.scatter function, which can be used very similar to the plt.plot function.

# In[6]
plt.scatter(x,y,marker='o');

The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where the properties of each individual point can be individually controlled or mapped to data.
In order to better see the overlapping results, we'll also use the alpha keyword to adjust the transparency(투명도) level.

#In[7]
rng=np.random.default_rng(0)
x=rng.normal(size=100)
y=rng.normal(size=100)
colors=rng.random(100)
sizes=1000*rng.random(100)

plt.scatter(x,y,c=colors,s=sizes,alpha=0.3)
plt.colorbar(); # show color scale

The color argument is automatically mapped to a color scale(shown here by the colorbar command), and that the size argument is given in pixels.
In this way, the color and size of points can be used to convey information in the visualization, in order to visualize multidimensional data.
We might use the Iris dataset from Scikit-Learn, where each sample is one of three types of flowers that has had the size of its petals and sepals carefully measured.

# In[8]
from sklearn.datasets import load_iris
iris=load_iris()
features=iris.data.T

plt.scatter(features[0],features[1],alpha=0.4,
s=100*features[3],c=iris.target,cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);

We can see that this scatter plot has given us the ability to simultaneously explore four different dimensions of the data: the $(x,y)$ location of each point corresponds to the sepal length and width, the size of the point is related to the petal width, and the color is related to the particular species of flower.

plot Versus scatter: A Note on Efficiency

When datasets get larger than a few thousand points, plt.plot can be noticeably more efficient than plt.scatter.
The reason is that plt.scatter has the capability to render a different size and/or color for each point, so the renderer must do the extra work of constructing each point individually.
With plt.plot, the markers for each point are guaranteed to be identical, so the work of determining the appearance of the points is done only once for the entire set of data.

Visualizing Uncertainties

Basic Errorbars

One standard way to visualize uncertainties is using an errorbar.
A basic errorbar can be created with a single Matplotlib function call.

# In[9]
x=np.linspace(0,10,50)
dy=0.8
y=np.sin(x)+dy*np.random.randn(50)

plt.errorbar(x,y,yerr=dy,fmt='.k');

Here the fmt is a format code controlling the appearance of lines and points, and it has the same syntax as the shorthand used in plt.plot.
In addition to these basic options, the errorbar function has many options to fine-tune(미세 조정하다) the outputs.

# In[10]
plt.errorbar(x,y,yerr=dy,fmt='o',color='black',
ecolor='lightgray',elinewidth=3,capsize=0);

For more information about plt.errorbar, refer to this url :
plt.errorbar documentation

Continuous Errors

In some situations it is desirable to show errorbars on continuous quantities.
Though, Matplotlib does not have a built-in convenience routine for this type of application, it's relatively easy to combine primitives like plt.plot and plt.fill_between for a useful result.
We'll perform a simple Gaussian process regression, using the Scikit-Learn API. This is a method of fitting a very flexible nonparametric function to data with a continuous measure of the uncertainty.

# In[11]
from sklearn.gaussian_process import GaussianProcessRegressor

# define the model and draw some data
model=lambda x: x*np.sin(x)
xdata=np.array([1,3,5,6,8])
ydata=model(xdata)

# compute the gaussian process fit
gp=GaussianProcessRegressor()
gp.fit(xdata[:,np.newaxis],ydata)

xfit=np.linspace(0,10,1000)
yfit,dyfit=gp.predict(xfit[:,np.newaxis],return_std=True)

We now have xfit, yfit, and dyfit, which sample the continuous fit to our data.
We could pass these to the plt.errorbar function as in the previous section, but we don't really want to plot 1,000 points with 1,000 errorbars.
Instead, we can use the plt.fill_between function with a light color to visualize this continuous error.

# In[12]
plt.plot(xdata,ydata,'or')
plt.plot(xfit,yfit,'-',color='gray')
plt.fill_between(xfit,yfit-dyfit,yfit+dyfit,color='gray',alpha=0.2)
plt.xlim(0,10);

fill_between call signature: we pass an x value, then the lower y bound, then the upper y bound, and the result is that the area between these regions is filled.
In regions near a measured data point, the model is strongly constrained, and this is reflected in the small model uncertainties.
In regions far from a measured data point, the model is not strongly constrained, and the model uncertainties increase.

For more information about plt.fill_between function, refer to this url :
plt.fill_between documentation
And this is about plt.fill that is the closely related with plt.fill_between :
plt.fill documentation

노정훈

이전 포스트

Simple Line Plots

다음 포스트