Introduction to NLP (Wk.5)

Ch. 7 Machine Learning

7-3) Linear Regression

Introduction to Linear Regression

Linear Regression models the linear relationship between between independant variable(s) x & dependendant variable y .
If there is one independant variable x, it is called Simple Linear Regression.

Simple Linear Regression Analysis

y=wx+by = {wx + b}

Above is the simple linear regression formula.
w is 'weight'
b is 'bias'
If we find the value of w and b correctly, we successfully modeled the relationship betwen x and y.

Multiple Linear Regression Analysis

There can be multiple independant variables.
We call that Multiple Linear Regression.

y=w1x1+w2x2+...wnxn+by = {w_1x_1 + w_2x_2 + ... w_nx_n + b}

Above is the multiple linear regression formula.


We make hypothesis to assume the relationship between x and y.
Below is the hypothesis formula for linear regression.

H(x)=wx+bH(x) = {wx + b}

Cost Function

Mean Squared Error (MSE)

The goal is to find w and b that express the rule the most.
In machine learning, we make formula that calculates the error of predicted value from the reality.
That is called

  1. Objective Function
  2. Cost Function
  3. Loss Function

Then, we find w and b value that has the least error from the function above.
Therefore, the formula above should not only represent the value, but also should be optimized for lessening the error.

For regression, we usually use Mean Squared Error (MSE).
Below is the formula.

1ni=1n[y(i)H(x(i))]2=210/4=52.5\frac{1}{n} \sum_{i=1}^{n} \left[y^{(i)} - H(x^{(i)})\right]^2 = 210 / 4 = 52.5

Below is re-defining MSE with cost function by w and b

cost(w,b)=1ni=1n[y(i)H(x(i))]2cost(w, b) = \frac{1}{n} \sum_{i=1}^{n} \left[y^{(i)} - H(x^{(i)})\right]^2

If we have less errors, MSE also gets lower.
If we find w and b that makes Cost(w,b) 's value the least, we can find line that shows the relationship between x and y the best.

w,bminimize cost(w,b)w, b → minimize\ cost(w, b)

Optimizer: Gradiend Descent

Machine Learning, Deep Learning (including linear regression) performs the task to find w and b to minimize the cost function.
We call the algorithm we use for that as 'Optimizer' or 'Optimization Algorithm'

7-4) Auto-Gradient and Linear Regression


Here is a sample formula.


Following is the code for auto-gradient by w.

import tensorflow as tf

w = tf.Variable(2.)

def f(w):
  y = w**2
  z = 2*y + 5
  return z

with tf.GradientTape() as tape:
  z = f(w)

gradients = tape.gradient(z, [w])

Linear Regression

# variables that will be learned
# initialize them with 4 & 1 (random variable)
w = tf.Variable(4.0)
b = tf.Variable(1.0)

def hypothesis(x):
  return w*x + b

def mse_loss(y_pred, y):
  return tf.reduce_mean(tf.square(y_pred - y))
x = [1, 2, 3, 4, 5, 6, 7, 8, 9] # study time
y = [11, 22, 33, 44, 53, 66, 77, 87, 95] # grade

# uses Gradient Descent Algorithm
# learning rate = 0.01
optimizer = tf.optimizers.SGD(0.01)

for i in range(301):
  with tf.GradientTape() as tape:
    y_pred = hypothesis(x)
    cost = mse_loss(y_pred, y)
  gradients = tape.gradient(cost, [w, b])
  optimizer.apply_gradients(zip(gradients, [w, b]))

  if i % 10 == 0:
    print("epoch : {:3} | w : {:5.4f} | b : {:5.4} | cost : {:5.6f}".format(i, w.numpy(), b.numpy(), cost))

# check if it works
x_test = [3.5, 5, 5.5, 6]

7-5) Logistic Regression

Introduction to Binary Classification

This is for the cases we have two options.
It is not proper to use linear expression for binary classification.
We have to use function whose range is 0 to 1, and has 'S' shape.
We call it 'Sigmoid Function'.

Sigmoid Function

The below is the formula for sigmoid function.

H(x)=11+e(wx+b)=sigmoid(wx+b)=σ(wx+b)H(x) = \frac{1}{1 + e^{-(wx + b)}} = sigmoid(wx + b) = σ(wx + b)

e is an Euler's number.

7-6) Logistic Regression with TensorFlow

Logistic Regression with Keras

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers

x = np.array([-50, -40, -30, -20, -10, -5, 0, 5, 10, 20, 30, 40, 50])
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]) # number 10 to 0

model = Sequential()
model.add(Dense(1, input_dim=1, activation='sigmoid')) # one x, one y, using sigmoid function as an activation

sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd ,loss='binary_crossentropy', metrics=['binary_accuracy']) # using sigmoid gradient descend as an optimizer & using cross entrophy function as loss function, y, epochs=200) # learns 200 times

plt.plot(x, model.predict(x), 'b', x,y, 'k.') # sigmoid function graph for x between 0 to 10

7-7) Multi-Input

Multi Linear Regression

Usually for Deep Learning, the number of dependant variables is more than two. Which is, in the perspective of coding the model, input vector's dimension is bigger than two.

If we have x1 as midterm score, x2 as final score, x3 as added point, and y as score, below is the hypothesis.

H(X)=w1x1+w2x2+w3x3+bH(X) = {w_1x_1 + w_2x_2 + w_3x_3 + b}
we call the vector with attributes [x1,x2,x3] as X\text{we call the vector with attributes }[x_1, x_2, x_3] \text{ as X}

Multi Logistic Regression

If we have x1 as sepal length (cm), x2 as petal length (cm), and y as species, below is the hypothesis.

H(X)=sigmoid(w1x1+w2x2+b)H(X) = sigmoid({w_1x_1 + w_2x_2 + b})
we call the vector with attributes [x1,x2] as X\text{we call the vector with attributes }[x_1, x_2] \text{ as X}

7-8) Calculation of Vector and Matrix

Why We Have to Know This?

There may be cases which has dependant variables more than one.
Keras is easy to use, but if we develop low-level machine learning using Numpy or TensorFlow, we must understand calculation of variables with calculation of vector & matrix.
In other words, user should be able to set the size of matrix (or tensor) from data and number of variables.

Introduction to Vector, Matrix, and Tensor

Vector is 'amount with size and direction'.
In Python, it is expressed with 1-dimension array, or list.

Matrix is '2-dimension structure with row and column'.
In Python, it is expressed with 2-dimension array.

If the dimension is more than 2, we call it tensor.
In Python, it is expressed with n-dimension array. (n >= 3)

7-9) Softmax Regression

Introduction to Multi-Class Classification

This is for the cases we have three options.
Basically, it is a method to make the sum of possibility of each option to 1.

Softmax Function

If the number of options(classes) is k, it gets k-dimension vector as input, and returns the possiblity for each class.

zi=i-th element in k-dimension vectorz_{i} = \text{i-th element in k-dimension vector}
pi=the possibility of i-th class is the answerp_{i} = \text{the possibility of i-th class is the answer}
pi=ezij=1kezj  for i=1,2,...kp_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{k} e^{z_{j}}}\ \ for\ i=1, 2, ... k
Below is the ouput of softmax function when k = 3, and the input is z=[z1 z2 z3]\text{Below is the ouput of softmax function when k = 3, and the input is } z=[z_{1}\ z_{2}\ z_{3}]
softmax(z)=[ez1j=13ezj ez2j=13ezj ez3j=13ezj]=[p1,p2,p3]=y^=predictionsoftmax(z)=[\frac{e^{z_{1}}}{\sum_{j=1}^{3} e^{z_{j}}}\ \frac{e^{z_{2}}}{\sum_{j=1}^{3} e^{z_{j}}}\ \frac{e^{z_{3}}}{\sum_{j=1}^{3} e^{z_{j}}}] = [p_{1}, p_{2}, p_{3}] = \hat{y} = \text{prediction}
