Qualitative variables take values in an unordered set , such asfeature vector and a qualitative response taking values in the set The
Classificationtask is to build a function that takes as input thefeaturevector and predicts its value for
probabilities that belongs to each category in .
Suppose for the Default classification task that we code
Can we simply perform a linear regression of on and classify as Yes if ?
binary outcome, linear regression does a good job as a classifier, and if equivalent to linear discriminant analysis which we discuss laterregression is perfect for this tasklinear regression might produce probabilities less than zero or bigger than one
Logistic regressionis more appropriate
Now suppose that
Linear regression is not appropriate here
Multiclass Logistic RegressionorDiscriminant Analysisare more appropriate
Logistic regression uses the formlog odds or logitlikelihood gives the probability of the observed zeros and ones in the datamaximize the likelihood of the observed dataMulticlass logistic regression is also referred to as multinomial regressionBayes theorem to flip things around and obtain Normal or Gaussian distributions for each class, this leads to linear or quadratic discriminant analysisBayes theorem
One writes this slightly differently for discriminant analysis:
density for in class marginal or prior probability for class 
When the priors are different, we take them into account as well, and compare .
On the right, we favor the blue class - the decision boundary has shifted to the left
well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. But, Linear discriminant analysis does not suffer from this problemnormal in each of the classes, the linear discriminant model is again more stable than the logistic regression modelLinear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the dataGaussian density has the formmean, and the variance (in class )Bayes formula, we got a rather complex expression for discriminant score:linear function of decision boundary is atvariance in the th classlinear in , other : constant)
linear discriminant analysis can be viewed exactly in a dimensional plotBecause it essentially classifies to the
closest centroidand they span aK-1dimensional plane
best 2-dimensional plance for visualizing the discriminant rulelargest
misclassification rate : precision : recall : False positive rate : The fraction of negative examples that are classified as positive - in exampleFalse negative rate : The fraction of positive examples that are classified as negative - in example
- Lower Threshold : Higher
False Positiveand LowerFalse Negative- Higher Threshold : Lower
False Positiveand HigherFalse Negative
With controlling thethresholdwe can getROC Curve- If the
classifierhas good performance theAUCvalue goes near 1
Gaussians but different in each class, we get quadratic discriminant analysisGaussians and same in each class, we get linear discriminant analysisnaive Bayes. For Gaussian this means the are diagonalindependent in each classQDA and even LDA break downGaussian naive Bayes assumes each is diagonal:mixed feature vectors(qualitative and quantitative)If is
qualitative, replace with probability mass function (histogram) over discrete categories
LDAlogistic regressionLogistic Regression uses the conditional likelihood based on (discriminative learning)LDA uses the full likelihood based on (generative learning)Logistic Regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the modelsoftmax function to form probabilitiesmultinomial log likelihood (cross-entropy) - a generalization of the binomialLogistic regression models directly, via the logistic function
Similarly the multinomial logistic regression uses the softmax function
These all model the conditional distribution of given
By contrast generative models start with the conditional distribution of given , and then use Bayes formula to turn things around:
is the density of given
is the marginal probability that is in class
Linear and Quadratic discriminant analysis derive from generative models, where are Gaussian
Often useful if some classes are well seperated - a situation where logistic regression is unstable
Naive Bayes assumes that the densities in each class
Factor
Equivalently this assumes that the features are independent within each class
Then using Bayes formula:
: Probability that the value could be found by each distribution
high-dimensional densities. Much easier to specify one-dimensional densitiesmixed featuresquantitative, can model as univariate Gaussian. We estimate and from the data, and then plug into Gaussian density formula for histogram estimate of the density, and directly estimate by the proportion of observations in the bin into which fallsqualitative, can simply model the proportion in each category
Naive Bayes model takes the form of a generalized additive model from later ChapterGeneralized linear models provide a unified framework for dealing with many different response types(non-negative responses, skewed distributions, and more)
variance mostly increases with the meanpredictions are on the wrong scale, and some counts are zeroPoisson Distribution is useful for modeling counts:predictions are non-negativeGLMs : Gaussian, Binomial and Poissonlink function. This it the transformation of the mean that is represented by a linear modellinear : logistic : Poisson regression : variance functionsmaximum-likelihood