Linear regression is a simple approach to supervised learning. It assumes that the dependence of
on is linear.
True regression functions are never linear!
Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.
where and are two
unkown constantsthat represent theinterceptandslope, also known ascoefficientsorparameters, and is theerror term
The symbol denotes an estimated value
Then represents the th
residual
residual sum of squares(RSS) as
or
least squares approach chooses and to minimize the .The minimizing values can be shown to be
where the are the
sample means
The standard error of an estimator reflects how it varies under repeated sampling.
These standard errors can be used to compute confidence intervals
A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form
There is approximately a
95%chance that the intervalIt will contain the
true valueof
For example, for the advertising data, the 95% confidence interval for is
The most common hypothesis testing involves testing the null hypothesis of
There is no relationship between and versus the alternative hypothesis | |
|---|---|
| There is some relationship between and |
Mathematically, this corresponds to testing
versus
Since if then the model reduces to , and is not associated with
To test the null hypothesis, we compute a t-statistic, given by
This will have a t-distribution with n-2 degrees of freedom, assuming .
p-value : probability of observing any value equal to or larger
| Coefficient | Std. Error | t-statistic | p-value | |
|---|---|---|---|---|
| 7.0325 | 0.4578 | 15.36 | <0.0001 | |
| 0.0475 | 0.0027 | 17.67 | <0.0001 |
We compute the Residual Standard Error
where the
residual sum-of-squaresisRSS
or fraction of variance explained is
where the
total sum of squares(TSS)
It can be shown that in this simple linear regression setting that , where is the correlation between and :
Here our model is
We interpret as the average effect on of a one unit increase in , holding all other predictors fixed
predictors are uncorrelated : a balanced designcoefficient can be estimated and tested separatelyCorrelations amongst predictors cause problems :variance of all coefficients tends to increase, sometimes dramaticallyhazardous - when changes, everything else changesClaims of causality should be avoided for observational datapredictions using the formulasum of squared residualsat least one of the predictors useful in predicting the responsesubset of the predictors useful?predictor values, what response value should we predict, and how accurate is our prediction?F-statisticall subsets or best subsets regressionleast squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model sizeoptimized subsetsnull modeladd to the null model the variable that result in the lowest RSSrule is satisfiedRemove the variables with the largest p-valueremovedrule is reachedquantitative but the qualitative, taking a discrete set of valuescategorical predictors or factor variablesqualitative predictors like thisWith more than two levels, we create additional dummy variables
There will always be one fewer dummy variable than the number of levels
Baseline : What do you want to set as intercept is very important
interactions and nonlinearityAdvertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other mediasynergy effect, and in statistics it is referred to as an interaction effectinteraction term has a very small p-value, but the associated main effects do nothierarchy principle :
- If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
interactions are hard to interpret in a model without main effectsinteraction terms also contain main effects, if the model has no main effect termsinteraction term, the model takes the forminteractions, it takes the form
- Same Slope
No interaction- Different Slope
With an interaction
Non-Linear structure might be more effectiveOutliersNon-constant variance of error termsHigh leverage pointsCollinearityClassification problems : logistic regression, support vector machinesNon-linearity : kernal smoothing, splines and generalized additive models : nearest neighbor methodsInteractions : Tree-based methods, bagging, random forests and boostingnon-linearitiesRegularized fitting : Ridge regression and Lasso