Machine Learning NotesSupervised vs Unsupervised LearningClassification vs Regression ProblemMonovarious Linear RegressionHypothesisCost FunctionGradient DescentMultivarious Linear RegressionGradient DescentFeature ScalingMean NormalizationPolynomial RegressionNormal EquationLinear Regression VectorizationHypothesisCost FunctionGradient DescentLogistic RegressionHypothesisDecision BoundaryCost FunctionGradient DescentVectorizationMulticlass ClassificationAdvanced OptimizationOverfitting & UnderfitingRegularizationRegularized Linear RegressionRegularized Gradient DescentRegularized Normal EquationRegularized Logistic RegressionRegularized Advanced Optimization
= number of traning examples 's = "input" variable / features 's = "output" variable / "target" variable
= one training example = th training example
Example
Input () | Output () |
---|---|
0 | 10 |
2 | 19 |
5 | 38 |
7 | 45 |
Number of training examples is 4: is 2 is the third training example, so
Hypothesis maps 's to 's
Cost function used to choose and by minimizing the difference of the hypothesized output variable and .
We need to minimize by varying and , where is the hypothesis
Repeat the following operation until the cost function is minimized:
Example
Where and is the learning rate, two statements are computed simultaneously. Repeat until gradient descent reaches local minimum
= number of features contains all theta parameters including
Repeat the following simulataneously until reached minimum:
can be elements of . Likewise for
Feature scaling makes gradient descent faster
Example
is between 0 and 100,000
is between 0 and 5
is between 0 and 0.0001
Running gradient descent without scaling is very slow Appropriate scaling in this case:
Used to normalize the values to be centered around 0
Where is the original non-normalized value, is the average of the feature set, is the maximum value of the feature set, and is the minimum value of the feature set.
Basically multivarious linear regression except are square, cube... of . Which will represent a graph that is non linear, but instead polynomial (duh)
Example
Where , , , and
Normal Equation is an alternative method to gradient descent, it is much faster unless there are a lot of features (eg. )
Example
64 | 25 | 59 | 1600 |
48 | 11 | 100 | 800 |
5 | 88 | 76 | 450 |
All features and example values are stored in a design matrix with an added column of 1 in front because that accounts for which is always 1 All output example values are stored in vector We then plug these into the formula which will return theta in the form of a vector
Note: Rows of design matrix are training sets transposed. In this case,
To operate inverse on matrices, use pinv()
function as certain matrices inside the calculation may not be invertable
Where hypothesis equals to transposed vector multiplied by vector Alternatively:
Where hypothesis equals to the matrix multiplication of , the matrix containing all features and sample sets, and vector
Vectorized form for multivarious cases can written as:
Where is the matrix containing the feature sets, is a vector containing and is a vector containing all the values
Alternatively, this is the same thing as:
Where the squaring part is actually element-wise
Where is a vector changed during gradient descent. is a vector calculated from where is a vector containing one set of feature values. Hypothesis may be vectorized (reference above)
Alternatively:
This is basically the same thing as above except we changed vector to matrix that contains all values, and changed variable to a vector that holds all
Logistic regression is used for classification problems, and (despite the name) not associated with regression problems. Using linear regression isn't a good idea for classification problems
Binary Classification:
Where the output has only two outcomes (eg. 0 or 1, true or false)
Multiclass Classfication:
Sigmoid / Logistic function:
outputs probability of on input between 0 and 1, which is what we want for binary classification
Example
This means that the probability of given features and parameterized by is 30%. This also means that the probablity (in binary classification) of is 70%
If we predict when , then
Decision Boundary is a property of where it separates different outcomes of classifications
Example
and
So pluggin in and doing the inequality to predict when or , we get or . This means that we predict when
Also, is a line that act as the decision boundary between the two outcomes. The points on the line corresponds directly to where
Decision boundary is not limited to lines. Polynomial regressions, for example allows for more complex boundaries including elipses and curves
Example
and
This makes the boundary , which is a circle about the origin with radius
Recall
cost function for linear regression with moved inside the summation:
We replace with so we have:
Running this linear regression cost functions with logistic regression hypothesis () will result in non-convex function that has many local optima for . So running gradient descent does not guarentee to reach the global minimum. Instead we use this new cost function:
Or simply:
We can then plug this into the overall logistic regression cost function , to fit parameters and minimize , to make prediction given new
Recall
Gradient descent from linear regression: repeat while simultaneously update all . Doing the partial derivatives we get:
This is identical to the gradient descent algorithm for linear regression, except where instead
The gradient descent can be vectorized by using vectors and :
can be more than two outcomes ()
One vs. All method trains a logistic regression classifier for each class to predict the probability that . For predictions with new input , choose class that maximizes
Where the superscript indexes which class it is.
For a multiclass classification problem with amount of classes, different logistic regression classifier will be trained ( decision boundaries for each class)
Gradient descent is not always the most efficient algorithm for optimizing / minimizing . Others include Conjugate Descent, BFGS, L-BFGS. They are more complex than gradient descent, but often run much faster
Example
To minimize ,
The following code is how to implement the descent in MATLAB/Octave
xxxxxxxxxx
91function [jVal, gradient] = costFunction(theta)
2jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
3gradient = zeros(2,1);
4gradient(1) = 2*(theta(1)-5);
5gradient(2) = 2*(theta(2)-5);
6
7options = optimset('GradObj', 'on', 'MaxIter', '100');
8initialTheta = zeros(2,1);
9[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
We implements the
costFunction
function that returns two things:jVal
is the cost andgradient
is a vector that corresponds to the two partial derivatesAdvanced optimization function
fminunc
is then called aftercostFunction
is setup. It takes a pointer to the cost function (@costFunction
) and some other options
fminunc
should return the optimal
Underfitting or high bias is when hypothesis doesn't fit training set very well. Usually too few features
Overfitting or high variance fits the traning example very well, but fails to generalize enough to predict future samples. Usually too many features
To address overfitting:
Reduce number of features (reducing features discards information about the problem)
Regularization
Regularization reduces the magnitude of parameters which allows for "simpler" hypothesis and less prone to overfitting. So we modify the cost function:
Added a term at the end to regularize all the magnitudes, where is the regularization parameter that keeps the small. If is too large, the algorithm will result in underfitting
Note: starts from 1 because by convention, we don't need to regularize
Modified gradient descent for linear regression:
Repeat {
}
Since we don't need to regularize we can separate it. For every other , we take the partial derivative of the new regularized
Grouping the terms together, we can write:
for which is multiplied to and makes it smaller. The second part of the equation is identical to unregularized
Recall
normal equation:
Modified normal equation adds an extra term that is multiplied by a identity matrix except with 0 for the element with first column and first row:
If , then is non-invertable, regularization fixes this as long as . So , where is our "modified indentity matrix", is invertable
Recall
cost function for logistic regression, we add the term that applies the regularization
Recall
gradient descent for logic regression. Same as regularized linear regression gradient descent, we separate the operations for and other . The gradient descent is the same as regularized linear regression except is for logistic regression
Recall
advanced optimization
xxxxxxxxxx
71function [jVal, gradient] = costFunction(theta)
2jVal = %code to compute J(theta)
3gradient(1) = %p.deriv of J(theta) to theta_0
4gradient(2) = %p.deriv of J(theta) to theta_1
5gradient(3) = %p.deriv of J(theta) to theta_2
6%...
7gradient(n+1) = %p.deriv of J(theta) to theta_n
The code is very similar to the unregularized, except the code to computer the cost function and the derivative of the cost function changed
This
costFunction
can be used infminunc
function to get the regularized hypothesis