ML class overview: Difference between revisions

From John's wiki
Jump to navigation Jump to search
No edit summary
 
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
* [[ML class]]
This is an overview in point form of the content in the [[ML class]].
* [[ML_class#INTRODUCTION|INTRODUCTION]]
 
** [[ML_class#Welcome|Examples of machine learning]]
= [[ML_class#INTRODUCTION|INTRODUCTION]] =
*** Database mining (Large datasets from growth of automation/web)
 
**** clickstream data
== [[ML_class#Welcome|Examples of machine learning]] ==
**** medical records
 
**** biology
* Database mining (Large datasets from growth of automation/web)
**** engineering
** clickstream data
*** Applications that can't be programmed by hand
** medical records
**** autonomous helicopter
** biology
**** handwriting recognition
** engineering
**** most of Natural Language Processing (NLP)
* Applications that can't be programmed by hand
**** Computer Vision
** autonomous helicopter
*** Self-customising programs
** handwriting recognition
**** Amazon
** most of Natural Language Processing (NLP)
**** Netfilx product recommendations
** Computer Vision
*** Understanding human learning (brain, real AI)
* Self-customising programs
** [[ML_class#What_is_Machine_Learning.3F|What is Machine Learning?]]
** Amazon
*** Definitions of Machine Learning
** Netfilx product recommendations
**** Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
* Understanding human learning (brain, real AI)
**** Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E.
 
*** There are several different types of ML algorithms. The two main types are:
== [[ML_class#What_is_Machine_Learning.3F|What is Machine Learning?]] ==
**** Supervised learning
 
***** teach computer how to do something
* Definitions of Machine Learning
**** Unsupervised learning
** Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
***** computer learns by itself
** Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E.
*** Other types of algorithms are:
* There are several different types of ML algorithms. The two main types are:
**** Reinforcement learning
** Supervised learning
**** Recommender systems
*** teach computer how to do something
** [[ML_class#Supervised_Learning|Supervised Learning]]
** Unsupervised learning
*** Supervised Learning in which the "right answers" are given
*** computer learns by itself
**** '''Regression''': predict continuous valued output (e.g. price)
* Other types of algorithms are:
**** '''Classification''': discrete valued output (e.g. 0 or 1)
** Reinforcement learning
** [[ML_class#Unsupervised_Learning|Unsupervised Learning]]
** Recommender systems
*** Unsupervised Learning in which the categories are unknown
 
**** '''Clustering''': cluster patterns (categories) are found in the data
== [[ML_class#Supervised_Learning|Supervised Learning]] ==
**** '''Cocktail party problem''': overlapping audio tracks are separated out
 
***** [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
* Supervised Learning in which the "right answers" are given
* [[ML_class#LINEAR_REGRESSION_WITH_ONE_VARIABLE|LINEAR REGRESSION WITH ONE VARIABLE]]
** '''Regression''': predict continuous valued output (e.g. price)
** [[ML_class#Model_Representation|Model Representation]]
** '''Classification''': discrete valued output (e.g. 0 or 1)
*** e.g. housing prices, price per square-foot
 
**** Supervised Learning
== [[ML_class#Unsupervised_Learning|Unsupervised Learning]] ==
**** Regression
 
*** Dataset called '''training set'''
* Unsupervised Learning in which the categories are unknown
*** Notation:
** '''Clustering''': cluster patterns (categories) are found in the data
** '''Cocktail party problem''': overlapping audio tracks are separated out
*** [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
 
= [[ML_class#LINEAR_REGRESSION_WITH_ONE_VARIABLE|LINEAR REGRESSION WITH ONE VARIABLE]] =
 
== [[ML_class#Model_Representation|Model Representation]] ==
 
* e.g. housing prices, price per square-foot
** Supervised Learning
** Regression
* Dataset called '''training set'''
* Notation:
{|class="wikitable"
{|class="wikitable"
| '''m'''    || number of training examples
| '''m'''    || number of training examples
Line 55: Line 67:
| '''(x<sup>(i)</sup>,y<sup>(i)</sup>)''' || i<sup>th</sup> training example
| '''(x<sup>(i)</sup>,y<sup>(i)</sup>)''' || i<sup>th</sup> training example
|}
|}
*** Training Set -> Learning Algorithm -> <em>h</em> (hypothesis)
* Training Set -> Learning Algorithm -> <em>h</em> (hypothesis)
*** Size of house (x) -> <em>h</em> -> Estimated price (y)
* Size of house (x) -> <em>h</em> -> Estimated price (y)
**** <em>h</em> maps from '''x''''s to '''y''''s
** <em>h</em> maps from '''x''''s to '''y''''s
*** How do we represent <em>h</em>?
* How do we represent <em>h</em>?
**** <em>h</em><sub>Θ</sub>(x) = <em>h</em>(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x
** <em>h</em><sub>Θ</sub>(x) = <em>h</em>(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x
*** Linear regression with one variable (x)
* Linear regression with one variable (x)
**** Univariate linear regression
** Univariate linear regression
** [[ML_class#Cost_Function|Cost Function]]
 
*** Helps us figure out how to fit the best possible straight line to our data
== [[ML_class#Cost_Function|Cost Function]] ==
*** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
 
*** <em>Θ<sub>i</sub></em>'s: Parameters
* Helps us figure out how to fit the best possible straight line to our data
*** How to choose parameters (<em>Θ<sub>i</sub></em>'s)?
* <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
**** Choose <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> so that <em>h<sub>Θ</sub></em>(x) is close to <em>y</em> for our training examples (<em>x</em>,<em>y</em>)
* <em>Θ<sub>i</sub></em>'s: Parameters
**** Minimise <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math> for <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub>
* How to choose parameters (<em>Θ<sub>i</sub></em>'s)?
***** <em>h<sub>Θ</sub></em>(x<sup>(i)</sup>) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x<sup>(i)</sup>
** Choose <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> so that <em>h<sub>Θ</sub></em>(x) is close to <em>y</em> for our training examples (<em>x</em>,<em>y</em>)
**** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
** Minimise <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math> for <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub>
**** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) is the Cost Function, also known in this case as the [http://en.wikipedia.org/wiki/Mean_squared_error Squared Error Function]
*** <em>h<sub>Θ</sub></em>(x<sup>(i)</sup>) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x<sup>(i)</sup>
** [[ML_class#Cost_Function_-_Intuition_I|Cost Function - Intuition I]]
** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
*** Summary:
** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) is the Cost Function, also known in this case as the [http://en.wikipedia.org/wiki/Mean_squared_error Squared Error Function]
**** '''Hypothesis''': <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
 
**** '''Parameters''': <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub>
== [[ML_class#Cost_Function_-_Intuition_I|Cost Function - Intuition I]] ==
**** '''Cost Function''': J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
 
**** '''Goal''': minimise <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub>)
* Summary:
*** Simplified:
** '''Hypothesis''': <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
**** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>1</sub>x
** '''Parameters''': <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub>
**** minimise <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>1</sub>)
** '''Cost Function''': J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
*** Can plot simplified model in 2D
** '''Goal''': minimise <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub>)
** [[ML_class#Cost_Function_-_Intuition_II|Cost Function - Intuition II]]
* Simplified:
*** Can plot J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) in 3D
** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>1</sub>x
*** Can plot with Contour Map (Contour Plot)
** minimise <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>1</sub>)
** [[ML_class#Gradient_Descent|Gradient Descent]]
* Can plot simplified model in 2D
*** Repeat until convergence { <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math> }
 
*** α = learning rate
== [[ML_class#Cost_Function_-_Intuition_II|Cost Function - Intuition II]] ==
** [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]]
 
*** <table style="display:inline;border:none;margin:0;padding:0;"><tr><td style="border:none;text-align:center;margin:0;padding:0;">min<br>Θ<sub>1</sub></td><td style="border:none;margin:0;padding:0;">J(Θ<sub>1</sub>)</td></tr></table>
* Can plot J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) in 3D
**** For Θ<sub>1</sub> &gt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> positive, moves toward local minimum
* Can plot with Contour Map (Contour Plot)
**** For Θ<sub>1</sub> &lt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> negative, moves toward local minimum
 
*** If learning rate α is too small algorithm takes a long time to run
== [[ML_class#Gradient_Descent|Gradient Descent]] ==
*** If learning rate α is too large algorithm may not converge or may diverge
 
*** When partial derivative is zero Θ<sub>1</sub> converges
* repeat until convergence { <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math> }
*** As we approach a local minimum, gradient descent automatically takes smaller steps
* α = learning rate
**** So no need to decrease α over time
 
== [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]] ==
 
* <table style="display:inline;border:none;"><tr><td style="border:none;text-align:center;">min<br>Θ<sub>1</sub></td><td style="border:none;">J(Θ<sub>1</sub>)</td></tr></table>
** For Θ<sub>1</sub> &gt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> positive, moves toward local minimum
** For Θ<sub>1</sub> &lt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> negative, moves toward local minimum
* If learning rate α is too small algorithm takes a long time to run
* If learning rate α is too large algorithm may not converge or may diverge
* When partial derivative is zero Θ<sub>1</sub> converges
* As we approach a local minimum, gradient descent automatically takes smaller steps
** So no need to decrease α over time
 
== [[ML_class#Gradient_Descent_for_Linear_Regression|Gradient Descent for Linear Regression]] ==
 
* Gradient descent algorithm:
** repeat until convergence {
*** <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math>
*** for j=0 and j=1
** }
* Linear Regression Model:
** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
** <table style="display:inline;border:none;"><tr><td style="border:none;text-align:center;">min<br>Θ<sub>0</sub>,Θ<sub>1</sub></td><td style="border:none;">J(Θ<sub>0</sub>,Θ<sub>1</sub>)</td></tr></table>
* Partial derivatives:
** j=0: <math>\frac{\part}{\part\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})</math>
** j=1: <math>\frac{\part}{\part\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)}</math>
* Gradient descent algorithm:
** repeat until convergence {
*** <math>\theta_0 := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})</math>
*** <math>\theta_1 := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)}</math>
** }
 
== [[ML_class#What.27s_Next|What's Next]] ==
 
* Two extensions:
** In <code>min J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>)</code>, solve for <em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub> exactly without needing iterative algorithm (gradient descent)
** Learn with larger number of features
* Linear Algebra topics:
** What are matrices and vectors
** Addition, subtraction and multiplication with matrices and vectors
** Matrix inverse, transpose
* <math>\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}^T\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}</math>
* <math>\begin{bmatrix}2 & 0 \\0 & 1 \end{bmatrix}^{-1}\begin{bmatrix}2 \\1 \end{bmatrix}-3\begin{bmatrix}1 \\5 \end{bmatrix}</math>
 
= [[ML_class#LINEAR_ALGEBRA_REVIEW|LINEAR ALGEBRA REVIEW]] =
 
== [[ML_class#Matrices_and_Vectors|Matrices and Vectors]] ==
 
* '''Matrix''': a rectangular array of numbers
* '''Dimension of Matrix''': number of rows by number of columns
** <math>\mathbb R^{4\times2}</math> = 4 rows and 2 columns
** <math>\mathbb R^{2\times3}</math> = 2 rows and 3 columns
* Matrix elements:
** <em>A<sub>ij</sub></em> = "<em>i,j</em> entry" in the <em>i<sup>th</sup></em> row, <em>j<sup>th</sup></em> column
** <math>A = \begin{bmatrix}1402 & 191 \\1371 & 821 \\949 & 1437 \\147 & 1448 \end{bmatrix}</math>
*** <em>A<sub>11</sub></em> = 1402
*** <em>A<sub>12</sub></em> = 191
*** <em>A<sub>32</sub></em> = 1437
*** <em>A<sub>41</sub></em> = 147
*** <em>A<sub>43</sub></em> = undefined (error)
* '''Vector''': an n x 1 matrix
** <math>y = \begin{bmatrix}460 \\232 \\315 \\178 \end{bmatrix}</math>
*** <math>\mathbb R^4</math>: n = 4; 4-dimensional vector
** <em>y<sub>i</sub></em> = <em>i<sup>th</sup></em> element
** 1-indexed vs 0-indexed
* Notation (generally):
** A,B,C,X = capital = matrix
** a,b,x,y = lower case = vector or scalar
 
== [[ML_class#Addition_and_Scalar_Multiplication|Addition and Scalar Multiplication]] ==
 
* Matrix addition:
** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5 \\0 & 1\end{bmatrix} = \begin{bmatrix}5 & 0.5 \\4 & 10 \\3 & 2\end{bmatrix}</math>
*** 3x2 matrix + 3x2 matrix = 3x2 matrix
** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5\end{bmatrix} = error</math>
*** 3x2 matrix + 2x2 matrix = error
* Scalar multiplication:
** <math>3 \times \begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} = \begin{bmatrix}3 & 0 \\6 & 15 \\9 & 3\end{bmatrix} = \begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} \times 3</math>
*** scalar x 3x2 matrix = 3x2 matrix = 3x2 matrix x scalar
** <math>\begin{bmatrix}4 & 0 \\6 & 3\end{bmatrix} / 4 = \frac{1}{4}\begin{bmatrix}4 & 0 \\6 & 3\end{bmatrix} = \begin{bmatrix}1 & 0 \\\frac{3}{2} & \frac{3}{4}\end{bmatrix}</math>
* Combination of operands:
** <math>\begin{align}
3 \times \begin{bmatrix}1 \\4 \\2 \end{bmatrix} + \begin{bmatrix}0 \\0 \\5 \end{bmatrix} - \begin{bmatrix}3 \\0 \\2 \end{bmatrix} / 3
&= \begin{bmatrix}3 \\12 \\6 \end{bmatrix} + \begin{bmatrix}0 \\0 \\5 \end{bmatrix} - \begin{bmatrix}1 \\0 \\\frac{2}{3} \end{bmatrix}
&= \begin{bmatrix}2 \\12 \\10\frac{1}{3} \end{bmatrix}
\end{align}</math>
*** 3x1 matrix = 3-dimensional vector
 
== [[ML_class#Matrix_Vector_Multiplication|Matrix Vector Multiplication]] ==
 
* <math>\begin{bmatrix}1 & 3 \\4 & 0 \\2 & 1\end{bmatrix} \begin{bmatrix}1 \\5 \end{bmatrix} = \begin{bmatrix}16 \\4 \\7 \end{bmatrix}</math>
** 1x1 + 3x5 = 16
** 4x1 + 0x5 = 4
** 2x1 + 1x5 = 7
** 3x2 matrix * 2x1 matrix = 3x1 matrix
* prediction = DataMatrix x parameters
** <em>h</em><sub>Θ</sub>(x) = -40 + 0.25x = <math>\begin{bmatrix}-40 \\0.25 \end{bmatrix}</math>
** <math>\begin{bmatrix}1 & 2104 \\1 & 1416 \\1 & 1534 \\1 & 852 \end{bmatrix} \times \begin{bmatrix}-40 \\0.25 \end{bmatrix} = \begin{bmatrix}-40\times1 + 0.25 \times 2104 \\-40\times1 + 0.25 \times 1416 \\-40\times1 + 0.25 \times 1534 \\-40\times1 + 0.25 \times 852 \\ \end{bmatrix}</math>

Latest revision as of 18:59, 10 October 2011

This is an overview in point form of the content in the ML class.

INTRODUCTION

Examples of machine learning

  • Database mining (Large datasets from growth of automation/web)
    • clickstream data
    • medical records
    • biology
    • engineering
  • Applications that can't be programmed by hand
    • autonomous helicopter
    • handwriting recognition
    • most of Natural Language Processing (NLP)
    • Computer Vision
  • Self-customising programs
    • Amazon
    • Netfilx product recommendations
  • Understanding human learning (brain, real AI)

What is Machine Learning?

  • Definitions of Machine Learning
    • Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
    • Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E.
  • There are several different types of ML algorithms. The two main types are:
    • Supervised learning
      • teach computer how to do something
    • Unsupervised learning
      • computer learns by itself
  • Other types of algorithms are:
    • Reinforcement learning
    • Recommender systems

Supervised Learning

  • Supervised Learning in which the "right answers" are given
    • Regression: predict continuous valued output (e.g. price)
    • Classification: discrete valued output (e.g. 0 or 1)

Unsupervised Learning

  • Unsupervised Learning in which the categories are unknown
    • Clustering: cluster patterns (categories) are found in the data
    • Cocktail party problem: overlapping audio tracks are separated out
      • [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

LINEAR REGRESSION WITH ONE VARIABLE

Model Representation

  • e.g. housing prices, price per square-foot
    • Supervised Learning
    • Regression
  • Dataset called training set
  • Notation:
m number of training examples
x's "input" variable / features
y's "output" variable / "target" variable
(x,y) one training example
(x(i),y(i)) ith training example
  • Training Set -> Learning Algorithm -> h (hypothesis)
  • Size of house (x) -> h -> Estimated price (y)
    • h maps from x's to y's
  • How do we represent h?
    • hΘ(x) = h(x) = Θ0 + Θ1x
  • Linear regression with one variable (x)
    • Univariate linear regression

Cost Function

  • Helps us figure out how to fit the best possible straight line to our data
  • hΘ(x) = Θ0 + Θ1x
  • Θi's: Parameters
  • How to choose parameters (Θi's)?
    • Choose Θ0, Θ1 so that hΘ(x) is close to y for our training examples (x,y)
    • Minimise for Θ0, Θ1
      • hΘ(x(i)) = Θ0 + Θ1x(i)
    • J(Θ0,Θ1) =
    • J(Θ0,Θ1) is the Cost Function, also known in this case as the Squared Error Function

Cost Function - Intuition I

  • Summary:
    • Hypothesis: hΘ(x) = Θ0 + Θ1x
    • Parameters: Θ0, Θ1
    • Cost Function: J(Θ0,Θ1) =
    • Goal: minimise Θ0, Θ1 J(Θ0, Θ1)
  • Simplified:
    • hΘ(x) = Θ1x
    • minimise Θ1 J(Θ1)
  • Can plot simplified model in 2D

Cost Function - Intuition II

  • Can plot J(Θ0,Θ1) in 3D
  • Can plot with Contour Map (Contour Plot)

Gradient Descent

  • repeat until convergence { }
  • α = learning rate

Gradient Descent Intuition

  • min
    Θ1
    J(Θ1)
    • For Θ1 > local minimum: positive, moves toward local minimum
    • For Θ1 < local minimum: negative, moves toward local minimum
  • If learning rate α is too small algorithm takes a long time to run
  • If learning rate α is too large algorithm may not converge or may diverge
  • When partial derivative is zero Θ1 converges
  • As we approach a local minimum, gradient descent automatically takes smaller steps
    • So no need to decrease α over time

Gradient Descent for Linear Regression

  • Gradient descent algorithm:
    • repeat until convergence {
      • for j=0 and j=1
    • }
  • Linear Regression Model:
    • hΘ(x) = Θ0 + Θ1x
    • J(Θ0,Θ1) =
    • min
      Θ01
      J(Θ01)
  • Partial derivatives:
    • j=0:
    • j=1:
  • Gradient descent algorithm:
    • repeat until convergence {
    • }

What's Next

  • Two extensions:
    • In min J(Θ0,Θ1), solve for Θ0,Θ1 exactly without needing iterative algorithm (gradient descent)
    • Learn with larger number of features
  • Linear Algebra topics:
    • What are matrices and vectors
    • Addition, subtraction and multiplication with matrices and vectors
    • Matrix inverse, transpose

LINEAR ALGEBRA REVIEW

Matrices and Vectors

  • Matrix: a rectangular array of numbers
  • Dimension of Matrix: number of rows by number of columns
    • = 4 rows and 2 columns
    • = 2 rows and 3 columns
  • Matrix elements:
    • Aij = "i,j entry" in the ith row, jth column
      • A11 = 1402
      • A12 = 191
      • A32 = 1437
      • A41 = 147
      • A43 = undefined (error)
  • Vector: an n x 1 matrix
      • : n = 4; 4-dimensional vector
    • yi = ith element
    • 1-indexed vs 0-indexed
  • Notation (generally):
    • A,B,C,X = capital = matrix
    • a,b,x,y = lower case = vector or scalar

Addition and Scalar Multiplication

  • Matrix addition:
      • 3x2 matrix + 3x2 matrix = 3x2 matrix
      • 3x2 matrix + 2x2 matrix = error
  • Scalar multiplication:
      • scalar x 3x2 matrix = 3x2 matrix = 3x2 matrix x scalar
  • Combination of operands:
      • 3x1 matrix = 3-dimensional vector

Matrix Vector Multiplication

    • 1x1 + 3x5 = 16
    • 4x1 + 0x5 = 4
    • 2x1 + 1x5 = 7
    • 3x2 matrix * 2x1 matrix = 3x1 matrix
  • prediction = DataMatrix x parameters
    • hΘ(x) = -40 + 0.25x =