ML class overview: Difference between revisions

From John's wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
This is an overview in point form of the content in the [[ML class]].
This is an overview in point form of the content in the [[ML class]].


* [[ML_class#INTRODUCTION|INTRODUCTION]]
= [[ML_class#INTRODUCTION|INTRODUCTION]] =
** [[ML_class#Welcome|Examples of machine learning]]
 
== [[ML_class#Welcome|Examples of machine learning]] ==
 
*** Database mining (Large datasets from growth of automation/web)
*** Database mining (Large datasets from growth of automation/web)
**** clickstream data
**** clickstream data
Line 17: Line 19:
**** Netfilx product recommendations
**** Netfilx product recommendations
*** Understanding human learning (brain, real AI)
*** Understanding human learning (brain, real AI)
** [[ML_class#What_is_Machine_Learning.3F|What is Machine Learning?]]
 
== [[ML_class#What_is_Machine_Learning.3F|What is Machine Learning?]] ==
 
*** Definitions of Machine Learning
*** Definitions of Machine Learning
**** Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
**** Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
Line 29: Line 33:
**** Reinforcement learning
**** Reinforcement learning
**** Recommender systems
**** Recommender systems
** [[ML_class#Supervised_Learning|Supervised Learning]]
 
== [[ML_class#Supervised_Learning|Supervised Learning]] ==
 
*** Supervised Learning in which the "right answers" are given
*** Supervised Learning in which the "right answers" are given
**** '''Regression''': predict continuous valued output (e.g. price)
**** '''Regression''': predict continuous valued output (e.g. price)
Line 38: Line 44:
**** '''Cocktail party problem''': overlapping audio tracks are separated out
**** '''Cocktail party problem''': overlapping audio tracks are separated out
***** [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
***** [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
* [[ML_class#LINEAR_REGRESSION_WITH_ONE_VARIABLE|LINEAR REGRESSION WITH ONE VARIABLE]]
 
** [[ML_class#Model_Representation|Model Representation]]
= [[ML_class#LINEAR_REGRESSION_WITH_ONE_VARIABLE|LINEAR REGRESSION WITH ONE VARIABLE]] =
 
== [[ML_class#Model_Representation|Model Representation]] ==
 
*** e.g. housing prices, price per square-foot
*** e.g. housing prices, price per square-foot
**** Supervised Learning
**** Supervised Learning
Line 63: Line 72:
*** Linear regression with one variable (x)
*** Linear regression with one variable (x)
**** Univariate linear regression
**** Univariate linear regression
** [[ML_class#Cost_Function|Cost Function]]
 
== [[ML_class#Cost_Function|Cost Function]] ==
 
*** Helps us figure out how to fit the best possible straight line to our data
*** Helps us figure out how to fit the best possible straight line to our data
*** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
*** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
Line 73: Line 84:
**** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
**** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
**** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) is the Cost Function, also known in this case as the [http://en.wikipedia.org/wiki/Mean_squared_error Squared Error Function]
**** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) is the Cost Function, also known in this case as the [http://en.wikipedia.org/wiki/Mean_squared_error Squared Error Function]
** [[ML_class#Cost_Function_-_Intuition_I|Cost Function - Intuition I]]
 
== [[ML_class#Cost_Function_-_Intuition_I|Cost Function - Intuition I]] ==
 
*** Summary:
*** Summary:
**** '''Hypothesis''': <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
**** '''Hypothesis''': <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
Line 83: Line 96:
**** minimise <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>1</sub>)
**** minimise <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>1</sub>)
*** Can plot simplified model in 2D
*** Can plot simplified model in 2D
** [[ML_class#Cost_Function_-_Intuition_II|Cost Function - Intuition II]]
 
== [[ML_class#Cost_Function_-_Intuition_II|Cost Function - Intuition II]] ==
 
*** Can plot J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) in 3D
*** Can plot J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) in 3D
*** Can plot with Contour Map (Contour Plot)
*** Can plot with Contour Map (Contour Plot)
Line 89: Line 104:
*** repeat until convergence { <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math> }
*** repeat until convergence { <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math> }
*** α = learning rate
*** α = learning rate
** [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]]
 
== [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]] ==
 
*** <table style="display:inline;border:none;"><tr><td style="border:none;text-align:center;">min<br>Θ<sub>1</sub></td><td style="border:none;">J(Θ<sub>1</sub>)</td></tr></table>
*** <table style="display:inline;border:none;"><tr><td style="border:none;text-align:center;">min<br>Θ<sub>1</sub></td><td style="border:none;">J(Θ<sub>1</sub>)</td></tr></table>
**** For Θ<sub>1</sub> &gt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> positive, moves toward local minimum
**** For Θ<sub>1</sub> &gt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> positive, moves toward local minimum
Line 98: Line 115:
*** As we approach a local minimum, gradient descent automatically takes smaller steps
*** As we approach a local minimum, gradient descent automatically takes smaller steps
**** So no need to decrease α over time
**** So no need to decrease α over time
** [[ML_class#Gradient_Descent_for_Linear_Regression|Gradient Descent for Linear Regression]]
 
== [[ML_class#Gradient_Descent_for_Linear_Regression|Gradient Descent for Linear Regression]] ==
 
*** Gradient descent algorithm:
*** Gradient descent algorithm:
**** repeat until convergence {
**** repeat until convergence {
Line 116: Line 135:
***** <math>\theta_1 := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)}</math>
***** <math>\theta_1 := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)}</math>
**** }
**** }
** [[ML_class#What.27s_Next|What's Next]]
 
== [[ML_class#What.27s_Next|What's Next]] ==
 
*** Two extensions:
*** Two extensions:
**** In <code>min J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>)</code>, solve for <em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub> exactly without needing iterative algorithm (gradient descent)
**** In <code>min J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>)</code>, solve for <em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub> exactly without needing iterative algorithm (gradient descent)
Line 126: Line 147:
*** <math>\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}^T\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}</math>
*** <math>\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}^T\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}</math>
*** <math>\begin{bmatrix}2 & 0 \\0 & 1 \end{bmatrix}^{-1}\begin{bmatrix}2 \\1 \end{bmatrix}-3\begin{bmatrix}1 \\5 \end{bmatrix}</math>
*** <math>\begin{bmatrix}2 & 0 \\0 & 1 \end{bmatrix}^{-1}\begin{bmatrix}2 \\1 \end{bmatrix}-3\begin{bmatrix}1 \\5 \end{bmatrix}</math>
* [[ML_class#LINEAR_ALGEBRA_REVIEW|LINEAR ALGEBRA REVIEW]]
 
** [[ML_class#Matrices_and_Vectors|Matrices and Vectors]]
= [[ML_class#LINEAR_ALGEBRA_REVIEW|LINEAR ALGEBRA REVIEW]] =
 
== [[ML_class#Matrices_and_Vectors|Matrices and Vectors]] ==
 
*** '''Matrix''': a rectangular array of numbers
*** '''Matrix''': a rectangular array of numbers
*** '''Dimension of Matrix''': number of rows by number of columns
*** '''Dimension of Matrix''': number of rows by number of columns
Line 148: Line 172:
**** A,B,C,X = capital = matrix
**** A,B,C,X = capital = matrix
**** a,b,x,y = lower case = vector or scalar
**** a,b,x,y = lower case = vector or scalar
** [[ML_class#Addition_and_Scalar_Multiplication|Addition and Scalar Multiplication]]
 
== [[ML_class#Addition_and_Scalar_Multiplication|Addition and Scalar Multiplication]] ==
 
*** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5 \\0 & 1\end{bmatrix} = \begin{bmatrix}5 & 0.5 \\4 & 10 \\3 & 2\end{bmatrix}</math>
*** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5 \\0 & 1\end{bmatrix} = \begin{bmatrix}5 & 0.5 \\4 & 10 \\3 & 2\end{bmatrix}</math>
**** 3x2 matrix + 3x2 matrix = 3x2 matrix
**** 3x2 matrix + 3x2 matrix = 3x2 matrix
*** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5\end{bmatrix} = error</math>
*** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5\end{bmatrix} = error</math>
**** 3x2 matrix + 2x2 matrix = error
**** 3x2 matrix + 2x2 matrix = error

Revision as of 14:51, 10 October 2011

This is an overview in point form of the content in the ML class.

INTRODUCTION

Examples of machine learning

      • Database mining (Large datasets from growth of automation/web)
        • clickstream data
        • medical records
        • biology
        • engineering
      • Applications that can't be programmed by hand
        • autonomous helicopter
        • handwriting recognition
        • most of Natural Language Processing (NLP)
        • Computer Vision
      • Self-customising programs
        • Amazon
        • Netfilx product recommendations
      • Understanding human learning (brain, real AI)

What is Machine Learning?

      • Definitions of Machine Learning
        • Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
        • Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E.
      • There are several different types of ML algorithms. The two main types are:
        • Supervised learning
          • teach computer how to do something
        • Unsupervised learning
          • computer learns by itself
      • Other types of algorithms are:
        • Reinforcement learning
        • Recommender systems

Supervised Learning

      • Supervised Learning in which the "right answers" are given
        • Regression: predict continuous valued output (e.g. price)
        • Classification: discrete valued output (e.g. 0 or 1)
    • Unsupervised Learning
      • Unsupervised Learning in which the categories are unknown
        • Clustering: cluster patterns (categories) are found in the data
        • Cocktail party problem: overlapping audio tracks are separated out
          • [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

LINEAR REGRESSION WITH ONE VARIABLE

Model Representation

      • e.g. housing prices, price per square-foot
        • Supervised Learning
        • Regression
      • Dataset called training set
      • Notation:
m number of training examples
x's "input" variable / features
y's "output" variable / "target" variable
(x,y) one training example
(x(i),y(i)) ith training example
      • Training Set -> Learning Algorithm -> h (hypothesis)
      • Size of house (x) -> h -> Estimated price (y)
        • h maps from x's to y's
      • How do we represent h?
        • hΘ(x) = h(x) = Θ0 + Θ1x
      • Linear regression with one variable (x)
        • Univariate linear regression

Cost Function

      • Helps us figure out how to fit the best possible straight line to our data
      • hΘ(x) = Θ0 + Θ1x
      • Θi's: Parameters
      • How to choose parameters (Θi's)?
        • Choose Θ0, Θ1 so that hΘ(x) is close to y for our training examples (x,y)
        • Minimise for Θ0, Θ1
          • hΘ(x(i)) = Θ0 + Θ1x(i)
        • J(Θ0,Θ1) =
        • J(Θ0,Θ1) is the Cost Function, also known in this case as the Squared Error Function

Cost Function - Intuition I

      • Summary:
        • Hypothesis: hΘ(x) = Θ0 + Θ1x
        • Parameters: Θ0, Θ1
        • Cost Function: J(Θ0,Θ1) =
        • Goal: minimise Θ0, Θ1 J(Θ0, Θ1)
      • Simplified:
        • hΘ(x) = Θ1x
        • minimise Θ1 J(Θ1)
      • Can plot simplified model in 2D

Cost Function - Intuition II

      • Can plot J(Θ0,Θ1) in 3D
      • Can plot with Contour Map (Contour Plot)
    • Gradient Descent
      • repeat until convergence { }
      • α = learning rate

Gradient Descent Intuition

      • min
        Θ1
        J(Θ1)
        • For Θ1 > local minimum: positive, moves toward local minimum
        • For Θ1 < local minimum: negative, moves toward local minimum
      • If learning rate α is too small algorithm takes a long time to run
      • If learning rate α is too large algorithm may not converge or may diverge
      • When partial derivative is zero Θ1 converges
      • As we approach a local minimum, gradient descent automatically takes smaller steps
        • So no need to decrease α over time

Gradient Descent for Linear Regression

      • Gradient descent algorithm:
        • repeat until convergence {
          • for j=0 and j=1
        • }
      • Linear Regression Model:
        • hΘ(x) = Θ0 + Θ1x
        • J(Θ0,Θ1) =
        • min
          Θ01
          J(Θ01)
      • Partial derivatives:
        • j=0:
        • j=1:
      • Gradient descent algorithm:
        • repeat until convergence {
        • }

What's Next

      • Two extensions:
        • In min J(Θ0,Θ1), solve for Θ0,Θ1 exactly without needing iterative algorithm (gradient descent)
        • Learn with larger number of features
      • Linear Algebra topics:
        • What are matrices and vectors
        • Addition, subtraction and multiplication with matrices and vectors
        • Matrix inverse, transpose

LINEAR ALGEBRA REVIEW

Matrices and Vectors

      • Matrix: a rectangular array of numbers
      • Dimension of Matrix: number of rows by number of columns
        • = 4 rows and 2 columns
        • = 2 rows and 3 columns
      • Matrix elements:
        • Aij = "i,j entry" in the ith row, jth column
          • A11 = 1402
          • A12 = 191
          • A32 = 1437
          • A41 = 147
          • A43 = undefined (error)
      • Vector: an n x 1 matrix
          • : n = 4; 4-dimensional vector
        • yi = ith element
        • 1-indexed vs 0-indexed
      • Notation (generally):
        • A,B,C,X = capital = matrix
        • a,b,x,y = lower case = vector or scalar

Addition and Scalar Multiplication

        • 3x2 matrix + 3x2 matrix = 3x2 matrix
        • 3x2 matrix + 2x2 matrix = error