Revision as of 14:51, 10 October 2011

This is an overview in point form of the content in the ML class.

INTRODUCTION

Examples of machine learning

- - Database mining (Large datasets from growth of automation/web)
    - clickstream data
    - medical records
    - biology
    - engineering
  - Applications that can't be programmed by hand
    - autonomous helicopter
    - handwriting recognition
    - most of Natural Language Processing (NLP)
    - Computer Vision
  - Self-customising programs
    - Amazon
    - Netfilx product recommendations
  - Understanding human learning (brain, real AI)

What is Machine Learning?

- - Definitions of Machine Learning
    - Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
    - Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E.
  - There are several different types of ML algorithms. The two main types are:
    - Supervised learning
      - teach computer how to do something
    - Unsupervised learning
      - computer learns by itself
  - Other types of algorithms are:
    - Reinforcement learning
    - Recommender systems

Supervised Learning

- - Supervised Learning in which the "right answers" are given
    - Regression: predict continuous valued output (e.g. price)
    - Classification: discrete valued output (e.g. 0 or 1)
- Unsupervised Learning
  - Unsupervised Learning in which the categories are unknown
    - Clustering: cluster patterns (categories) are found in the data
    - Cocktail party problem: overlapping audio tracks are separated out
      - [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');

LINEAR REGRESSION WITH ONE VARIABLE

Model Representation

- - e.g. housing prices, price per square-foot
    - Supervised Learning
    - Regression
  - Dataset called training set
  - Notation:

m	number of training examples
x's	"input" variable / features
y's	"output" variable / "target" variable
(x,y)	one training example
(x⁽ⁱ⁾,y⁽ⁱ⁾)	i^th training example

- - Training Set -> Learning Algorithm -> h (hypothesis)
  - Size of house (x) -> h -> Estimated price (y)
    - h maps from x's to y's
  - How do we represent h?
    - h_Θ(x) = h(x) = Θ₀ + Θ₁x
  - Linear regression with one variable (x)
    - Univariate linear regression

Cost Function

- - Helps us figure out how to fit the best possible straight line to our data
  - h_Θ(x) = Θ₀ + Θ₁x
  - Θ_i's: Parameters
  - How to choose parameters (Θ_i's)?
    - Choose Θ₀, Θ₁ so that h_Θ(x) is close to y for our training examples (x,y)
    - Minimise ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$ ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$ for Θ₀, Θ₁
      - h_Θ(x⁽ⁱ⁾) = Θ₀ + Θ₁x⁽ⁱ⁾
    - J(Θ₀,Θ₁) = ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$
    - J(Θ₀,Θ₁) is the Cost Function, also known in this case as the Squared Error Function

Cost Function - Intuition I

- - Summary:
    - Hypothesis: h_Θ(x) = Θ₀ + Θ₁x
    - Parameters: Θ₀, Θ₁
    - Cost Function: J(Θ₀,Θ₁) = ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$
    - Goal: minimise Θ₀, Θ₁ J(Θ₀, Θ₁)
  - Simplified:
    - h_Θ(x) = Θ₁x
    - minimise Θ₁ J(Θ₁)
  - Can plot simplified model in 2D

Cost Function - Intuition II

- - Can plot J(Θ₀,Θ₁) in 3D
  - Can plot with Contour Map (Contour Plot)
- Gradient Descent
  - repeat until convergence { $\theta _{j}\colon =\theta _{j}-\alpha {\frac {\partial }{\partial \theta _{j}}}J(\theta _{0},\theta _{1})$ }
  - α = learning rate

Gradient Descent Intuition

- - min
    Θ₁ J(Θ₁)
    - For Θ₁ > local minimum: ${\frac {d}{d\theta _{1}}}J(\theta _{1})$ positive, moves toward local minimum
    - For Θ₁ < local minimum: ${\frac {d}{d\theta _{1}}}J(\theta _{1})$ negative, moves toward local minimum
  - If learning rate α is too small algorithm takes a long time to run
  - If learning rate α is too large algorithm may not converge or may diverge
  - When partial derivative is zero Θ₁ converges
  - As we approach a local minimum, gradient descent automatically takes smaller steps
    - So no need to decrease α over time

Gradient Descent for Linear Regression

- - Gradient descent algorithm:
    - repeat until convergence {
      - $\theta _{j}\colon =\theta _{j}-\alpha {\frac {\partial }{\partial \theta _{j}}}J(\theta _{0},\theta _{1})$
      - for j=0 and j=1
    - }
  - Linear Regression Model:
    - h_Θ(x) = Θ₀ + Θ₁x
    - J(Θ₀,Θ₁) = ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$
    - min
      Θ₀,Θ₁ J(Θ₀,Θ₁)
  - Partial derivatives:
    - j=0: ${\frac {\partial }{\partial \theta _{0}}}J(\theta _{0},\theta _{1})={\frac {1}{m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})$
    - j=1: ${\frac {\partial }{\partial \theta _{1}}}J(\theta _{0},\theta _{1})={\frac {1}{m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})x^{(i)}$
  - Gradient descent algorithm:
    - repeat until convergence {
      - $\theta _{0}:=\theta _{0}-\alpha {\frac {1}{m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})$
      - $\theta _{1}:=\theta _{1}-\alpha {\frac {1}{m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})x^{(i)}$
    - }

What's Next

- - Two extensions:
    - In min J(Θ₀,Θ₁), solve for Θ₀,Θ₁ exactly without needing iterative algorithm (gradient descent)
    - Learn with larger number of features
  - Linear Algebra topics:
    - What are matrices and vectors
    - Addition, subtraction and multiplication with matrices and vectors
    - Matrix inverse, transpose
  - ${\begin{bmatrix}1&2\\0&1\end{bmatrix}}^{T}{\begin{bmatrix}1&2\\0&1\end{bmatrix}}$
  - ${\begin{bmatrix}2&0\\0&1\end{bmatrix}}^{-1}{\begin{bmatrix}2\\1\end{bmatrix}}-3{\begin{bmatrix}1\\5\end{bmatrix}}$

LINEAR ALGEBRA REVIEW

Matrices and Vectors

- - Matrix: a rectangular array of numbers
  - Dimension of Matrix: number of rows by number of columns
    - $\mathbb {R} ^{4\times 2}$ = 4 rows and 2 columns
    - $\mathbb {R} ^{2\times 3}$ = 2 rows and 3 columns
  - Matrix elements:
    - A_ij = "i,j entry" in the i^th row, j^th column
    - $A={\begin{bmatrix}1402&191\\1371&821\\949&1437\\147&1448\end{bmatrix}}$ $A={\begin{bmatrix}1402&191\\1371&821\\949&1437\\147&1448\end{bmatrix}}$
      - A₁₁ = 1402
      - A₁₂ = 191
      - A₃₂ = 1437
      - A₄₁ = 147
      - A₄₃ = undefined (error)
  - Vector: an n x 1 matrix
    - $y={\begin{bmatrix}460\\232\\315\\178\end{bmatrix}}$ $y={\begin{bmatrix}460\\232\\315\\178\end{bmatrix}}$
      - $\mathbb {R} ^{4}$ : n = 4; 4-dimensional vector
    - y_i = i^th element
    - 1-indexed vs 0-indexed
  - Notation (generally):
    - A,B,C,X = capital = matrix
    - a,b,x,y = lower case = vector or scalar

Addition and Scalar Multiplication

- - ${\begin{bmatrix}1&0\\2&5\\3&1\end{bmatrix}}+{\begin{bmatrix}4&0.5\\2&5\\0&1\end{bmatrix}}={\begin{bmatrix}5&0.5\\4&10\\3&2\end{bmatrix}}$ ${\begin{bmatrix}1&0\\2&5\\3&1\end{bmatrix}}+{\begin{bmatrix}4&0.5\\2&5\\0&1\end{bmatrix}}={\begin{bmatrix}5&0.5\\4&10\\3&2\end{bmatrix}}$
    - 3x2 matrix + 3x2 matrix = 3x2 matrix
  - ${\begin{bmatrix}1&0\\2&5\\3&1\end{bmatrix}}+{\begin{bmatrix}4&0.5\\2&5\end{bmatrix}}=error$ ${\begin{bmatrix}1&0\\2&5\\3&1\end{bmatrix}}+{\begin{bmatrix}4&0.5\\2&5\end{bmatrix}}=error$
    - 3x2 matrix + 2x2 matrix = error

@@ Line 1: / Line 1: @@
 This is an overview in point form of the content in the [[ML class]].
-* [[ML_class#INTRODUCTION|INTRODUCTION]]
+= [[ML_class#INTRODUCTION|INTRODUCTION]] =
-** [[ML_class#Welcome|Examples of machine learning]]
+== [[ML_class#Welcome|Examples of machine learning]] ==
 *** Database mining (Large datasets from growth of automation/web)
 **** clickstream data
@@ Line 17: / Line 19: @@
 **** Netfilx product recommendations
 *** Understanding human learning (brain, real AI)
-** [[ML_class#What_is_Machine_Learning.3F|What is Machine Learning?]]
+== [[ML_class#What_is_Machine_Learning.3F|What is Machine Learning?]] ==
 *** Definitions of Machine Learning
 **** Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
@@ Line 29: / Line 33: @@
 **** Reinforcement learning
 **** Recommender systems
-** [[ML_class#Supervised_Learning|Supervised Learning]]
+== [[ML_class#Supervised_Learning|Supervised Learning]] ==
 *** Supervised Learning in which the "right answers" are given
 **** '''Regression''': predict continuous valued output (e.g. price)
@@ Line 38: / Line 44: @@
 **** '''Cocktail party problem''': overlapping audio tracks are separated out
 ***** [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
-* [[ML_class#LINEAR_REGRESSION_WITH_ONE_VARIABLE|LINEAR REGRESSION WITH ONE VARIABLE]]
-** [[ML_class#Model_Representation|Model Representation]]
+= [[ML_class#LINEAR_REGRESSION_WITH_ONE_VARIABLE|LINEAR REGRESSION WITH ONE VARIABLE]] =
+== [[ML_class#Model_Representation|Model Representation]] ==
 *** e.g. housing prices, price per square-foot
 **** Supervised Learning
@@ Line 63: / Line 72: @@
 *** Linear regression with one variable (x)
 **** Univariate linear regression
-** [[ML_class#Cost_Function|Cost Function]]
+== [[ML_class#Cost_Function|Cost Function]] ==
 *** Helps us figure out how to fit the best possible straight line to our data
 *** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
@@ Line 73: / Line 84: @@
 **** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math>
 **** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) is the Cost Function, also known in this case as the [http://en.wikipedia.org/wiki/Mean_squared_error Squared Error Function]
-** [[ML_class#Cost_Function_-_Intuition_I|Cost Function - Intuition I]]
+== [[ML_class#Cost_Function_-_Intuition_I|Cost Function - Intuition I]] ==
 *** Summary:
 **** '''Hypothesis''': <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x
@@ Line 83: / Line 96: @@
 **** minimise <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>1</sub>)
 *** Can plot simplified model in 2D
-** [[ML_class#Cost_Function_-_Intuition_II|Cost Function - Intuition II]]
+== [[ML_class#Cost_Function_-_Intuition_II|Cost Function - Intuition II]] ==
 *** Can plot J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) in 3D
 *** Can plot with Contour Map (Contour Plot)
@@ Line 89: / Line 104: @@
 *** repeat until convergence { <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math> }
 *** α = learning rate
-** [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]]
+== [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]] ==
 *** <table style="display:inline;border:none;"><tr><td style="border:none;text-align:center;">min<br>Θ<sub>1</sub></td><td style="border:none;">J(Θ<sub>1</sub>)</td></tr></table>
 **** For Θ<sub>1</sub> &gt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> positive, moves toward local minimum
@@ Line 98: / Line 115: @@
 *** As we approach a local minimum, gradient descent automatically takes smaller steps
 **** So no need to decrease α over time
-** [[ML_class#Gradient_Descent_for_Linear_Regression|Gradient Descent for Linear Regression]]
+== [[ML_class#Gradient_Descent_for_Linear_Regression|Gradient Descent for Linear Regression]] ==
 *** Gradient descent algorithm:
 **** repeat until convergence {
@@ Line 116: / Line 135: @@
 ***** <math>\theta_1 := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)}</math>
 **** }
-** [[ML_class#What.27s_Next|What's Next]]
+== [[ML_class#What.27s_Next|What's Next]] ==
 *** Two extensions:
 **** In <code>min J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>)</code>, solve for <em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub> exactly without needing iterative algorithm (gradient descent)
@@ Line 126: / Line 147: @@
 *** <math>\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}^T\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}</math>
 *** <math>\begin{bmatrix}2 & 0 \\0 & 1 \end{bmatrix}^{-1}\begin{bmatrix}2 \\1 \end{bmatrix}-3\begin{bmatrix}1 \\5 \end{bmatrix}</math>
-* [[ML_class#LINEAR_ALGEBRA_REVIEW|LINEAR ALGEBRA REVIEW]]
-** [[ML_class#Matrices_and_Vectors|Matrices and Vectors]]
+= [[ML_class#LINEAR_ALGEBRA_REVIEW|LINEAR ALGEBRA REVIEW]] =
+== [[ML_class#Matrices_and_Vectors|Matrices and Vectors]] ==
 *** '''Matrix''': a rectangular array of numbers
 *** '''Dimension of Matrix''': number of rows by number of columns
@@ Line 148: / Line 172: @@
 **** A,B,C,X = capital = matrix
 **** a,b,x,y = lower case = vector or scalar
-** [[ML_class#Addition_and_Scalar_Multiplication|Addition and Scalar Multiplication]]
+== [[ML_class#Addition_and_Scalar_Multiplication|Addition and Scalar Multiplication]] ==
 *** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5 \\0 & 1\end{bmatrix} = \begin{bmatrix}5 & 0.5 \\4 & 10 \\3 & 2\end{bmatrix}</math>
 **** 3x2 matrix + 3x2 matrix = 3x2 matrix
 *** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5\end{bmatrix} = error</math>
 **** 3x2 matrix + 2x2 matrix = error

ML class overview: Difference between revisions

Revision as of 14:51, 10 October 2011

Contents

INTRODUCTION

Examples of machine learning

What is Machine Learning?

Supervised Learning

LINEAR REGRESSION WITH ONE VARIABLE

Model Representation

Cost Function

Cost Function - Intuition I

Cost Function - Intuition II

Gradient Descent Intuition

Gradient Descent for Linear Regression

What's Next

LINEAR ALGEBRA REVIEW

Matrices and Vectors

Addition and Scalar Multiplication

Navigation menu

ML class overview: Difference between revisions

Revision as of 14:51, 10 October 2011

Navigation menu

Search