ML class overview: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
(40 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This is an overview in point form of the content in the [[ML class]]. | |||
= [[ML_class#INTRODUCTION|INTRODUCTION]] = | |||
== [[ML_class#Welcome|Examples of machine learning]] == | |||
* Database mining (Large datasets from growth of automation/web) | |||
** clickstream data | |||
** medical records | |||
** biology | |||
** engineering | |||
* Applications that can't be programmed by hand | |||
** autonomous helicopter | |||
** handwriting recognition | |||
** most of Natural Language Processing (NLP) | |||
** Computer Vision | |||
* Self-customising programs | |||
** Amazon | |||
** Netfilx product recommendations | |||
* Understanding human learning (brain, real AI) | |||
== [[ML_class#What_is_Machine_Learning.3F|What is Machine Learning?]] == | |||
* Definitions of Machine Learning | |||
** Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. | |||
** Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E. | |||
* There are several different types of ML algorithms. The two main types are: | |||
** Supervised learning | |||
*** teach computer how to do something | |||
** Unsupervised learning | |||
*** computer learns by itself | |||
* Other types of algorithms are: | |||
** Reinforcement learning | |||
** Recommender systems | |||
== [[ML_class#Supervised_Learning|Supervised Learning]] == | |||
* Supervised Learning in which the "right answers" are given | |||
** '''Regression''': predict continuous valued output (e.g. price) | |||
** '''Classification''': discrete valued output (e.g. 0 or 1) | |||
== [[ML_class#Unsupervised_Learning|Unsupervised Learning]] == | |||
* Unsupervised Learning in which the categories are unknown | |||
** '''Clustering''': cluster patterns (categories) are found in the data | |||
** '''Cocktail party problem''': overlapping audio tracks are separated out | |||
*** [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x'); | |||
= [[ML_class#LINEAR_REGRESSION_WITH_ONE_VARIABLE|LINEAR REGRESSION WITH ONE VARIABLE]] = | |||
== [[ML_class#Model_Representation|Model Representation]] == | |||
* e.g. housing prices, price per square-foot | |||
** Supervised Learning | |||
** Regression | |||
* Dataset called '''training set''' | |||
* Notation: | |||
{|class="wikitable" | {|class="wikitable" | ||
| '''m''' || number of training examples | | '''m''' || number of training examples | ||
Line 55: | Line 67: | ||
| '''(x<sup>(i)</sup>,y<sup>(i)</sup>)''' || i<sup>th</sup> training example | | '''(x<sup>(i)</sup>,y<sup>(i)</sup>)''' || i<sup>th</sup> training example | ||
|} | |} | ||
* Training Set -> Learning Algorithm -> <em>h</em> (hypothesis) | |||
* Size of house (x) -> <em>h</em> -> Estimated price (y) | |||
** <em>h</em> maps from '''x''''s to '''y''''s | |||
* How do we represent <em>h</em>? | |||
** <em>h</em><sub>Θ</sub>(x) = <em>h</em>(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x | |||
* Linear regression with one variable (x) | |||
** Univariate linear regression | |||
== [[ML_class#Cost_Function|Cost Function]] == | |||
* Helps us figure out how to fit the best possible straight line to our data | |||
* <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x | |||
* <em>Θ<sub>i</sub></em>'s: Parameters | |||
** | * How to choose parameters (<em>Θ<sub>i</sub></em>'s)? | ||
** Choose <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> so that <em>h<sub>Θ</sub></em>(x) is close to <em>y</em> for our training examples (<em>x</em>,<em>y</em>) | |||
** Minimise <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math> for <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> | |||
*** <em>h<sub>Θ</sub></em>(x<sup>(i)</sup>) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x<sup>(i)</sup> | |||
** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math> | |||
** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) is the Cost Function, also known in this case as the [http://en.wikipedia.org/wiki/Mean_squared_error Squared Error Function] | |||
== [[ML_class#Cost_Function_-_Intuition_I|Cost Function - Intuition I]] == | |||
* Summary: | |||
** '''Hypothesis''': <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x | |||
** '''Parameters''': <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> | |||
** '''Cost Function''': J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math> | |||
** '''Goal''': minimise <em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>0</sub>, <em>Θ</em><sub>1</sub>) | |||
* Simplified: | |||
** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>1</sub>x | |||
** minimise <em>Θ</em><sub>1</sub> J(<em>Θ</em><sub>1</sub>) | |||
* Can plot simplified model in 2D | |||
* | |||
== [[ML_class#Cost_Function_-_Intuition_II|Cost Function - Intuition II]] == | |||
* Can plot J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) in 3D | |||
* Can plot with Contour Map (Contour Plot) | |||
== [[ML_class#Gradient_Descent|Gradient Descent]] == | |||
* repeat until convergence { <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math> } | |||
* α = learning rate | |||
== [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]] == | |||
* <table style="display:inline;border:none;"><tr><td style="border:none;text-align:center;">min<br>Θ<sub>1</sub></td><td style="border:none;">J(Θ<sub>1</sub>)</td></tr></table> | |||
** For Θ<sub>1</sub> > local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> positive, moves toward local minimum | |||
** For Θ<sub>1</sub> < local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> negative, moves toward local minimum | |||
* If learning rate α is too small algorithm takes a long time to run | |||
* If learning rate α is too large algorithm may not converge or may diverge | |||
* When partial derivative is zero Θ<sub>1</sub> converges | |||
* As we approach a local minimum, gradient descent automatically takes smaller steps | |||
** So no need to decrease α over time | |||
== [[ML_class#Gradient_Descent_for_Linear_Regression|Gradient Descent for Linear Regression]] == | |||
* Gradient descent algorithm: | |||
** repeat until convergence { | |||
*** <math>\theta_j \colon= \theta_j - \alpha\frac{\part}{\part\theta_j}J(\theta_0,\theta_1)</math> | |||
*** for j=0 and j=1 | |||
** } | |||
* Linear Regression Model: | |||
** <em>h<sub>Θ</sub></em>(x) = <em>Θ</em><sub>0</sub> + <em>Θ</em><sub>1</sub>x | |||
** J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>) = <math>\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2</math> | |||
** <table style="display:inline;border:none;"><tr><td style="border:none;text-align:center;">min<br>Θ<sub>0</sub>,Θ<sub>1</sub></td><td style="border:none;">J(Θ<sub>0</sub>,Θ<sub>1</sub>)</td></tr></table> | |||
* Partial derivatives: | |||
** j=0: <math>\frac{\part}{\part\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})</math> | |||
** j=1: <math>\frac{\part}{\part\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)}</math> | |||
* Gradient descent algorithm: | |||
** repeat until convergence { | |||
*** <math>\theta_0 := \theta_0 -\alpha \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})</math> | |||
*** <math>\theta_1 := \theta_1 -\alpha \frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)}) x^{(i)}</math> | |||
** } | |||
== [[ML_class#What.27s_Next|What's Next]] == | |||
* Two extensions: | |||
** In <code>min J(<em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub>)</code>, solve for <em>Θ</em><sub>0</sub>,<em>Θ</em><sub>1</sub> exactly without needing iterative algorithm (gradient descent) | |||
** Learn with larger number of features | |||
* Linear Algebra topics: | |||
** What are matrices and vectors | |||
** Addition, subtraction and multiplication with matrices and vectors | |||
** Matrix inverse, transpose | |||
* <math>\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}^T\begin{bmatrix}1 & 2 \\0 & 1 \end{bmatrix}</math> | |||
* <math>\begin{bmatrix}2 & 0 \\0 & 1 \end{bmatrix}^{-1}\begin{bmatrix}2 \\1 \end{bmatrix}-3\begin{bmatrix}1 \\5 \end{bmatrix}</math> | |||
= [[ML_class#LINEAR_ALGEBRA_REVIEW|LINEAR ALGEBRA REVIEW]] = | |||
== [[ML_class#Matrices_and_Vectors|Matrices and Vectors]] == | |||
* '''Matrix''': a rectangular array of numbers | |||
* '''Dimension of Matrix''': number of rows by number of columns | |||
** <math>\mathbb R^{4\times2}</math> = 4 rows and 2 columns | |||
** <math>\mathbb R^{2\times3}</math> = 2 rows and 3 columns | |||
* Matrix elements: | |||
** <em>A<sub>ij</sub></em> = "<em>i,j</em> entry" in the <em>i<sup>th</sup></em> row, <em>j<sup>th</sup></em> column | |||
** <math>A = \begin{bmatrix}1402 & 191 \\1371 & 821 \\949 & 1437 \\147 & 1448 \end{bmatrix}</math> | |||
*** <em>A<sub>11</sub></em> = 1402 | |||
*** <em>A<sub>12</sub></em> = 191 | |||
*** <em>A<sub>32</sub></em> = 1437 | |||
*** <em>A<sub>41</sub></em> = 147 | |||
*** <em>A<sub>43</sub></em> = undefined (error) | |||
* '''Vector''': an n x 1 matrix | |||
** <math>y = \begin{bmatrix}460 \\232 \\315 \\178 \end{bmatrix}</math> | |||
*** <math>\mathbb R^4</math>: n = 4; 4-dimensional vector | |||
** <em>y<sub>i</sub></em> = <em>i<sup>th</sup></em> element | |||
** 1-indexed vs 0-indexed | |||
* Notation (generally): | |||
** A,B,C,X = capital = matrix | |||
** a,b,x,y = lower case = vector or scalar | |||
== [[ML_class#Addition_and_Scalar_Multiplication|Addition and Scalar Multiplication]] == | |||
* Matrix addition: | |||
** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5 \\0 & 1\end{bmatrix} = \begin{bmatrix}5 & 0.5 \\4 & 10 \\3 & 2\end{bmatrix}</math> | |||
*** 3x2 matrix + 3x2 matrix = 3x2 matrix | |||
** <math>\begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} + \begin{bmatrix}4 & 0.5 \\2 & 5\end{bmatrix} = error</math> | |||
*** 3x2 matrix + 2x2 matrix = error | |||
* Scalar multiplication: | |||
** <math>3 \times \begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} = \begin{bmatrix}3 & 0 \\6 & 15 \\9 & 3\end{bmatrix} = \begin{bmatrix}1 & 0 \\2 & 5 \\3 & 1\end{bmatrix} \times 3</math> | |||
*** scalar x 3x2 matrix = 3x2 matrix = 3x2 matrix x scalar | |||
** <math>\begin{bmatrix}4 & 0 \\6 & 3\end{bmatrix} / 4 = \frac{1}{4}\begin{bmatrix}4 & 0 \\6 & 3\end{bmatrix} = \begin{bmatrix}1 & 0 \\\frac{3}{2} & \frac{3}{4}\end{bmatrix}</math> | |||
* Combination of operands: | |||
** <math>\begin{align} | |||
3 \times \begin{bmatrix}1 \\4 \\2 \end{bmatrix} + \begin{bmatrix}0 \\0 \\5 \end{bmatrix} - \begin{bmatrix}3 \\0 \\2 \end{bmatrix} / 3 | |||
&= \begin{bmatrix}3 \\12 \\6 \end{bmatrix} + \begin{bmatrix}0 \\0 \\5 \end{bmatrix} - \begin{bmatrix}1 \\0 \\\frac{2}{3} \end{bmatrix} | |||
&= \begin{bmatrix}2 \\12 \\10\frac{1}{3} \end{bmatrix} | |||
\end{align}</math> | |||
*** 3x1 matrix = 3-dimensional vector | |||
== [[ML_class#Matrix_Vector_Multiplication|Matrix Vector Multiplication]] == | |||
* <math>\begin{bmatrix}1 & 3 \\4 & 0 \\2 & 1\end{bmatrix} \begin{bmatrix}1 \\5 \end{bmatrix} = \begin{bmatrix}16 \\4 \\7 \end{bmatrix}</math> | |||
** 1x1 + 3x5 = 16 | |||
** 4x1 + 0x5 = 4 | |||
** 2x1 + 1x5 = 7 | |||
** 3x2 matrix * 2x1 matrix = 3x1 matrix | |||
* prediction = DataMatrix x parameters | |||
** <em>h</em><sub>Θ</sub>(x) = -40 + 0.25x = <math>\begin{bmatrix}-40 \\0.25 \end{bmatrix}</math> | |||
** <math>\begin{bmatrix}1 & 2104 \\1 & 1416 \\1 & 1534 \\1 & 852 \end{bmatrix} \times \begin{bmatrix}-40 \\0.25 \end{bmatrix} = \begin{bmatrix}-40\times1 + 0.25 \times 2104 \\-40\times1 + 0.25 \times 1416 \\-40\times1 + 0.25 \times 1534 \\-40\times1 + 0.25 \times 852 \\ \end{bmatrix}</math> |
Latest revision as of 18:59, 10 October 2011
This is an overview in point form of the content in the ML class.
INTRODUCTION
Examples of machine learning
- Database mining (Large datasets from growth of automation/web)
- clickstream data
- medical records
- biology
- engineering
- Applications that can't be programmed by hand
- autonomous helicopter
- handwriting recognition
- most of Natural Language Processing (NLP)
- Computer Vision
- Self-customising programs
- Amazon
- Netfilx product recommendations
- Understanding human learning (brain, real AI)
What is Machine Learning?
- Definitions of Machine Learning
- Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
- Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E.
- There are several different types of ML algorithms. The two main types are:
- Supervised learning
- teach computer how to do something
- Unsupervised learning
- computer learns by itself
- Supervised learning
- Other types of algorithms are:
- Reinforcement learning
- Recommender systems
Supervised Learning
- Supervised Learning in which the "right answers" are given
- Regression: predict continuous valued output (e.g. price)
- Classification: discrete valued output (e.g. 0 or 1)
Unsupervised Learning
- Unsupervised Learning in which the categories are unknown
- Clustering: cluster patterns (categories) are found in the data
- Cocktail party problem: overlapping audio tracks are separated out
- [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
LINEAR REGRESSION WITH ONE VARIABLE
Model Representation
- e.g. housing prices, price per square-foot
- Supervised Learning
- Regression
- Dataset called training set
- Notation:
m | number of training examples |
x's | "input" variable / features |
y's | "output" variable / "target" variable |
(x,y) | one training example |
(x(i),y(i)) | ith training example |
- Training Set -> Learning Algorithm -> h (hypothesis)
- Size of house (x) -> h -> Estimated price (y)
- h maps from x's to y's
- How do we represent h?
- hΘ(x) = h(x) = Θ0 + Θ1x
- Linear regression with one variable (x)
- Univariate linear regression
Cost Function
- Helps us figure out how to fit the best possible straight line to our data
- hΘ(x) = Θ0 + Θ1x
- Θi's: Parameters
- How to choose parameters (Θi's)?
- Choose Θ0, Θ1 so that hΘ(x) is close to y for our training examples (x,y)
- Minimise for Θ0, Θ1
- hΘ(x(i)) = Θ0 + Θ1x(i)
- J(Θ0,Θ1) =
- J(Θ0,Θ1) is the Cost Function, also known in this case as the Squared Error Function
Cost Function - Intuition I
- Summary:
- Hypothesis: hΘ(x) = Θ0 + Θ1x
- Parameters: Θ0, Θ1
- Cost Function: J(Θ0,Θ1) =
- Goal: minimise Θ0, Θ1 J(Θ0, Θ1)
- Simplified:
- hΘ(x) = Θ1x
- minimise Θ1 J(Θ1)
- Can plot simplified model in 2D
Cost Function - Intuition II
- Can plot J(Θ0,Θ1) in 3D
- Can plot with Contour Map (Contour Plot)
Gradient Descent
- repeat until convergence { }
- α = learning rate
Gradient Descent Intuition
min
Θ1J(Θ1) - For Θ1 > local minimum: positive, moves toward local minimum
- For Θ1 < local minimum: negative, moves toward local minimum
- If learning rate α is too small algorithm takes a long time to run
- If learning rate α is too large algorithm may not converge or may diverge
- When partial derivative is zero Θ1 converges
- As we approach a local minimum, gradient descent automatically takes smaller steps
- So no need to decrease α over time
Gradient Descent for Linear Regression
- Gradient descent algorithm:
- repeat until convergence {
- for j=0 and j=1
- }
- repeat until convergence {
- Linear Regression Model:
- hΘ(x) = Θ0 + Θ1x
- J(Θ0,Θ1) =
min
Θ0,Θ1J(Θ0,Θ1)
- Partial derivatives:
- j=0:
- j=1:
- Gradient descent algorithm:
- repeat until convergence {
- }
- repeat until convergence {
What's Next
- Two extensions:
- In
min J(Θ0,Θ1)
, solve for Θ0,Θ1 exactly without needing iterative algorithm (gradient descent) - Learn with larger number of features
- In
- Linear Algebra topics:
- What are matrices and vectors
- Addition, subtraction and multiplication with matrices and vectors
- Matrix inverse, transpose
LINEAR ALGEBRA REVIEW
Matrices and Vectors
- Matrix: a rectangular array of numbers
- Dimension of Matrix: number of rows by number of columns
- = 4 rows and 2 columns
- = 2 rows and 3 columns
- Matrix elements:
- Aij = "i,j entry" in the ith row, jth column
-
- A11 = 1402
- A12 = 191
- A32 = 1437
- A41 = 147
- A43 = undefined (error)
- Vector: an n x 1 matrix
-
- : n = 4; 4-dimensional vector
- yi = ith element
- 1-indexed vs 0-indexed
-
- Notation (generally):
- A,B,C,X = capital = matrix
- a,b,x,y = lower case = vector or scalar
Addition and Scalar Multiplication
- Matrix addition:
-
- 3x2 matrix + 3x2 matrix = 3x2 matrix
-
- 3x2 matrix + 2x2 matrix = error
-
- Scalar multiplication:
-
- scalar x 3x2 matrix = 3x2 matrix = 3x2 matrix x scalar
-
- Combination of operands:
-
- 3x1 matrix = 3-dimensional vector
-
Matrix Vector Multiplication
-
- 1x1 + 3x5 = 16
- 4x1 + 0x5 = 4
- 2x1 + 1x5 = 7
- 3x2 matrix * 2x1 matrix = 3x1 matrix
- prediction = DataMatrix x parameters
- hΘ(x) = -40 + 0.25x =