ML class overview: Difference between revisions

Revision as of 16:08, 5 October 2011

ML class
INTRODUCTION
- Examples of machine learning
  - Database mining (Large datasets from growth of automation/web)
    - clickstream data
    - medical records
    - biology
    - engineering
  - Applications that can't be programmed by hand
    - autonomous helicopter
    - handwriting recognition
    - most of Natural Language Processing (NLP)
    - Computer Vision
  - Self-customising programs
    - Amazon
    - Netfilx product recommendations
  - Understanding human learning (brain, real AI)
- What is Machine Learning?
  - Definitions of Machine Learning
    - Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
    - Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on on T, as measured by P, improves with experience E.
  - There are several different types of ML algorithms. The two main types are:
    - Supervised learning
      - teach computer how to do something
    - Unsupervised learning
      - computer learns by itself
  - Other types of algorithms are:
    - Reinforcement learning
    - Recommender systems
- Supervised Learning
  - Supervised Learning in which the "right answers" are given
    - Regression: predict continuous valued output (e.g. price)
    - Classification: discrete valued output (e.g. 0 or 1)
- Unsupervised Learning
  - Unsupervised Learning in which the categories are unknown
    - Clustering: cluster patterns (categories) are found in the data
    - Cocktail party problem: overlapping audio tracks are separated out
      - [W,s,v] = svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');
LINEAR REGRESSION WITH ONE VARIABLE
- Model Representation
  - e.g. housing prices, price per square-foot
    - Supervised Learning
    - Regression
  - Dataset called training set
  - Notation:

m	number of training examples
x's	"input" variable / features
y's	"output" variable / "target" variable
(x,y)	one training example
(x⁽ⁱ⁾,y⁽ⁱ⁾)	i^th training example

- - Training Set -> Learning Algorithm -> h (hypothesis)
  - Size of house (x) -> h -> Estimated price (y)
    - h maps from x's to y's
  - How do we represent h?
    - h_Θ(x) = h(x) = Θ₀ + Θ₁x
  - Linear regression with one variable (x)
    - Univariate linear regression
- Cost Function
  - Helps us figure out how to fit the best possible straight line to our data
  - h_Θ(x) = Θ₀ + Θ₁x
  - Θ_i's: Parameters
  - How to choose parameters (Θ_i's)?
    - Choose Θ₀, Θ₁ so that h_Θ(x) is close to y for our training examples (x,y)
    - Minimise ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$ ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$ for Θ₀, Θ₁
      - h_Θ(x⁽ⁱ⁾) = Θ₀ + Θ₁x⁽ⁱ⁾
    - J(Θ₀,Θ₁) = ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$
    - J(Θ₀,Θ₁) is the Cost Function, also known in this case as the Squared Error Function
- Cost Function - Intuition I
  - Summary:
    - Hypothesis: h_Θ(x) = Θ₀ + Θ₁x
    - Parameters: Θ₀, Θ₁
    - Cost Function: J(Θ₀,Θ₁) = ${\frac {1}{2m}}\sum _{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}$
    - Goal: minimise Θ₀, Θ₁ J(Θ₀, Θ₁)
  - Simplified:
    - h_Θ(x) = Θ₁x
    - minimise Θ₁ J(Θ₁)
  - Can plot simplified model in 2D
- Cost Function - Intuition II
  - Can plot J(Θ₀,Θ₁) in 3D
  - Can plot with Contour Map (Contour Plot)
- Gradient Descent
  - Repeat until convergence { $\theta _{j}\colon =\theta _{j}-\alpha {\frac {\partial }{\partial \theta _{j}}}J(\theta _{0},\theta _{1})$ }
  - α = learning rate
- Gradient Descent Intuition
  - min
    Θ₁ J(Θ₁)
    - For Θ₁ > local minimum: ${\frac {d}{d\theta _{1}}}J(\theta _{1})$ positive, moves toward local minimum
    - For Θ₁ < local minimum: ${\frac {d}{d\theta _{1}}}J(\theta _{1})$ negative, moves toward local minimum
  - If learning rate α is too small algorithm takes a long time to run
  - If learning rate α is too large algorithm may not converge or may diverge
  - When partial derivative is zero Θ₁ converges
  - As we approach a local minimum, gradient descent automatically takes smaller steps
    - So no need to decrease α over time

@@ Line 89: / Line 89: @@
 *** α = learning rate
 ** [[ML_class#Gradient_Descent_Intuition|Gradient Descent Intuition]]
-*** <table style="display:inline;border:none;"><tr><td style="border:none; text-align:center;">min<br>Θ<sub>1</sub></td><td style="border:none">J(Θ<sub>1</sub>)</td></tr></table>
+*** <table style="display:inline;border:none;margin:0;padding:0;"><tr><td style="border:none; text-align:center;">min<br>Θ<sub>1</sub></td><td style="border:none">J(Θ<sub>1</sub>)</td></tr></table>
 **** For Θ<sub>1</sub> &gt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> positive, moves toward local minimum
 **** For Θ<sub>1</sub> &lt; local minimum: <math>\frac{d}{d\theta_1}J(\theta_1)</math> negative, moves toward local minimum

ML class overview: Difference between revisions

Revision as of 16:08, 5 October 2011

Navigation menu

Search