Very Short Intro To Deep Learning

University of Turin

Collegio Carlo Alberto

JPE Data Editor

2026-04-14

Purpose

What are Machine Learning (ML), AI and Deep Learning?

  • ML is a branch of AI concerned with algo development.
  • We want to generalise models to unseen data.
  • AI wants to develop algos that do not need to be programmed - they learn how to improve themselves.
  • Deep Learning is a form of an Artificial Neural Network with layers.

What Are (Deep) Neural Networks?

  • Take a \(p\) dimensional input vector \(X\) and build a nonlinear function \(f(X)\) to predict output \(Y\).
  • \(f\) obeys a certain structure, which - importantly - allows automatic differentiation during optimization and parameter search.

A Taxonomy of Learning

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning (RL)

πŸ‘‰ Look more at the statlearning textbook for more details!

A Taxonomy of Learning


Supervised Unsupervised Reinforcement
Data Labeled \((x_i, y_i)\) pairs Unlabeled \(x_i\) only Payoff/Utility
Goal Learn \(f: x \mapsto y\) Find structure in \(x\) Learn policy \(\pi(a \mid s)\)
Output Prediction / class Cluster / embedding Action sequence
Feedback Prediction error None (self-organized) Lifetime Utility

Taxonomy examples

Supervised

  • Regression: house price from features
  • Classification: spam vs. not spam; digit recognition
  • Sequence: machine translation, speech-to-text
  • Vision: image classification (ImageNet), object detection

Taxonomy examples

Supervised

  • Regression: house price from features
  • Classification: spam vs. not spam; digit recognition
  • Sequence: machine translation, speech-to-text
  • Vision: image classification (ImageNet), object detection

Unsupervised

  • Clustering: customer segmentation (k-means)
  • Dimensionality reduction: PCA, autoencoders, t-SNE
  • Density estimation: GANs, VAEs, normalizing flows
  • Self-supervised: language model pre-training (GPT, BERT)

Taxonomy examples

Supervised

  • Regression: house price from features
  • Classification: spam vs. not spam; digit recognition
  • Sequence: machine translation, speech-to-text
  • Vision: image classification (ImageNet), object detection

Unsupervised

  • Clustering: customer segmentation (k-means)
  • Dimensionality reduction: PCA, autoencoders, t-SNE
  • Density estimation: GANs, VAEs, normalizing flows
  • Self-supervised: language model pre-training (GPT, BERT)

Reinforcement

  • Games: AlphaGo, AlphaZero, Atari DQN
  • Robotics: locomotion, manipulation
  • Economics: dynamic programming, optimal stopping
  • LLMs: RLHF (fine-tuning with human feedback)

Deep learning - Artificial Neural Networks (ANN)

Components of an ANN:

  1. Artificial Neurons: each neuron gets an input signal, processes it and outputs another signal (think: one number in, one number out)
  2. Edges: connections between neurons (undirected, because can we want to go back and forwards)

Real Neurons

Real neurons in a human brain have many more ways of computing, calling this neurons is a stretch of imagination. Marketing.

Example: A Single Layer ANN

  • input \(x\) (often a vector) which gets transformed to an
  • output \(y\) (can also be a vector). We apply an activation function \(\phi\) to a linear transformation of the input:
\[ x = \left( \begin{align} x_{1} \\ x_{2} \\ x_{3} \end{align} \right) \] \[\begin{align} z &= w'x + b \\ y &= \phi(z) \end{align} \]

Note

\(w\) is a vector or weights, \(b\) is an intercept or bias term. \(\phi\) is (in general) a nonlinear function. This example has a 3-dim input and a 1-dim output, and a single layer (i.e. just the output layer).

Bias? Why Bias?

\[\begin{align} z &= w'x + b \\ y &= \phi(z) \end{align} \]

  • Well we know that \(b\) is just the intercept. \(z\) is nothing but a linear transformation.
  • We economists call \(z\) β€œa regression”. The intercept shifts the line/hyperplane up and down, else it passes through the origin.
  • So, this just takes a linear transform of \(x\) and sticks it into a nonlinear function \(\phi\).

What are those \(\phi\) functions then?

  • It’s key that they are nonlinear.
  • Typical choices are
    • sigmoid: \(\phi(x) = \frac{1}{1 + \exp(-x)}\)
    • ReLU (rectified linear unit): \(\phi(x) = \begin{cases} 0 & \text{if } x<0\\x & \text{else.} \end{cases}\)
  • many others (softmax, tanh, Leaky ReLU, GELU,…)

Deep Neural Networks (DNNs)

Single input layer, multiple hidden layers, single output layer

DNNs

  • Notice: each neuron towards the right depends on entire network left of it.
  • \(y\) depends on \(h^{(2)}\), which depends on \(h_1^{(1)}\) and \(h_2^{(1)}\).

DNNs

\[\begin{align} h_1^{(1)} &= \phi(w_{11}^{(1)} x_1 + w_{12}^{(1)} x_2 + b_1^{(1)}) \\ h_2^{(1)} &= \phi(w_{21}^{(1)} x_1 + w_{22}^{(1)} x_2 + b_2^{(1)})\\ h^{(2)} &= \phi(w_1^{(2)} h_1^{(1)} + w_2^{(2)} h_2^{(1)} + b^{(2)}) \\ y &= \phi(w^{(3)} h^{(2)} + b^{(3)}) \end{align}\]

  • \(w_{12}^{(1)}\) strength of \(x_2 \rightarrow h_1^{(1)}\)
  • \(\phi\) typically constant in layer
  • \(\phi\) can change across layers

Example: Hand Written Digit Recognition

Simplest Problem: Detect / vs \


  • Suppose we have 4 pixels of data, arranged as a square
  • Each pixel can be white (on, 1) or black (off, 0).
  • Imagine this is a very low quality digital photograph of somebody’s handwritten /
  • like, 4 pixels of information only.
  • You can see why we chose / for this exercise, any other digit is too complex for this. More on that later.

Simplest Problem: Detect / vs \


  • Suppose we have 4 pixels of data, arranged as a square
  • Each pixel can be white (on, 1) or black (off, 0).
  • We want to recognize from those a white forward slash
  • or white backward slash
  • Those represent handwriting in this example.
βŸ‹

βŸ‹
⟍

Simplest Problem: Detect / vs \

How does supervised learning work with NNs?

  • We provide the algo with pairs of values: input, output \((X,Y)\)
  • We train it on those pairs.
  • Then we give it a new \(X\) and want a new \(Y\) back.
  • For regression tasks that’s ok. Vector \(x\) in, number \(y\) out.
  • But classification (is this picture / or \?) needs a tweak.

We need to encode classes (/ or \) into numbers somehow.

Simplest Problem: Detect / vs \

Input Encoding

Let’s go down column-wise in each box of squares and record 0 for black and 1 for white. Each square is \(x_1,\dots,x_4 \in \{0,1\}\)

(0,1,1,0)
(1,0,0,1)

Simplest Problem: Detect / vs \

Output encoding

Same for output \(Y\). We use a one-hot encoding with a 2-dimensional vector \((y_{1}, y_{2})\).


One hot encoding vs dummy variables

  • Dummy vars typically have a reference category. For K levels you need K-1 columns. Like red,green,blue is K=3.
  • One-hot encodes all categories.

Simplest Problem: Detect / vs \

Output encoding

Same for output \(Y\). We use a one-hot encoding with a 2-dimensional vector \((y_{1}, y_{2})\).

Dummy Variables

green blue
  0     0
  1     0
  0     1
  0     0
  0     1

One-hot Encoding

red green blue
 1    0     0
 0    1     0
 0    0     1
 1    0     0
 0    0     1

Simplest Problem: Detect / vs \

Output encoding: One-hot

Let’s go with the following. You can see that this is arbitrary (we could easily have inverted this without consequences.)

βŸ‹ (1,0)
⟍ (0,1)

Simplest Problem: Detect / vs \

\[ x = \left( \begin{align} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{align} \right) \]

\[\left( \begin{align} y_{1} \\ y_{2} \end{align} \right) = y\]

Computing parameters

We need \(y = \phi(z)\) and \(z=Wx + b\).

  1. \(W\) is a \((2,4)\) matrix of weights
  2. \(b\) is a \((2,1)\) vector of biases

Simplest Problem: Detect / vs \

Linear Activation Function (for Teaching only)

  • Let’s assume that \(\phi(x)=x\) so we can compute the coefficients.

\[y = \phi(Wx + b) = Wx + b\]

  • Which, written out is

\[\left( \begin{array}{c} y_{1} \\ y_{2} \end{array} \right) = \left( \begin{array}{cccc} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} \\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} \end{array} \right) \left( \begin{array}{c} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right)\]

There are 2 relevant cases, and we need to find values for \(W\) and \(b\) such that a certain tuple of x values results in a certain y output.

\[\left( \begin{array}{cccc} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} \\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} \end{array} \right) \left( \begin{array}{c} 0 \\ 1\\ 1\\ 0 \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right)\]

\[\left( \begin{array}{cccc} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} \\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} \end{array} \right) \left( \begin{array}{c} 1 \\ 0\\ 0\\ 1 \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 0 \\ 1 \end{array} \right)\]

There are 2 relevant cases, and we need to find values for \(W\) and \(b\) such that a certain tuple of x values results in a certain y output.

\[\left( \begin{array}{c} w_{1,2} + w_{1,3} \\ w_{2,2} + w_{2,3} \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right)\]

\[\left( \begin{array}{c} w_{1,1} + w_{1,4} \\ w_{2,1} + w_{2,4} \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 0 \\ 1 \end{array} \right)\]

There are 2 relevant cases, and we need to find values for \(W\) and \(b\) such that a certain tuple of x values results in a certain y output.

\[\begin{aligned} w_{1,2} + w_{1,3} + b_{1} &= 1 \\ w_{2,2} + w_{2,3} + b_{2} &= 0 \\ w_{1,1} + w_{1,4} + b_{1} &= 1 \\ w_{2,1} + w_{2,4} + b_{2} &= 0 \end{aligned}\]

If we can find values \(W\) and \(b\) such that this holds, our NN will perfectly recognize / and \.

Note

Solving 4 equations with 10 variables: we have 6 degrees of freedom too many. Can just choose an arbitrary number for those and solve system for the remaining four.

Computed weights and biases

I set \(b=0\) and \(w_{11},w_{12},w_{21},w_{23}=0.5\). This gives

\[\left( \begin{array}{cccc} 0.5 & 0.5 & 0.5 & -0.5 \\ 0.5 & 0.5 & -0.5 & 0.5 \end{array} \right) \left( \begin{array}{c} 0 \\ 1\\ 1\\ 0 \end{array} \right) + \left( \begin{array}{c} 0 \\ 0 \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right)\]

\[\left( \begin{array}{cccc} 0.5 & 0.5 & 0.5 & -0.5 \\ 0.5 & 0.5 & -0.5 & 0.5 \end{array} \right) \left( \begin{array}{c} 1 \\ 0\\ 0\\ 1 \end{array} \right) + \left( \begin{array}{c} 0 \\ 0 \end{array} \right) = \left( \begin{array}{c} 0 \\ 1 \end{array} \right)\]

Simplest Problem: Detect / vs \


  • A very simple NN with a linear activation function was perfectly able to recognize those characters from colored (black or white) pixels
  • 4 input neurons
  • 2 output neurons
  • Increasing pixels: increasing resolution of the image
  • more variables to deal with, and nonlinear activations: cannot solve by hand.

Demo Setup

function demo(x=nothing)
    println("Simple Neural Network Demo")
    println("="^40)
    if x === nothing
        x = rand(4)
    elseif !(isa(x, Vector{<:Real}) && length(x) == 4)
        error("Input 'x' must be a vector of 4 real numbers")
    end
    
    # Input
    # Weights and biases
    W = [0.5  0.5  0.5 -0.5;
         0.5  0.5 -0.5  0.5]
    b = [0, 0]
    
    # Output    
    y = W * x + b
    
    println("Input:")
    println("β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”")
    println("β”‚ $(round(x[1], digits=1)) β”‚ $(round(x[2], digits=1)) β”‚")
    println("β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€")
    println("β”‚ $(round(x[3], digits=1)) β”‚ $(round(x[4], digits=1)) β”‚")
    println("β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜")
    
    # simple rule: (1,0) means `/`
    # so let's say any (q,r) means `/` as long as q > r
    println("Output:")
    if y[1] >= y[2]
        println("β”Œβ”€β”€β”€β”")
        println("β”‚ βŸ‹ β”‚")
        println("β””β”€β”€β”€β”˜")
    end
    if y[1] β‰ˆ y[2]
        println("or")
    end
    if y[2] >= y[1]
        println("β”Œβ”€β”€β”€β”")
        println("β”‚ ⟍ β”‚")
        println("β””β”€β”€β”€β”˜")
    end
    
    return y
end
demo (generic function with 2 methods)

Run Demo


demo([0,1,1,0])  # should give `/`
Simple Neural Network Demo
========================================
Input:
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ 0.0 β”‚ 1.0 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 1.0 β”‚ 0.0 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
Output:
β”Œβ”€β”€β”€β”
β”‚ βŸ‹ β”‚
β””β”€β”€β”€β”˜
2-element Vector{Float64}:
 1.0
 0.0

Run Demo


demo([1,0,0,1])  # should give `\`
Simple Neural Network Demo
========================================
Input:
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ 1.0 β”‚ 0.0 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 0.0 β”‚ 1.0 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
Output:
β”Œβ”€β”€β”€β”
β”‚ ⟍ β”‚
β””β”€β”€β”€β”˜
2-element Vector{Float64}:
 0.0
 1.0

Run Demo on unseen data

How does this NN generalize to test data?


demo([0.5,0.1,0.2,0.4])  # square with shades of grey
Simple Neural Network Demo
========================================
Input:
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ 0.5 β”‚ 0.1 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚ 0.2 β”‚ 0.4 β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
Output:
β”Œβ”€β”€β”€β”
β”‚ ⟍ β”‚
β””β”€β”€β”€β”˜
2-element Vector{Float64}:
 0.2
 0.4

5Γ—5 Pixel Digit Recognition: More Difficult!

Scaling Up: From 4 to 25 Pixels

What we had:

  • 4-pixel grid, 2 classes (/ vs \)
  • 10 unknowns, 4 equations β†’ solved analytically

What we want now:

  • 25 pixels (5Γ—5 grid), 10 classes (digits 0–9)
  • 260 parameters, 300 equations β†’ overdetermined
  • Cannot solve exactly β€” must minimize error

Solution: Gradient descent β€” find \(W\), \(b\) that minimize average squared error.

5Γ—5 Pixel Digits

Three hand-crafted variants (a, b, c) for each digit:

Input Encoding

Each 5Γ—5 grid is read row by row into \(x = (x_1, \ldots, x_{25}) \in \{0,1\}^{25}\):

\[\begin{align} ( & 0,0,1,0,0, \\ & 0,1,1,0,0, \\ & 0,0,1,0,0, \\ & 0,0,1,0,0, \\ & 0,0,1,0,0 ) \end{align}\]

PIXELS[2,:]'  # transpose so that you can see it
1Γ—25 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  0.0  1.0  0.0  0.0  0.0  1.0  1.0  …  0.0  0.0  0.0  0.0  1.0  0.0  0.0

Input encoding

Each 5Γ—5 grid is read row by row into \(x = (x_1, \ldots, x_{25}) \in \{0,1\}^{25}\):

5Γ—5 adjoint(::Matrix{Float64}) with eltype Float64:
 1.0  1.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0
 0.0  1.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  0.0

5Γ—5 adjoint(::Matrix{Float64}) with eltype Float64:
 1.0  1.0  1.0  1.0  1.0
 1.0  0.0  0.0  0.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  0.0  0.0  0.0  1.0
 1.0  1.0  1.0  1.0  1.0

Output Encoding

10 classes. One-hot encoding for \(y = (y_0,y_1,\dots,y_9) \in \{0,1\}^{10}\)

1 \[(0,1,0,0,0,0,0,0,0,0)\]
3 \[(0,0,0,1,0,0,0,0,0,0)\]

Network Parameters

\[y = Wx + b\]

Size Count
\(W\) \(10 \times 25\) 250 weights
\(b\) \(10 \times 1\) 10 biases
Total 260 parameters

With 30 training samples: \(30 \times 10 = 300\) equations for 260 unknowns

β†’ overdetermined: no exact solution in general β†’ minimize error

Loss Function

For one training pair \((x, y)\) the sample loss is:

\[\mathbf{L}_{(x,y)}(W,b) = \sum_{i=0}^{9} \big(\hat{y}_i - y_i\big)^2 = \sum_{i=0}^{9} \bigg(\sum_{j=1}^{25} w_{ij}\, x_j + b_i - y_i\bigg)^2\]

The average loss over the full training set \(S\) is:

\[\mathbf{L}(W,b) = \frac{1}{|S|} \sum_{(x,y) \in S} \mathbf{L}_{(x,y)}(W,b)\]

Note

We want to find \(W\), \(b\) that minimize \(\mathbf{L}(W,b)\).

Gradients

The gradient tells us the direction of steepest ascent of the loss. Moving opposite to it reduces the loss.

For weights \(w_{ij}\):

\[\frac{\partial \mathbf{L}_{(x,y)}}{\partial w_{ij}} = \frac{\partial \mathbf{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial w_{ij}} = 2(\hat{y}_i - y_i) \cdot x_j\]

For biases \(b_i\):

\[\frac{\partial \mathbf{L}_{(x,y)}}{\partial b_i} = \frac{\partial \mathbf{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial b_i} = 2(\hat{y}_i - y_i) \cdot 1\]

Gradients

The gradient tells us the direction of steepest ascent of the loss. Moving opposite to it reduces the loss.


Tip

In matrix form we have \(\delta = 2(\hat{y} - y)\) - a vector of length(y) = 10 here.

  • \(\nabla_W \mathbf{L} = \delta\, x^\top\) : 10 by 30
  • \(\nabla_b \mathbf{L} = \delta\) : 10

Julia: Training Data Setup

  • We need to reshape the data so that it’s useful to our algorithm.
  • We encode classes with one-hot as before
function one_hot(digit::Int)
    y = zeros(Float64, 10)
    y[digit + 1] = 1.0   # digit 0 β†’ index 1, ..., digit 9 β†’ index 10
    return y
end

X_train = [PIXELS[i, :] for i in 1:30] ;       # 30 input vectors, each length 25
Y_train = [one_hot(LABELS[i]) for i in 1:30];  # 30 one-hot output vectors, each length 10
@show Y_train[2,:] # show row 2 of output
@show X_train[2,:]; # show row 2 of input
Y_train[2, :] = [[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
X_train[2, :] = [[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]]

Julia: Predict and Loss

  • We need functions to predict the output
  • And to measure to deviation from the true output.
predict(W, b, x) = W * x + b             # forward pass: 10-dim output

sample_loss(Ε·, y) = sum((Ε· .- y).^2)    # sum of squared errors for one sample

# Example: zero-weight model on digit-0 input
W0 = zeros(10, 25);  b0 = zeros(10)
Ε·0 = predict(W0, b0, X_train[1])         # X_train[1] is digit 0 β†’ Ε·0 = [0,…,0]
println("loss (zero weights): ", sample_loss(Ε·0,  Y_train[1]))  # (0-1)Β²+(0-0)Β²Γ—9 = 1
println("loss (perfect pred): ", sample_loss(Y_train[1], Y_train[1]))  # = 0
loss (zero weights): 1.0
loss (perfect pred): 0.0

Julia: Gradients

function sample_gradients(W, b, x, y)
    Ε· = predict(W, b, x)
    Ξ΄ = 2.0 .* (Ε· .- y)    # βˆ‚L/βˆ‚Ε·  (10-vector)
    βˆ‡W = Ξ΄ * x'             # βˆ‚L/βˆ‚W  (10Γ—25 outer product)
    βˆ‡b = Ξ΄                  # βˆ‚L/βˆ‚b  (10-vector)
    return βˆ‡W, βˆ‡b
end

# Example: same zero-weight model on digit-0 input
# Ε·=[0,…,0], y=[1,0,…,0]  β†’  Ξ΄ = 2*(Ε·-y) = [-2,0,…,0]
βˆ‡W0, βˆ‡b0 = sample_gradients(W0, b0, X_train[1], Y_train[1])
println("βˆ‡W size : ", size(βˆ‡W0))           # (10, 25)
println("βˆ‡W[1,:]   : ", βˆ‡W0[1,:])          # first output reacts how to this W0?
println("βˆ‡W[2,:]   : ", βˆ‡W0[2,:])          # second?
println("βˆ‡b[1]   : ", βˆ‡b0[1])             # -2  β†’ W and b for digit 0 must increase
println("βˆ‡b[2]   : ", βˆ‡b0[2])             #  0  β†’ digit-1 neuron unaffected
βˆ‡W size : (10, 25)
βˆ‡W[1,:]   : [-2.0, -2.0, -2.0, -2.0, -2.0, -2.0, -0.0, -0.0, -0.0, -2.0  …  -2.0, -0.0, -0.0, -0.0, -2.0, -2.0, -2.0, -2.0, -2.0, -2.0]
βˆ‡W[2,:]   : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
βˆ‡b[1]   : -2.0
βˆ‡b[2]   : 0.0

Julia: Gradient Descent

function gradient_descent(X, Y; Ξ·=0.1, tol=5e-3, maxiter=10_000)
    W = randn(10, 25) .* 0.1  # random starting Weights
    b = zeros(Float64, 10)    # zero bias
    losses = Float64[]
    for iter in 1:maxiter
        # initiate gradients and loss at zero
        βˆ‡W = zeros(10, 25); βˆ‡b = zeros(10); L = 0.0
        # average sample gradient is average of each point's gradient
        for (x, y) in zip(X, Y)
            gW, gb = sample_gradients(W, b, x, y)
            βˆ‡W .+= gW;  βˆ‡b .+= gb # accumulate sum
            L  += sample_loss(predict(W, b, x), y)
        end
        # divide by n to get average
        βˆ‡W ./= length(X);  βˆ‡b ./= length(X);  L /= length(X)
        push!(losses, L)
        gnorm = norm(vcat(vec(βˆ‡W), βˆ‡b))
        if iter % 2000 == 0
            println("iter $iter: loss=$(round(L,digits=4))  β€–βˆ‡β€–=$(round(gnorm,digits=4))")
        end
        if gnorm < tol
            println("Converged at iteration $iter  (loss=$(round(L,digits=5)))")
            break
        end
        # update by the neg of gradient
        # with stepsize Ξ·
        W .-= Ξ· .* βˆ‡W;  b .-= Ξ· .* βˆ‡b
    end
    return W, b, losses
end
gradient_descent (generic function with 1 method)

Training

using Random
Random.seed!(42)
W_opt, b_opt, loss_history = gradient_descent(X_train, Y_train);
iter 2000: loss=0.051  β€–βˆ‡β€–=0.0141
iter 4000: loss=0.0285  β€–βˆ‡β€–=0.008
iter 6000: loss=0.0194  β€–βˆ‡β€–=0.0057
Converged at iteration 7105  (loss=0.01625)

Loss over Iterations

Training Accuracy

function classify(W, b, x)
    scores = predict(W, b, x)
    return argmax(scores) - 1   # convert 1-indexed back to digit label
end

correct = sum(classify(W_opt, b_opt, x) == LABELS[i] for (i, x) in enumerate(X_train))
println("Training accuracy: $correct / $(length(X_train)) = $(round(100correct/length(X_train), digits=1))%")
Training accuracy: 30 / 30 = 100.0%

Prediction Scores

Predicted Scores

  • Great!
  • Notice though:
sorted_scores[1:5,1:5]
5Γ—5 Matrix{Float64}:
  1.0258       0.0427252  -0.0887751   0.168204   -0.0178768
  0.987873    -0.0138577   0.0179698  -0.0767012  -0.00623931
  0.97401     -0.0349464   0.102859   -0.119221    0.0321606
 -0.00977294   0.9823      0.0516094  -0.0539683   0.0185191
 -0.00151022   0.992443    0.0146322  -0.0287553   0.00313316
  • We have numbers outside \([0,1]\) here!
  • This is like the linear probability model!
  • In a real application, we’d want an activation function \(\phi\) which maps into \([0,1]\)!

Learned Weights

Each row of \(W\) is a 25-dim vector: red = positive weight (pixel ON β†’ higher score), blue = negative.

END