Very Short Intro To Deep Learning

Florian Oswald

florian.oswald@unito.it

University of Turin

Collegio Carlo Alberto

JPE Data Editor

2026-04-14

Purpose

I want to give you a crash course in Deep Learning with Julia
We don’t have the time to give a full overview of this huge field.
I recommend the interested reader to consult
- http://statlearning.com/
- https://bio322.epfl.ch/
There is a vibrant Econ Literature (see J Fernandez-Villaverde)
Julia Tools:
Parts of this lecture are under copyright of Asvin Goel under an MIT license.

What are Machine Learning (ML), AI and Deep Learning?

ML is a branch of AI concerned with algo development.
We want to generalise models to unseen data.
AI wants to develop algos that do not need to be programmed - they learn how to improve themselves.
Deep Learning is a form of an Artificial Neural Network with layers.

What Are (Deep) Neural Networks?

Take a \(p\) dimensional input vector \(X\) and build a nonlinear function \(f(X)\) to predict output \(Y\).
\(f\) obeys a certain structure, which - importantly - allows automatic differentiation during optimization and parameter search.

A Taxonomy of Learning

Supervised Learning
Unsupervised Learning
Reinforcement Learning (RL)

👉 Look more at the statlearning textbook for more details!

A Taxonomy of Learning

	Supervised	Unsupervised	Reinforcement
Data	Labeled \((x_i, y_i)\) pairs	Unlabeled \(x_i\) only	Payoff/Utility
Goal	Learn \(f: x \mapsto y\)	Find structure in \(x\)	Learn policy \(\pi(a \mid s)\)
Output	Prediction / class	Cluster / embedding	Action sequence
Feedback	Prediction error	None (self-organized)	Lifetime Utility

Taxonomy examples

Supervised

Regression: house price from features
Classification: spam vs. not spam; digit recognition
Sequence: machine translation, speech-to-text
Vision: image classification (ImageNet), object detection

Taxonomy examples

Supervised

Regression: house price from features
Classification: spam vs. not spam; digit recognition
Sequence: machine translation, speech-to-text
Vision: image classification (ImageNet), object detection

Unsupervised

Clustering: customer segmentation (k-means)
Dimensionality reduction: PCA, autoencoders, t-SNE
Density estimation: GANs, VAEs, normalizing flows
Self-supervised: language model pre-training (GPT, BERT)

Taxonomy examples

Supervised

Regression: house price from features
Classification: spam vs. not spam; digit recognition
Sequence: machine translation, speech-to-text
Vision: image classification (ImageNet), object detection

Unsupervised

Clustering: customer segmentation (k-means)
Dimensionality reduction: PCA, autoencoders, t-SNE
Density estimation: GANs, VAEs, normalizing flows
Self-supervised: language model pre-training (GPT, BERT)

Reinforcement

Games: AlphaGo, AlphaZero, Atari DQN
Robotics: locomotion, manipulation
Economics: dynamic programming, optimal stopping
LLMs: RLHF (fine-tuning with human feedback)

Deep learning - Artificial Neural Networks (ANN)

Components of an ANN:

Artificial Neurons: each neuron gets an input signal, processes it and outputs another signal (think: one number in, one number out)
Edges: connections between neurons (undirected, because can we want to go back and forwards)

Real Neurons

Real neurons in a human brain have many more ways of computing, calling this neurons is a stretch of imagination. Marketing.

Example: A Single Layer ANN

input \(x\) (often a vector) which gets transformed to an
output \(y\) (can also be a vector). We apply an activation function \(\phi\) to a linear transformation of the input:

\[ x = \left( \begin{align} x_{1} \\ x_{2} \\ x_{3} \end{align} \right) \]

\[\begin{align} z &= w'x + b \\ y &= \phi(z) \end{align} \]

Note

\(w\) is a vector or weights, \(b\) is an intercept or bias term. \(\phi\) is (in general) a nonlinear function. This example has a 3-dim input and a 1-dim output, and a single layer (i.e. just the output layer).

Bias? Why Bias?

\[\begin{align} z &= w'x + b \\ y &= \phi(z) \end{align} \]

Well we know that \(b\) is just the intercept. \(z\) is nothing but a linear transformation.
We economists call \(z\) “a regression”. The intercept shifts the line/hyperplane up and down, else it passes through the origin.
So, this just takes a linear transform of \(x\) and sticks it into a nonlinear function \(\phi\).

What are those \(\phi\) functions then?

It’s key that they are nonlinear.
Typical choices are
- sigmoid: \(\phi(x) = \frac{1}{1 + \exp(-x)}\)
- ReLU (rectified linear unit): \(\phi(x) = \begin{cases} 0 & \text{if } x<0\\x & \text{else.} \end{cases}\)
many others (softmax, tanh, Leaky ReLU, GELU,…)

Deep Neural Networks (DNNs)

Single input layer, multiple hidden layers, single output layer

DNNs

Notice: each neuron towards the right depends on entire network left of it.
\(y\) depends on \(h^{(2)}\), which depends on \(h_1^{(1)}\) and \(h_2^{(1)}\).

DNNs

\[\begin{align} h_1^{(1)} &= \phi(w_{11}^{(1)} x_1 + w_{12}^{(1)} x_2 + b_1^{(1)}) \\ h_2^{(1)} &= \phi(w_{21}^{(1)} x_1 + w_{22}^{(1)} x_2 + b_2^{(1)})\\ h^{(2)} &= \phi(w_1^{(2)} h_1^{(1)} + w_2^{(2)} h_2^{(1)} + b^{(2)}) \\ y &= \phi(w^{(3)} h^{(2)} + b^{(3)}) \end{align}\]

\(w_{12}^{(1)}\) strength of \(x_2 \rightarrow h_1^{(1)}\)
\(\phi\) typically constant in layer
\(\phi\) can change across layers

Example: Hand Written Digit Recognition

Simplest Problem: Detect `/` vs `\`

Suppose we have 4 pixels of data, arranged as a square
Each pixel can be white (on, 1) or black (off, 0).

Imagine this is a very low quality digital photograph of somebody’s handwritten /
like, 4 pixels of information only.
You can see why we chose / for this exercise, any other digit is too complex for this. More on that later.

Simplest Problem: Detect `/` vs `\`

Suppose we have 4 pixels of data, arranged as a square
Each pixel can be white (on, 1) or black (off, 0).

We want to recognize from those a white forward slash
or white backward slash
Those represent handwriting in this example.

Simplest Problem: Detect `/` vs `\`

How does supervised learning work with NNs?

We provide the algo with pairs of values: input, output \((X,Y)\)
We train it on those pairs.
Then we give it a new \(X\) and want a new \(Y\) back.

For regression tasks that’s ok. Vector \(x\) in, number \(y\) out.
But classification (is this picture / or \?) needs a tweak.

We need to encode classes (/ or \) into numbers somehow.

Simplest Problem: Detect `/` vs `\`

Input Encoding

Let’s go down column-wise in each box of squares and record 0 for black and 1 for white. Each square is \(x_1,\dots,x_4 \in \{0,1\}\)

		(0,1,1,0)

		(1,0,0,1)

Simplest Problem: Detect `/` vs `\`

Output encoding

Same for output \(Y\). We use a one-hot encoding with a 2-dimensional vector \((y_{1}, y_{2})\).

One hot encoding vs dummy variables

Dummy vars typically have a reference category. For K levels you need K-1 columns. Like red,green,blue is K=3.
One-hot encodes all categories.

Simplest Problem: Detect `/` vs `\`

Output encoding

Same for output \(Y\). We use a one-hot encoding with a 2-dimensional vector \((y_{1}, y_{2})\).

Dummy Variables

One-hot Encoding

red green blue
 1    0     0
 0    1     0
 0    0     1
 1    0     0
 0    0     1

Simplest Problem: Detect `/` vs `\`

Output encoding: One-hot

Let’s go with the following. You can see that this is arbitrary (we could easily have inverted this without consequences.)

		(1,0)
		(0,1)

Simplest Problem: Detect `/` vs `\`

\[ x = \left( \begin{align} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{align} \right) \]

\[\left( \begin{align} y_{1} \\ y_{2} \end{align} \right) = y\]

Computing parameters

We need \(y = \phi(z)\) and \(z=Wx + b\).

\(W\) is a \((2,4)\) matrix of weights
\(b\) is a \((2,1)\) vector of biases

Simplest Problem: Detect `/` vs `\`

Linear Activation Function (for Teaching only)

Let’s assume that \(\phi(x)=x\) so we can compute the coefficients.

\[y = \phi(Wx + b) = Wx + b\]

Which, written out is

\[\left( \begin{array}{c} y_{1} \\ y_{2} \end{array} \right) = \left( \begin{array}{cccc} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} \\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} \end{array} \right) \left( \begin{array}{c} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right)\]

There are 2 relevant cases, and we need to find values for \(W\) and \(b\) such that a certain tuple of x values results in a certain y output.

\[\left( \begin{array}{cccc} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} \\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} \end{array} \right) \left( \begin{array}{c} 0 \\ 1\\ 1\\ 0 \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right)\]

\[\left( \begin{array}{cccc} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} \\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} \end{array} \right) \left( \begin{array}{c} 1 \\ 0\\ 0\\ 1 \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 0 \\ 1 \end{array} \right)\]

There are 2 relevant cases, and we need to find values for \(W\) and \(b\) such that a certain tuple of x values results in a certain y output.

\[\left( \begin{array}{c} w_{1,2} + w_{1,3} \\ w_{2,2} + w_{2,3} \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right)\]

\[\left( \begin{array}{c} w_{1,1} + w_{1,4} \\ w_{2,1} + w_{2,4} \end{array} \right) + \left( \begin{array}{c} b_{1} \\ b_{2} \end{array} \right) = \left( \begin{array}{c} 0 \\ 1 \end{array} \right)\]

There are 2 relevant cases, and we need to find values for \(W\) and \(b\) such that a certain tuple of x values results in a certain y output.

\[\begin{aligned} w_{1,2} + w_{1,3} + b_{1} &= 1 \\ w_{2,2} + w_{2,3} + b_{2} &= 0 \\ w_{1,1} + w_{1,4} + b_{1} &= 1 \\ w_{2,1} + w_{2,4} + b_{2} &= 0 \end{aligned}\]

If we can find values \(W\) and \(b\) such that this holds, our NN will perfectly recognize / and \.

Note

Solving 4 equations with 10 variables: we have 6 degrees of freedom too many. Can just choose an arbitrary number for those and solve system for the remaining four.

Computed weights and biases

I set \(b=0\) and \(w_{11},w_{12},w_{21},w_{23}=0.5\). This gives

\[\left( \begin{array}{cccc} 0.5 & 0.5 & 0.5 & -0.5 \\ 0.5 & 0.5 & -0.5 & 0.5 \end{array} \right) \left( \begin{array}{c} 0 \\ 1\\ 1\\ 0 \end{array} \right) + \left( \begin{array}{c} 0 \\ 0 \end{array} \right) = \left( \begin{array}{c} 1 \\ 0 \end{array} \right)\]

\[\left( \begin{array}{cccc} 0.5 & 0.5 & 0.5 & -0.5 \\ 0.5 & 0.5 & -0.5 & 0.5 \end{array} \right) \left( \begin{array}{c} 1 \\ 0\\ 0\\ 1 \end{array} \right) + \left( \begin{array}{c} 0 \\ 0 \end{array} \right) = \left( \begin{array}{c} 0 \\ 1 \end{array} \right)\]

Simplest Problem: Detect `/` vs `\`

A very simple NN with a linear activation function was perfectly able to recognize those characters from colored (black or white) pixels
4 input neurons
2 output neurons

Increasing pixels: increasing resolution of the image
more variables to deal with, and nonlinear activations: cannot solve by hand.

Demo Setup

function demo(x=nothing)
    println("Simple Neural Network Demo")
    println("="^40)
    if x === nothing
        x = rand(4)
    elseif !(isa(x, Vector{<:Real}) && length(x) == 4)
        error("Input 'x' must be a vector of 4 real numbers")
    end
    
    # Input
    # Weights and biases
    W = [0.5  0.5  0.5 -0.5;
         0.5  0.5 -0.5  0.5]
    b = [0, 0]
    
    # Output    
    y = W * x + b
    
    println("Input:")
    println("┌─────┬─────┐")
    println("│ $(round(x[1], digits=1)) │ $(round(x[2], digits=1)) │")
    println("├─────┼─────┤")
    println("│ $(round(x[3], digits=1)) │ $(round(x[4], digits=1)) │")
    println("└─────┴─────┘")
    
    # simple rule: (1,0) means `/`
    # so let's say any (q,r) means `/` as long as q > r
    println("Output:")
    if y[1] >= y[2]
        println("┌───┐")
        println("│ ⟋ │")
        println("└───┘")
    end
    if y[1] ≈ y[2]
        println("or")
    end
    if y[2] >= y[1]
        println("┌───┐")
        println("│ ⟍ │")
        println("└───┘")
    end
    
    return y
end

demo (generic function with 2 methods)

Run Demo

demo([0,1,1,0])  # should give `/`

Simple Neural Network Demo
========================================
Input:
┌─────┬─────┐
│ 0.0 │ 1.0 │
├─────┼─────┤
│ 1.0 │ 0.0 │
└─────┴─────┘
Output:
┌───┐
│ ⟋ │
└───┘

2-element Vector{Float64}:
 1.0
 0.0

Run Demo

demo([1,0,0,1])  # should give `\`

Simple Neural Network Demo
========================================
Input:
┌─────┬─────┐
│ 1.0 │ 0.0 │
├─────┼─────┤
│ 0.0 │ 1.0 │
└─────┴─────┘
Output:
┌───┐
│ ⟍ │
└───┘

2-element Vector{Float64}:
 0.0
 1.0

Run Demo on unseen data

How does this NN generalize to test data?

demo([0.5,0.1,0.2,0.4])  # square with shades of grey

Simple Neural Network Demo
========================================
Input:
┌─────┬─────┐
│ 0.5 │ 0.1 │
├─────┼─────┤
│ 0.2 │ 0.4 │
└─────┴─────┘
Output:
┌───┐
│ ⟍ │
└───┘

2-element Vector{Float64}:
 0.2
 0.4

5×5 Pixel Digit Recognition: More Difficult!

Scaling Up: From 4 to 25 Pixels

What we had:

4-pixel grid, 2 classes (/ vs \)
10 unknowns, 4 equations → solved analytically

What we want now:

25 pixels (5×5 grid), 10 classes (digits 0–9)
260 parameters, 300 equations → overdetermined
Cannot solve exactly — must minimize error

Solution: Gradient descent — find \(W\), \(b\) that minimize average squared error.

5×5 Pixel Digits

Three hand-crafted variants (a, b, c) for each digit:

Input Encoding

Each 5×5 grid is read row by row into \(x = (x_1, \ldots, x_{25}) \in \{0,1\}^{25}\):

\[\begin{align} ( & 0,0,1,0,0, \\ & 0,1,1,0,0, \\ & 0,0,1,0,0, \\ & 0,0,1,0,0, \\ & 0,0,1,0,0 ) \end{align}\]

PIXELS[2,:]'  # transpose so that you can see it

1×25 adjoint(::Vector{Float64}) with eltype Float64:
 0.0  0.0  1.0  0.0  0.0  0.0  1.0  1.0  …  0.0  0.0  0.0  0.0  1.0  0.0  0.0

Input encoding

Each 5×5 grid is read row by row into \(x = (x_1, \ldots, x_{25}) \in \{0,1\}^{25}\):

5×5 adjoint(::Matrix{Float64}) with eltype Float64:
 1.0  1.0  1.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0
 0.0  1.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  0.0

5×5 adjoint(::Matrix{Float64}) with eltype Float64:
 1.0  1.0  1.0  1.0  1.0
 1.0  0.0  0.0  0.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  0.0  0.0  0.0  1.0
 1.0  1.0  1.0  1.0  1.0

Output Encoding

10 classes. One-hot encoding for \(y = (y_0,y_1,\dots,y_9) \in \{0,1\}^{10}\)

		\[(0,1,0,0,0,0,0,0,0,0)\]
		\[(0,0,0,1,0,0,0,0,0,0)\]

Network Parameters

\[y = Wx + b\]

	Size	Count
\(W\)	\(10 \times 25\)	250 weights
\(b\)	\(10 \times 1\)	10 biases
Total		260 parameters

With 30 training samples: \(30 \times 10 = 300\) equations for 260 unknowns

→ overdetermined: no exact solution in general → minimize error

Loss Function

For one training pair \((x, y)\) the sample loss is:

\[\mathbf{L}_{(x,y)}(W,b) = \sum_{i=0}^{9} \big(\hat{y}_i - y_i\big)^2 = \sum_{i=0}^{9} \bigg(\sum_{j=1}^{25} w_{ij}\, x_j + b_i - y_i\bigg)^2\]

The average loss over the full training set \(S\) is:

\[\mathbf{L}(W,b) = \frac{1}{|S|} \sum_{(x,y) \in S} \mathbf{L}_{(x,y)}(W,b)\]

Note

We want to find \(W\), \(b\) that minimize \(\mathbf{L}(W,b)\).

Gradients

The gradient tells us the direction of steepest ascent of the loss. Moving opposite to it reduces the loss.

For weights \(w_{ij}\):

\[\frac{\partial \mathbf{L}_{(x,y)}}{\partial w_{ij}} = \frac{\partial \mathbf{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial w_{ij}} = 2(\hat{y}_i - y_i) \cdot x_j\]

For biases \(b_i\):

\[\frac{\partial \mathbf{L}_{(x,y)}}{\partial b_i} = \frac{\partial \mathbf{L}}{\partial \hat{y}_i} \cdot \frac{\partial \hat{y}_i}{\partial b_i} = 2(\hat{y}_i - y_i) \cdot 1\]

Gradients

The gradient tells us the direction of steepest ascent of the loss. Moving opposite to it reduces the loss.

Tip

In matrix form we have \(\delta = 2(\hat{y} - y)\) - a vector of length(y) = 10 here.

\(\nabla_W \mathbf{L} = \delta\, x^\top\) : 10 by 30
\(\nabla_b \mathbf{L} = \delta\) : 10

Julia: Training Data Setup

We need to reshape the data so that it’s useful to our algorithm.
We encode classes with one-hot as before

function one_hot(digit::Int)
    y = zeros(Float64, 10)
    y[digit + 1] = 1.0   # digit 0 → index 1, ..., digit 9 → index 10
    return y
end

X_train = [PIXELS[i, :] for i in 1:30] ;       # 30 input vectors, each length 25
Y_train = [one_hot(LABELS[i]) for i in 1:30];  # 30 one-hot output vectors, each length 10
@show Y_train[2,:] # show row 2 of output
@show X_train[2,:]; # show row 2 of input

Y_train[2, :] = [[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]
X_train[2, :] = [[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]]

Julia: Predict and Loss

We need functions to predict the output
And to measure to deviation from the true output.

predict(W, b, x) = W * x + b             # forward pass: 10-dim output

sample_loss(ŷ, y) = sum((ŷ .- y).^2)    # sum of squared errors for one sample

# Example: zero-weight model on digit-0 input
W0 = zeros(10, 25);  b0 = zeros(10)
ŷ0 = predict(W0, b0, X_train[1])         # X_train[1] is digit 0 → ŷ0 = [0,…,0]
println("loss (zero weights): ", sample_loss(ŷ0,  Y_train[1]))  # (0-1)²+(0-0)²×9 = 1
println("loss (perfect pred): ", sample_loss(Y_train[1], Y_train[1]))  # = 0

loss (zero weights): 1.0
loss (perfect pred): 0.0

Julia: Gradients

function sample_gradients(W, b, x, y)
    ŷ = predict(W, b, x)
    δ = 2.0 .* (ŷ .- y)    # ∂L/∂ŷ  (10-vector)
    ∇W = δ * x'             # ∂L/∂W  (10×25 outer product)
    ∇b = δ                  # ∂L/∂b  (10-vector)
    return ∇W, ∇b
end

# Example: same zero-weight model on digit-0 input
# ŷ=[0,…,0], y=[1,0,…,0]  →  δ = 2*(ŷ-y) = [-2,0,…,0]
∇W0, ∇b0 = sample_gradients(W0, b0, X_train[1], Y_train[1])
println("∇W size : ", size(∇W0))           # (10, 25)
println("∇W[1,:]   : ", ∇W0[1,:])          # first output reacts how to this W0?
println("∇W[2,:]   : ", ∇W0[2,:])          # second?
println("∇b[1]   : ", ∇b0[1])             # -2  → W and b for digit 0 must increase
println("∇b[2]   : ", ∇b0[2])             #  0  → digit-1 neuron unaffected

∇W size : (10, 25)
∇W[1,:]   : [-2.0, -2.0, -2.0, -2.0, -2.0, -2.0, -0.0, -0.0, -0.0, -2.0  …  -2.0, -0.0, -0.0, -0.0, -2.0, -2.0, -2.0, -2.0, -2.0, -2.0]
∇W[2,:]   : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
∇b[1]   : -2.0
∇b[2]   : 0.0

Julia: Gradient Descent

function gradient_descent(X, Y; η=0.1, tol=5e-3, maxiter=10_000)
    W = randn(10, 25) .* 0.1  # random starting Weights
    b = zeros(Float64, 10)    # zero bias
    losses = Float64[]
    for iter in 1:maxiter
        # initiate gradients and loss at zero
        ∇W = zeros(10, 25); ∇b = zeros(10); L = 0.0
        # average sample gradient is average of each point's gradient
        for (x, y) in zip(X, Y)
            gW, gb = sample_gradients(W, b, x, y)
            ∇W .+= gW;  ∇b .+= gb # accumulate sum
            L  += sample_loss(predict(W, b, x), y)
        end
        # divide by n to get average
        ∇W ./= length(X);  ∇b ./= length(X);  L /= length(X)
        push!(losses, L)
        gnorm = norm(vcat(vec(∇W), ∇b))
        if iter % 2000 == 0
            println("iter $iter: loss=$(round(L,digits=4))  ‖∇‖=$(round(gnorm,digits=4))")
        end
        if gnorm < tol
            println("Converged at iteration $iter  (loss=$(round(L,digits=5)))")
            break
        end
        # update by the neg of gradient
        # with stepsize η
        W .-= η .* ∇W;  b .-= η .* ∇b
    end
    return W, b, losses
end

gradient_descent (generic function with 1 method)

Training

using Random
Random.seed!(42)
W_opt, b_opt, loss_history = gradient_descent(X_train, Y_train);

iter 2000: loss=0.051  ‖∇‖=0.0141
iter 4000: loss=0.0285  ‖∇‖=0.008
iter 6000: loss=0.0194  ‖∇‖=0.0057
Converged at iteration 7105  (loss=0.01625)

Loss over Iterations

Training Accuracy

function classify(W, b, x)
    scores = predict(W, b, x)
    return argmax(scores) - 1   # convert 1-indexed back to digit label
end

correct = sum(classify(W_opt, b_opt, x) == LABELS[i] for (i, x) in enumerate(X_train))
println("Training accuracy: $correct / $(length(X_train)) = $(round(100correct/length(X_train), digits=1))%")

Training accuracy: 30 / 30 = 100.0%

Prediction Scores

Predicted Scores

Great!
Notice though:

sorted_scores[1:5,1:5]

5×5 Matrix{Float64}:
  1.0258       0.0427252  -0.0887751   0.168204   -0.0178768
  0.987873    -0.0138577   0.0179698  -0.0767012  -0.00623931
  0.97401     -0.0349464   0.102859   -0.119221    0.0321606
 -0.00977294   0.9823      0.0516094  -0.0539683   0.0185191
 -0.00151022   0.992443    0.0146322  -0.0287553   0.00313316

We have numbers outside \([0,1]\) here!
This is like the linear probability model!
In a real application, we’d want an activation function \(\phi\) which maps into \([0,1]\)!

Learned Weights

Each row of \(W\) is a 25-dim vector: red = positive weight (pixel ON → higher score), blue = negative.

Very Short Intro To Deep Learning

Purpose

What are Machine Learning (ML), AI and Deep Learning?

What Are (Deep) Neural Networks?

A Taxonomy of Learning

A Taxonomy of Learning

Taxonomy examples

Taxonomy examples

Taxonomy examples

Deep learning - Artificial Neural Networks (ANN)

Example: A Single Layer ANN

Bias? Why Bias?

What are those \(\phi\) functions then?

Deep Neural Networks (DNNs)

DNNs

DNNs

Example: Hand Written Digit Recognition

Simplest Problem: Detect / vs \

Simplest Problem: Detect / vs \

Simplest Problem: Detect / vs \

How does supervised learning work with NNs?

Simplest Problem: Detect / vs \

Input Encoding

Simplest Problem: Detect / vs \

Output encoding

Simplest Problem: Detect / vs \

Output encoding

Simplest Problem: Detect / vs \

Output encoding: One-hot

Simplest Problem: Detect / vs \

Simplest Problem: Detect / vs \

Linear Activation Function (for Teaching only)

Computed weights and biases

Simplest Problem: Detect / vs \

Demo Setup

Run Demo

Run Demo

Run Demo on unseen data

5×5 Pixel Digit Recognition: More Difficult!

Scaling Up: From 4 to 25 Pixels

5×5 Pixel Digits

Input Encoding

Input encoding

Output Encoding

Network Parameters

Loss Function

Gradients

Gradients

Julia: Training Data Setup

Julia: Predict and Loss

Julia: Gradients

Julia: Gradient Descent

Training

Loss over Iterations

Training Accuracy

Prediction Scores

Predicted Scores

Learned Weights

END

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`

Simplest Problem: Detect `/` vs `\`