\[\begin{align}
z &= w'x + b \\
y &= \phi(z)
\end{align}
\]
Note
\(w\) is a vector or weights, \(b\) is an intercept or bias term. \(\phi\) is (in general) a nonlinear function. This example has a 3-dim input and a 1-dim output, and a single layer (i.e. just the output layer).
Bias? Why Bias?
\[\begin{align}
z &= w'x + b \\
y &= \phi(z)
\end{align}
\]
Well we know that \(b\) is just the intercept. \(z\) is nothing but a linear transformation.
We economists call \(z\) βa regressionβ. The intercept shifts the line/hyperplane up and down, else it passes through the origin.
So, this just takes a linear transform of \(x\) and sticks it into a nonlinear function \(\phi\).
If we can find values \(W\) and \(b\) such that this holds, our NN will perfectly recognize / and \.
Note
Solving 4 equations with 10 variables: we have 6 degrees of freedom too many. Can just choose an arbitrary number for those and solve system for the remaining four.
Computed weights and biases
I set \(b=0\) and \(w_{11},w_{12},w_{21},w_{23}=0.5\). This gives
A very simple NN with a linear activation function was perfectly able to recognize those characters from colored (black or white) pixels
4 input neurons
2 output neurons
Increasing pixels: increasing resolution of the image
more variables to deal with, and nonlinear activations: cannot solve by hand.
Demo Setup
functiondemo(x=nothing)println("Simple Neural Network Demo")println("="^40)if x ===nothing x =rand(4)elseif !(isa(x, Vector{<:Real}) &&length(x) ==4)error("Input 'x' must be a vector of 4 real numbers")end# Input# Weights and biases W = [0.50.50.5-0.5;0.50.5-0.50.5] b = [0, 0]# Output y = W * x + bprintln("Input:")println("βββββββ¬ββββββ")println("β $(round(x[1], digits=1)) β $(round(x[2], digits=1)) β")println("βββββββΌββββββ€")println("β $(round(x[3], digits=1)) β $(round(x[4], digits=1)) β")println("βββββββ΄ββββββ")# simple rule: (1,0) means `/`# so let's say any (q,r) means `/` as long as q > rprintln("Output:")if y[1] >= y[2]println("βββββ")println("β β β")println("βββββ")endif y[1] β y[2]println("or")endif y[2] >= y[1]println("βββββ")println("β β β")println("βββββ")endreturn yend
The gradient tells us the direction of steepest ascent of the loss. Moving opposite to it reduces the loss.
Tip
In matrix form we have \(\delta = 2(\hat{y} - y)\) - a vector of length(y) = 10 here.
\(\nabla_W \mathbf{L} = \delta\, x^\top\) : 10 by 30
\(\nabla_b \mathbf{L} = \delta\) : 10
Julia: Training Data Setup
We need to reshape the data so that itβs useful to our algorithm.
We encode classes with one-hot as before
functionone_hot(digit::Int) y =zeros(Float64, 10) y[digit +1] =1.0# digit 0 β index 1, ..., digit 9 β index 10return yendX_train = [PIXELS[i, :] for i in1:30] ; # 30 input vectors, each length 25Y_train = [one_hot(LABELS[i]) for i in1:30]; # 30 one-hot output vectors, each length 10@show Y_train[2,:] # show row 2 of output@show X_train[2,:]; # show row 2 of input
functiongradient_descent(X, Y; Ξ·=0.1, tol=5e-3, maxiter=10_000) W =randn(10, 25) .*0.1# random starting Weights b =zeros(Float64, 10) # zero bias losses =Float64[]for iter in1:maxiter# initiate gradients and loss at zero βW =zeros(10, 25); βb =zeros(10); L =0.0# average sample gradient is average of each point's gradientfor (x, y) inzip(X, Y) gW, gb =sample_gradients(W, b, x, y) βW .+= gW; βb .+= gb # accumulate sum L +=sample_loss(predict(W, b, x), y)end# divide by n to get average βW ./=length(X); βb ./=length(X); L /=length(X)push!(losses, L) gnorm =norm(vcat(vec(βW), βb))if iter %2000==0println("iter $iter: loss=$(round(L,digits=4)) βββ=$(round(gnorm,digits=4))")endif gnorm < tolprintln("Converged at iteration $iter (loss=$(round(L,digits=5)))")breakend# update by the neg of gradient# with stepsize Ξ· W .-= Ξ· .* βW; b .-= Ξ· .* βbendreturn W, b, lossesend
iter 2000: loss=0.051 βββ=0.0141
iter 4000: loss=0.0285 βββ=0.008
iter 6000: loss=0.0194 βββ=0.0057
Converged at iteration 7105 (loss=0.01625)
Loss over Iterations
Training Accuracy
functionclassify(W, b, x) scores =predict(W, b, x)returnargmax(scores) -1# convert 1-indexed back to digit labelendcorrect =sum(classify(W_opt, b_opt, x) == LABELS[i] for (i, x) inenumerate(X_train))println("Training accuracy: $correct / $(length(X_train)) = $(round(100correct/length(X_train), digits=1))%")