Today we’re div­ing into how learn­ing al­go­rithms work in de­tail.

As we know, numpy is good for ef­fi­cient vec­tor maths. [Tutorial]

Image clas­si­fi­ca­tion: A core task in com­puter vi­sion

Your sys­tem re­ceives some in­put im­age. The model also has a fixed set of dis­crete la­bels:

$${cat, dog, plane, car}$$

The task is to as­sign a class to the im­age.

To the com­puter the im­age is­n’t a hol­lis­tic iim­age of a cat - it’s just a huge ar­ray of num­bers be­tween 0 and 255. The dif­fer­ence be­tween the se­man­tic idea of cat” and this huge ar­ray is known as the se­man­tic gap.

You can change the im­age in very sub­tle ways (ie. move the cam­era), then every singl pixel in the ar­ray would be dif­fer­ent, even thugh it;s stil the same cat. Other chal­lenges in­clude:

• Illumination
• Object de­for­ma­tion
• Occlusion (we might only see parts of the cat)
• Background clut­ter (the ob­ject looks the same as the back­ground)
• Intraclass vari­a­tion (cats come in dif­fer­ent shapes and sizes)

Our brain is evolved to do all of this, but it’s very hard to do au­to­mat­i­cally. Amazingly, some of this tech works at hu­man-level per­for­mance in som lim­ited prob­lems.

Unlike, say, sort­ing an ar­ray, there’s no ob­vi­ous way to hard-code the al­go­rithm for rec­og­niz­ing the cat.

1. Edge de­tec­tion
2. Categorize the dif­fer­ent cor­ners etc, write down a set of rules to rec­og­nize a cat

This is:

• Brittle
• not scal­able

A data-dri­ven ap­proach is much bet­ter.

1. Collect a dataset of im­ages and la­bels
2. Use ma­chine learn­ing to train a clas­si­fier (Ingest all the data and sum­marise it in some way)
3. Evaluate the clas­si­fier on new im­ages

Our API has changed: rather than one func­tion that takes an im­age and re­turns a la­bel, we have two:

• train(), which takes im­ages and trains a model
• predict(), which takes a model and makes a pre­dic­tion

Of course the con­cept of a data-dri­ven ap­proach is much more gen­eral

First clas­si­fier: Nearest Neighbour

train(): mem­o­rize all data and la­bels predict(): Predict the la­bel based on the most sim­i­lar source im­age

Let’s do this on CIFAR10

• 10 classes
• 50,000 train­ing im­ages
• 10,000 test­ing im­ages

Given a pair of im­ages, how do we ac­tu­ally com­pare them?

L1 dis­tance: $$d_1(I_1, I_2) = \displaystyle\sum_{p}|I_1^p-I_2^p|$$

This just com­pares all pix­els in each im­age and sub­stracts them, fi­nally adds up the dif­fer­ences to pro­duce a scalar. It’s kind of a stu­pid way to com­pare im­ages, but it does rea­son­able things some­times.

[python code to do near­est neigh­bour clas­si­fier]

import numpy as np

class NearestNeighbour:
def __init__(self):
pass

def train(self, X, y):
# X is N x D where each row is an example. Y is a one-dimensional vector of size N. N of course the number of examples
# Below, we're just memorising all the training data and all the examples
self.Xtr = X
self.ytr = y

def predict(self, X):
# X is N x D where each row is an example we want to predict a label for
num_test = X.shape
# Let's make sure that teh output type matches th input type
Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

for i in xrange(num_test):
#L1 distance:
distances = np.sum(np.abs(self.Xtr - X[i,:]), axis=1) # this returns a vector with the distance to each training image
min_index = np.argmin(distances)
Ypred[i] = self.ytr[min_index]
return Ypred

Computation: train O(1) predict O(N)

This is bad: We want clas­si­fiers that are slow at train­ing time, but fast at test­ing time - this is the wrong way around. Neural net­works are the re­verse of this.

We can draw the de­ci­sion bound­aries of the near­est neigh­bour clas­si­fier. By look­ing at the pic­turewe start to see some of the prob­lems:

• a lit­tle yel­low is­land in the mid­dle of the green re­gion
• noisy blue re­gion

This mo­ti­vates k-near­est negh­bours. Instead of just find­ing the near­est neigh­bour, we find k near­est neigh­bours, and take a ma­jor­ity vote as to whihch class a dat­a­point should be­long to. This leads to smoother de­ci­sion bound­aries. You al­most al­ways want K > 1.

When think­ing about com­puter vi­sion, it’s good to flip back and forth be­tween dif­fer­ent view­points:

• high-di­men­sional points in the plane
• ac­tual im­ages

L2 (Euclidian) dis­tance: $$d_2(I_1,I_2) = \sqrt{\displaystyle\sum_{p}(I_1^p-I_2^p)^2}$$

Different dis­tance met­rics make dif­fer­ent as­sump­tions about the geom­e­try of the un­der­ly­ing space. L1 pro­duces a square, L2 a cir­cle. L1 changes based on the co­or­di­nate sys­tem. By us­ing a gen­er­alised dis­tance met­ric, we can ap­ply k-near­est-neigh­bours to ba­si­call any kind of data.

Decision bound­aries in L1 tends to fol­low the co­or­di­nate axis, whereas L2 bound­aries are more nat­ural.

Online demo

Once you use this al­go­rithm in prac­tice, you need to set hy­per­pa­ra­me­ters (k, dis­tance mea­sure). These aren’t learned from teh data, but choices you make ahead of time. How do you do this?

• Idea 1: Choose hy­per­pa­ra­me­ters that work best on train­ing data. Bad. K=1 al­ways works per­fectly on train­ing data. Ie. leads to over­fit­ting.
• Idea 2: Split data into train and test, choose hy­per­pa­ra­me­ters that work best on test data. Also bad. We’d have no un­der­stand­ing of how al­go­rithm will per­form on new data.
• Idea 3: Split data into train, val­i­da­tion and test sets, choose hy­per­pa­ra­me­ters on val­i­da­tion and eval­u­ate on test. You only ever run the al­go­rithm once on the test data.
• Idea 4: Cross val­i­da­tion. We’ll take our data (save test set), and split the data into lots of dif­fer­ent folds. Try each fold as val­i­da­tion and av­er­age the re­sults. This is kind of the gold stan­dard, but hard to use in deep learn­ing.

When you do cross-val­i­da­tion, you can draw a graph for dif­fer­ent val­ues of hy­per­pa­ra­me­ters (ie. $$k$$).

KNN is ba­si­cally never used on im­ages be­cause:

• Slow at test time
• And L2 is­n’t very good at findin per­cep­tual dif­fer­ence at im­ages
• Curse of di­men­sion­al­ity. For knn to work well, we need the space densely cov­ered (otherwise the near­est neigh­bours could be far away). To do this, we need an ex­po­nen­tial num­ber of train­ing in­stances to cover the space well enough.

Linear Classification

An anal­ogy of neural net­works are lego blocks - you stick to­gether all these mod­ules to make a model. Of course a ReLU is just a lin­ear clas­si­fier.

Recall CIFAR10:

• 50,000 train im­ages
• 10,000 test im­ages
• Each im­age 32x32x3
• 10 classes

Parametric ap­proach

$$\text{Image} \rightarrow f(x,W) \rightarrow 10 \text{ num­bers giv­ing class scores}$$

The im­age is a 32x32x3 vec­tor Larger class scores mean that the class is more likely. In the KNN setup we just keep around the train­ing data. Here, we sum­marise the train­ing data into $$W$$, so we can throw away the train­ing data at test time.

There are all kinds of ways to com­bine data and weights, teh sim­plest is to mul­ti­ply them.

For a 32x32x3 im­age:

1. We turn the im­age into a $$3072*1$$ vec­tor
2. The weights are $$10 * 3072$$
• Bias term is 10x1
• we end up with a $$10 * 1$$ vec­tor giv­ing us prob­a­bil­li­ties for each class.

$$f(x,W) = Wx + b$$

The bias term is giv­ing us data-in­de­pen­dent pref­er­ence for one class over an­other. If your dataset is un­bal­anced, the bias term can even things out.

Linear clas­si­fi­ca­tion is a tem­plate-match­ing ap­proach. Which means you can go back­wards and vi­su­alise the rows in the weight ma­trix and see the tem­plates! (This also use­ful for CCA work­shop)

The prob­lem is also ap­par­ent: The lin­ear clas­si­fier only learns one tem­plate per class. If there’s in­ter­class vari­a­tion, it’s go­ing to have prob­lems. Neural net­works don’t have this re­stric­tion - they can learn more than one tem­plate.

In high-di­men­sional space, each im­age is a point. The lin­ear clas­si­fier is draw­ing planes through this space to seper­ate the data.

Linear clas­si­fi­ca­tion strug­gles when the de­ci­sion bound­aries aren’t con­tin­u­ous, ie. if there’s is­lands in the plane.

Next: How can we de­ter­mine if $$W$$ is good or bad?

• Loss func­tion
• Optimisation
• Convolutional Nets