Why do we study vi­sual data?

The amount of data has ex­ploded - this is due to the in­crease in sen­sors in the world. Roughly 80% of all traf­fic on the in­ter­net are video (Cisco 2016). Visual data is hard to un­der­stand - ie. the dark mat­ter of the in­ter­net.

Computer vi­sion touches on all kinds of fields:

CS224 on Deep Learning and Natural Language Processing

A his­tory of com­puter vi­sion

The num­ber of species went from a few hun­dred to hun­dreds of thou­sands within 10 mil­lion years. Parker (2010) sug­gests this is be­cause the de­vel­op­ment of vi­sion. Now vi­sion is the most im­por­tant sen­sory sys­tems in most in­tel­li­gent an­i­mals.

None of these re­ally man­aged to solve ob­ject recog­ni­tion. Maybe we shoudl do ob­ject seg­men­ta­tion first?

Going into the 2000s, we start get­ting much bet­ter data (thanks to the in­ter­net). In the early 2000s we start to have bench­mark datasets that al­low us to mea­sure progress in ob­jet recog­ni­tion. PASCAL Visual Object Challenge is one of these. It has 20 cat­e­gories with 10,000 im­ages each. Different re­search groups can com­pare progress on this. ImageNet (Deng, Dong, Socher, Li, Li, Fei-Fei)is the next it­er­a­tion of this with 14M im­ages in 22,000 cat­e­gories. In 2009 the ImageNet team writes up the Large Scale Visual Recognition Challenge : 1000 Object classes, 1.4M im­ages. (Russakovsky et al. 2014). By 2012 im­age recog­ni­tion al­go­rithms are on par with hu­man per­for­mance (a sin­gle Stanford PHD stu­dent do­ing the chal­lenge for weeks).

In 2012 the er­ror rate drops by 10% - this is a con­vo­lu­tional neural net­work (this is what this course is about).

An overview of the course

We’re fo­cussed on im­age clas­si­fi­ca­tion. This rel­a­tively sim­ple tool is use­ful by it­self, but we’re also talk­ing about ob­ject de­tec­tion (drawing bound­ing boxes around ob­jects in im­ages) and im­age cap­tion­ing (given an im­age, pro­duce a nat­ural lan­guage sen­tence de­scrib­ing that im­age).

Convolutional NNs had this break­through in 2012, since then we’re ba­si­cally fine-tun­ing and go­ing from 8 to 200 lay­ers. The gen­eral idea has been around since the 90s, but thanks to:

they’re much more ad­vanced now.

Human vi­sion does much more than draw bound­ing boxes - there’s form­ing 3d mod­els of the world, ac­tiv­ity recog­ni­tion (given a video, how do you know what’s hap­pen­ing).

Johnson et al. (2015): Image Retrieval us­ing Scene Graphs

In some sense the holy grail of com­puter vi­sion is to un­der­stand the story of an im­age in a rich and nu­anced way.

[Barrack Obama press­ing his foot on a scale]

We un­der­stand this as funny be­cause we have all this in­cred­i­ble un­der­standign and back­ground in­for­ma­tion about this im­age - how scales work, who Obama is, how peo­ple feel about weight etc.

Goodfellow, Bengio and Courville: Deep Learning