Hinton, Turing Award 2018
Final Nail in The Coffin

 https://www.youtube.com/watch?v=-cv0ddgcImk&list=PLF9MB5-iSvcvFpxC8MbJFB2p8r4YjuiOx&index=1

[Applause] i'd first like to thank all the people at acm who devote their time to making all of this run smoothly so there have been two paradigms for ai um since the 1950s there's been the logic inspired approach where the essence of intelligence is seen as symbolic expressions operated on by symbolic rules and the main problem has been reasoning how do we get a computer to do reasoning like people do and there's been the biologically inspired approach which is very different um it sees the essence of intelligence as learning the connection strengths of the neural network and the main things to focus on at least to begin with are learning and perception so they're very different paradigms with very different initial goals they have very different views of the internal representations that should be used so the symbolic paradigm thinks that you should use symbolic expressions and you can give these to the computer if you invent a good language to express them in and you can of course get new expressions within the computer by applying rules the biological paradigm thinks the internal representations are nothing at all like language they're just big vectors of neural activity and these big vectors have causal effects on other big vectors and these vectors are going to be learned from data so all the structure in these vectors is going to be learned from data i'm obviously giving sort of caricatures of the two positions to emphasize how different they are they lead to two very different ways of trying to get a computer to do what you want so um one method which i slightly naughtily call intelligent design is what you would call programming um it's you figure out how to solve the problem and then you tell the computer exactly what to do the other method is you just show the computer a lot of examples of inputs and the outputs you produce and you let the computer figure it out of course you have to program the computer there too but it's programmed once with some general purpose learning algorithm that again is a simplification so an example of a kind of thing that people spent 50 years trying to do with symbolic ai is take an image and describe what's in the image so think about taking the millions of pixels in the image on the left and converting them to a string of words it's not obvious how you'd write that program people tried for a long time and they couldn't write that program people doing neural nets also tried for a long time and in the end they managed to get a system that worked quite well which was based on the pure learning approach so the central question for neural nets was always we know that big neural nets with lots of layers and nonlinear processing elements can compute complicated things at least we believe they can um but the question is can they learn to do it so can you learn a task like object recognition or machine translation by taking a big net and starting from random weights and somehow training it so it changes the weights so it changes what it computes there's an obvious learning algorithm for such systems which was proposed by turing and by selfridge and by many other people variations of it and the idea is you start with random weights so this is how turing believed human intelligence works you start with random weights and rewards and punishments cause you to change the connection strengths so you eventually learn stuff um this is extremely inefficient it will work but it's extremely inefficient in the 1960s rosenblatt introduced a fairly simple and efficient learning procedure much more efficient than random trial and error that could figure out how to learn the weights on features in which you extract features from the image and then you combine the features using weights to make a decision and he managed to show you can do some things like that some moderately impressive things but the in perceptrons you don't learn the features that again is a simplification rosenblatt had all sorts of ideas about how you would learn the features but he didn't invent back propagation in 1969 minsky and pappet showed that the kinds of perceptrons that rosenbladder got to work were very limited in what they could do there were some fairly simple things they were unable to do and minsky and puppets strongly implied that making them deeper wouldn't help and better learning algorithms wouldn't help there was a basic limitation of this way of doing things and that led to the first neural net winter in the 1970s and the 1980s many different groups invented the backpropagation algorithm variations of it and backpropagation allows a neural network to learn the feature detectors and have multiple layers of learned feature detectors that created a lot of excitement it allowed neural networks for example to convert words into vectors that represented the meanings of the words and they could do that just by trying to predict the next word and it looked as if it might be able to solve tough problems like speech recognition and shape recognition and indeed it did solve it did do moderately well at speech recognition and for some forms of shape recognition it did very well like yalakan's networks that read handwriting but what i'm going to do now is explain very briefly how neural networks work i know most of you will know this but i just want to go over it just in case so we make a gross idealization of a neuron and the aim of this idealization is to get something that can learn so that we can study how you put all these things together to learn something complicated in big networks of these things so it has some incoming weights that you can vary well the learning algorithm will vary and it gives an output that's just equal to its input provided the inputs over a certain amount so it's a rect that's a rectified linear neuron which we actually didn't start using until later but these are the kinds of neurons that work very well and then you hook them up into a network and you have weights on the incoming weights for each of these neurons and as you change those incoming weights you're changing what feature that neuron will respond to so by learning these weights you're learning the features you put in a few hidden layers and then you'd like to train it so that the output neurons do what you like so for example we might show images of dogs and cats we might like the left neuron to turn on for a dog and the right one for a cat and the question is how are we going to train it so there's two kinds of learning algorithms mainly oh there's three actually three but the third one doesn't work very well that's called reinforcement learning um there's a wonderful reductive ad absurdum of reinforcement learning called deep mind um so that was a joke there's um supervised training where you show the network what the output ought to be and you adjust the weights until it produces the output you want and for that you need to know what the output ought to be and there's unsupervised learning where you take some data and you try and represent that data in the hidden layers in such a way that you can reconstruct the data or perhaps reconstruct parts of the data if i blank out small parts of the data can i reconstruct them now from the hidden nodes that's that's the way unsupervised learning typically works in neural nets so here's a really inefficient way to do supervised learning by using a mutation or reinforcement kind of method what you would do is you take your neural net you give it some a typical set of examples you'd see how well it did you then take one weight and you change that weight slightly and you see if the neural net does better or worse if it does better you keep that change if it does worse you'd throw it away perhaps you're changing the opposite direction and that's already a factor of two improvement but this is an incredibly slow learning algorithm it will work but what it achieves can be achieved many many times faster by back propagation so you could think of back propagation as just an efficient version of this algorithm so in back propagation instead of changing a weight and measuring what effect that has on the performance of the network what you do is you use the fact that all of the weights of the network are inside the computer you use that fact to compute what the effect of a weight change would be on the performance and you do that for all of the weights in parallel so if you have a million weights you can compute for all of them in parallel what the effect of a small change in that weight would be on the performance and then you can update them all in parallel that has its own problems but it'll go a million times faster than the previous algorithm um many people in the press describe that as an exponential speed up actually it's a linear speed up the term exponentially is used quadratically too often so we get to back propagation where you do a forward pass through the net you look to see what the outputs are and then using the difference between what you got and what you wanted you do a backwards pass which has much the same flavor as the forward pass it's just high school calculus or maybe first university year calculus and [Music] you can now compute in parallel which direction you should change each weight in and then very surprisingly you don't have to do that for the whole training set you just take a small batch of examples and on that batch of examples you compute how to change the connection strengths and you might have got it wrong because of the quirks of that batch of examples but you change them anyway and then you take another batch of examples this is called stochastic gradient descent and i guess the major discovery of of the neural net community is that stochastic gradient descent even though it has no real right to work actually works really well and but it works really well at scale if you give it lots of data and big nets it really shows shows its colors however in the 1980s we were very very pleased by back propagation it seemed to have solved the problem and we were convinced it was going to um solve everything and it did actually do quite well at speech recognition and some forms of object recognition but it was basically a disappointment it didn't work nearly as well as we thought and the real issue was why and at the time people had all sorts of analyses of why it didn't work most of which were wrong so they said it's getting trapped in local optima we now know that wasn't the problem um when other learning algorithms worked better than back propagation or modified data sets um most people in the machine learning community adopted the view that what you guys are trying to do is learn these deep multi-layer networks from random weights just using stochastic gradient descent and this is crazy um it's never going to work you're just asking for too much there's no way you're going to get systems like this to work unless you put in quite a lot of hand engineering you somehow wire in some prior knowledge so linguists for example have been indoctrinated to believe that a lot of language is innate and you'd never learn language without prior knowledge in fact they had mathematical theorems that proved you couldn't learn language without prior knowledge my response to that is beware of mathematicians bearing theorems so i just want to give you some really silly theories i i'm a monty python fan so here's some really silly theories um the continents used to be connected and then drifted apart and you can imagine how silly geologists thought that theory was great big neural nets that start with random weights and no prior knowledge can learn to do machine translation that seemed like a very very silly theory to many people um just to add one more if you take a natural remedy and you keep diluting it the more you dilute it the more potent it gets and some people believe that too so the quote at the top was taken actually from the continental drift literature wagoner who suggested in 1912 was kind of laughed out of town even though he actually had very good arguments he didn't have a good mechanism and the geological community said you know we've got to keep this stuff out of the textbooks and out of the journals it's just going to confuse people um we had our own little experience of that in the second euronet winter so nips of all conferences declined to take a paper of mine you don't forget those things and and like many other disappointed authors i had a word with a friend on the program committee and my friend on the program committee told me well you see they couldn't accept this because they had two papers on deep learning and they had to decide which one to accept and they had actually accepted the other one so they couldn't reasonably be expected to have two papers on the same thing in the same conference i suggest you go to nips now and see whether um joshua benjio submitted a paper to icml in about 2009 i'm not certain of the year but it's around then um and one of the reviewers said that neural network papers had no place in the machine learning conference so i suggest you go to icml um cvpr which is the leading computer vision conference that was the most outrageous of all i think um jan and his co-workers submitted a paper doing semantic segmentation that beat the state of the art it beat what the mainland computer vision people could do and it got rejected and one of the reviewers said this paper tells us nothing about computer vision because everything's learned so the viewer liked the field of computer vision at the time was stuck in the frame of mind that the way you do computer vision is you think about the nature of the task of vision you preferably write down some equations you think about how to do the computations that are required to do vision then you get some implementation of it and then you see whether it works um the idea that you just learn everything was outside the realm of things that were worth considering and so the reviewer basically missed the point which was that everything was learned um he completely failed to see how that completely changed computer vision now i shouldn't be too hard on those guys because a little later on they were very reasonable with a bit more evidence they suddenly flipped so between 2005 and 2009 researchers some of them in canada we make yan an honorary canadian because he's french made several technical advances that allowed back propagation to work better in feed forward nets they involved using unsupervised pre-training to initialize the weights before you turn on back propagation things like dropping out units at random to make the whole thing much more robust and introducing rectified linear units which turned out to be easier to train for us the details of those advances are our bread and butter we're very interested in those but the main message is that with a few technical advances back propagation works amazingly well and the main reason is because we now have lots of labeled data and a lot of convenient compute power inconvenient computer isn't much use um but things like gpus and more recently tpus allow you to apply a lot of computation and they've made the huge difference so really the deciding factor i think was the increase in compute power so i think a lot of the credit for deep learning really goes to the people who collected the big databases like fei-fei li and the people who made the computers go fast like um david patterson and others lots of others so the killer app from my point of view was in 2009 when in my lab we got a bunch of gpus and two graduate students um made them do learn to do acoustic modeling acoustic modeling means you take something like a spectrogram and you try and figure out for the middle frame of the spectrogram which piece of which phoneme the speaker is trying to express and in this little database we used relatively little there are 183 labels for which piece of which phoneme it might be and so you pre-train a net with many layers of 2000 hidden units um you can't pre-train the last layer because you don't know the labels yet and you're training it just to be able to reproduce what's in the layer below and then you turn on learning in all the layers and it does slightly better than the state of the art which you've taken 30 years to develop when people in speech saw that the smart people they realized that with more development this stuff was going to be amazing and my graduate students went off to various groups like msr and ibm and google in particular navd jaitley went to google and ported the system for acoustic modeling that was developed in toronto fairly literally and it came out in the android in 2012 there was a lot of good engineering to make it run in real time and it gave a big decrease in error rates and at more or less the same time all the other groups started changing the way they did speech recognition and now all the good speech recognizers use neural nets they're not like the neural nets we introduced initially neural nets have gradually eroded more and more parts of the system sort of putting a neural net in your system is a bit like getting gangrene it'll gradually eat the whole system then in 2012 two other of my graduate students applied neural nets of the kind developed over many years by yanaka to object recognition on a big database that fei felia put together with a thousand different classes of object and it was finally a big enough database of real images so you could show what neural nets could do and they could do a lot so if you looked at the results all the computer vision systems the standard ones had asymptoted at about 25 error um our system developed by two graduate students um got 16 percent error and then further work on neural nets like that by 2015 it was down to five percent and now it's down to considerably below that so then what happened was exactly what ought to happen in science um leaders of the computer vision community looked at this result and they said oh they really do work we were wrong okay we're going to switch and within a year they all switched and so science finally worked like it was meant to the last thing i want to talk about is a radically new way to do machine translation which was introduced in 2014 by people at google and also in montreal by people in joshua bengio's lab and the idea in 2014 was for each language we're going to have a neural network it'll be a recurrent network that is going to encode the string of words in that language which it receives one at a time into a big vector i call that big vector a thought vector the idea is that big vector captures the meaning of that string of words then you take that big vector and you give it to a decoder network and the decoder network turns the big vector into a string of words in another language and it sort of worked and with a bit of development it worked very well since 2014 one of the major pieces of development has been that when you're decoding the meaning of a sentence what you do is you look back at the sentence you were encoding and that's called soft attention so each time you produce a new word you're deciding where to look in the sentence that you're translating that helps a lot you also now pre-train the word embeddings and that helps a lot and the way the pre-training works is you take a bunch of words and you try and reproduce these words in a deep net but you've left out some of the words so from these words you have to reproduce the same words but you have to fill in the blanks essentially they use things called transformers where in this deep net as your as each word goes through the net it's looking at kind of nearby words to disambiguate what it might mean so if you have a word like may when it goes in you'll get an initial vector that's sort of ambiguous between the modal and the month but if it sees the 13th next to it it knows pretty well it's the month and so in the next day it can disambiguate that and the meaning of that may will be the month um and those transformer nets now work really well for getting word embeddings they also it turns out learn a whole lot of grammar so all the stuff that linguists thought had to be put in innately these neural nets are now getting in there they're getting lots of syntactic understanding but it's all being learned from data if you look in the early layers of transformer nets they know what parts of speech things are um if you look in later parts of the nets they know how to disambiguate pronoun references um basically they're learning grammar the way a little kid learns grammar just by just from looking at sentences so i think that the machine translation was really the final nail in the coffin and symbolic ai because machine translation is the ideal task for symbolic ai it symbols in and it symbols out but it turns out if you want to do it well inside what you need is big vectors okay i have um said everything i wanted to say about the history up to 2014 or so of neural nets i've emphasized the ideology that were these two camps and that the good guys won it's not over yet because of course what we need is for neural nets now to begin to be able to explain reasoning we can't do that yet we're working on it but reasoning is the last thing that people do not the first thing and reasoning is built on top of all this other stuff and my view has always been you're never going to understand reasoning until you understand all this other stuff and now we are beginning to understand all this other stuff and we're more or less ready to begin to understand reasoning but reasoning just with sort of bare symbols by using rules that are expressed as other symbols that seem to me just hopeless you're missing all the content there's no meaning there okay i want to talk a little bit about the future of computer vision so convolutional neural nets have been very effective and what convolutional neural nets do is they wire in the idea that if a feature is useful in one place it's also going to be useful in another place and that allows us to accumu to combine evidence from different locations to learn a shared feature detector that is to learn replicated feature detectors that are the same in all these places and that's a huge win it makes it much more data efficient and those things jan got working in the 1990s they were one of the few things that worked really well in the 1990s and they work even better now but i don't think they're the way people do vision i mean i think one aspect of it that there's replicated apparatus that's clearly true of the brain but they don't recognize objects the same way as we do and that leads to adversarial examples so if i give you a big database a convolutional neural net will do very well it may do better than a person but it doesn't recognize things the same way as a person does and so i can change things in a way that will cause the convolutional neural net to change its mind and a person can't even see the changes i've made they're using things much more like texture and color they're not using the geometrical relationships between objects and their parts um i'm convinced that people the main way in which people recognize objects they obviously use texture and color but they're very well aware of the geometrical relationships between an object and its parts and that geometrical relationship is completely independent of viewpoint and that gives you something that's very robust that you should be able to train for much less data and i actually can't resist doing a little demonstration to convince you that when you understand objects it's not just when you're being a scientist that use coordinate frames it's even when you're just naively thinking about objects you impose coordinate frames on them and so i'm going to do a little demonstration and you have to participate in this demonstration otherwise it's no fun okay so i want you to imagine sitting on the table top in front of you there's a cube so here's the top here's the bottom here's the cube it's a wireframe cube like this okay matte black wires and what i'm going to do with this cube is from your point of view there's a front bottom right hand corner here and this top back left hand corner here okay and i'm going to rotate the cube so that the top back left hand corner is vertically above the front bottom right hand corner so here we are and so now i want you to hold your fingertip in space probably your left fingertip where the top vertex of the cube is okay and now nobody's doing it come on now with your other fingertip i just want you to point to where the other corners of the cube are the ones that aren't resting on the table so there's one on the table one vertically above it here where are the other corners and you have to do it you have to point them out okay now i can't see what you're doing but i know that a large number of you will have pointed out four other corners because i've done this before and now i want you to imagine a cube in the normal orientation and ask how many corners does it have okay it's got eight corners right so the six of these guys and what most people do is they say here here here and here what's the problem well the problem is that's not a cube what you've done is you've preserved the four-fold rotational symmetry that a cube has um and points out a completely different shape it's a completely different shape that has the same number of faces as a cube has corners and the same number of corners as the cube have faces it's the jewel of a cube if you substitute corners for faces because you really like symmetry so much you're prepared to really mangle things to preserve the symmetries um actually a cube has three edges coming down like that and three edges coming up like that and my six fingertips are where the corners are okay and people just can't see that unless they're crystallographers or very clever um so the main point of this demo is i forced you by doing this rotation to use an axis for the cube the main axis that defined the orientation of the cube was not one of the axes of the coordinate frame you usually use for a cube and by forcing you to use an unfamiliar coordinate frame i destroyed all your knowledge about where the parts of a cube are you understand things relative to coordinate frames and if i get you to impose a different coordinate frame it's just a different object as far as you're concerned now convolutional nets don't do that and because they don't do that i don't think they're the way people perceive shapes um we've recently managed to make neural nets do that by doing some self-supervised training and there's an archive reference there which if you're very quick you could get or you could i'll send out a tweet about it later and the last thing i want to say is about not about shape recognition in particular but about the future of neural networks there's something very funny and very unbiological we've been doing for the last 50 years which is we've only been using two time scales that is you have neural activities and they change rapidly and you have weights and they change slowly and that's it but we know that in biology synapses change at all sorts of timescales and the question is what happens if you now introduce more time scales in particular let's just introduce one more time scale and let's say that in addition to these weights changing slowly and that's what's going on in long-term learning the weights the weights have a component the very same weight the very same synapses but there's an extra component that can change more rapidly and decays quite rapidly so if you ask where's your memory of the fact that a minute ago i put my finger on this corner here is that in a bunch of neurons that are sitting there um sort of being active so that you can remember that that seems unlikely it's much more likely your memory for this is in fast modifications to the weights of the neural network that allow you to reconstruct this very rapidly and that will decay with time so you've got a memory that's in the weights that's a short-term memory as soon as you do that all sorts of good things happen you can use that to get a better optimization method and you can use that to do something that may very well be relevant to reasoning you can use it to allow neural networks to do true recursion not very deep but true recursion and what i mean by true recursion is when you do the recursive call like a relative clause in a sentence the neural net can use all the same neurons and all the same weights that it was using for the whole sentence to process the relative clause and of course to do that somehow it has to remember what was going on when it decided to process the relative clause it has to store that somewhere and i didn't think it stores in other neurons i think it stores it in temporary changes to synapse strengths and when it's finished processing the relative clause it packages it up and says basically it says now what was i doing when i started doing this processing and it can get the information back from this associative memory in the fast weights i wanted to finish with that because the very first talk i gave in 1973 was about exactly that i had a system that worked on a computer that had 64k of memory um i haven't got around to publishing it yet but i think it's becoming fashionable again so i assume well and that's the end of my talk and i'm out of time