Yan Le Cun Turing Award Lecture 2019
https://www.youtube.com/watch?v=psl1wY2V-L0

okay i'll talk about the sequel but i'll start also with a little bit of history and sort of go through some of the things that uh jeff just mentioned so jeff talked about about supervised running and supervisioning works amazingly well if you have lots of data we all know this so we can do speech recognition we can do image recognition we can do face recognition we can do we can generate captions for images we can do translation that works really well and if you give your neural net a particular structure something like a convolutional net uh as jeff mentioned in the late 80s early 90s we could train systems to recognize handwriting that was quite successful uh by the end of the 90s the system of this type that i built at bell labs i was reading something like 10 to 20 of all the checks in the us so a big success even a commercial success but that but by that time the entire community had basically abandoned neural nets partly because of the lack of large data sets for which they could work partly because the type of software at the time that you had to write was fairly complicated and it it was a big investment to um to do this partly also because computers were not fast enough for all kinds of other applications but convolutional nets really are inspired by uh biology they're not copying biology but there is a lot of inspiration from from biology from the architecture of the visual cortex and ideas that come naturally when you study signal processing the idea that filtering is is a good way to kind of process signals whether they're audio signals or image image signals and the convolutions is a way to do filtering is very natural and the fact that you find this in the brain is really not that surprising and those ideas of course were proposed by hublin weasel in sort of classic work in neuroscience back in the 60s as well as uh and sort of picked up by fukushima who's a japanese researcher who tried to build computer models of uh the the hublin result model if you want and i found that inspiring and and sort of tried to reproduce this using uh neural nets that could be trained with back propagation that's basically what the convolutional net uh is and so the idea accomplished on that is that uh the the world the perceptual world is compositional that the the visual world objects are formed by parts and parts are formed by motifs and motifs of one by uh textures or elementary combinations of edges and edges are formed by by pixels arrangements of pixels and so if you have a system that sort of hierarchically can detect uh unusually useful combinations of pixels into edges and edges into motifs and motifs into parts of objects then you will have a recognition system this idea of hierarchy actually goes back a long time and so that's uh that's really the principle of computational nets and it turns out that hierarchical representations are good not just for for vision but also for speech for text and for all kinds of other natural signals that are comprehensible because they are compositional i think there is this saying uh it's attributed to einstein i believe the what is most uh mysterious about the world is that it is understandable and it's probably because of the compositional nature of uh of the natural of natural signals so uh in the early 90s we were able to do things like build recognition systems like this one this is a younger version of myself here i'm at bell labs this is by the way my phone number at bell labs in hondal no longer operating i'm hitting a key here and the system captures an image with a video camera this runs on the pc with a special dsp card in it and it could run those commercial nets at you know several hundred characters per second at the time which was amazing we could run 20 megaflops you know that was just incredible so that worked really well and uh pretty soon we realized we could use this for natural images as well to do things like detecting uh faces eventually detecting pedestrians that took uh a few years but as uh jeff mentioned there was sort of a neural net winter between the mid 90s and the uh sort of late 2000s if you want where almost nobody was working on their own ads except a few crazy people like us so that didn't stop us and so working on face detection pedestrian detection even working on using machine learning and convolutional net for robotics where we we would use a convolutional net to label an entire image in such a way that every pixel in an image would be labeled as to whether it's traversable or not reversible by a robot and the nice thing about this is that you can you can collect data automatically you don't need to manually label it because using stereo vision you can figure out if a pixel sticks out of the ground or not using 3d reconstruction but unfortunately that only works at short range so if you want a system that can plan long range trajectories then you can train a convolutional net to make the predictions for traversability using those labels and then let the robot drive itself around so it's got this particular robot here as a combination of different features that it uses extracted by the conventional net and also a rapid stereo vision system that allows it to avoid obstacles such as pesky graduate students uh piazza mania and raya hatsel by the way who are pretty sure the robot is not going to run them over because they actually wrote the code okay and then a couple years later we used a very similar system to do semantic segmentation this is actually the work that jeff was talking about that was rejected from cvpr2011 so this is the system that could in real time using a fpga implementation uh segment you know basically give a category for every pixel in an image at about 30 frames per second that sort of decent resolution it was far from perfect but it could sort of label with sort of reasonable accuracy detect pedestrians detect the roads and the trees and etc um but the results basically were not immediately believed by the computer vision community now to measure the progress that has happened uh since then in the last 10 years essentially this is an example of a result of a really recent system that was put together by a team at facebook that they called the panoptic feature pyramid network so it's basically a large convolutional net that has sort of a path that extracts features multi-layer extract features and then another path that sort of generates an output image and the output image basically identifies and generates a mask for every instance of every object in the image and tells you what category they are so here the name of the categories on the display but it can recognize something like a few hundred categories people vehicles of various kinds and not just object categories but also sort of background uh sort of textures or regions things like you know grass and sand and uh trees and things like that so you would imagine a system like this would be very useful for things like self-driving cars if you have the complete segmentation identification of all the pixels in an image uh it would make it easier to build star-driven cars not just not just driving cars but also medical image analysis system so this is a relatively similar architecture people call this u-net sometimes because of the obvious u-shaped of this of this commercial net again it has an encoder part that sort of extracts features and then a sort of a part that constructs the output image where the the parts of the medical images are segmented this is the kind of result that it's producing uh this is some work by some of my colleagues at nyu i was not involved in this work uh this a different subgroup of colleagues with some common co-authors has worked also on uh detecting breast cancer from from imaging from from x-rays from mammograms uh in fact one of the most sort of hardest topics in uh in in radiology these days is using deep learning for uh medical image analysis it's a it's probably going to affect if not revolutionize radiology in the next few years it's already it already has to some extent some more work along those directions this is actually a collaboration between the nyu medical school and physical care research in accelerating the data collection for mri so when you go to an mri you have to sit in the machine for about an hour or 20 minutes depending on the kind of exam you're going through and this technique here using those kind of reconstruction condition that allows to basically reduce the data collection time and get images that are essentially of the same quality so they will not put radiology out of jobs but it will make the job more interesting probably um jeff was mentioning work on translation with neural nets this is i think a very surprising and interesting development of the fact that you can use neural nets to do translation and there is a lot of innovation in the kind of architectures that are used for this so jeff talked about the attention mechanism the transformer architecture this is a new one called enemy convolutions which kind of recycles a bit of those ideas and and things work really well there those networks are very large they have a few hundred million parameters in them and so the the some of the challenges there is actually running them on on gpus having enough memory to run them we're basically limited by uh gpu memory there so those ideas of image segmentation have been used by people working on self-driving cars particularly people at mobile life going which is not intel going back several years those the first convolutional nets i think that were deployed for self-driving cars or for driving assistance were in the 2015 tesla s model nvidia has devoted a large set of efforts also to sell driving cars and so there's a lot of interesting things going on there but progress is uh i wouldn't say slow but completely autonomous driving is a hard problem it's not as easy as people thought initially okay so um jeff uh kind of brushed away reinforcement learning but reinforcement learning is something that a lot of people are really excited about particularly people at deepmind but there is a problem with the current crop of reinforcement learning which is that it's extremely data inefficient if you want to train a system to do anything using reinforcement learning it will have to do lots and lots of trial and errors so for example to get a machine to play atari games classic atari games to the level that any human can reach in about 15 minutes of training the machine will have to play the equivalent of 80 hours of real-time play to play uh go at a superhuman level they will have to to play something like 20 million games to play starcraft this is a recent deepmind work it's a blog post noted paper the alpha star system took the equivalent of 200 years of real-time play to reach a human level on a single map for kind of a single type of player by the way all those systems use comb nets and various other things but uh that's an interesting thing so the problem with refreshment running is that those those models have to try something to to know if it's to work and it's really not practical to use in the real world if you want to train a robot to grasp things or you want to train a car to drive itself so you know to figure out um to train a system to drive a car so it doesn't run off cliffs it will actually have to dry to you know it will actually have to run off the cliffs multiple times before it figures out how not to do that first of all to figure out it's a bad idea and second to figure out how not to do it because it doesn't have a model of the world it doesn't it can't imagine what's going to happen before it happens it has to try things to correct itself that's why it's so inefficient so that begs the question how is it that humans and animals can learn so efficiently so quickly we can learn to drive a car most of us can learn to drive a car in about 20 hours of training with hardly any accident how does that happen we don't run off cliffs because we have a pretty good intuitive physics model that tells us if i turn if i'm driving next to a cliff and i'm turning the wheel to the right the car is going to run off the cliff it's going to fall and nothing good is going to come out of this okay so we have this internal model and the question is how do we learn this internal model and then and the next question is how do we get machines to learn internal models like that basically just by observation so there is a a gentleman called emmanuel dupu in paris is a developmental psychologist he works actually on how children learn language and speech and things like that but also other concepts and he made this chart about the time the the age and months at which babies learn basic concepts like things like distinguishing animate objects from inanimate objects that happens really quickly around three months old now the fact that some objects are are stable some of them will fall and you can sort of measure whether babies are surprised by the behavior of some objects and then it takes about nine months for babies to figure out that objects that are not supported will fall basically gravity so if you show a six month be a six month old baby the the the scenario on the top left where there's a little car on a platform and you push the little car off the platform and the car doesn't fall it's a trick baby six months old don't even pay attention that's just another thing that the world throws at them that they you know they have to learn it's fine a nine-month-old baby will go like the little girl at the bottom left be very very surprised in the meantime they've learned the concept of gravity and nobody has really told them what gravity is they've just kind of observed the world and they figured out objects that are not supported just fall and so when that doesn't happen they get surprised how does that happen it's not just humans animals have those models too you know cats dogs rats or wrong tongues so here is a baby orangutong here is being shown a magic trick put an object in a cup um remove the object but he doesn't see that then show the cut it's empty it was on the floor laughing so his model of the world was violated right he has a pretty good model of the world object permanence that's a very basic concept objects are not supposed to disappear like that um and when your model of the world is being violated you pay attention because you're going to learn something about the world you didn't know if it really violates a very basic thing about the world it's funny but it's also it might be dangerous right it's something that can kill you because you just didn't predict what what just happened okay so what's the salvation really you know how do we get machines to learn this this kind of stuff you know learn all the huge amount of background knowledge we learn about the world by just observing in the first few months of life and animals do this too so for example if i ask you um if i train myself to predict what the world is going to look like when i move my head slightly to the left because of parallax motion objects that are nearby and objects that are far away won't move uh the same way relative to you know my my viewpoint and so the best way to predict how the world is going to look when i move my head is to basically represent internally the notion of depth and consequently sort of conversely if i train a system to predict what the world what the world is going to look like when it moves its camera maybe it's going to learn the notion of depth automatically and once you have depth you have objects because you have objects in front of others you have occlusion edges once you have objects you have you know things you can influence and things that can move independently of others and things like that so concepts can kind of build on top of each other each other like this through prediction so that's the idea of cell supervision it's prediction and reconstruction i give the machine a piece of data that's a video clip i mask a piece of that video clip and i ask the system to predict the missing part from the part that it can observe okay so that would be video prediction just predict the future but the more general form of cell supervised running is i don't specify in advance which part i'm going to mask or not i'm just going to tell the system i'm going to mask a piece of it and whatever is masked you i'm asking you to reconstruct it and in fact i may not even mask it at all i'm just going to virtually mask it and just ask the system to reconstruct the input under certain constraints so the advantage of this cell supervision is that it's not task dependent you get the machine to learn about the world without training it for a particular task and so you can learn just by observation without having to interact with the world which is much more efficient but more importantly you're asking the system to predict a lot of stuff not just a value function like in reinforcement learning where basically the only thing you give the machine to predict is a scalar value once in a while not reinforcement not uh supervised learning where you ask the system to predict a label which is a few bits in the case of cell supervisor you're asking the machine to predict a lot of stuff and so that led me to this slightly obnoxious analogy at least for people who work on reinforcement learning which is the idea that if intelligence or learning is a is a cake the bulk of the cake the gene wise as we say in french uh israeli self-supervisioning most of what we learn most of the knowledge we accumulate about the world is learned through self-supervised learning um there's a little bit of icing on the cake which is supervised running we're you know being showed a picture book and we're being told the name of objects and with just a few examples we can know what the objects are we're we're taught the meaning of some words and you know babies can learn young young children can learn many many words per day new words and then the cherry on the cake is refreshing accounting it's it's a very small amount of information you're asking the machine to predict and so there's no way that the machine can learn purely from from that form of learning it has to be a combination of uh probably all three forms of learning but but principally self-supervised running this idea is not new a lot of people have argued for the idea of prediction for for learning the idea of learning models predictive models and one one such person is is jeff as a matter of fact this is a quote from from him which you know this is from a few years ago but he's been saying this for about 40 years at least for longer than i've known him and and it goes like this the brain has about 10 to 14 synapses and we only live about 10 to the 9 seconds so we have a lot more parameters than data this motivates the idea that we must do a lot of unsupervised learning or self-supervisioning since the perceptual input including proprioception is the only place where we can get 10 to the 50 dimensions of constraints per second if you're asked to predict everything that comes into your sense your your senses uh you know every fraction of a second that's a lot of information you have to learn and that might be enough to constrain all the synapses we have in our brain to learn things that are meaningful so the sequel of deep learning in my opinion is self-supervisioning and in fact historically as jeff mentioned the the the sort of deep learning conspiracy that that joshua jeff and i started in the early 2000s was focused on unsupervised running and super speed training and it was partly successful but we kind of put it on the back burner for a while and it's coming back to the four now it's going to create a new revolution at least that's my prediction and the next the revolution will not be supervised so i have to thank uh alyosha efros for this slogan he invented it uh of course he got inspired by jill scott heron you know the revolution will not be televised you can even get a t-shirt with it now um so what is self-supervised running really surprising is filling in the blanks and it works really well for natural language processing so natural language processing a method has become standard over the last year uh in models like bert and and others is you take a long uh sequence of words extracted from a corpus of of text you blank out uh some proportion of the words and you train a very large neural net based on those transformer architectures or various other architectures to predict the missing words and in fact it cannot exactly predict the missing word so you're asking it to predict a distribution over the entire vocabulary for the probability that every you know each word may occur at those locations so that's called that's a special case of what we call a masked auto encoder okay give it an input ask it to reconstruct with this part of input that are not that is not present people have been trying to do this in the context of image recognition as well there's these various attempts at doing this so this is word from uh pathetic at all from a few years ago where you blank out some pieces of an image and then you ask the system to fill it to fill them in and it's only partially successful not nearly as successful as in the context of natural language processing so natural language processing has been a revolution over the last year of using those pre-training systems for natural language understanding translation all kinds of stuff and the performance is amazing it they're very very big models but the performance really works really well and there were sort of early indications of this uh in work that uh joshua ventura did a long time ago in the 90s and uh roland colliber and jason weston did in the uh around 2010 uh in uh using neural nets for nlp and then more recent work work to vag fast decks etc which use this idea of predicting words from their context basically but really is sort of the this whole idea is completely taken off so why does it work for natural language processing and why does it not work so well for for in the context of images and vision i think it's because of the sort of how we represent uncertainty or how we do not represent uncertainty so let's say we want to do video prediction we we have short video clips with a few frames in this case here a little girl approaching a birthday cake and then we asked the machine to predict the next few frames in the video if you train a large neural net to predict the next few frames using least squared error you know squared error what you get are blurry predictions why because the system cannot exactly predict what's going to happen and so it the best thing you can do is predict the average of all the possible futures to be more concrete let's say all the videos consist of someone putting a pen on the table and letting it go and every time you repeat the experiment the pen falls in a different direction and you can't really predict which direction it's going to fall then if you predict the average of all the outcomes it would be a transparent pen you know superimposed on itself in all possible orientations that's not a good prediction so if you want a system to be able to represent multiple predictions it has to have what's called a little variable so you have a function implemented by neural net it takes the past let's say a few frames from a video and it wants to predict the next few frames it has to have an extra variable here it's called z so that when you vary this variable the output varies over a particular set of plausible predictions okay that's called electron variable model the problem with training those things is that there's basically only two ways of training them that we know about or two kind of uh families of uh of ways to train those systems one is a very cool idea from ian goodfellow and his collaborators in your montreal a few years ago called adversarial training or generative adversarial networks and the idea of [Music] gans during antiversal networks is to train the second neural net to tell the first neural net whether its prediction is on this manifold or or set of plausible futures or not and you train those two networks simultaneously there's another technique that consists in sort of inferring what the ideal value of the latent variable would be to make a good prediction but if you do this you have the danger that the latent variable will capture all the information there is to to capture about about the prediction and no information will actually be used from the past to make that prediction so you have to regularize this related variable okay so those ideas of things like adversarial training work really well so what you see here at the bottom is a video prediction for a small short clip where the system has been trained with this adversarial training uh and there are you know various ways of doing those uh predictions not just in pixel space but also in the space of objects that have been already segmented those adversarial genetic virtual networks can generate images that are used for kind of assistance to sort of artistic production so these are non-existing faces you have a system here that's been trained to produce an image that looks like a celebrity and after the system is trained you feed it a few hundred random numbers and that comes a face that doesn't exist and they look pretty good this is worked by nvidia from this year actually it was presented this year you can use this to produce all kinds of different things like you know clothing for example training on a collection of clothes from a famous designer um so i think we need sort of new ways of uh representing of of sort of formulating this problem with unsupervised running so that our systems can deal with this uh uncertainty in the prediction in the context of continuous high-dimensional spaces we don't have the problem in the context of adversarial of natural language processing because it's easy to represent the distribution of a words it's just a discrete distribution it's a long vector of numbers between zero and one that's sum to one but it's very hard in continuous high dimensional spaces and so we need new techniques for this and one technique i'm proposing is something called energy-based self-supervise running which is imagine that your world is two-dimensional you only have two input variables two sensors and your entire uh world your entire training set is composed of those dots here in this two-dimensional space what you'd like is to train a contrast function let's call it an energy that gives low energy to points that are on the manifold of data and higher energy outside and there is basically a lot of research to do there to find the best method to do this my favorite one is what i call regularization viable models and we had some success about 10 years ago in using techniques of this type for learning features in accomplishment completely unsupervised what you see on the left here is uh animation of a system that learns uh basically oriented filters by just being trained with natural image patches to reconstruct those under sparsity constraints uh and what you see on the right is uh filters of a convolutional net that are learned in the same with the same algorithm with different numbers of filters those things kind of work they don't beat supervisor only if you have tons of data but the hope is that they will reduce the amount of necessary label data so i'm going to end with an example of how to combine combine all this to get a machine to learn something useful like a like a task a motor task so here what i'm what i'm talking about is can we train a machine to learn to drive by just observing uh other people driving and by training a model of what goes on in the world so you are in your car you can see all the cars around you and if you can predict what the cars around you are going to do ahead of time then you can drive defensively basically you can you can decide to stay away from this car because you see it swerving you can decide to uh you know kind of slow down because the car in front of you is likely to uh to slow down because there's another car in front of it that is slowing down so you have all those predictive models that basically keep you safe and you've sort of learned to integrate them over time you don't even have to think about it it's just in your in your sort of reflexes of driving you can talk at the same time and and your your work but the way to train a system like this is you first have to train a forward model so forward model would be here is the state of the world at time t give me the a prediction about the set of the world at time t plus one and the problem with this of course is the world is not deterministic there's a lot of things that could happen so it's the same problem i was talking about with a pen many things can happen so but if you had such a forward model you could run the forward model multiple time steps and then if you had an objective function like how far you are from the other cars whether you are in lane things like this you could back propagate gradient through this entire system to train a neural net to predict the correct course of action they would uh they would be safe over the long run and this can be done completely in your head if you have a forward model in your head you don't have to actually drive to train yourself to drive you can just imagine all of those things so that's a specific example so you put a camera looking down on the highway it follows every car and it extracts a little rectangle around every car that follows every car that you see at the bottom and uh and what you're doing now is you're training a commercial net to take a few frames centered or on a particular car and predict the next state of the world and if you do this you get uh oops sorry you get the second column so the the column on the left is what happens in the real world the second column is what happens if you just train a commercial net with least square to predict what's going to happen it can only predict the average of all the possible features and so you get blurry predictions if you now transform the model so it has a latent variable that allows it to take into account the uncertainty about the world and i'm not going to explain exactly how that works then you get the the prediction that you just saw on the on the right where for every drawing of this latent variable you get different predictions but they're crisp okay so now what you can do is um you can to do this training i was telling you about earlier you sample this related variable so you get different possible scenarios about what's going to happen in your future then through back propagation you train your policy network to get your system to drive if you do this it doesn't work it doesn't work because the system goes into regions of the state space where your the forward model is very inaccurate and very uncertain so we have to do is add another term in the objective function that prevents the system from going into parts of the space where it doesn't where its predictions are bad okay so it's like an inverse curiosity constraint if you want and if you do this it works so these are examples of the blue car is driving itself the little white dot indicates whether it accelerates whether it breaks or whether it turns and it kind of keeps itself safe away from the other cars the other cars can't see it the blue car is invisible here let me show you another example here so here the yellow car is the actual car in the video the blue car is what the agent here that's been trained is uh is doing and it's being squeezed between two cars so it has to escape because the other cars don't see it so to squeeze out but it works it works reasonably well and basically that system has never interacted with the real world it's just watched other people drive and then it's used that for for training its action plans basically it's policy okay now i'm going to go a little uh philosophical if you want there is throughout history of technology and science there's been this phenomenon it's not universal but it's pretty frequent where people invent an artifact and then derive a science out of this artifact to explain how this artifact works or to kind of figure out its limitations so a good example is the invention of the telescope in the 1600s optics was not developed until at least 50 years later but people had a good intuition of how to build telescopes before that steam engine was invented in the late 1600s early 1700s and thermodynamics was you know came up more than 100 years later basically designed to explain the limitations of thermal engines and thermodynamics now is the foundation of one of the most fundamental intellectual construction of all science so it was purposely defined to explain a particular artifact that's very interesting same thing with electromagnetism electrodynamics with uh you know the invention of sailboats and airplanes and aerodynamics uh uh you know invention of compounds and chemistry to explain et cetera right computers and computer science came after the inventory of computers right information theory came after the invention of first digital communication through radio and data type and things like that so it's quite possible that now we have you know in the next few decades we'll have empirical systems that are built by trial and error perhaps by systematic optimization on powerful machines perhaps by intuitions by empirical work perhaps with a little bit of theory perhaps a lot of theory hopefully and the question is whether this will lead to a whole theory of intelligence the fact that we can build an artifact that is intelligent might lead to a general theory of information processing and intelligence um and that's kind of a big hope i'm not sure this is going to be realized over the next few decades but that's a good program a word of caution about biological inspiration so neural nets are biologically inspired cognition nets are biologically inspired but they're just inspired they're not copied let me give you a story of a gentleman called clement idea uh is there any french people in the room here okay can you raise your hand french people no french people yeah okay a couple have you heard of climate idea never heard of cleveland right there yeah you have okay is there anyone who is not french who have heard of clemon idea okay one person two person basically nobody you guys have no idea who he is right okay so this guy built in the late 1800s a bat-shaped airplane steam-powered he was a steam engine designer and his airplane actually took off on his own power 13 years before the wright brothers flew for about 50 meters at about 50 centimeters altitude and then kind of crashed landed um it was basically uncontrollable so basically the guy just copied bats and just assumed that because it has the shape of a bat it would just fly right that seemed a little bit naive it was not naive at all but it was it kind of you know stuck a little bit too close to biology and and got sort of hypnotized by it a little bit and didn't do things like you know build a model or a glider or a kite or you know a wind tunnel like the like the red brothers did so he stuck a little too close to biology on the other hand he had a big legacy which is that his second airplane was called the avion and that's actually the word in french spanish and portuguese for airplane so he had some legacy um but he was kind of a secretive guy he you know this was before the open source days and so this is why you never heard of him thank you very much