I should start off this post by disclaiming: I am genuinely out of my depth in a lot of this, and what I say has a higher percentage chance than usual of just being straight up Wrong.
A bunch of authors made a splash a few months ago with their paper “Intriguing Properties of Neural Networks.” Link is to a PDF. The short version of the story is:
- Neural Networks are a long-standing AI concept where basically you set up a kind of generic learning algorithm, and instead of explicitly programming it to, say, recognize images of cats, you let it “learn” by showing it several thousand pictures of cats and several thousand pictures of not-cats and then it hopefully figures it out.
- They’ve been the topic of lots of research since at least the 90’s, and I would guess earlier. In those times, they didn’t really do a good job. They had some intriguing properties, but nothing commercializable.
- But in the 21st Century, a variation called Deep Learning has had a lot of success. This is the algorithm at the core of IBM’s Watson (the one that won Jeopardy forever). And now you can for example have a program that will reasonably identify pictures of cats.
- A problem that neural networks have always had is “overtraining.” That is, you show it a thousand pictures of cats, and it gets focused on little details of the pictures that happen by chance to be shared in a large percentage of your thousand “training” pictures, but are not genuinely features shared by cats. Like, this pixel in the lower left is red in 70 of your 1000 pictures, just by chance, but the network thinks that this is a feature of “photos of cats.”
- It’s hard to tell when something is overtrained, because neural networks are this organic, not-very-understandable code. There isn’t one place in it where it’s like, “Oh, and here’s where I’m looking for whiskers.”
- That said, Deep Learning algorithms have been pretty resistant to the problems of overtraining (at least, as far as we can tell)…
- The paper we’re talking about here found out that you can examine a neural net that’s supposed to be identifying a picture, and then build an “adversarial sample,” where you attack a few specific pixels that the algorithm weights heavily, and then apply a filter to, say, an image of a dump-truck, and get a resulting image that, to a human, is basically identical to the original picture, but the deep learning algorithm will very confidently say, “Oh, no, that’s a picture of a cat.”
- And then the internet was filled with WTFs.
So a lot of discussion has been spent on how serious a flaw this is, in practice, of deep learning neural networks, which is not a topic I intend to address. I’m more interested in what this actually means.
The initial thing everyone says is, “well, this is overtraining.” From some perspectives, it looks a lot like overtraining. You’d expect to see an overtrained algorithm heavily weight a particular pixel and give it more freight than it really should carry. On the other hand, these algorithms do do a pretty good job with non-adversarial samples, which traditional overtraining has not done. To be clear, an “adversarial sample” is one specifically designed to fool this algorithm, with knowledge of how the algorithm works.
So, I think that the actual deal is that it’s overtraining on kind of a micro level. That is, when a deep learning algorithm learns a photoset, I think what’s happening is that it learns about a ton of different features of images, with lots of flaws in how it learns about them. Like, it looks for several hundred or several thousand features of a photo, and some number of those features are just total bullshit, and lots of other ones are semi-bullshit. But in a normal sample, the errors typically cancel each other out, and you’re sort of left with a wisdom of crowds situation in which a lot of people who are individually very wrong are collectively pretty right. An adversarial sample basically targets the flaws and lets the wrong voices crowd out the right.
Now, how does that relate to the ultimate goal of making better AI? I think that it highlights the great undiscovered country of AI: when humans look at a bunch of pictures of cats, they probably do on some level notice “oh, these are similarities in these photos,” but we discard many or all of the “wrong” similarities by applying an understanding of the world that deep learning neural networks do not have. We can say, “it’s hard to tell where the cat ends and the background begins in this photo, but I know not imagine that cats just blur into the background because even if I’ve never seen a cat before, I know that almost all objects per se do not just blur into the environment, and I can categorize the cat as a kind of solid object rather than, say, a gas.” And we do that kind of error correction based not just on visual input, but assigned meaning. And nobody really knows how that assigned meaning stuff happens. Maybe it’s just another level of neural network with a much larger training set.
Anyhow, using that meaning stuff, we then weed out the adversarial samples from our own visual processing. But more to the point, we also don’t just end up looking at “features” of cats like “there are some pixels here,” but rather we string the similarities together into concepts like “whiskers” that themselves have meaning. And then that is what allows us to, say, look at a bunch of photos of cats and then see a stylized line drawing of a cat and understand that both refer to the same concept.
Or, again, I want to be clear: Maybe I’m just massively wrong.