神经网络、流形、拓扑

原文(Original article): “Neural Networks, Manifolds, and Topology”,http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

topology, neural networks, deep learning, manifold hypothesis

关键词:拓扑、神经网络、深度学习、流形猜想

Recently, there’s been a great deal of excitement and interest in deep neural networks because they’ve achieved breakthrough results in areas such as computer vision.1

近年来,由于深度学习神经网络在计算机视觉等领域的突破性进展,深度学习已然成为一个令人兴奋的、很有趣的领域 1

However, there remain a number of concerns about them. One is that it can be quite challenging to understand what a neural network is really doing. If one trains it well, it achieves high quality results, but it is challenging to understand how it is doing so. If the network fails, it is hard to understand what went wrong.

然而,已然有一些问题没有解决。其中一个是很难明白神经网络到底在干嘛:如果被很好的训练,人们能得到很好的结果,但是很难明白其到底是如何实现的;而如果得不到好的结果,也很难明白到底哪出了问题。

While it is challenging to understand the behavior of deep neural networks in general, it turns out to be much easier to explore low-dimensional deep neural networks – networks that only have a few neurons in each layer. In fact, we can create visualizations to completely understand the behavior and training of such networks. This perspective will allow us to gain deeper intuition about the behavior of neural networks and observe a connection linking neural networks to an area of mathematics called topology.

虽然很难解析一个深度学习网络,但是,解析一个低维度的,由几个神经元构成的神经网络,却并不算太难。事实上,我们可以对这个神经网络的行为和训练进行完全的图像化。这样能让我们更直观地对它进行理解,并观察其与拓扑学之间的联系。

A number of interesting things follow from this, including fundamental lower-bounds on the complexity of a neural network capable of classifying certain datasets.

A Simple Example

一个简单的例子

Let’s begin with a very simple dataset, two curves on a plane. The network will learn to classify points as belonging to one or the other.

让我们先研究一组简单的数据:平面内的两条曲线。神经网络会学习将数据点归类为这两条曲线中的一条。

The obvious way to visualize the behavior of a neural network – or any classification algorithm, for that matter – is to simply look at how it classifies every possible data point.

最简单的可视化神经网络(或任意的分类算法)的行为的方法,是看它具体如何对每个点进行分类的。

We’ll start with the simplest possible class of neural network, one with only an input layer and an output layer. Such a network simply tries to separate the two classes of data by dividing them with a line.

我们从最简单的神经网络(只有一个输入层和一个输出层)开始研究。这样一个神经网络会尝试将数据用一条直线进行分割归类。

That sort of network isn’t very interesting. Modern neural networks generally have multiple layers between their input and output, called “hidden” layers. At the very least, they have one.

这样的神经网络不是很有意思。当代的神经网络,在输入层(Input)与输出层(Output)之间,一般还有很多(至少有一层)隐藏层(Hidden)。

Diagram of a simple network from Wikipedia
上图为一个简单的神经网络图(取自Wikipedia)

As before, we can visualize the behavior of this network by looking at what it does to different points in its domain. It separates the data with a more complicated curve than a line.

跟之前一样,我们可以通过显示这个网络具体对每个数据点进行了怎样的操作来可视化这个网络的行为。我们发现它用一根更复杂的曲线来分割数据点。

With each layer, the network transforms the data, creating a new representation.2 We can look at the data in each of these representations and how the network classifies them. When we get to the final representation, the network will just draw a line through the data (or, in higher dimensions, a hyperplane).

对于每一层,这个神经网络对数据进行变换,创造一个新的表象2。我们可以看看数据在每一个表象里的数据,以及这个神经网络如何对它们进行归类的。当我们得到了最终的表象,这个神经网络会简单的画一条直线(高维度的情况,则对应于一个超平面)来分割它们。

In the previous visualization, we looked at the data in its “raw” representation. You can think of that as us looking at the input layer. Now we will look at it after it is transformed by the first layer. You can think of this as us looking at the hidden layer.

在前面的可视化里,我们是在初始的表象里看数据的状况(这就好比我们在输入层里看数据)。现在,我们看看数据经过第一层的变换后的样子(这就好比我们在隐藏层里看数据)。

Each dimension corresponds to the firing of a neuron in the layer.

每一个维度对应于该层内一个神经元的作用。

The hidden layer learns a representation so that the data is linearly separable
隐藏层学会了一种表象,因此数据被线性地分割归类了。

Continuous Visualization of Layers

对层的连续可视化

In the approach outlined in the previous section, we learn to understand networks by looking at the representation corresponding to each layer. This gives us a discrete list of representations.

以上我们对每一层进行了可视化,这让我们得到一个离散的表象序列。

The tricky part is in understanding how we go from one to another. Thankfully, neural network layers have nice properties that make this very easy.

棘手的部分是理解如何从一个表象变换都另一个。索性,神经网络层具有很好的特性,使这个问题变得简单起来。

There are a variety of different kinds of layers used in neural networks. We will talk about tanh layers for a concrete example. A tanh layer tanh(Wx+b) consists of:

  1. A linear transformation by the “weight” matrix W
  2. A translation by the vector b
  3. Point-wise application of tanh.

神经网络里有各种各样的层。作为一个具体的例子,我们考虑 tanh 层。一个tanh 层(\tanh(Wx+b))由如下部分组成:

  1. 由矩阵W进行的线性变换,
  2. 由矢量b进行的平移变换,
  3. 对每一个元素取tanh。

We can visualize this as a continuous transformation, as follows:

我们可以把这个连续的变换过程进行可视化(如下如所示):

Gradually applying a neural network layer

The story is much the same for other standard layers, consisting of an affine transformation followed by pointwise application of a monotone activation function.

其他标准层次的故事是一样的,其中包括仿射变换,然后逐点应用一个单调的激活函数。

We can apply this technique to understand more complicated networks. For example, the following network classifies two spirals that are slightly entangled, using four hidden layers. Over time, we can see it shift from the “raw” representation to higher level ones it has learned in order to classify the data. While the spirals are originally entangled, by the end they are linearly separable.

我们可以应用这种技术来了解更复杂的网络。例如,以下的神经网络使用四个隐藏层对两个稍微纠缠的螺旋进行分类。随着时间的推移,我们可以看到它从“原始”表象转变为更高级别的表象,以便对数据进行分类。螺旋线最初被缠结在一起,而最终它们被线性分离。

On the other hand, the following network, also using multiple layers, fails to classify two spirals that are more entangled.

与之对比,以下网络也使用多层次,但未能对两个更加纠缠的螺旋进行成功分类。

It is worth explicitly noting here that these tasks are only somewhat challenging because we are using low-dimensional neural networks. If we were using wider networks, all this would be quite easy.

此处有一个问题值得明确:这些任务之所以具有点挑战性,是因为我们使用低维神经网络。如果我们使用更复杂的网络,这一切将是非常容易的。

(Andrej Karpathy has made a nice demo based on ConvnetJS that allows you to interactively explore networks with this sort of visualization of training!)

(Andrej Karpathy提供了基于ConvnetJS的一个很好的演示,让您可以通过这种训练可视化来交互式地探索神经网络!)

Topology of tanh Layers

tanh层的拓扑结构

Each layer stretches and squishes space, but it never cuts, breaks, or folds it. Intuitively, we can see that it preserves topological properties. For example, a set will be connected afterwards if it was before (and vice versa).

每层都延伸和挤压空间,但它不会剪切,断裂或折叠。直观地,我们可以看到它保留原始的拓扑属性。例如,一个集合会保持连续,如果之前如此(反之亦然)。

Transformations like this, which don’t affect topology, are called homeomorphisms. Formally, they are bijections that are continuous functions both ways.

这种不影响拓扑的转换称为同态。正式地,它们是双向连续的双射函数。

Theorem: Layers with N inputs and N outputs are homeomorphisms, if the weight matrix, W, is non-singular. (Though one needs to be careful about domain and range.)

定理:N个输入和N个输出的层是同态,如果权重矩阵W是非奇异的。 (尽管需要注意域和范围。)

Proof: Let’s consider this step by step:

  1. Let’s assume W has a non-zero determinant. Then it is a bijective linear function with a linear inverse. Linear functions are continuous. So, multiplying by W is a homeomorphism.
  2. Translations are homeomorphisms
  3. tanh (and sigmoid and softplus but not ReLU) are continuous functions with continuous inverses. They are bijections if we are careful about the domain and range we consider. Applying them pointwise is a homeomorphism

证明:让我们逐步考虑一下:

  1. 假设W有一个非零的行列式。那么它是一个具有线性倒数的双射线性函数。线性函数是连续的;所以,“乘以W”是同态。
  2. 平移变换是同态。
  3. tanh(sigmoid和softplus也是,ReLU不是)是具有连续反转的连续函数。如果我们谨慎处理我们考虑的领域和范围,那么它们是双射的。逐一元素进行运用是同态。

Thus, if W has a non-zero determinant, our layer is a homeomorphism. ∎

因此,如果W具有非零行列式,则我们的层是同态。 ∎

This result continues to hold if we compose arbitrarily many of these layers together.

Topology and Classification

拓扑与分类

A is red, B is blue
上图中,红色代表A,蓝色代表B

Consider a two dimensional dataset with two classes A,B2:

A = \{x | d(x,0) < 1/3\}
B = \{x | 2/3 < d(x,0) < 1\}
让我们考虑一组具有两个类别的二维数据 A, B \subset \mathbb{R}^2:
A = \{x | d(x,0) < 1/3\}
B = \{x | 2/3 < d(x,0) < 1\}

Claim: It is impossible for a neural network to classify this dataset without having a layer that has 3 or more hidden units, regardless of depth.

声明:如果没有至少一个具有3个或更多隐藏单元的层,无论深度如何,神经网络都不可能对该数据集进行分类。

As mentioned previously, classification with a sigmoid unit or a softmax layer is equivalent to trying to find a hyperplane (or in this case a line) that separates A and B in the final represenation. With only two hidden units, a network is topologically incapable of separating the data in this way, and doomed to failure on this dataset.

如前所述,用Sigmoid单元或softmax层分类等同于试图找到在最终表象中分离A和B的超平面(或在这种情况下是一条直线)。 只有两个隐藏的单元,网络在拓扑上不能以这种方式分离数据,注定要在此数据集上失败。

In the following visualization, we observe a hidden representation while a network trains, along with the classification line. As we watch, it struggles and flounders trying to learn a way to do this.

在以下可视化中,我们观察到一个神经网络训练时的隐藏表示,以及分类线。正如我们所看到的那样,它正在努力学习如何实现分类。

For this network, hard work isn’t enough.
对于这个神经网络,单凭“努力”是不够的。

In the end it gets pulled into a rather unproductive local minimum. Although, it’s actually able to achieve 80% classification accuracy.

最终,它被拉到一个非常无效的局部最小值。虽然实际上能够达到〜80%的分类精度。

This example only had one hidden layer, but it would fail regardless.

虽然这个例子只有一个隐藏层,但是如论如何都它都会失败。

Proof: Either each layer is a homeomorphism, or the layer’s weight matrix has determinant 0. If it is a homemorphism, A is still surrounded by B, and a line can’t separate them. But suppose it has a determinant of 0: then the dataset gets collapsed on some axis. Since we’re dealing with something homeomorphic to the original dataset, A is surrounded by B, and collapsing on any axis means we will have some points of A and B mix and become impossible to distinguish between. ∎

证明:每个层都是同态,或者层的权重矩阵的行列式为0.如果它是同态,则A仍然被B包围,并且直线不能分开它们。但假设它的行列式为0:那么数据集在某个轴上被折叠。由于我们处理与原始数据集同源的东西,A被B包围,并且在任何轴上折叠意味着我们将有一些A和B混合点,A和B变得无法区分。 ∎

If we add a third hidden unit, the problem becomes trivial. The neural network learns the following representation:

如果我们添加第三个隐藏单元,问题变得微不足道。神经网络学会了如下的表象:

With this representation, we can separate the datasets with a hyperplane.

有了这个表象,我们可以用超平面分割数据集。

To get a better sense of what’s going on, let’s consider an even simpler dataset that’s 1-dimensional:

为了更好地了解发生了什么,让我们考虑一个更简单的一维数据集:

A = [-\frac{1}{3}, \frac{1}{3}]

B = [-1, -\frac{2}{3}] \cup [\frac{2}{3}, 1]

Without using a layer of two or more hidden units, we can’t classify this dataset. But if we use one with two units, we learn to represent the data as a nice curve that allows us to separate the classes with a line:

不使用具有两个或更多隐藏单元的层,我们无法对此数据集进行分类。但是,如果我们使用两个单元,通过学习将数据表示为一条很好的曲线,这样可以让我们用一条直线来分类:

What’s happening? One hidden unit learns to fire when x>1/2 and one learns to fire when x>1/2. When the first one fires, but not the second, we know that we are in A.

发生了什么?当x> -1/2时,一个隐藏的神经元学会了触发;当x> 1/2时,另一个学会着火。当第一个触发而不是第二个时,我们知道我们在A类别。

The Manifold Hypothesis

流形猜想

Is this relevant to real world data sets, like image data? If you take the manifold hypothesis really seriously, I think it bears consideration.

这与现实世界的数据集有关吗,如图像数据一样?如果你真的认真对待这个流形猜想,那我认为它是值得考虑的。

The manifold hypothesis is that natural data forms lower-dimensional manifolds in its embedding space. There are both theoretical3 and experimental4 reasons to believe this to be true. If you believe this, then the task of a classification algorithm is fundamentally to separate a bunch of tangled manifolds.

流形猜想认为自然数据在其嵌入空间中形成较低维度的流形。理论3和实验4都有理由认为这是真的。如果你相信这一点,那么分类算法的任务就是从根本上分离一堆纠缠的流形。

In the previous examples, one class completely surrounded another. However, it doesn’t seem very likely that the dog image manifold is completely surrounded by the cat image manifold. But there are other, more plausible topological situations that could still pose an issue, as we will see in the next section.

在前面的例子中,一个类完全包围另一个。然而,“狗”的图像流形似乎不太可能被“猫”图像流形完全包围。但是,还有其他更可信的拓扑情况仍然可能引发一个问题,我们将在下一节中看到。

链接和同伦

Another interesting dataset to consider is two linked tori, A and B:

另一个需要考虑的有趣的数据集是两个链接的环面,A和B:

.

Much like the previous datasets we considered, this dataset can’t be separated without using n+1 dimensions, namely a 4th dimension.

就像我们以前的数据集一样,这个数据集不能在没有n + 1维度的情况下被分开,即第四维。

Links are studied in knot theory, an area of topology. Sometimes when we see a link, it isn’t immediately obvious whether it’s an unlink (a bunch of things that are tangled together, but can be separated by continuous deformation) or not.

链接是在结理论(属于拓扑领域)中研究的。有时,当我们看到一个链接时,它是否是一个链接,不是立即显而易见的(如下图所示,一束纠结在一起,但可以通过连续变形分开)。

A relatively simple unlink.
一个相对简单的非链接例子。

If a neural network using layers with only 3 units can classify it, then it is an unlink. (Question: Can all unlinks be classified by a network with only 3 units, theoretically?)

如果使用只有3个神经元的层的神经网络可以对其进行分类,那么它是一个非链接。 (问题:理论上,所有非链接是否只能由3个单位进行分类?)

From this knot perspective, our continuous visualization of the representations produced by a neural network isn’t just a nice animation, it’s a procedure for untangling links. In topology, we would call it an ambient isotopy between the original link and the separated ones.

从结理论的角度来看,我们对神经网络表象的连续可视化,不仅仅是一个很好的动画,而且是一个解开链接的过程。在拓扑学中,我们称之为原始链接和分离链路之间的背景合痕。

Formally, an ambient isotopy between manifolds A and B is a continuous function F:[0,1]×XY such that each Ft is a homeomorphism from X to its range, F0 is the identity function, and F1 maps A to B. That is, Ft continuously transitions from mapping A to itself to mapping A to B.

正式地来说,流形A和B之间的背景合痕是连续函数F: [0,1] \times X \to Y,使得每个F_t是从X到其范围的同胚,F_0是恒等函数,F_1将A映射到B。也就是说,F_t从“映射A到自身”不断地转换到“映射A到B”。

Theorem: There is an ambient isotopy between the input and a network layer’s representation if: a) W isn’t singular, b) we are willing to permute the neurons in the hidden layer, and c) there is more than 1 hidden unit.

定理:输入数据和神经网络层表象之间存在背景同痕,如果:a)W不是单数,b)我们愿意置换隐藏层中的神经元,c)有多于1个隐藏单元。

Proof: Again, we consider each stage of the network individually:

  1. The hardest part is the linear transformation. In order for this to be possible, we need W to have a positive determinant. Our premise is that it isn’t zero, and we can flip the sign if it is negative by switching two of the hidden neurons, and so we can guarantee the determinant is positive. The space of positive determinant matrices is path-connected, so there exists p:[0,1]GLn()5 such that p(0)=Id and p(1)=W. We can continually transition from the identity function to the W transformation with the function xp(t)x, multiplying x at each point in time t by the continuously transitioning matrix p(t).
  2. We can continually transition from the identity function to the b translation with the function xx+tb.
  3. We can continually transition from the identity function to the pointwise use of σ with the function: x(1t)x+tσ(x). ∎

证明:我们再次考虑网络的每个阶段:

  1. 最难的部分是线性变换。为了让其成为可能,我们需要W有一个正的行列式。我们的前提是它不是零,如果它是负的,我们可以翻转标志,通过切换两个隐藏的神经元,所以我们可以保证行列式是正的。正行列式矩阵的空间是路径连通的,所以存在p: [0,1] \to GL_n(\mathbb{R}),使得p(0) = Idp (1)= W5。我们可以通过函数x \to p(t)x连续地从恒等函数向W变换转变,在时间t的每个时间点x乘以连续变换矩阵p (t)
  2. 我们可以使用函数x \to x + tb来连续地从恒等函数转换为b翻译。
  3. 我们可以通过函数:x \to (1-t) x +t\sigma (x),从恒等函数到“逐点的使用\sigma”进行连续地变换。 ∎

I imagine there is probably interest in programs automatically discovering such ambient isotopies and automatically proving the equivalence of certain links, or that certain links are separable. It would be interesting to know if neural networks can beat whatever the state of the art is there.

我想象,程序可能会自动发现这种背景同痕并自动证明某些链接的等价性,或者某些链接是可分离的。知道神经网络是否可以击败任何现有技术,这将是有趣的。

(Apparently determining if knots are trivial is NP. This doesn’t bode well for neural networks.)

(显然,确定结是否是一个最简单的结,是一个非确定性多项式时间(NP)问题,这对于神经网络来说并没有什么用。)

The sort of links we’ve talked about so far don’t seem likely to turn up in real world data, but there are higher dimensional generalizations. It seems plausible such things could exist in real world data.

到目前为止,我们已经讨论过的那种链接,看起来似乎并不会在现实世界中出现,但是还有很多高维度的推广例子可能存在于现实世界的数据中。

Links and knots are 1-dimensional manifolds, but we need 4 dimensions to be able to untangle all of them. Similarly, one can need yet higher dimensional space to be able to unknot n-dimensional manifolds. All n-dimensional manifolds can be untangled in 2n+2 dimensions.6

链接和结是一维流形,但是我们需要4个维度才能解开它们。类似地,人们可能需要更高的维度空间来能够解开n维流形。所有n维流形可以在2n + 2维解开6

(I know very little about knot theory and really need to learn more about what’s known regarding dimensionality and links. If we know a manifold can be embedded in n-dimensional space, instead of the dimensionality of the manifold, what limit do we have?)

(我对结理论了解很少,真的需要更多地了解关于维度和链接的知识,如果我们知道多维数据可以嵌入到n维空间中,而不是流形的维度,我们有什么限制? )

The Easy Way Out

简单的解决方案

The natural thing for a neural net to do, the very easy route, is to try and pull the manifolds apart naively and stretch the parts that are tangled as thin as possible. While this won’t be anywhere close to a genuine solution, it can achieve relatively high classification accuracy and be a tempting local minimum.

神经网做自然的事情,非常简单的途径就是尝试简单地拉扯流形,让纠缠的部分尽可能地展开。虽然这不会是真正解决方案,但它可以达到相对较高的分类准确度,并且是一个诱人的局域最小值。

It would present itself as very high derivatives on the regions it is trying to stretch, and sharp near-discontinuities. We know these things happen.7 Contractive penalties, penalizing the derivatives of the layers at data points, are the natural way to fight this.8

它将呈现在它试图拉伸的区域上的非常高的导数值,在靠近不连续区域则变得很尖锐。我们知道这些事情发生了7。“收缩处罚”——处罚神经网络层的数据点处的导数,是一种自然的解决方法8

Since these sort of local minima are absolutely useless from the perspective of trying to solve topological problems, topological problems may provide a nice motivation to explore fighting these issues.

由于这些局部最小值从尝试解决拓扑问题的角度来看是绝对没有用的,所以拓扑问题可能会提供一个很好的动机来探索这些问题。

On the other hand, if we only care about achieving good classification results, it seems like we might not care. If a tiny bit of the data manifold is snagged on another manifold, is that a problem for us? It seems like we should be able to get arbitrarily good classification results despite this issue.

另一方面,如果我们只关心实现良好的分类结果,似乎我们可能不在乎。如果数据流中的一小部分被阻塞在另一个流形上,那对我们来说会是一个问题吗?尽管存在这个问题,我们似乎应该能够获得任意良好的分类结果。

(My intuition is that trying to cheat the problem like this is a bad idea: it’s hard to imagine that it won’t be a dead end. In particular, in an optimization problem where local minima are a big problem, picking an architecture that can’t genuinely solve the problem seems like a recipe for bad performance.)

(我的直觉是,试图欺骗这样的问题是一个坏主意:很难想象它不会是一个死胡同,特别是在一个优化问题中,局部最小值是一个大问题,挑选一个不能真正解决问题的神经网络架构似乎是一个不明智之举。)

Better Layers for Manipulating Manifolds?

操纵流形的更好的层?

The more I think about standard neural network layers – that is, with an affine transformation followed by a point-wise activation function – the more disenchanted I feel. It’s hard to imagine that these are really very good for manipulating manifolds.

我对标准神经网络层的思考越多 (也就是说,一个仿射变换,然后是逐点运用激活函数),我就越感觉不理想。我很难想象这些操作对操纵流形真的很好使。

Perhaps it might make sense to have a very different kind of layer that we can use in composition with more traditional ones?

或许,将一个十分不一样的层与更传统的层结合起来使用会比较好?

The thing that feels natural to me is to learn a vector field with the direction we want to shift the manifold:

对我来说自然的事情是学习一个向量场,我们要改变这个方向:

And then deform space based on it:

然后根据它扭曲空间:

One could learn the vector field at fixed points (just take some fixed points from the training set to use as anchors) and interpolate in some manner. The vector field above is of the form:

F(x)=v0f0(x)+v1f1(x)1+f0(x)+f1(x)

Where v0 and v1 are vectors and f0(x) and f1(x) are n-dimensional gaussians. This is inspired a bit by radial basis functions.

我们可以在固定点训练矢量场(只需从训练集中取一些固定点作为锚点),并以某种方式插值。上述矢量场的形式如下:

F(x) = \frac{v_0f_0(x) + v_1f_1(x)}{1+f_0(x)+f_1(x)}

其中v_0v_1是向量,f_0(x)f_1(x)是n维高斯。这是受到径向基矢函数的启发。

K-Nearest Neighbor Layers

K-最近邻层

I’ve also begun to think that linear separability may be a huge, and possibly unreasonable, amount to demand of a neural network. In some ways, it feels like the natural thing to do would be to use k-nearest neighbors (k-NN). However, k-NN’s success is greatly dependent on the representation it classifies data from, so one needs a good representation before k-NN can work well.

我也开始认为线性可分离性可能是神经网络需求的巨大的,也可能是不合理的。在某些方面,感觉自然的事情是使用k-最近邻(k-NN)。然而,k-NN的成功很大程度上取决于它对数据进行分类的表象,所以在k-NN可以正常工作之前需要一个很好的表象。

As a first experiment, I trained some MNIST networks (two-layer convolutional nets, no dropout) that achieved 1% test error. I then dropped the final softmax layer and used the k-NN algorithm. I was able to consistently achieve a reduction in test error of 0.1-0.2%.

作为第一个实验,我培训了一些实现了〜1%测试错误的MNIST网络(两层卷积网络,没有丢包)。然后我放弃最后的softmax层,并使用k-NN算法。我能够一致地减少测试误差为0.1-0.2%。

Still, this doesn’t quite feel like the right thing. The network is still trying to do linear classification, but since we use k-NN at test time, it’s able to recover a bit from mistakes it made.

不过,这并不是很正确的事情。神经网络仍然在尝试线性分类,但是由于我们在测试时使用k-NN,因此可以从错误中恢复一点。

k-NN is differentiable with respect to the representation it’s acting on, because of the 1/distance weighting. As such, we can train a network directly for k-NN classification. This can be thought of as a kind of “nearest neighbor” layer that acts as an alternative to softmax.

由于权重是距离的倒数,k-NN对于其表象的作用是可微分的。因此,我们可以直接训练网络进行k-NN分类。这可以被认为是一种作为替代softmax的“最近邻”层。

We don’t want to feedforward our entire training set for each mini-batch because that would be very computationally expensive. I think a nice approach is to classify each element of the mini-batch based on the classes of other elements of the mini-batch, giving each one a weight of 1/(distance from classification target).9

我们不想为每个小样本(mini-batch)提供我们的整个训练集,因为这样计算量将非常大。我认为一个很好的方法是根据小批量的其他元素的类别来分类小样本的每个元素,给每个元素的权重为“分类目标的距离的倒数”9

Sadly, even with sophisticated architecture, using k-NN only gets down to 5-4% test error – and using simpler architectures gets worse results. However, I’ve put very little effort into playing with hyper-parameters.

令人遗憾的是,即使使用复杂的架构,使用k-NN只能降低5-4%的测试错误 – 并且使用更简单的架构会变得更糟糕的结果。但是,我在调节“超参数”(hyper-parameters)方面没花什么时间。

Still, I really aesthetically like this approach, because it seems like what we’re “asking” the network to do is much more reasonable. We want points of the same manifold to be closer than points of others, as opposed to the manifolds being separable by a hyperplane. This should correspond to inflating the space between manifolds for different categories and contracting the individual manifolds. It feels like simplification.

不过,从审美角度来看,我真的很喜欢这个方案,就像这样做,因为看起来我们正在“问”网络做的更合理。我们希望相同流形的点比其他点的点更接近,而不是让流形能够由超平面分离。这应该对应于不同类别的流形之间的空间膨胀和收缩各个流形。感觉就像简化。

Conclusion

结论

Topological properties of data, such as links, may make it impossible to linearly separate classes using low-dimensional networks, regardless of depth. Even in cases where it is technically possible, such as spirals, it can be very challenging to do so.

数据的拓扑结构(例如链接),可能使得使用低维网络(无论用多少层)线性分类成为不可能。即便在某些情况(如螺旋桨)下从技术角度来看行得通,这样做也是非常有挑战性的。

To accurately classify data with neural networks, wide layers are sometimes necessary. Further, traditional neural network layers do not seem to be very good at representing important manipulations of manifolds; even if we were to cleverly set weights by hand, it would be challenging to compactly represent the transformations we want. New layers, specifically motivated by the manifold perspective of machine learning, may be useful supplements.

为了用神经网络精确地对数据进行分类,有时需要更宽的层。此外,传统的神经网络层似乎不能很好地表示流形的重要操作;即使我们用手巧妙地设定权重,想要用紧凑的方式表示我们想要的变换也是很困难的。新的层,特别是由从流形的角度来审视机器学习所启发得到的,可能是有用的补充。

(This is a developing research project. It’s posted as an experiment in doing research openly. I would be delighted to have your feedback on these ideas: you can comment inline or at the end. For typos, technical errors, or clarifications you would like to see added, you are encouraged to make a pull request on github.)

(这是一个发展中的研究项目,并作为一个开放性研究课题发布在了网上。如果您有什么反馈意见,我将很高兴听到:您可以在线或最后发表评论,若有打字、技术错误或补充,欢迎您在github上提出请求。)

Acknowledgments

感谢

Thank you to Yoshua Bengio, Michael Nielsen, Dario Amodei, Eliana Lorch, Jacob Steinhardt, and Tamsyn Waterhouse for their comments and encouragement.

感谢Yoshua Bengio,Michael Nielsen,Dario Amodei,Eliana Lorch,Jacob Steinhardt和Tamsyn Waterhouse的意见和鼓励。


  1. This seems to have really kicked off with Krizhevsky et al., (2012), who put together a lot of different pieces to achieve outstanding results. Since then there’s been a lot of other exciting work.
  2. These representations, hopefully, make the data “nicer” for the network to classify. There has been a lot of work exploring representations recently. Perhaps the most fascinating has been in Natural Language Processing: the representations we learn of words, called word embeddings, have interesting properties. See Mikolov et al. (2013), Turian et al. (2010), and, Richard Socher’s work. To give you a quick flavor, there is a very nice visualization associated with the Turian paper.
  3. A lot of the natural transformations you might want to perform on an image, like translating or scaling an object in it, or changing the lighting, would form continuous curves in image space if you performed them continuously.
  4. Carlsson et al. found that local patches of images form a klein bottle.
  5. GLn() is the set of invertible n×n matrices on the reals, formally called the general linear group of degree n.
  6. This result is mentioned in Wikipedia’s subsection on Isotopy versions.
  7. See Szegedy et al., where they are able to modify data samples and find slight modifications that cause some of the best image classification neural networks to misclasify the data. It’s quite troubling.
  8. Contractive penalties were introduced in contractive autoencoders. See Rifai et al. (2011).
  9. I used a slightly less elegant, but roughly equivalent algorithm because it was more practical to implement in Theano: feedforward two different batches at the same time, and classify them based on each other.

  1. 这似乎由 Krizhevsky et al., (2012) 真正开始。它把结合现有的各种成果取得了不错的效果,从那以后,陆续有各种令人兴奋的进展。
  2. 这些表象希望使网络的数据“更好”进行分类。近来有很多工作正在探索。也许在自然语言处理中最引人入胜的是:我们学习的词语,称为词嵌入,具有有趣的属性。参见Mikolov et al. (2013), Turian et al. (2010),以及 Richard Socher 的工作。为了给你一个快速的体验,这里有一个与Turian的文章相关的很棒的可视化例子
  3. 您可能希望在图像上执行大量自然变换,如翻译、缩放对象,或更改曝光,如果连续执行,则会在图像空间中形成连续的曲线。
  4. Carlsson et al. 发现局部的图像斑块形成一个克莱因瓶。
  5. GLn(ℝ)是真实的可逆n×n矩阵集合,正式地,称为度数为n的一般线性群。
  6. 参见 Szegedy et al.,他们能够修改数据样本,并发现轻微的修改,导致一些最好的图像分类神经网络对数据的误解。这很麻烦。
  7. 压缩惩罚被应用在压缩编码器中,参见Rifai et al. (2011)
  8. 我使用了一个稍微不那么优雅,但是大致相当的算法,因为在Theano中实现更容易:同时提供两个不同的小样本,并且基于彼此进行分类。

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s