深度学习、自然语言处理、表征

Deep Learning, NLP, and Representations

Posted on July 7, 2014

关键词: 神经网络,深度学习,表征,NLP,递归神经网络

Introduction

简介

In the last few years, deep neural networks have dominated pattern recognition. They blew the previous state of the art out of the water for many computer vision tasks. Voice recognition is also moving that way.

在过去几年中,模式识别为深层神经网络所主导。它们在很多计算机视觉问题中的表现,超越了之前得到的最佳结果。语音识别也朝着这个趋势发展。

But despite the results, we have to wonder… why do they work so well?

但是,尽管取得了成果,但我们不得不怀疑为什么他们的工作如此的好?

This post reviews some extremely remarkable results in applying deep neural networks to natural language processing (NLP). In doing so, I hope to make accessible one promising answer as to why deep neural networks work. I think it’s a very elegant perspective.

本文回顾了将深层神经网络应用到自然语言处理(NLP)中的一些非常显着的结果。在这样做的时候,我希望能够为为什么深层神经网络能表现出色提供一个可能的答案。我认为这是一个非常优雅的视角。

One Hidden Layer Neural Networks

单隐藏层神经网络

A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem.

有隐藏层的神经网络具有的普遍特征:给定足够的隐藏单元,可以近似任何函数。这是经常被引用的定理,甚至被更频繁的误解和应用。

It’s true, essentially, because the hidden layer can be used as a lookup table.

这是本质上对的,是因为隐藏层可以被用作查找表。

For simplicity, let’s consider a perceptron network. A perceptron is a very simple neuron that fires if it exceeds a certain threshold and doesn’t fire if it doesn’t reach that threshold. A perceptron network gets binary (0 and 1) inputs and gives binary outputs.

为了简单起见,我们来考虑感知器网络。感知器是一个非常简单的神经元,如果它超过一定阈值则触发,如果没有达到该阈值,则不会触发。感知器网络获得二进制(0和1)输入,并提供二进制输出。

Note that there are only a finite number of possible inputs. For each possible input, we can construct a neuron in the hidden layer that fires for that input,1 and only on that specific input. Then we can use the connections between that neuron and the output neurons to control the output in that specific case. 2

请注意,只有有限数量的可能输入。对于每个可能的输入,我们可以在隐藏层中为该输入创建一个控制神经元,并且仅在该特定输入上。然后,我们可以使用该神经元和输出神经元之间的连接来控制该特定情况下的输出。

And so, it’s true that one hidden layer neural networks are universal. But there isn’t anything particularly impressive or exciting about that. Saying that your model can do the same thing as a lookup table isn’t a very strong argument for it. It just means it isn’t impossible for your model to do the task.

因此,一个隐层神经网络是普适的。但是没有什么特别令人印象深刻或令人兴奋的。声称你的模型可以做查找表能做的事情,不是一个非常强的论点。这只是意味着您的模型不是不可能完成任务。

Universality means that a network can fit to any training data you give it. It doesn’t mean that it will interpolate to new data points in a reasonable way.

普遍性意味着网络可以适应您给予的任何培训数据。这并不意味着它会以合理的方式插入新的数据点。

No, universality isn’t an explanation for why neural networks work so well. The real reason seems to be something much more subtle… And, to understand it, we’ll first need to understand some concrete results.

不,普遍性不是解释为什么神经网络工作得很好。真正的原因似乎是更微妙的东西…而且要理解,我们首先需要了解一些具体的结果。

Word Embeddings

词语嵌入

I’d like to start by tracing a particularly interesting strand of deep learning research: word embeddings. In my personal opinion, word embeddings are one of the most exciting area of research in deep learning at the moment, although they were originally introduced by Bengio, et al. more than a decade ago.3 Beyond that, I think they are one of the best places to gain intuition about why deep learning is so effective.

我想首先追踪一个特别有趣的深入学习研究:词语嵌入。在我个人看来,词语嵌入目前是深度学习领域最令人兴奋的研究领域之一,尽管它们最初在十几年前由Bengio等人介绍。除此之外,我认为这个领域是获得关于为什么深度学习如此有效的直觉的最好的地方之一。

A word embedding W:wordsn is a paramaterized function mapping words in some language to high-dimensional vectors (perhaps 200 to 500 dimensions). For example, we might find:

W(cat)=(0.2, 0.4, 0.7, ...)

W(mat)=(0.0, 0.6, 0.1, ...)

(Typically, the function is a lookup table, parameterized by a matrix, θ, with a row for each word: Wθ(wn)=θn.) W is initialized to have random vectors for each word. It learns to have meaningful vectors in order to perform some task.

一个词语嵌入 W: \mathrm{words} \to \mathbb{R}^n 是将某种语言中的词映射到高维向量(可能是200到500个维度)的参数化函数。例如,我们可能会发现:

W(``\text{cat}\!") = (0.2,~ \text{-}0.4,~ 0.7,~ ...)

W(``\text{mat}\!") = (0.0,~ 0.6,~ \text{-}0.1,~ ...)

(通常,这个函数是由矩阵theta参数化的查找表,每个单词具有一行: W_\theta(w_n) = \theta_n。)W被初始化为每个单词具有一个随机向量。它为了能完成特定的任务学习有意义的向量。

For example, one task we might train a network for is predicting whether a 5-gram (sequence of five words) is ‘valid.’ We can easily get lots of 5-grams from Wikipedia (eg. “cat sat on the mat”) and then ‘break’ half of them by switching a word with a random word (eg. “cat sat song the mat”), since that will almost certainly make our 5-gram nonsensical.

例如,我们可以训练一个任务是预测一个5-gram(五个字的顺序)是否“有效”的网络,我们可以很容易地从维基百科获得5-gram(例如,“猫-坐-垫子-上” ),然后通过用一个随机单词(例如“猫-坐-歌曲-垫子”)“打破”它们的一半,因为这几乎肯定这会使我们的5-gram无意义。

Modular Network to determine if a 5-gram is ‘valid’ (From Bottou (2011))
确定5-gram是否“有效”的模块化网络(来自Bottou(2011))
The model we train will run each word in the 5-gram through W to get a vector representing it and feed those into another ‘module’ called R, which tries to predict if the 5-gram is ‘valid’ or ‘broken.’ Then, we’d like:
R(W(cat), W(sat), W(on), W(the), W(mat))=1
R(W(cat), W(sat), W(song), W(the), W(mat))=0

In order to predict these values accurately, the network needs to learn good parameters for both W and R.

我们训练的模型将对每个单词运行5-gram并通过W以获得代表它的向量,并将它们馈入另一个称为R的“模块”,该模块试图预测5-gram是“有效”还是“无效”。

R(W(``\text{cat}\!"),~ W(``\text{sat}\!"),~ W(``\text{on}\!"),~ W(``\text{the}\!"),~ W(``\text{mat}\!")) = 1

R(W(``\text{cat}\!"),~ W(``\text{sat}\!"),~ W(``\text{song}\!"),~ W(``\text{the}\!"),~ W(``\text{mat}\!")) = 0

Now, this task isn’t terribly interesting. Maybe it could be helpful in detecting grammatical errors in text or something. But what is extremely interesting is W.

现在,这个任务并不是非常有趣。也许这可能有助于检测文本或某些东西的语法错误,但真正有趣的是W。

(In fact, to us, the entire point of the task is to learn W. We could have done several other tasks – another common one is predicting the next word in the sentence. But we don’t really care. In the remainder of this section we will talk about many word embedding results and won’t distinguish between different approaches.)

(事实上​​,对我们来说,整个任务目的是学习W。我们可以完成其他几个任务 —— 另一个常见的是预测句子中的下一个单词,但我们并不在乎,在其余的本节将介绍很多字嵌入结果,不会区分不同的方法。)

One thing we can do to get a feel for the word embedding space is to visualize them with t-SNE, a sophisticated technique for visualizing high-dimensional data.

我们可以做的一件事就是让嵌入空间这个词的感觉就是用t-SNE来形象化,这是一种用于可视化高维数据的复杂技术。

t-SNE visualizations of word embeddings. Left: Number Region; Right: Jobs Region. From Turian et al. (2010), see complete image.
单词嵌入的t-SNE可视化。左:数字区;对:工作种类区。从Turian等人(2010),查看完整图片。

This kind of ‘map’ of words makes a lot of intuitive sense to us. Similar words are close together. Another way to get at this is to look at which words are closest in the embedding to a given word. Again, the words tend to be quite similar.

What words have embeddings closest to a given word? From Collobert et al. (2011)
 什么词的嵌入和某个词的嵌入最接近?来自Collobert等(2011年)

It seems natural for a network to make words with similar meanings have similar vectors. If you switch a word for a synonym (eg. “a few people sing well” “a couple people sing well”), the validity of the sentence doesn’t change. While, from a naive perspective, the input sentence has changed a lot, if W maps synonyms (like “few” and “couple”) close together, from R’s perspective little changes.

神经网络似乎自然地使具有相似含义的单词具有相似的向量。如果你换一个词来代替同义词(例如“一小数人唱的好好”→“几个人唱好”),句子的有效性就不会改变。而从浅显的角度来看,输入的句子发生了很大变化,如果W将同义词(如“少数”和“几个”)映射到一起,则从R的角度来看,几乎没有什么变化。

This is very powerful. The number of possible 5-grams is massive and we have a comparatively small number of data points to try to learn from. Similar words being close together allows us to generalize from one sentence to a class of similar sentences. This doesn’t just mean switching a word for a synonym, but also switching a word for a word in a similar class (eg. “the wall is blue” “the wall is red”). Further, we can change multiple words (eg. “the wall is blue” “the ceiling is red”). The impact of this is exponential with respect to the number of words.4

这是非常强大的。可能的5-gram的数量是巨大的,而我们有相对较少的数据点供学习。类似的词汇靠近在一起使我们能够从一句话推广到一类类似的句子。这不仅意味着切换同义词的单词,而且在类似的类中切换单词(例如,“墙是蓝色”→“墙是红色”)。此外,我们可以更改多个单词(例如“墙是蓝色”→“天花板是红色”)。这个数字对这个数字的影响是指数的。

So, clearly this is a very useful thing for W to do. But how does it learn to do this? It seems quite likely that there are lots of situations where it has seen a sentence like “the wall is blue” and know that it is valid before it sees a sentence like “the wall is red”. As such, shifting “red” a bit closer to “blue” makes the network perform better.

所以很明显,这对W来说是非常有用的。但是如何才能做到这一点呢?看来似乎有很多情况,它看到过像“墙是蓝色”这样的一个句子后,并能在看到“墙是红色”这样的句子前就知道其是有效的。因此,让“红色”更靠近“蓝色”,会使神经网络表现更好。

We still need to see examples of every word being used, but the analogies allow us to generalize to new combinations of words. You’ve seen all the words that you understand before, but you haven’t seen all the sentences that you understand before. So too with neural networks.

我们仍然需要看看每一个单词被使用的例子,但是这种类比使我们能够推广到新的单词组合。你看过你以前理解的所有单词,但是你还没有看到你以前理解的所有句子。神经网络也是如此。

取自 Mikolov et al. (2013

Word embeddings exhibit an even more remarkable property: analogies between words seem to be encoded in the difference vectors between words. For example, there seems to be a constant male-female difference vector:

词嵌入表现出更显著的属性:词之间的类比似乎被编码在词之间的差异向量中。例如,似乎有一个不变的男女差异矢量:

W(woman)W(man)  W(aunt)W(uncle)
W(woman)W(man)  W(queen)W(king)

This may not seem too surprising. After all, gender pronouns mean that switching a word can make a sentence grammatically incorrect. You write, “she is the aunt” but “he is the uncle.” Similarly, “he is the King” but “she is the Queen.” If one sees “she is the uncle,” the most likely explanation is a grammatical error. If words are being randomly switched half the time, it seems pretty likely that happened here.

这可能不是太令人惊讶。毕竟,性别代词意味着切换一个单词可以使语法不正确。你写的是“她是阿姨”,但是“他是叔叔”,同样地,“他是国王”,但是“她是女王”,如果看到“她是叔叔”,最可能的解释就是语法错误。如果这些词一半被随机转换,那么似乎很可能发生。

“Of course!” We say with hindsight, “the word embedding will learn to encode gender in a consistent way. In fact, there’s probably a gender dimension. Same thing for singular vs plural. It’s easy to find these trivial relationships!”

“当然!”我们以事后说,“嵌入一词将会以一致的方式学习对性别编码。事实上,这可能是一个性别层面。单数和复数的考虑也相同。很容易找到这些琐碎的关系!“

It turns out, though, that much more sophisticated relationships are also encoded in this way. It seems almost miraculous!

然而,事实证明,更复杂的关系也是以这种方式编码的。看起来差不多奇迹了!

Relationship pairs in a word embedding. From Mikolov et al. (2013b).
 一个词嵌入中的关系对。引自Mikolov et al. (2013b)

It’s important to appreciate that all of these properties of W are side effects. We didn’t try to have similar words be close together. We didn’t try to have analogies encoded with difference vectors. All we tried to do was perform a simple task, like predicting whether a sentence was valid. These properties more or less popped out of the optimization process.

重要的是要注意W的所有这些属性都是副产品。我们并没有试图让相似的词语紧密地联系在一起。我们没有尝试用差分向量对类比进行编码。我们试图做的只是执行一个简单的任务,比如预测句子是否有效。这些属性或多或少是在优化过程自然涌现。

This seems to be a great strength of neural networks: they learn better ways to represent data, automatically. Representing data well, in turn, seems to be essential to success at many machine learning problems. Word embeddings are just a particularly striking example of learning a representation.

这似乎是神经网络的一个很大的优势:他们学习更好的方法来自动表示数据。反过来讲,很好地表示数据似乎对许多机器学习问题的成功至关重要。词语嵌入只是学习表征的一个特别突出的例子。

Shared Representations

共享表征

The properties of word embeddings are certainly interesting, but can we do anything useful with them? Besides predicting silly things, like whether a 5-gram is ‘valid’?

词嵌入的属性当然是有趣的,但是我们可以利用他们做点有用的事情吗?除了预测愚蠢的事情,例如5-gram是否“有效”?

W and F learn to perform task A. Later, G can learn to perform B based on W.
W和F学习执行任务A。后来,G可以学习基于W执行B。

We learned the word embedding in order to do well on a simple task, but based on the nice properties we’ve observed in word embeddings, you may suspect that they could be generally useful in NLP tasks. In fact, word representations like these are extremely important:

我们学习了单词嵌入,以便在一个简单的任务上做得很好,但是基于我们在单词嵌入中观察到的不错的属性,您可能依然会怀疑它们在NLP任务中通常很有用。事实上,这样的词语表达是非常重要的:

The use of word representations… has become a key “secret sauce” for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling. (Luong et al. (2013))

词表征的使用已经成为近年来许多NLP系统成功的关键“秘密武器”,包括命名实体识别,词性标注,解析和语义角色标签等任务。 (Luong et al. (2013))

This general tactic – learning a good representation on a task A and then using it on a task B – is one of the major tricks in the Deep Learning toolbox. It goes by different names depending on the details: pretraining, transfer learning, and multi-task learning. One of the great strengths of this approach is that it allows the representation to learn from more than one kind of data.

这种一般的策略 ——在任务A上学习一个很好的表示,然后在任务B上使用它 -——是深度学习工具箱中的主要技巧之一。它根据细节不同而不同:提前训练,转移学习和多任务学习。这种方法的最大优点之一是它允许表示从多种数据中学习。

There’s a counterpart to this trick. Instead of learning a way to represent one kind of data and using it to perform multiple kinds of tasks, we can learn a way to map multiple kinds of data into a single representation!

有一种和这个技巧类似的方法:我们可以学习一种将多种数据映射到单个表示中的方法,而不是学习一种表示一种数据并使用它来执行多种任务的方法!

One nice example of this is a bilingual word-embedding, produced in Socher et al. (2013a). We can learn to embed words from two different languages in a single, shared space. In this case, we learn to embed English and Mandarin Chinese words in the same space.

一个很好的例子是在Socher等人制作的双语词汇嵌入。 (2013a)。我们可以学习在单一的共享空间中嵌入两种不同语言的单词。在这种情况下,我们学习在相同的空间中嵌入英文和汉语单词。

We train two word embeddings, W en and W zh in a manner similar to how we did above. However, we know that certain English words and Chinese words have similar meanings. So, we optimize for an additional property: words that we know are close translations should be close together.

我们以类似于我们上面的方式训练双语嵌入(W_{en}, 和 W_{zh})。但是,我们知道某些英文单词和中文词有相似的含义。所以,我们优化了一个额外的属性:我们知道的密切的翻译应该是紧密的。

Of course, we observe that the words we knew had similar meanings end up close together. Since we optimized for that, it’s not surprising. More interesting is that words we didn’t know were translations end up close together.

当然,我们观察到,我们知道两个词有相似的含义最终会结合在一起。这并不奇怪, 因为我们进行了这样的优化。更有趣的是,我们不知道的的单词的翻译最终也会在一起。

In light of our previous experiences with word embeddings, this may not seem too surprising. Word embeddings pull similar words together, so if an English and Chinese word we know to mean similar things are near each other, their synonyms will also end up near each other. We also know that things like gender differences tend to end up being represented with a constant difference vector. It seems like forcing enough points to line up should force these difference vectors to be the same in both the English and Chinese embeddings. A result of this would be that if we know that two male versions of words translate to each other, we should also get the female words to translate to each other.

鉴于我们以前在词汇嵌入方面的经验,这似乎并不奇怪。词嵌入一起拉扯相似的词,所以如果我们知道的类似的东西的英文和中文字是相近的,他们的同义词也将彼此相邻。我们也知道,像性别差异这样的事情往往最终被用一个恒定的差异向量来表示。似乎迫使足够的点排队应该强制这些差异向量在中英文嵌入中是一样的。这样做的结果是,如果我们知道两个男性版本的单词相互转换,我们也应该让女性的话相互转化。

Intuitively, it feels a bit like the two languages have a similar ‘shape’ and that by forcing them to line up at different points, they overlap and other points get pulled into the right positions.

直觉上,有点像两种语言有类似的“形状”,通过强迫他们在不同的点排队,它们重叠,其他点被拉到正确的位置。

t-SNE visualization of the bilingual word embedding. Green is Chinese, Yellow is English. (Socher et al. (2013a))
t-SNE可视化双语词嵌入。绿色是中文,黄色是英文。 (Socher等(2013a))

In bilingual word embeddings, we learn a shared representation for two very similar kinds of data. But we can also learn to embed very different kinds of data in the same space.

在双语词汇嵌入中,我们为两种非常相似的数据学习共享表示。但是我们也可以学习在同一空间嵌入非常不同种类的数据。

Recently, deep learning has begun exploring models that embed images and words in a single representation.5

最近,深入学习已经开始探索将图像和单词嵌入到单个表示中的模型。

The basic idea is that one classifies images by outputting a vector in a word embedding. Images of dogs are mapped near the “dog” word vector. Images of horses are mapped near the “horse” vector. Images of automobiles near the “automobile” vector. And so on.

基本思想是通过在字嵌入中输出矢量来对图像进行分类。狗的图像映射在“狗”字矢量附近。马的图像映射在“马”向量附近。图像汽车附近的“汽车”矢量图。等等。

The interesting part is what happens when you test the model on new classes of images. For example, if the model wasn’t trained to classify cats – that is, to map them near the “cat” vector – what happens when we try to classify images of cats?

有趣的部分是当您在新类图像上测试模型时会发生什么。例如,如果模型没有被训练来分类猫 – 也就是将它们映射到“猫”矢量附近 – 当我们尝试分类猫的图像时会发生什么?

It turns out that the network is able to handle these new classes of images quite reasonably. Images of cats aren’t mapped to random points in the word embedding space. Instead, they tend to be mapped to the general vicinity of the “dog” vector, and, in fact, close to the “cat” vector. Similarly, the truck images end up relatively close to the “truck” vector, which is near the related “automobile” vector.

事实证明,网络能够非常合理地处理这些新类别的图像。猫的图像未映射到单词嵌入空间中的随机点。相反,它们往往被映射到“狗”向量的附近,实际上靠近“猫”向量。类似地,卡车图像最终相对靠近相关的“汽车”矢量的“卡车”矢量。

This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.

这是由斯坦福大学的成员完成的,只有8个已知类(和2个未知类)。结果已经相当令人印象深刻了。但是,由于标记的样本少,图像和语义空间之间的关系之间可插值的点很少。

The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.

Google的研究组做了一个更大的版本 – 比起8个类别,他们在同一时间使用了1,000个类别( (Frome et al. (2013)),并且跟进了一个新的变种((Norouzi et al. (2014))。两者都基于非常强大的图像分类模型 (Krizehvsky et al. (2012)),但是以不同的方式将图像嵌入到字嵌入空间中。

The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.

结果令人印象深刻。虽然他们可能没有获得未知类的图像到表示该类的精确向量,但是它们能够到达正确的邻域。所以,如果你要求它对未知类的图像进行分类,并且类是完全不同的,它可以区分不同的类。

Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.

即使我以前从未见过Aesculapian蛇或犰狳,如果你向我显示一张照片和另一张照片,我可以告诉你哪个是哪一个,因为我有一个一般的想法是什么样的动物相关联与每个单词。这些网络可以完成同样的事情。

(These results all exploit a sort of “these words are similar” reasoning. But it seems like much stronger results should be possible based on relationships between words. In our word embedding space, there is a consistent difference vector between male and female version of words. Similarly, in image space, there are consistent features distinguishing between male and female. Beards, mustaches, and baldness are all strong, highly visible indicators of being male. Breasts and, less reliably, long hair, makeup and jewelery, are obvious indicators of being female.6 Even if you’ve never seen a king before, if the queen, determined to be such by the presence of a crown, suddenly has a beard, it’s pretty reasonable to give the male version.)

(这些结果都利用了一种“这些词是相似的”推理,但是似乎根据词之间的关系可以得到更强的结果,在我们的词嵌入空间中,男女版本之间存在一致的差异向量。同样,在图像空间中,男性和女性之间存在一致的特征。胡须,胡子和秃头都是男性的很强的、高度可见的指标,乳头和不太可靠的长发,化妆和珠宝是显而易见的作为女性的指标。即使你以前从未见过王,如果女皇因为皇冠的存在而被确定是如此,突然就有一个胡子,将其定为男性版本是非常合理的。)

Shared embeddings are an extremely exciting area of research and drive at why the representation focused perspective of deep learning is so compelling.

共享嵌入是一个非常激动人心的研究领域。我真正想说的是为什么深入学习的代表重点视角是如此引人注目。

Recursive Neural Networks

递归神经网络

We began our discussion of word embeddings with the following network:

我们开始使用以下网络讨论词嵌入:

Modular Network that learns word embeddings (From Bottou (2011))
 学习词汇嵌入的模块化网络(来自 Bottou (2011)

The above diagram represents a modular network, R(W(w1), W(w2), W(w3), W(w4), W(w5)).

上图为模块化网络,

R(W(w_1),~ W(w_2),~ W(w_3),~ W(w_4),~ W(w_5))

It is built from two modules, W and R. This approach, of building neural networks from smaller neural network “modules” that can be composed together, is not very wide spread. It has, however, been very successful in NLP.

它由两个模块W和R构成。这种从较小的神经网络“模块”构建神经网络的方法可以组合在一起,并不是很广泛被运用。然而,它在NLP中非常成功。

Models like the above are powerful, but they have an unfortunate limitation: they can only have a fixed number of inputs.

像上面的模型是强大的,但它们有一个不幸的限制:它们只能有固定长度的输入。

We can overcome this by adding an association module, A, which will take two word or phrase representations and merge them.

我们可以通过添加一个关联模块A来克服这一点,它将采取两个单词或短语表示并合并它们。

By merging sequences of words, A takes us from representing words to representing phrases or even representing whole sentences! And because we can merge together different numbers of words, we don’t have to have a fixed number of inputs.

通过合并单词序列,A将我们从表示单词代表短语,甚至代表整个句子!而且因为我们可以将不同数量的单词合并在一起,所以我们不必具有固定长度的输入。

It doesn’t necessarily make sense to merge together words in a sentence linearly. If one considers the phrase “the cat sat on the mat”, it can naturally be bracketed into segments: “((the cat) (sat (on (the mat))))”. We can apply A based on this bracketing:

将句子中的词线性地合并在一起并不一定是有意义的。如果考虑到“猫坐在垫子上”这个短语,那么它自然就可以被括起来成为:“((the cat) (sat (on (the mat))))”。我们可以根据这个包围来应用A:

(取自 Bottou (2011))

These models are often called “recursive neural networks” because one often has the output of a module go into a module of the same type. They are also sometimes called “tree-structured neural networks.”

这些模型通常被称为“递归神经网络”,因为一个模块通常具有相同类型的模块的输出。它们有时也被称为“树结构神经网络”。

Recursive neural networks have had significant successes in a number of NLP tasks. For example, Socher et al. (2013c) uses a recursive neural network to predict sentence sentiment:

递归神经网络在许多NLP任务中取得了重大成功。例如,Socher等人(2013c)使用递归神经网络来预测句子情绪:

One major goal has been to create a reversible sentence representation, a representation that one can reconstruct an actual sentence from, with roughly the same meaning. For example, we can try to introduce a disassociation module, D, that tries to undo A:

一个主要目标是创建一个可逆的句子表示,一个可以用大致相同含义重建实际句子的表示。例如,我们可以尝试引入一个解除关联模块D来尝试撤消A:

(取自 Bottou (2011))

If we could accomplish such a thing, it would be an extremely powerful tool. For example, we could try to make a bilingual sentence representation and use it for translation.

如果我们能够完成这样的事情,那将是一个非常强大的工具。例如,我们可以尝试使用双语句子表示,并将其用于翻译。

Unfortunately, this turns out to be very difficult. Very very difficult. And given the tremendous promise, there are lots of people working on it.

不幸的是,事实证明是非常困难的。非常非常困难鉴于巨大的希望,有很多人在努力。

Recently, Cho et al. (2014) have made some progress on representing phrases, with a model that can encode English phrases and decode them in French. Look at the phrase representations it learns!

最近,Cho et al. (2014) 在表达短语方面已经取得了一些进展,提出一个可以对英语短语进行编码,并用法语解码的模型。看看它学会的短语表达!

Small section of the t-SNE of the phrase representation (From Cho et al. (2014))
短语表示的t-SNE的小部分 (取自 Cho et al. (2014))

Criticisms

批评

I’ve heard some of the results reviewed above criticized by researchers in other fields, in particular, in NLP and linguistics. The concerns are not with the results themselves, but the conclusions drawn from them, and how they compare to other techniques.

我听说过其他领域的研究人员批评的一些结果,特别是NLP和语言学。关注的不是结果本身,而是从他们得出的结论,以及它们如何与其他技术相比。

I don’t feel qualified to articulate these concerns. I’d encourage someone who feels this way to describe the concerns in the comments.

我没有资格澄清这些关切。我鼓励有人以这种方式描述评论中的问题。

Conclusion

结论

The representation perspective of deep learning is a powerful view that seems to answer why deep neural networks are so effective. Beyond that, I think there’s something extremely beautiful about it: why are neural networks effective? Because better ways of representing data can pop out of optimizing layered models.

深入学习的表征视角是一个很有效的视角,似乎回答了为什么深层神经网络如此有效。除此之外,我认为有一些非常美丽的东西:为什么神经网络有效?因为更好的表示数据的方法可以得到优化分层模型。

Deep learning is a very young field, where theories aren’t strongly established and views quickly change. That said, it is my impression that the representation-focused perspective of neural networks is presently very popular.

深度学习是一个非常年轻的领域,理论不是很强大,意见迅速改变。也就是说,我的印象是,神经网络的以表示为重点的视角目前非常受欢迎。

This post reviews a lot of research results I find very exciting, but my main motivation is to set the stage for a future post exploring connections between deep learning, type theory and functional programming. If you’re interested, you can subscribe to my rss feed so that you’ll see it when it is published.

这篇文章回顾了很多我觉得非常激动人心的研究成果,但是我的主要动机是为将来探索深度学习,类型理论和功能编程之间的联系做一个铺垫。如果您有兴趣,您可以订阅我的RSS订阅,以便在发布时看到它。

(I would be delighted to hear your comments and thoughts: you can comment inline or at the end. For typos, technical errors, or clarifications you would like to see added, you are encouraged to make a pull request on github)

(我很高兴听到您的意见和想法:您可以在线或文章末尾发表评论。若您发现有打字错误,技术错误或需要澄清的地方,鼓励您在github上提出更改请求。)

Acknowledgments

感谢

I’m grateful to Eliana Lorch, Yoshua Bengio, Michael Nielsen, Laura Ball, Rob Gilson, and Jacob Steinhardt for their comments and support.

我很感激Eliana Lorch,Yoshua Bengio,Michael Nielsen,Laura Ball,Rob Gilson和Jacob Steinhardt的意见和支持。


  1. Constructing a case for every possible input requires 2n hidden neurons, when you have n input neurons. In reality, the situation isn’t usually that bad. You can have cases that encompass multiple inputs. And you can have overlapping cases that add together to achieve the right input on their intersection.
  2. (It isn’t only perceptron networks that have universality. Networks of sigmoid neurons (and other activation functions) are also universal: give enough hidden neurons, they can approximate any continuous function arbitrarily well. Seeing this is significantly trickier because you can’t just isolate inputs.)
  3. Word embeddings were originally developed in (Bengio et al, 2001; Bengio et al, 2003), a few years before the 2006 deep learning renewal, at a time when neural networks were out of fashion. The idea of distributed representations for symbols is even older, e.g. (Hinton 1986).”
  4. The seminal paper, A Neural Probabilistic Language Model (Bengio, et al. 2003) has a great deal of insight about why word embeddings are powerful.
  5. Previous work has been done modeling the joint distributions of tags and images, but it took a very different perspective.
  6. I’m very conscious that physical indicators of gender can be misleading. I don’t mean to imply, for example, that everyone who is bald is male or everyone who has breasts is female. Just that these often indicate such, and greatly adjust our prior.
  1. 为每个可能的输入构造一个案例需要2n个隐藏的神经元,当你有n个输入神经元。在现实中,情况通常不是那么糟糕。您可以包含多个输入的案例。您可以将重叠的情况加在一起,以便在其交叉点上实现正确的输入。
  2. (不仅感知器网络具有普遍性,Sigmoid神经​​元网络(和其他激活函数)也是通用的:给出足够的隐藏神经元,它们可以任意地接近任何连续的函数,看到这一点并不容易,因为你不能单纯地隔离输入。)
  3. 单词嵌入最初是在2006年深度学习更新前几年(Bengio et al, 2001; Bengio et al, 2003)开发的,当时神经网络不合时宜。符号的分布式表示的想法甚至更早被提出,例如( (Hinton 1986)。
  4. 创新论文A Neural Probabilistic Language Model (Bengio, et al. 2003) 对于嵌入词的强大性有很大的了解。
  5. 以前的工作已经完成了对标签和图像的联合分布的建模,但是它采取了非常不同的观点。
  6. 我非常清楚性别的物理指标可能会误导。我并不意味着,例如,每个秃头的人都是男性,或者每个有乳房的人都是女性。只是这些经常表明这样,并大大调整我们的优先认知。

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s