This post is going to be a small repository of information about Yoon Kim’s paper on using Convolutional Neural Networks for Sentence Classification.

Yoon Kim and his team, talk about using Convolutional Neural Nets as a means for sentence classification - essentially, they create sentence vectors by taking word vectors and concatenating them together, then pass them through layers of convolutions or filters to create what they call a feature map, which is literally just a set of filter outputs.

Each filter itself is a non linear function of a weight vector times a window of words plus a bias term. They use overlapping windows of a size n, to produce the feature map for every sentence.

From the feature map, max pooling is applied - in particular they mention something called max overtime pooling over the feature map and take the maximum value out of the feature map as the feature corresponding to a particular filter. Thus they select 1 feature for every sentence, which naturally deals with variable sentence lengths.

By varying the window size and filters they create multiple features which then form the penultimate layer of the network. This penultimate layer is connected to a fully connected softmax layer - whose output is a probability distribution over each of the labels.

The model then uses dropout by using a masking vector - the layer of features is multiplied element-wise by a vector of bernoulli variables (variables that can be either 0 or 1 by some probability) Thus gradients are only backpropogated by non zero elements in the resultant weights of the vector. If after any gradient descent step the l2 norm of the weight vector is greater than s, it gets rescaled back to s.

The paper mentions the following hyperparameters, ReLU for the filters, filter window size of 3,4,5 with a 100 feature maps each, a dropout rate (bernoulli probability) of 0.5 and l2 constraint of s = 3. with a minibatch size of 50. The team trained using stochastic gradient descent with shuffled minibatches, using the adadelta update rule (for faster training).

The adadelta learning rule basically means that the learning rate for each subsequent step in the gradient descent depends upon a decaying average of previous gradients. Thus the running average of the gradient at any step in the algorithm depends upon the previous average and the current gradient. This is explained much better here

An interesting result of this paper is that using pretrained word2vec word vectors didn’t seem to affect the prediction accuracy of the classifier at all - in fact, it was significantly better than randomly chosen vectors that could be trained. Though having word2vec vectors as an initialized value for the word vectors, but letting them train with the network did slightly better - for all practical purposes I believe the ability to use simple inference for word2vectors as opposed to tuning the vectors themselves far outweighs the small accuracy percentage.

This is perhaps an indicator that word2vec vectors are very decent feature extractors for a variety of classification tasks across datasets. It’s also perhaps the first step towards more exciting papers like the recent OpenAI attempt to use unsupervised learning to create universal feature extractors for text and nlp tasks.