Post written by Jorn Peters & Taco Cohen
When we humans see an object we’ve never seen before, we are almost immediately able to recognize the same object in many different situations. For example, when a child learns about its new teddy bear, it will still recognize the teddy if you turn it upside down. In contrast, while current-generation Deep Neural Networks (DNNs) can learn to recognize the teddy bear eventually, they will need to see many examples of rotated teddy bears, each one labelled “teddy”.
A rotated teddy bear is still a teddy bear
This hunger for data, or “statistical inefficiency” is perhaps the most significant practical limitation of current deep learning technology. Many of our clients at Scyfer have problems that could be solved by deep learning, but don’t have large annotated datasets.
At Scyfer we have developed two innovative technologies to deal with this problem:
- Scyfer Active Learning Platform: once integrated, our system will passively observe the work of a domain expert (whether that’s a medical doctor diagnosing patients or a factory worker identifying defective products). As the system is starting to learn how to imitate the expert, it will identify its own weaknesses and ask for guidance from the expert, thereby greatly accelerating its learning without requiring so many examples.
- Data-efficient deep networks: by building in prior knowledge, like “a rotated teddy bear is still a teddy bear”, we can drastically reduce the number of examples required to learn a new concept.
The active learning platform is the subject of a future blog post. In this post we will describe an exciting new method to exploit prior knowledge about symmetries to drastically reduce the data requirements of deep networks. This method is called group equivariant convolutional neural networks (G-CNNs). We have previously published a research paper on G-CNNs (showing state-of-the-art results on the very competitive CIFAR dataset). In this post we hope to make the basic ideas accessible to a wider audience.
Symmetry in Data
Symmetry and asymmetry. A hexagon has rotation and reflection symmetry, the blob does not.
When discussing symmetry, the first thing most people will think of is reflection symmetry (shown above). We can reflect a hexagon in all three axes, and the shape will stay exactly the same. Notice, however, that the color pattern on the hexagon does change. In general, when we talk about symmetry, we are talking about transformations (like reflection) that leave some property (e.g. shape) the same, while potentially changing others (like color).
In machine learning, we are interested in symmetries of the labels that we are trying to predict. For example, if we have a picture of a dog and we rotate this picture by ninety degrees, we still have a picture of dog. So the label “dog” is invariant under rotation, and we say that the label has a rotational symmetry. As obvious as this observation is to us, a neural network initially knows nothing about what kinds of transformations would change the label and which ones wouldn’t!
We can rotate , mirror horizontally , mirror vertically , or translate a picture without intrinsically changing it (Photo of dog by Garyt70).
Cayley diagrams of the groups , and , consisting of rotations (by multiples of 90 or 60 degrees) and reflections ( only).
The mathematics of symmetry is called group theory. A group is simply the set of all symmetries of an object. As shown below, groups can have very rich and sometimes complicated structure, but we will not go into the details in this post.
So how does this relate to deep learning? Perhaps the most successful and widely used deep network architecture is the convolutional neural network or CNN. The reason for this success is that convolutional networks exploit translation symmetry. Convolution layers use the same set of filters in each position of the image, thereby saving a huge number of parameters. Re-using filters makes sense, because due to the symmetry, patterns that can appear in one position will also tend to appear in other positions.
Importantly, convolutional networks use not one but many layers of convolution. For the n-th layer to be able to exploit translation symmetry, the previous layers must “preserve” this symmetry. If we used a bunch of fully connected layers and then stacked a convolution layer on top, we wouldn’t expect it to work very well, because the fully connected layers don’t necessarily preserve the symmetry.
“Preserving the symmetry” sounds a bit vague. A more precise way to say this is that convolution layers are equivariant to translations. This means that if we translate an image and then feed it through a convolution layer, we get the same thing as if we had fed the original image through the convolution layer and then translated the resulting feature maps. This is shown in the figure below.
Can we extend this idea beyond translations to other kinds of transformations, thereby further increasing the amount of weight sharing? Unfortunately, convolution is not equivariant to transformations like rotation. As shown in the figure below, the feature map corresponding to the vertical bar contains different information about the image than the feature map that corresponds to the horizontal bar, even though they represent essentially the same object (just rotated). Since the rotation symmetry that we have in the input space is not preserved by convolution layers, it is not clear how we could exploit it through weight tying anywhere beyond the first layer.
In this section we will present group convolutions, which are equivariant to larger groups of symmetries than just translations, allowing us to drastically reduce the number of parameters per feature map. Group convolutions were recently introduced to deep learning by Scyfer researchers researchers Cohen & Welling (2016), who showed that they substantially improve results on image classification tasks where relatively little data is available.
To understand the G-Conv, let’s first look at the normal translation-conv. The figure below shows an animation of this.
Visualization of classic 2D convolution
When we look at the movement of the filter, we see that it only performs translations, i.e., each output value (red pixel) is associated with a specific translation of the filter. There is of course no reason to restrict transformations on the filter to translations. Instead a larger group can be used – for example, the group p4 which contains translations and rotations by multiples of ninety degrees (a combination of and the translation group), or , which additionally contains mirror reflections.
Visualization of the G-Conv for the roto-translation group
The figure above shows the group convolution (G-Conv) for the group . Besides translating, the filter now also rotates. For each of the four orientations of the filter, we perform a classic convolution and store the result in an output feature map (this is all done in a single call to a classic conv routine). Conceptually, the output feature maps are ordered in a circle that mirrors the Cayley diagram of above.
So what happens when we rotate the input image and then do a G-Conv? Is it really equivariant? As explained in the previous section, we want the output of the G-conv to be the same when we transform the input, up to some transformation of the output. The animation below shows that this is indeed the case. When we rotate the input, two things happen: first, the individual output feature maps rotate, and secondly, the output feature maps are permuted (following the arrows). This may sound a bit complicated, but if you inspect the figure below you will see that the output changes in a very predictable manner depending on the change in the input.
-convolution is equivariant under rotation
So we see that a rotation in the input space corresponds to a “rotation plus channel permutation” in the output space. We still have the same group of symmetries (rotations and translations), but the group acts differently on the feature space than it does on the input space. This means that for the next layer, we must rotate the filter using this new action: instead of rotating each channel of the filter independently as we would do in the input space, we rotate the channels and then cyclically permute them. In this way, we again create four “rotated” copies of the filter bank that share parameters. As shown in the paper (Cohen & Welling 2016), the output will again have the same transformation law as shown above, so we can iterate this procedure indefinitely, creating a network of arbitrary depth with a very high degree of weight-sharing in every layer.
G-Convs can be defined for any group G. If we use a group that includes translations, we can leverage fast 2D convolution routines to do the heavy lifting. Besides the roto-translation group, another group to consider is the group of translations, rotations by multiples of 90 degrees and reflections through the vertical and horizontal axes (see Cayley diagram for shown above). In 3D we have even more symmetry; a cubic filter having 48 symmetries.
We have seen how the G-Conv exploits symmetries beyond translations by sharing parameters between transformed versions of the same filter bank. As with any idea in deep learning, the most important question to ask is: does it actually work?
To test the effectiveness of G-CNNs, we performed experiments on the well known CIFAR10 image classification dataset, with data augmentation (CIFAR10+) or without. We compared G-CNNs (for G = and G = to standard CNNs for two baseline architectures: the All-CNN-C network by Springenberg et al. (2015) and a 44-layer Residual Network by He et al. (2015)) below.
We see that G-CNNs consistently outperform regular CNNs, with the larger group giving better results than the smaller group. This is true even if we use data augmentation with crops and horizontal flips (right-most column). The -CNNs obtained state-of-the-art results in both augmented and non-augmented settings. (In our follow up paper, we further improved the numbers to 3.65% error on CIFAR10+, and 18.82% on CIFAR100+).
Interestingly, we found that while both CNNs and G-CNNs improve with standard data augmentation (shifts and horizontal flips), performance for both actually degrades if we augment the dataset with vertical flips and 90-degree rotations, because it makes the learning problem much harder (there are no upside-down cars in the test set!). Despite this fact, the -conv (which exploits vertical flips and 90-degree rotation symmetry) works very well.
Besides more effective use of parameters, another potential reason for the success of G-CNNs is improved optimization. Whereas a filter in a standard CNN gets a gradient signal from a single channel, a G-Conv filter gets gradient signals from several channels. This leads to a less noisy gradient, and indeed we have noticed that G-CNNs converge much faster than regular CNNs.
If you have never been exposed to group theory before, the mathematics behind G-convs can seem a bit daunting. Hopefully, this blogpost helps in understanding G-convs at the intuitive level and seeing why they make sense.If you want to learn more, check out Chris Olah’s post on groups and group convolutions.
If you’re interested in trying out G-Convs for yourself, have a look at our github repositories GrouPy and gconv_experiments. GrouPy implements the G-Conv layer, and gconv_experiments shows how you can use them (they’re basically a drop-in replacement of the standard conv layers). We have implementations in Chainer and TensorFlow, but porting to other frameworks should not be difficult.
T.S. Cohen, M. Welling, Group Equivariant Convolutional Networks. Proceedings of the International Conference on Machine Learning (ICML), 2016
He, X. Zhang, S. Ren, and J. and Sun, Deep Residual Learning for Image Recognition. arXiv:1512.03385, 2015.
J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, Striving for Simplicity: The All Convolutional Net. Proceedings of the International Conference on Learning Representations (ICLR), 2015