Contrastive learning models proposed in recent years (Figure 1 from A Simple Framework for Contrastive Learning of Visual Representations) |
Supervised learning has been widely used in machine learning, which requires the input data (for example, images) be annotated (for example, classes). We can use supervised learning to build models to find the relationship between input data and annotated labels and then use the models to predict labels for future inputs. This is the way in traditional machine learning, such as decision tree, logistic regression, SVM, and neural networks.
It's costly to create annotations, and as a result for most of the circumstances annotation is not available. We need a method which can make use of tons of data in real life without annotation, which is called unsupervised learning. Conventional unsupervised learning includes clustering (for example, k-means), dimension reduction (for example, principal component analysis) and so on. Contrastive learning is developed recently and is one of the most important part of unsupervised learning.
I will discuss two classical papers about contrastive learning. One brings in the idea of contrastive learning and the other one has a more advance framework and the method is commonly used recently.
Paper 1: InstDisc(2018)
Wu, Z., Xiong, Y., Yu, S.X. and Lin, D., 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3733-3742).
This is one of the papers which proposed the concept of contrastive learning (CPC (Contrastive Predictive Coding) also proposed contrastive learning in the same year with generative methods. Since we only focus on instance-based methods, only InstDisc will be discussed here). The authors are motivated by some supervised learning results as shown in Figure 1. They observe "For an image from class leopard, the classes that get highest responses from a trained neural net classifier are all visually correlated, e.g., jaguar and cheetah." It seems the apparent similarity in the images themselves can bring some classes closer than others. This motivates them to treat each image as a distinct class of its own and thus each image has its natural annotation. Then they take one image as positive sample, and the rest of them as negative samples. Those samples can be trained by contrastive learning and they call this method 'individual instance'.
As shown in Figure 2, they use a deep neural network composed of CNN layers to encode each image as a 128-dim feature vector. A memory bank is used to store those features.
Figure 2: The pipeline of the proposed unsupervised feature learning approach (Figure 2 in Unsupervised Feature Learning via Non-Parametric Instance Discrimination). |
Figure 3: Loss function (Equation 6 and 7 in Unsupervised Feature Learning via Non-Parametric Instance Discrimination). |
As shown in Figure 4, the performance of their method achieves 54% accuracy, which is much higher than the baseline models (from Random to Exemplar).
Figure 4: Top-1 classification accuracy on ImageNet (Table 2 in Unsupervised Feature Learning via Non-Parametric Instance Discrimination). |
Paper 2: SimCLR(2020)
Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.
This is a more recent work and they use a more advanced framework for contrastive learning. They no longer use a memory bank which contains all the negative samples which takes up too much space when dealing with big data. Instead, they use a more dynamic way to store the samples with a momentum encoder. The contrastive learning framework is composed of four parts: data augmentation module, neural network encoder, projection head, contrastive loss function.
Data augmentation means to transform any image into two correlated views of the same image, which are called a positive pair. One in the pair is considered the anchor i, the other is considered as the positive sample j. For a dataset with N images, we have 2N samples after data augmentation. For a particular image, it has the anchor i, the positive sample j, and 2(N-1) negative samples. For a dataset of images, the data augmentation techniques include: random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur.
An encoder is a neural network that extracts representation vectors from augmented data examples. ResNet is used as the encoder in this paper.
A projection head can map representations to the space where contrastive loss is applied. To be specific, they use a MLP with one hidden layer as the projection head. The projection head can boost model performance significantly based on experiment results.
A contrastive loss function is defined as shown below in Figure 5. In a positive pair of examples (i, j), i is the anchor sample, j is the positive sample. K refers to all the 2N samples in the augmented data.
Figure 5: Loss function (Equation 1 in 'A Simple Framework for Contrastive Learning of Visual Representations'). |
Anchor sample and the positive sample is the most similar naturally. We want to find a representation by which in the embedding space the positive sample is closest to the anchor sample, and the negative samples are pushed far away. Such a representation can help us to find the inherent similarities in those images.
Figure 6: A simple framework for contrastive learning of visual representations (Figure 2 in 'A Simple Framework for Contrastive Learning of Visual Representations'). |
As shown in Figure 7, they compared the proposed method SimCLR with baseline methods and previous contrastive methods. SimCLR has achieved 85.8% accuracy, which is the best among all those methods.
Figure 7: ImageNet accuracy (Table 7 in 'A Simple Framework for Contrastive Learning of Visual Representations'). |
The two papers discussed above are both classic papers on contrastive learning. InstDisc proposed the idea of contrastive learning and proposed a preliminary method for this task. They observe during image classification for leopard, some other classes are also included in the results. Related classes such as jaguar and cheetah appear more frequently than unrelated classes such as lifeboat. There must be some apparent similarities in those images. They proposed a pipeline based on 'individual instances' to look for such similarities. This pipeline can represent the images in a 128-dimension feature vector and distinguish between positive and negative samples to obtain the best representation.
- Xin Wei