2024-07-15: Literature Reading and Paper Summary: Two Classical papers on contrastive learning

Contrastive learning models proposed in recent years (Figure 1 from A Simple Framework for Contrastive Learning of Visual Representations)

Supervised learning has been widely used in machine learning, which requires the input data (for example, images) be annotated (for example, classes). We can use supervised learning to build models to find the relationship between input data and annotated labels and then use the models to predict labels for future inputs. This is the way in traditional machine learning, such as decision tree, logistic regression, SVM, and neural networks.

It's costly to create annotations, and as a result for most of the circumstances annotation is not available. We need a method which can make use of tons of data in real life without annotation, which is called unsupervised learning. Conventional unsupervised learning includes clustering (for example, k-means), dimension reduction (for example, principal component analysis) and so on. Contrastive learning is developed recently and is one of the most important part of unsupervised learning.

I will discuss two classical papers about contrastive learning. One brings in the idea of contrastive learning and the other one has a more advance framework and the method is commonly used recently.

Paper 1: InstDisc（2018）

Wu, Z., Xiong, Y., Yu, S.X. and Lin, D., 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3733-3742).

Code

This is one of the papers which proposed the concept of contrastive learning (CPC (Contrastive Predictive Coding) also proposed contrastive learning in the same year with generative methods. Since we only focus on instance-based methods, only InstDisc will be discussed here). The authors are motivated by some supervised learning results as shown in Figure 1. They observe "For an image from class leopard, the classes that get highest responses from a trained neural net classifier are all visually correlated, e.g., jaguar and cheetah." It seems the apparent similarity in the images themselves can bring some classes closer than others. This motivates them to treat each image as a distinct class of its own and thus each image has its natural annotation. Then they take one image as positive sample, and the rest of them as negative samples. Those samples can be trained by contrastive learning and they call this method 'individual instance'.

Figure 1: Supervised learning results that motivate the unsupervised approach. In the classification results in response to leopard, jaguar and cheetah appear much more frequently than other unrelated images. (Figure 1 in Unsupervised Feature Learning via Non-Parametric Instance Discrimination).

As shown in Figure 2, they use a deep neural network composed of CNN layers to encode each image as a 128-dim feature vector. A memory bank is used to store those features.

Figure 2: The pipeline of the proposed unsupervised feature learning approach (Figure 2 in Unsupervised Feature Learning via Non-Parametric Instance Discrimination).

The non-parametric softmax in Figure 2 is approximated by NCE (noise-contrastive estimation) to reduce dimension. One image is treated as positive sample and the rest of the images are treated as noise, or negative samples. This can make the multi-class classification problem into a number of binary classification problems.

The loss function is shown in Figure 3. In the equation below, h is the posterior probability of sample i with feature v not being noise.

Figure 3: Loss function (Equation 6 and 7 in Unsupervised Feature Learning via Non-Parametric Instance Discrimination).

As shown in Figure 4, the performance of their method achieves 54% accuracy, which is much higher than the baseline models (from Random to Exemplar).

Figure 4: Top-1 classification accuracy on ImageNet (Table 2 in Unsupervised Feature Learning via Non-Parametric Instance Discrimination).

Paper 2: SimCLR（2020）

Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597-1607). PMLR.

Code

This is a more recent work and they use a more advanced framework for contrastive learning. They no longer use a memory bank which contains all the negative samples which takes up too much space when dealing with big data. Instead, they use a more dynamic way to store the samples with a momentum encoder. The contrastive learning framework is composed of four parts: data augmentation module, neural network encoder, projection head, contrastive loss function.

Data augmentation means to transform any image into two correlated views of the same image, which are called a positive pair. One in the pair is considered the anchor i, the other is considered as the positive sample j. For a dataset with N images, we have 2N samples after data augmentation. For a particular image, it has the anchor i, the positive sample j, and 2(N-1) negative samples. For a dataset of images, the data augmentation techniques include: random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur.

An encoder is a neural network that extracts representation vectors from augmented data examples. ResNet is used as the encoder in this paper.

A projection head can map representations to the space where contrastive loss is applied. To be specific, they use a MLP with one hidden layer as the projection head. The projection head can boost model performance significantly based on experiment results.

A contrastive loss function is defined as shown below in Figure 5. In a positive pair of examples (i, j), i is the anchor sample, j is the positive sample. K refers to all the 2N samples in the augmented data.

Figure 5: Loss function (Equation 1 in 'A Simple Framework for Contrastive Learning of Visual Representations').

Anchor sample and the positive sample is the most similar naturally. We want to find a representation by which in the embedding space the positive sample is closest to the anchor sample, and the negative samples are pushed far away. Such a representation can help us to find the inherent similarities in those images.

The framework can be demonstrated as a pipeline as shown in Figure 6. The original image x is transformed into two positive pairs by different data augmentation techniques. Then the augmented dataset is input into the encoder and projection head to optimize the loss function. After training is completed, they keep the encoder and throw away the projection head. The representation derived in the encoder can be used for downstream tasks.

Figure 6: A simple framework for contrastive learning of visual representations (Figure 2 in 'A Simple Framework for Contrastive Learning of Visual Representations').

As shown in Figure 7, they compared the proposed method SimCLR with baseline methods and previous contrastive methods. SimCLR has achieved 85.8% accuracy, which is the best among all those methods.

Figure 7: ImageNet accuracy (Table 7 in 'A Simple Framework for Contrastive Learning of Visual Representations').

The two papers discussed above are both classic papers on contrastive learning. InstDisc proposed the idea of contrastive learning and proposed a preliminary method for this task. They observe during image classification for leopard, some other classes are also included in the results. Related classes such as jaguar and cheetah appear more frequently than unrelated classes such as lifeboat. There must be some apparent similarities in those images. They proposed a pipeline based on 'individual instances' to look for such similarities. This pipeline can represent the images in a 128-dimension feature vector and distinguish between positive and negative samples to obtain the best representation.

SimCLR proposed a more advance method for contrastive learning which is widely used recently. They use an encoder to store the samples instead of a memory bank, which allows us to work on big data. An encoder based on ResNet is used to find the representations instead of an architecture of CNN layers. Data augmentation is used to generate more positive samples and thus help us to distinguish between positive and negative samples in a better way. This method generally shows significant improvement in performance. In my paper, ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning, the model is based on SimCLR method.

- Xin Wei

2024-07-15: Literature Reading and Paper Summary: Two Classical papers on contrastive learning

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List