Quantcast
Channel: Web Science and Digital Libraries Research Group
Viewing all articles
Browse latest Browse all 737

2024-01-10: Paper Summary: ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning

$
0
0

 

Figure 1: An example of claim sentences (Figure 1 in ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning)


In the summer of 2023, we published a paper titled "ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning" at the Joint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) and the 3rd AI + Informatics (AII 2023) co-located with JCDL 2023.

In this paper, we explored ways to automatically locate key claims from abstracts in academic papers. One challenge of this task is the limited publicly available training data. We compared transfer learning and contrastive learning and found that contrastive learning outperformed transfer learning at a lower cost of training data. We proposed a contrastive-learning-based model, ClaimDistiller, with the highest performance among all methods which has an F1 score as high as 87.45%.


Claims

We define a scientific claim as a sentence that provides the core findings of a scientific paper. One example is given in Figure 1. Specifically, there are three types of claims in academic papers as pointed out by Achakulvisut et al

  • Type 1: A statement that declares something is better;
  • Type 2: A statement that proposes something new;
  • Type 3: A statement that describes a new finding or a new cause effect relationship.

In 
Figure 1, the claim sentence is a new finding stated in the paper, so it is annotated as a claim. The non-claim sentence is about the publication information of the whole paper. It does not belong to any type of claim above.


Datasets

We employed three datasetsSciCE (scientific claim extraction)SciARK, and Pubmed-RCT

SciCE is the largest dataset so far for scientific claim extraction. It contains 1,500 scientific abstracts in the biomedical domain. Each sentence is labeled by domain experts into two categories: claim and non-claim. This dataset is used in both transfer learning and contrastive learning.

The second dataset is Pubmed-RCT, which contains discourse labels of sentences in abstracts of PubMed papers. Pubmed-RCT is a dataset consisting of 20,000 abstracts, including 2.3 million sentences selected from the MEDLINE/PubMed Baseline Database. In our paper, it is used as the source dataset in transfer learning. As SciCE dataset contains only 1500 abstracts, the limited size is a constraint for better performance. We incorporate the Pubmed-RCT dataset, which is much larger than the SciCE dataset for building the pre-trained model.

The third dataset, SciARK, serves a similar role as SciCE: it is used in different deep learning frameworks to compare performance.  It is composed of 9055 sentences, extracted from the abstracts of 689 academic papers. Each sentence is annotated as Claim, Evidence, or Nonetype. In order to be consistent with SciCE, we merge the "Evidence" and "Nonetype" as "non-claim" and treat it as a binary-class dataset (claim vs. non-claim).

 

Deep Learning Frameworks

Figure 2 shows a comparison of the three different training frameworks. ‘Network’ in this figure can be any of the base models (which will be discussed in the next part). The training frameworks are as follows:

  • Trained from Scratch. In this setting, the neural classifier is trained directly using the SciCE corpus with only the base models, namely CNN-1D, USE-Dense, and WC-BiLSTM, which will be discussed in the next part.


  • Transfer Learning. In this setting, the neural classifier is first pre-trained using the PubMed-RCT corpus and then fine-tuned on the SciCE corpus. During the fine-tuning stage, we freeze the weights of all layers except the fully-connected classification layer. Then we replace that fully-connected layer with a new layer with classes in the target dataset.

  • Supervised Contrastive Learning. In supervised contrastive learning, the neural network is firstly pre-trained with augmented training data from the SciCE corpus and then fine-tuned on the original SciCE data. Note that in this setting, only SciCE is used, unlike in transfer learning the PubMed-RCT dataset is used to capture more useful information.

Figure 2: Comparing training from scratch, transfer learning, and supervised contrastive learning.

Both transfer learning and supervised contrastive learning can be used to overcome limited training data. The difference is that transfer learning uses external source data, while contrastive learning makes up 'augmented data' 
similar to the original data to enlarge training data.

Base Models

The base models used for training the model mentioned in the previous part are shown below:

  • CNN-1D. Similar to regular CNN used in feature extraction from 2-dimensional images, 1-dimensional CNN has been used for extracting features from word sequences. This method works by sliding a window with a fixed width over a sequence and convolving features of tokens covered by the window. An average pooling was used to aggregate features from individual tokens.

  • USE-dense. We adopted the pre-trained Universal Sentence Encoder (USE) to encode claim text into dense 512-dimensional vectors. The initial embeddings produced by USE were fine-tuned on the SciCE corpus, after which the sentences were encoded to dense feature vectors used by the fully-connected layer for classification.

  • WC-BiLSTM (Word and Character embedding Bidirectional Long Short-Term Memory). We combine pre-trained Word2Vec embedding with character embedding to encode unseen words. The combined embedding is fed to bidirectional long short-term memory (BiLSTM) layers to extract patterns from claim sentences. Finally, the representations were passed to a fully-connected layer for classification.


Data Augmentation

Data augmentation is an essential part of contrastive learning, which creates the dataset used for pre-training by sentences with similar semantics. There are mainly five types of methods for data augmentation:

  • Round Trip Translation (RTT). This method first translates the sentence from English to French and then translates it back to English. Translation is based on Google translation services as well as Amazon translate.

  • Wordnet Synonym Replacement. This method replaces words with their synonyms in the sentence. Replaceable words such as verbs and nouns are selected from a sentence using a part-of-speech tagger. Then a number of words are selected out of them following a Geometric distribution and replaced by their synonyms, which are given by a synonym library provided by WordNet.

  • EDA (Easy Data Augmentation) Synonym Replacement. Randomly pick a word (not stop words) from the sentence and then replace the word with one of its synonyms chosen at random.

  • EDA Random Deletion. Randomly remove any word in the sentence with a probability you can specify. We use the default probability value of 0.2.

  • EDA Random Insertion. Find a random synonym of a random word (not a stop word) in the sentence and then insert the synonym into any position in the sentence randomly.

Proposed Framework: ClaimDistiller

Our proposed framework is based on supervised contrastive learning. The architecture of the framework is shown in Figure 3. The SCL can be implemented in two stages. In the first stage, we augment each labeled sentence into two sentences with similar semantics. This augmented dataset is fed into the encoder and supports the Stage 1 training. The encoder along with the projection head, which is composed of several dense layers, minimizes the supervised contrastive loss to obtain the optimal embeddings in order to group positive samples together and push negative samples far away. In Stage 2, we retain the weights of the encoder and freeze the weights in its dense layers, and add two more dense layers for classification. The classifier is trained to minimize the cross-entropy loss function.


Figure 3: Architecture of our proposed framework: ClaimDistiller (Figure 4 in ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning).


 Results

The results of the experiments are shown in Table 1. Contrastive learning achieves better performance than transfer learning consistently across all models. Contrastive-learning-based model ClaimDistiller has the best performance across all metrics compared with other models, achieving F1=87.45%, precision=87.08%, and recall=87.83%. With SciARK, ClaimDistiller has the best performance with F1=88.93%, precision=90.02%, and recall=89.47%.


Table 1: A comparison of models on the scientific claim extraction task (Table 2 in ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning).

The training time needed for each model varies. In general, transfer learning needs significantly more time for training than supervised contrastive learning. There are also advantages in terms of data size with contrastive learning. Contrastive learning uses less than 6000 sentences while transfer learning uses 2 million sentences for pre-training.


Prediction Analysis

There are two examples containing typical prediction results for case studies. The ground truth claims are highlighted in blue. Green labels mean the sentences are non-claims and red labels mean sentences are claims. Labels with red frames indicate wrong predictions. 

In the first example, the model can identify all the claims, but it mistakenly recognizes two sentences as claims. The first one is about this article's main idea. The second one is a statement about a fact. So both of them are non-claims.


Figure 4: Example 1 of the prediction results in the test set of SciCE (Figure 10 in ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning).



In the second example, the ground truth contains two claims but the model only predicted one. The false-negative sentence is marked by a red box. This sentence is about a new finding, so it should be a claim.


Figure 5: Example 2 of the prediction results in the test set of SciCE (Figure 10 in ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning).

Conclusion

The challenge for claim extraction is how to efficiently use existing limited data. We propose the ClaimDistiller framework, which uses supervised contrastive learning on top of existing text encoders to boost the performance of scientific claim extraction. We showcased the efficacy of this mechanism on two benchmark datasets. Our result established a new state-of-the-art, outperforming the existing method by 7%, which used transfer learning on a BiLSTM-CRF architecture. We demonstrated that the SCL achieved comparable or higher F1 scores compared with transfer learning methods with significantly less training data and time. 



Wei, Xin, Md Reshad Ul Hoque, Jian Wu, and Jiang Li. "ClaimDistiller: Scientific Claim Extraction with Supervised Contrastive Learning." (2023). In Proceedings of the Joint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) and the 3rd AI + Informetrics (AII2023), pages 65–77, 2023.


-- Xin


Viewing all articles
Browse latest Browse all 737

Trending Articles