ParsCit is a citation parser developed by a joint effort of Pennsylvania State University and National University of Singapore. Over the past ten years, it is been compared with many other citation parsing tools and is still widely used. Although Neural ParsCit has been developed, the implementation is still not as easy to use as ParsCit. In particular, PDFMEF encapsules ParsCit as the default citation parser.
However, many people found that installing ParsCit is not very straightforward. This is partially because it is written in perl and the instructions on the ParsCit website are not 100% accurate. In this blog post, I describe the installation procedures of ParsCit on a Ubuntu 16.04.6 LTS desktop. Installation on CentOS should be similar. The instructions do not cover Windows.
The following steps assume we install ParsCit under /home/username/github.
However, many people found that installing ParsCit is not very straightforward. This is partially because it is written in perl and the instructions on the ParsCit website are not 100% accurate. In this blog post, I describe the installation procedures of ParsCit on a Ubuntu 16.04.6 LTS desktop. Installation on CentOS should be similar. The instructions do not cover Windows.
The following steps assume we install ParsCit under /home/username/github.
- Download the source code from https://github.com/knmnyn/ParsCit and unzip it.
$ unzip ParsCit-master.zip - Install c++ compiler
$ sudo apt install g++
To test it, write a simple program hello.cc and run
$ g++ -o hello hello.cc
$ ./hello - Install ruby
$ sudo apt install ruby-full
To test it, run
$ ruby --version - Perl usually comes with the default Ubuntu installation, to test it, run
$ perl --version - Install Perl modules, first start CPAN
$ perl -MCPAN -e shell
choose the default setups until the CPAN prompt is up:
cpan[1]>
Then install packages one by one
cpan[1]> install Class::Struct
cpan[2]> install Getopt::Long
cpan[3]> install Getopt::Std
cpan[4]> install File::Basename
cpan[5]> install File::Spec
cpan[6]> install FindBin
cpan[7]> install HTML::Entities
cpan[8]> install IO::File
cpan[9]> install POSIX
cpan[10]> install XML::Parser
cpan[11]> install XML::Twig
choose the default setups
cpan[12]> install XML::Writer
cpan[13]> install XML::Writer::String - Install crfpp (verison 0.51) from source.
- Get into the crfpp directory
$ cd crfpp/ - Unzip the tar file
$ tar xvf crf++-0.51.tar.gz - Get into the CRF++ directory
$ cd CRF++-0.51/ - Configure
$ ./configure - Compile
$ make
This WILL cause an error like below
path.h:26:52: error: 'size_t' has not been declared
void calcExpectation(double *expected, double, size_t) const;
^
Makefile:375: recipe for target 'node.lo' failed
make[1]: *** [node.lo] Error 1
make[1]: Leaving directory '/home/jwu/github/ParsCit-master/crfpp/CRF++-0.51'
Makefile:240: recipe for target 'all' failed
make: *** [all] Error 2
This is likely caused by missing the following two lines in node.cpp and path.cpp. Add these two lines before other include statements, so the beginnings of either file look like
#include "stdlib.h"
#include <iostream>
#include <cmath>
#include "path.h"
#include "common.h"
then run ./configure and "make" again. - Install crf++
$ make clean
$ make
This should rebuld crf_test and crf_learn. - Move executables to where parscit expects to find them.
$ cp cp crf_learn crf_test ..
$ cd .libs
$ cp -Rf * ../../.libs - Test ParsCit. Under the bin/ directory, run
$ ./citeExtract.pl -m extract_all ../demodata/sample2.txt
$ ./citeExtract.pl -i xml -m extract_all ../demodata/E06-1050.xml