Citation Analysis is used to rate authors (problematic) and to find interesting papers (good idea). Citations of papers at the famous arXiv.org preprint server are analysed by CiteBase which is very useful. Unluckily it is buggy and does not alway work. I really wonder why the full text of a paper is parsed instead of using the BibTeX source. The citation parser ParaCite has been developed in the Open Citation Project. Since then it seems to be more or less abandoned. But it’s open source so you can test you papers before uploading and one could take the suiting parts to build a better citation parser. I found out that this way you can extract citations out of a document in $file (for instance a pdf) with perl (the needed modules are available at CPAN):
my $parser = Biblio::Citation::Parser::Citebase->new; my $content = Biblio::Document::Parser::Utils::get_content( $file ); my $doc_parser = Biblio::Document::Parser::Brody->new; my @references = $doc_parser->parse($content); for (my $i=0; $i < @references; $i++) { my $metadata = $parser->parse( $references[$i] ); print '[' . ($i+1) . '] ' . Dumper( $metadata ) . "\n"; }
In the documented that I tested there are almost always parsing errors, but better then nothing. I wonder what CiteSeer uses to extract citations? There is more action in citation parsing in the Zotero project – even an IDE called Scaffold to create new “translators” that extract bibliographic data out of webpages. Another playing ground is Wikipedia which contains a growing number of references. And of course there are the commericla citation indexes like SCI. I thought to use citation data for additional catalog enrichement (in addition to ISBN2Wikipedia) but quality of data seems to be too low and identifiers are missing.
P.S: Right after writing this, I found Alf Eaton’s experiment with collecting together the conversations around a paper from various academic, news, blog and other discussion channels – as soon as you have identifiers (ISBN, URL, DOI, PMID…) the world gets connected
P.P.S: ParsCit seems to be a good new reference string parsing package (open source, written in Perl).
P.P.S: Konstantin Baierer manages a bibliography on citation parsing for his parser Citation::Multi::Parser.