Thursday, July 31, 2008

CiteSeer's Dataset

I am exploring the citation and co-authorship graphs of the documents (and contexts) indexed by CiteSeer. However, parsing their index has proved tricky. The good news is that CiteSeer provides an OAI-PMH compliant dump of their index. I downloaded and unzipped the index as follows:

$ wget http://cs1.ist.psu.edu/public/oai/oai_citeseer.tar.gz
$ tar -zxf oai_citeseer.tar.gz

The file is based on the Dublin Core standard with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses. The index is split into many 'dump' files with no root XML tag. So:

$ echo "<records>" `cat oai_citeseer/*` "</records>" > cs.xml

The file is quite big: approximately 1.9GB with over 36 million lines. The bad news is that:

$ xmllint --stream cs.xml

cs.xml:92025: parser error : attributes construct error
<oai_citeseer:author name="L. "j. Svensson">

The XML is not well-formed. I tried some quick repairs with sed:

$ sed -e 's/L\.\ \"j\.\ Svensson/L\.\ J\.\ Svensson/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml

cs.xml:168403: parser error : internal error
<dc:title>Imagining CLP(^,= alpheta )</dc:title>

There also appears to be unprintable characters in the file. A post from the Xalan mailing list provides a solution:

$ java XMLFix cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml

cs.xml:418791: parser error : attributes construct error
<oai_citeseer:author name="Nitin "nick Sawhney">

A recurring problem concerns people who parenthesize some part of their name, e.g. Nitin "Nick" Sawhney's. To fix these errors in the name attribute of the oai_citeseer:author tag:

$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\"\>\)/\1\2/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\)\"\([^\"]*\"\>\)/\1\2\3/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml

cs.xml:25857443: parser error : attributes construct error
<oai_citeseer:author name="Kai Voy"""zy Massachusettsiassachu">

I manually edit this line with vim, and I'm done! I have a well-formed XML file.