Thursday, July 31, 2008

CiteSeer's Dataset

I am exploring the citation and co-authorship graphs of the documents (and contexts) indexed by CiteSeer. However, parsing their index has proved tricky. The good news is that CiteSeer provides an OAI-PMH compliant dump of their index. I downloaded and unzipped the index as follows:

$ wget http://cs1.ist.psu.edu/public/oai/oai_citeseer.tar.gz
$ tar -zxf oai_citeseer.tar.gz

The file is based on the Dublin Core standard with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses. The index is split into many 'dump' files with no root XML tag. So:

$ echo "<records>" `cat oai_citeseer/*` "</records>" > cs.xml

The file is quite big: approximately 1.9GB with over 36 million lines. The bad news is that:

$ xmllint --stream cs.xml

cs.xml:92025: parser error : attributes construct error
<oai_citeseer:author name="L. "j. Svensson">

The XML is not well-formed. I tried some quick repairs with sed:

$ sed -e 's/L\.\ \"j\.\ Svensson/L\.\ J\.\ Svensson/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml

cs.xml:168403: parser error : internal error
<dc:title>Imagining CLP(^,= alpheta )</dc:title>

There also appears to be unprintable characters in the file. A post from the Xalan mailing list provides a solution:

$ java XMLFix cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml

cs.xml:418791: parser error : attributes construct error
<oai_citeseer:author name="Nitin "nick Sawhney">

A recurring problem concerns people who parenthesize some part of their name, e.g. Nitin "Nick" Sawhney's. To fix these errors in the name attribute of the oai_citeseer:author tag:

$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\"\>\)/\1\2/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\)\"\([^\"]*\"\>\)/\1\2\3/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml

cs.xml:25857443: parser error : attributes construct error
<oai_citeseer:author name="Kai Voy"""zy Massachusettsiassachu">

I manually edit this line with vim, and I'm done! I have a well-formed XML file.

6 comments:

Jacky said...

Could you give me the citeseer dataset which had been parsed by you ?

Martin Harrigan said...

Hi Jacky,

The dataset I have is quite old now (July '08). Did you look at
http://citeseerx.ist.psu.edu/about/metadata? You can use OAI-PMH (see
http://www.oaforum.org/tutorial) to retrieve the metadata. There is
some sample code on the CiteSeerX website. The problems above are now fixed.

Martin.

Jacky said...

Dear Martin:
Thanks for your reply. I have got the latest citeseerx metadata. It seems like that current metadata does not include paper category information, but I need paper category information. Do u know how to get paper category information?

Martin Harrigan said...

Hi Jacky.

Sorry, I'm not sure where that metadata is located.

Martin.

SD said...

Hi, I have access to the Citeseerx dataset and I wanted to kno whether the "citations" table in the database has attributes of the cited paper or the citing paper?

Martin Harrigan said...

Hi SD, I'm not too sure. I made the post above about six years ago and it was about the original "CiteSeer" database (no X).