$ wget http://cs1.ist.psu.edu/public/oai/oai_citeseer.tar.gz
$ tar -zxf oai_citeseer.tar.gz
The file is based on the Dublin Core standard with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses. The index is split into many 'dump' files with no root XML tag. So:
$ echo "<records>" `cat oai_citeseer/*` "</records>" > cs.xml
The file is quite big: approximately 1.9GB with over 36 million lines. The bad news is that:
$ xmllint --stream cs.xml
cs.xml:92025: parser error : attributes construct error
<oai_citeseer:author name="L. "j. Svensson">
The XML is not well-formed. I tried some quick repairs with sed:
$ sed -e 's/L\.\ \"j\.\ Svensson/L\.\ J\.\ Svensson/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml
cs.xml:168403: parser error : internal error
<dc:title>Imagining CLP(^,= alpheta )</dc:title>
There also appears to be unprintable characters in the file. A post from the Xalan mailing list provides a solution:
$ java XMLFix cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml
cs.xml:418791: parser error : attributes construct error
<oai_citeseer:author name="Nitin "nick Sawhney">
A recurring problem concerns people who parenthesize some part of their name, e.g. Nitin "Nick" Sawhney's. To fix these errors in the name attribute of the oai_citeseer:author tag:
$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\"\>\)/\1\2/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\)\"\([^\"]*\"\>\)/\1\2\3/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml
cs.xml:25857443: parser error : attributes construct error
<oai_citeseer:author name="Kai Voy"""zy Massachusettsiassachu">
I manually edit this line with vim, and I'm done! I have a well-formed XML file.