$ wget http://cs1.ist.psu.edu/public/oai/oai_citeseer.tar.gz
$ tar -zxf oai_citeseer.tar.gz
The file is based on the Dublin Core standard with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses. The index is split into many 'dump' files with no root XML tag. So:
$ echo "<records>" `cat oai_citeseer/*` "</records>" > cs.xml
The file is quite big: approximately 1.9GB with over 36 million lines. The bad news is that:
$ xmllint --stream cs.xml
cs.xml:92025: parser error : attributes construct error
<oai_citeseer:author name="L. "j. Svensson">
The XML is not well-formed. I tried some quick repairs with sed:
$ sed -e 's/L\.\ \"j\.\ Svensson/L\.\ J\.\ Svensson/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml
cs.xml:168403: parser error : internal error
<dc:title>Imagining CLP(^,= alpheta )</dc:title>
There also appears to be unprintable characters in the file. A post from the Xalan mailing list provides a solution:
$ java XMLFix cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml
cs.xml:418791: parser error : attributes construct error
<oai_citeseer:author name="Nitin "nick Sawhney">
A recurring problem concerns people who parenthesize some part of their name, e.g. Nitin "Nick" Sawhney's. To fix these errors in the name attribute of the oai_citeseer:author tag:
$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\"\>\)/\1\2/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ sed -e 's/\(name\=\"[^\"]*\)\"\([^\"]*\)\"\([^\"]*\"\>\)/\1\2\3/g' cs.xml > csX.xml; mv csX.xml cs.xml
$ xmllint --stream cs.xml
cs.xml:25857443: parser error : attributes construct error
<oai_citeseer:author name="Kai Voy"""zy Massachusettsiassachu">
I manually edit this line with vim, and I'm done! I have a well-formed XML file.
6 comments:
Could you give me the citeseer dataset which had been parsed by you ?
Hi Jacky,
The dataset I have is quite old now (July '08). Did you look at
http://citeseerx.ist.psu.edu/about/metadata? You can use OAI-PMH (see
http://www.oaforum.org/tutorial) to retrieve the metadata. There is
some sample code on the CiteSeerX website. The problems above are now fixed.
Martin.
Dear Martin:
Thanks for your reply. I have got the latest citeseerx metadata. It seems like that current metadata does not include paper category information, but I need paper category information. Do u know how to get paper category information?
Hi Jacky.
Sorry, I'm not sure where that metadata is located.
Martin.
Hi, I have access to the Citeseerx dataset and I wanted to kno whether the "citations" table in the database has attributes of the cited paper or the citing paper?
Hi SD, I'm not too sure. I made the post above about six years ago and it was about the original "CiteSeer" database (no X).
Post a Comment