About GROTOAP2 ============== GROTOAP2 (GROund Truth for Open Access Publications) is a dataset useful for training and performance evaluation of document content analysis tasks, such as document zone classification. GROTOAP2 was built automatically from PubMed Central Open Access Subset. It contains 13,210 ground truth files, that store geometrical and logical structure of the articles content. The corresponding PDF files can be downloaded from http://europepmc.org/. GROTOAP2 dataset is available under CC-BY license. GROTOAP dataset can be downloaded from: http://cermine.ceon.pl/grotoap2/. How to cite =========== Please cite the following paper: D.Tkaczyk, P.Szostek and L.Bolikowski, "GROTOAP2 - The methodology of creating a large ground truth dataset of scientific articles," in Proceedings of the 3rd International Workshop on Mining Scientific Publications, D-Lib Magazine, 2014. Authors ======= Dominika Tkaczyk Pawel Szostek The content of GROTOAP2 ======================= GROTOAP2 consists of: * 13,210 ground truth files in TrueViz XML format (files GROTOAP2-[1-4].zip) * a sample of 132 ground truth files (GROTOAP2-sample.zip) * pdflinks.txt file, that contains URLs to corresponding PDF files * bash script download_pdfs.sh for downloading the corresponding PDF files To download the PDF files, type: unzip GROTOAP2-*.zip cd grotoap2 ./download_pdfs.sh Warsaw, 11 July 2014