Published November 7, 2018 | Version 1.0.0
Dataset Open

IUST-PDFCorpus

  • 1. Ph.D. Student, Iran University of Science and Technology (IUST)

Contributors

Supervisor:

  • 1. Associate professor at Iran University of Science and Technology

Description

About

IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.

Citing IUST-PDFCorpus

If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2  

Files

iust_pdf_data_objects_507299_objs_6141_files.zip

Files (1.1 GB)

Name Size Download all
md5:ba2ae37b3e12844262db97f4f8944bae
13.5 MB Preview Download
md5:5085ce71cf67e7e7d6a02f28a84052a7
28.6 MB Preview Download
md5:00316994dcb65f4f75748435b3d938b1
5.8 MB Preview Download
md5:66b4eeaa1febde23c9da6bcdb4bc3829
477.5 MB Preview Download
md5:3d2f1d45e377219af16e6781a068c375
497.8 MB Preview Download
md5:7bb6b256689a921634ff542bcff4b45b
31.1 MB Preview Download