IUST-PDFCorpus
Creators
- 1. Ph.D. Student, Iran University of Science and Technology (IUST)
Contributors
Supervisor:
- 1. Associate professor at Iran University of Science and Technology
Description
About
IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.
Citing IUST-PDFCorpus
If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2
Files
iust_pdf_data_objects_507299_objs_6141_files.zip
Files
(1.1 GB)
Name | Size | Download all |
---|---|---|
md5:ba2ae37b3e12844262db97f4f8944bae
|
13.5 MB | Preview Download |
md5:5085ce71cf67e7e7d6a02f28a84052a7
|
28.6 MB | Preview Download |
md5:00316994dcb65f4f75748435b3d938b1
|
5.8 MB | Preview Download |
md5:66b4eeaa1febde23c9da6bcdb4bc3829
|
477.5 MB | Preview Download |
md5:3d2f1d45e377219af16e6781a068c375
|
497.8 MB | Preview Download |
md5:7bb6b256689a921634ff542bcff4b45b
|
31.1 MB | Preview Download |