NASA collates the world’s largest archive of PDFs for malware research

16 June 2023
  • NASA’s Jet Propulsion Laboratory creates the world’s largest PDF archive
  • It contains over eight million files downloaded from the web
  • Researchers can use them to improve PDF security

NASA’s Jet Propulsion Laboratory (JPL) has created the world’s largest open-source archive of PDFs, containing eight million individual files. These will be used by research groups developing tools to detect malware embedded in the file type’s code.

The corpus is one of several tools JPL has created with the PDF Association, a nonprofit seeking to establish open standards for PDF technology, to help improve online security. Scientists across the world can use the resource to identify privacy vulnerabilities in software or anticipate future cyber threats.

Tim Allison, a data scientist at JPL, said: “PDFs are used everywhere and are important for contracts, legal documents, 3D engineering designs, and many other purposes. Unfortunately, they are complex and can be compromised to hide malicious code or render different information for different users in a malicious way.

“To confront these and other challenges from PDFs, a large sample of real-world PDFs needs to be collected from the internet to create a shared, freely available resource for software experts.”

Screenshot of Digital Copora website where the PDFs are stored.PDF – which stands for portable document format – is a commonly used file type for sharing electronic documents. They can contain text, images, videos, 3D models, and more, so they are used for various purposes, such as legal contracts and engineering documents.

They were first created by Adobe Systems in 1993, with the aim of preserving the formatting and integrity of documents across different operating systems and software. Their compact file size also makes them easy to store, share, and download. The popularity of PDFs increased quickly as a result, and, as of 2020, over 2.5 trillion PDFs are estimated to exist in the world today.

But while Adobe has built various security capabilities into PDFs, like password protection and encryption, the surprisingly complex format is far from bulletproof. Bad actors can still compromise the files, embed malicious code, and exploit vulnerabilities in PDF viewers to gain control or access confidential data.

This reality has inspired the Safety Documents – or SafeDocs – program from the Defense Advanced Research Projects Agency (DARPA). It aims to develop secure and reliable software tools for handling electronic data formats, like PDFs, to achieve better data protection.  As part of this, JPL – which is normally busy developing satellites, probes, and Mars rovers  – was tasked with creating a vast archive of PDFs. These could then be studied during the development of such tools.

Stock image showing Adobe Reader.

The work began on July 23 2021 with a two-week search for files that could become part of this corpus. Data scientists scaped an open-source repository of web data from around the world called ‘Common Crawl’ to identify and download the PDFs they wanted.

During this initial search, the main criteria were that the files had to be freely available, i.e., not behind a firewall or in a private network. It resulted in over eight million PDFs, but two million of these were truncated, as Common Crawl does not allow files to be downloaded that are over 1 megabyte in size. The JPL team, therefore, had to go back and acquire the complete files to ensure their PDF database was complete and that it could be used to conduct meaningful research.

As well as the PDFs themselves, which contained a wide variety of subject matter, the JPL team extracted and stored parts of their metadata. This included the software that was used to create them, as well as the server location of the source website. This resulted in the archive totaling about eight terabytes, making it the world’s most extensive publicly available collection of PDFs.

The PDFs are available to be downloaded as zip files that can be used in the ongoing war against cyber threats. They are being hosted on ‘Digital Corpora’, a website home to other vast online archives, including disk images that can be used to test computer forensic tools and photos of cell phones.

NASA's Jet Propulsion Laboratory mission control room.

Experts can search them for malware and use what they find to improve the security of PDF technology and anticipate future attacks. Developers could also use them to identify bugs in their code or to check if their software is compatible with different versions of PDFs.

Dr Simson Garfinkel, a computer scientist who created a similar large corpus of a million documents in 2008, said: “This is open and repeatable science. Researchers need to have a common data set to work with so that they can compare results of different analysis techniques and experiments.

“PDF is one of the most important file types on the internet today, and this contribution of roughly eight terabytes of data provides faculty, students, and corporations with up-to-date reference data that will power research for years to come.”