Tool Release: Detecting Malicious Documents with Mass File Inspector

July 15, 2020


This blog gives a quick overview of our newly released tool Mass File Inspector (mf_inspector). This tool is an open-source document static analysis tool written in Python. It was created to aid with the mass inspection and identification of malicious files in document stores. This arose as a direct requirement of an incident response exercise, when a customer suspected there may be malicious documents in Microsoft’s SharePoint system. So we decided to write a tool to help us find any malicious documents.

We’ll cover how the tool works later, first let’s go through what can make a document malicious.

Malicious Documents

With email being the number one delivery method for malware, it’s no surprise that malware embedded in document attachments is highly pervasive. The sections below give a brief overview on document types, how attackers can embed malicious payloads, and some advice on preventative measures.

Office Document Types

Office/Word documents can be largely categorised into three formats:

  • OLE – A proprietary Microsoft file format first released in 1990
  • OpenXML – A Zip compressed, XML file format developed by Microsoft
  • ODF – A Zip compressed, XML file format format developed by OASIS

Malicious Word Documents


OLE and OpenXML formats use Microsoft’s dreaded VBA (Visual Basic for Applications) as their macro programming language. Since VBA can call the Windows system API, it’s possible to write macros which inject shellcode directly into process memory using the VirtualAlloc, RtlMoveMemory and CreateThread API calls. In fact it’s possible to use VBA to write complex malware, however most VBA macros are Droppers/Downloaders which download and launch additional malware.

Below is a non-exhaustive list of VBA macro capabilities:

  • shellcode injection
  • auto startup
  • launching programs
  • downloading files/programs
  • calling ActiveX objects
  • creating/destroying files

Luckily macros are disabled by default in the Microsoft Office suite. They can however be enabled via the settings menu or by clicking on Enable Content when the following popup appears SECURITY WARNING Macros have been disabled.

Prevention measures include making sure macros are disabled and using software which can identify suspicious macros and embedded content. Antivirus may also help prevent malicious activity by identifying malicious payloads on disk or by detecting and stopping malicious payloads executing in memory.

mf_inspector uses oletools to detect and notify the user of suspicious macros and embedded content. mf_inspector can also scan documents with ClamAV anti-virus, provided the service is running.

Document Parsing Vulnerabilities

Like with any software that parses data, bugs may exist in document editors and viewers that enable the execution of arbitrary code when parsing malformed data. Software that checks the validity of document formats can help prevent this type of exploit. Keeping the document reading software up to date with the latest security patches can also help. Additionally, malicious payloads embedded inside documents can also be detected with antivirus software.

Quick PDF Overview

Before looking into malicious PDF types and their payloads it helps to have a general idea of how PDFs are structured. The structure of a PDF can be broken down into four sections:

  • Header
  • Body
  • XREF Table
  • Trailer

PDF Header

The PDF header usually starts with the ASCII %PDF- followed by the version. For example:

00000000 25 50 44 46 2d 31 2e 37 |%PDF-1.7|

Unlike typical headers, information about how the file is laid out is not contained in this section. Instead it is contained at the end of the PDF document, in the xref table and trailer sections.

PDF Body

Following the header is a body which consists of a series of objects, with each object describing a part of the document.

Each object starts with an object index number, followed by a generation number and the obj keyword. The start of an object could for example be 5 0 obj. The end of each object is signified by the endobj keyword. The object data is contained between these keywords. Below is an example object entry:

5 0 obj  % page content
  /Length 44
70 50 TD
/F1 12 Tf
(Hello, world!) Tj


The xref table contains a series of entries which allow PDF software to reference every object in the document. This is achieved by providing the start offset to every object in the first ten bytes of each entry.

0 6
0000000000 65535 f 
0000000010 00000 n 
0000000079 00000 n 
0000000173 00000 n 
0000000301 00000 n 
0000000380 00000 n

Entry structure:

  • The first ten digits in each entry indicate the start offset of the object, e.g 0000000380
  • The next five represent the generation number, e.g 65535
  • Finally, the letter f indicates the object is free (it’s been removed) and an n refers to an object in use.

PDF Trailer

The PDF trailer contains various information about the structure of a PDF, including:

  • Total number of objects, e.g. /Size 6
  • The start offset of the xref table, e.g. startxref 492
  • EOF string, e.g %%EOF

Below is an example:

  /Size 6
  /Root 1 0 R

Malicious PDF Types and Prevention

Malicious PDFs can largely be categorised in two ways:

Type 1: Exploitation Of Parsing Engines

Vulnerabilities in PDF reader parsing engines typically involve maliciously crafting valid or invalid documents, and can often be exploited and leveraged to execute arbitrary code.

For example, a bug (CVE-2015-7622) in the image parsing engine of Adobe Reader and Adobe Acrobat resulted in memory corruption and arbitrary code execution. This attack is carried out by embedding a specially crafted image into one of the object sections.

Prevention of this type of vulnerability can be carried out by applying software patches to fix these vulnerabilities. Software that can scan and detect malformed data contained in PDFs can also help prevent this type of attack. Additionally, antivirus software may be able to detect the malicious payload on disk or in memory.

Type 2 : Using Inbuilt PDF Functions To Run Malicious Code

Inbuilt PDF functions can be used to carry out malicious actions. For example, JavaScript can be embedded in an object and executed on start-up with “/OpenAction”, as in our implementation of CVE-2018-8414 detailed in a previous blog. Shellcode can also be embedded and techniques such as Heap Spraying can be used to gain arbitrary code execution.

From a defensive perspective, again use software which can scan and detect JavaScript, OpenAction and other potentially dangerous functions contained in the PDFs.

mf_inspector will notify the user if it finds any of these in a PDF document. Disabling JavaScript in the PDF reader can also help prevent JavaScript from executing.

Further Information On PDF Malware

Mass File Inspector

Feature Overview

When run with a default configuration mf_inspector inspects all files contained in a folder and its subfolders. Each file’s type is determined by its header magic and the type of inspection differs dependent upon this. For example, if the file is an Office document, the code will check for macros, and extract document metadata. If the file is a PDF, the code will look for potentially dangerous PDF objects such as embedded JavaScript content.

Noteworthy findings are shown to the user through the Python logging module’s WARNING level messages. More detailed results are also output to a CSV formatted file.

A more complete list of features is shown below:

  • Inspection of Office documents; highlighting suspicious content and macros using the oletools module
  • Inspection of PDF documents; highlighting suspicious PDF objects (JavaScript, OpenAction, Launch etc…)
  • Highlights any executable files
  • Optional lookup of file hashes with the Virus Total API
  • Optional scanning of files with ClamAV
  • Optional anonymous mode (prevents disclosure of sensitive information)
  • Basic multiprocessing support allows for speedup on multi-core systems
  • Malicious score system
  • Analysis results saved to CSV formatted file


mf_inspector requires python 3.6+ to run.

First, clone the repository:

git clone

Change directory to the root of the repository:

cd mf_inspector

OPTIONAL: Create and activate a venv:

virtualenv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Running the Tool

To run mf_inspector with default settings, run the command:

python -d /home/naka/work/mf_inspector/test_folder/

The -d flag is required to specify the root directory containing files for inspection.

Console/Log Output

mf_inspector uses the Python3 logging module for printing to stdout and creating log files. Using the logging module allows debug logs from third party dependencies to be viewed by setting the logging level to DEBUG.

The log level can be specified by passing it to the -l flag. Below is a quick description of each log level along with the type of information produced by each level:

  • INFO – Indicates informational messages, e.g how many files were found in total
  • WARNING – Indicates that mf_inspector has found something interesting, e.g A word document contains suspicious macros
  • ERROR – Indicates that an exception or error has occurred. e.g could not open file for reading
  • DEBUG – Used for debugging purposes

Console output:

2020-06-11 09:32:03,705 INFO - ##### LOG START - mf_inspector #####
2020-06-11 09:32:03,705 INFO - Walking filesystem found in "/home/naka/work/mf_inspector/test_folder/"
2020-06-11 09:32:03,705 INFO - Found 18 files
2020-06-11 09:32:03,727 WARNING - VBA macros in "/home/naka/work/mf_inspector/test_folder/maldoc"
2020-06-11 09:32:03,740 INFO - Opening OLE file /home/naka/work/mf_inspector/test_folder/maldoc
2020-06-11 09:32:03,741 INFO - Check whether OLE file is PPT
2020-06-11 09:32:03,745 ERROR - Failed to extract pdf metadata for: /home/naka/work/mf_inspector/test_folder/pdf/bottle.pdf - EOF marker not found
2020-06-11 09:32:03,745 WARNING - Invalid file format for "/home/naka/work/mf_inspector/test_folder/pdf/bottle.pdf" for mime type "application/pdf"
2020-06-11 09:32:03,755 WARNING - Active "JavaScript" content found in PDF "/home/naka/work/mf_inspector/test_folder/pdf/bad.pdf"
2020-06-11 09:32:03,755 WARNING - Active "OpenAction" content found in PDF "/home/naka/work/mf_inspector/test_folder/pdf/bad.pdf"
2020-06-11 09:32:03,809 ERROR - Failed to parse macros for: /home/naka/work/mf_inspector/test_folder/doc/plant.docx - Error -3 while decompressing data: inval
id distance too far back
2020-06-11 09:32:03,810 ERROR - Failed to extract document metadata for: /home/naka/work/mf_inspector/test_folder/doc/plant.docx
2020-06-11 09:32:03,810 WARNING - Invalid file format for "/home/naka/work/mf_inspector/test_folder/doc/plant.docx" for mime type "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
2020-06-11 09:32:07,426 ERROR - Failed to extract pdf metadata for: /home/naka/work/mf_inspector/test_folder/pdf/encrypted.pdf - file has not been decrypted
2020-06-11 09:32:07,426 WARNING - Invalid file format for "/home/naka/work/mf_inspector/test_folder/pdf/encrypted.pdf" for mime type "application/pdf"
2020-06-11 09:32:07,965 WARNING - Found 18 suspicious items in "/home/naka/work/mf_inspector/test_folder/maldoc"
2020-06-11 09:32:07,965 WARNING - Found 4 IOCs: ['lewd.exe', 'MSVBVM60.DLL', 'user32.dll', 'VBA6.DLL'], in "/home/naka/work/mf_inspector/test_folder/maldoc"

Malicious Score Table

mf_inspector gives each file a malicious score based on its findings. The more suspicious and/or malicious content it finds in a file, the higher the score. This score is printed in both an ASCII table at the end of console/log output, as well as the results CSV.

Below is an example of the malicious score table when the -c (ClamAV) and -v (Virus Total) flags are used:

| File                                                                   | ClamAV Detected   | VTotal Matches   |   Malicious Score |
| mf_inspector/test_folder/maldoc                        | True              | 43/60            |               249 |
| mf_inspector/test_folder/pdf/bad.pdf                   | True              | 0                |                60 |
| mf_inspector/test_folder/pdf/2Fs11235-017-0334-z.pdf   | False             | 0                |                 6 |
| /mf_inspector/test_folder/pdf/encrypted.pdf             | False             | 0                |                 3 |
| /mf_inspector/test_folder/info.sys                      | False             | 0                |                 1 |
| /mf_inspector/test_folder/pdf/bottle.pdf                | False             | 0                |                 1 |
| /mf_inspector/test_folder/bin/Vysor-win32-ia32.exe.file | False             | 0                |                 1 |
| /mf_inspector/test_folder/doc/plant.docx                | False             | 0                |                 1 |

CSV Output

mf_inspector also collects and outputs a summary of information in a CSV file, including:

  • SHA-256 of the file
  • File Mime Type
  • File Metadata
  • Findings (Macros, PDF Objects, Virus Scan Results…etc)
  • Malicious Score

Here’s some example output:

Anonymous Mode

An anonymous mode is available by specifying the -a flag. Anonymous mode prevents disclosure of sensitive information contained in file metadata and file names.

In anonymous mode, all filenames and file paths are replaced with the hash of their respective filepaths when printed to stdout, logging and the output CSV files. Additionally, metadata is not extracted from the files.

2020-06-10 15:18:01,702 INFO - ##### LOG START - mf_inspector #####
2020-06-10 15:18:01,702 INFO - Walking filesystem found in "/home/naka/work/mf_inspector/test_folder/"
2020-06-10 15:18:01,702 INFO - Found 18 files
2020-06-10 15:18:01,703 INFO - ClamAV - Ping service success
2020-06-10 15:18:01,752 WARNING - Active "JavaScript" content found in PDF "1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054"
2020-06-10 15:18:01,753 WARNING - Active "OpenAction" content found in PDF "1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054"
2020-06-10 15:18:01,758 WARNING - ClamAV: ('FOUND', 'Pdf.Downloader.DeepLink-6622195-0') for 1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054
2020-06-10 15:18:01,758 INFO - VTotal - Attempting virus total lookup of hash for: 1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054
2020-06-10 15:18:02,084 INFO - VTotal - No match found on hash for: 1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054
2020-06-10 15:18:06,121 WARNING - ClamAV: ('FOUND', 'Doc.Dropper.Fareit-572') for 345b804a9416595840516674caaa65e65be57591d300beab2b6190298a9eac78
2020-06-10 15:18:06,121 INFO - VTotal - Attempting virus total lookup of hash for: 345b804a9416595840516674caaa65e65be57591d300beab2b6190298a9eac78
2020-06-10 15:18:06,415 WARNING - VTotal - 43/60 (71 percent) of vendors identified the file as malicious
2020-06-10 15:18:08,343 WARNING - Active "OpenAction" content found in PDF "4774a4ca47f89bb28cf5c19cf94c8b7868137a1d2cac27802ff385a25e566b24"

A separate file, mapping a file to its hash is created when this mode is enabled. This allows the anonymised results to be passed to an analyst without disclosing sensitive information about the files and their environment.

Below is an example of a file mapping produced by anonymous mode:


ClamAV And Virus Total

mf_inspector is also able to request that files be scanned by ClamAV if the the -c flag is supplied as a CLI parameter. This requires the clamd service to be installed and up and running.

If the -v flag is provided along with the Virus Total API key, mf_inspector will send the SHA256 hashes of files deemed suspicious and report the percentage of matches, if any.

Future Development

This is version 1.0 of the tool. Some of the planned features for version 2.0 are listed below:

  • Support for scanning files on cloud drives (Google, Onedrive, etc…)
  • Better support for executable files (YARA support, Header checking etc…)


See the project page on GitHub, and follow our Cyber Lab on Twitter for more news on our research.

Orson Mosley
Cyber Security Researcher