This blog gives a quick overview of our newly released tool Mass File Inspector (mf_inspector). This tool is an open-source document static analysis tool written in Python. It was created to aid with the mass inspection and identification of malicious files in document stores. This arose as a direct requirement of an incident response exercise, when a customer suspected there may be malicious documents in Microsoft’s SharePoint system. So we decided to write a tool to help us find any malicious documents.
We’ll cover how the tool works later, first let’s go through what can make a document malicious.
With email being the number one delivery method for malware, it’s no surprise that malware embedded in document attachments is highly pervasive. The sections below give a brief overview on document types, how attackers can embed malicious payloads, and some advice on preventative measures.
Office/Word documents can be largely categorised into three formats:
OLE and OpenXML formats use Microsoft’s dreaded VBA (Visual Basic for Applications) as their macro programming language. Since VBA can call the Windows system API, it’s possible to write macros which inject shellcode directly into process memory using the VirtualAlloc, RtlMoveMemory and CreateThread API calls. In fact it’s possible to use VBA to write complex malware, however most VBA macros are Droppers/Downloaders which download and launch additional malware.
Below is a non-exhaustive list of VBA macro capabilities:
Luckily macros are disabled by default in the Microsoft Office suite. They can however be enabled via the settings menu or by clicking on Enable Content when the following popup appears SECURITY WARNING Macros have been disabled.
Prevention measures include making sure macros are disabled and using software which can identify suspicious macros and embedded content. Antivirus may also help prevent malicious activity by identifying malicious payloads on disk or by detecting and stopping malicious payloads executing in memory.
mf_inspector uses oletools to detect and notify the user of suspicious macros and embedded content. mf_inspector can also scan documents with ClamAV anti-virus, provided the service is running.
Like with any software that parses data, bugs may exist in document editors and viewers that enable the execution of arbitrary code when parsing malformed data. Software that checks the validity of document formats can help prevent this type of exploit. Keeping the document reading software up to date with the latest security patches can also help. Additionally, malicious payloads embedded inside documents can also be detected with antivirus software.
Before looking into malicious PDF types and their payloads it helps to have a general idea of how PDFs are structured. The structure of a PDF can be broken down into four sections:
The PDF header usually starts with the ASCII %PDF- followed by the version. For example:
00000000 25 50 44 46 2d 31 2e 37 |%PDF-1.7|
Unlike typical headers, information about how the file is laid out is not contained in this section. Instead it is contained at the end of the PDF document, in the xref table and trailer sections.
Following the header is a body which consists of a series of objects, with each object describing a part of the document.
Each object starts with an object index number, followed by a generation number and the obj keyword. The start of an object could for example be 5 0 obj. The end of each object is signified by the endobj keyword. The object data is contained between these keywords. Below is an example object entry:
5 0 obj % page content << /Length 44 >> stream BT 70 50 TD /F1 12 Tf (Hello, world!) Tj ET endstream endobj
The xref table contains a series of entries which allow PDF software to reference every object in the document. This is achieved by providing the start offset to every object in the first ten bytes of each entry.
xref 0 6 0000000000 65535 f 0000000010 00000 n 0000000079 00000 n 0000000173 00000 n 0000000301 00000 n 0000000380 00000 n
Entry structure:
The PDF trailer contains various information about the structure of a PDF, including:
Below is an example:
trailer << /Size 6 /Root 1 0 R >> startxref 492 %%EOF
Malicious PDFs can largely be categorised in two ways:
Vulnerabilities in PDF reader parsing engines typically involve maliciously crafting valid or invalid documents, and can often be exploited and leveraged to execute arbitrary code.
For example, a bug (CVE-2015-7622) in the image parsing engine of Adobe Reader and Adobe Acrobat resulted in memory corruption and arbitrary code execution. This attack is carried out by embedding a specially crafted image into one of the object sections.
Prevention of this type of vulnerability can be carried out by applying software patches to fix these vulnerabilities. Software that can scan and detect malformed data contained in PDFs can also help prevent this type of attack. Additionally, antivirus software may be able to detect the malicious payload on disk or in memory.
Inbuilt PDF functions can be used to carry out malicious actions. For example, JavaScript can be embedded in an object and executed on start-up with “/OpenAction”, as in our implementation of CVE-2018-8414 detailed in a previous blog. Shellcode can also be embedded and techniques such as Heap Spraying can be used to gain arbitrary code execution.
From a defensive perspective, again use software which can scan and detect JavaScript, OpenAction and other potentially dangerous functions contained in the PDFs.
mf_inspector will notify the user if it finds any of these in a PDF document. Disabling JavaScript in the PDF reader can also help prevent JavaScript from executing.
When run with a default configuration mf_inspector inspects all files contained in a folder and its subfolders. Each file’s type is determined by its header magic and the type of inspection differs dependent upon this. For example, if the file is an Office document, the code will check for macros, and extract document metadata. If the file is a PDF, the code will look for potentially dangerous PDF objects such as embedded JavaScript content.
Noteworthy findings are shown to the user through the Python logging module’s WARNING level messages. More detailed results are also output to a CSV formatted file.
A more complete list of features is shown below:
mf_inspector requires python 3.6+ to run.
First, clone the repository:
git clone https://github.com/6point6/mf_inspector
Change directory to the root of the repository:
cd mf_inspector
OPTIONAL: Create and activate a venv:
virtualenv venv source venv/bin/activate
Install dependencies:
pip install -r requirements.txt
To run mf_inspector with default settings, run the command:
python mf_inspector.py -d /home/naka/work/mf_inspector/test_folder/
The -d flag is required to specify the root directory containing files for inspection.
mf_inspector uses the Python3 logging module for printing to stdout and creating log files. Using the logging module allows debug logs from third party dependencies to be viewed by setting the logging level to DEBUG.
The log level can be specified by passing it to the -l flag. Below is a quick description of each log level along with the type of information produced by each level:
Console output:
2020-06-11 09:32:03,705 INFO - ##### LOG START - mf_inspector ##### 2020-06-11 09:32:03,705 INFO - Walking filesystem found in "/home/naka/work/mf_inspector/test_folder/" 2020-06-11 09:32:03,705 INFO - Found 18 files 2020-06-11 09:32:03,727 WARNING - VBA macros in "/home/naka/work/mf_inspector/test_folder/maldoc" 2020-06-11 09:32:03,740 INFO - Opening OLE file /home/naka/work/mf_inspector/test_folder/maldoc 2020-06-11 09:32:03,741 INFO - Check whether OLE file is PPT 2020-06-11 09:32:03,745 ERROR - Failed to extract pdf metadata for: /home/naka/work/mf_inspector/test_folder/pdf/bottle.pdf - EOF marker not found 2020-06-11 09:32:03,745 WARNING - Invalid file format for "/home/naka/work/mf_inspector/test_folder/pdf/bottle.pdf" for mime type "application/pdf" 2020-06-11 09:32:03,755 WARNING - Active "JavaScript" content found in PDF "/home/naka/work/mf_inspector/test_folder/pdf/bad.pdf" 2020-06-11 09:32:03,755 WARNING - Active "OpenAction" content found in PDF "/home/naka/work/mf_inspector/test_folder/pdf/bad.pdf" 2020-06-11 09:32:03,809 ERROR - Failed to parse macros for: /home/naka/work/mf_inspector/test_folder/doc/plant.docx - Error -3 while decompressing data: inval id distance too far back 2020-06-11 09:32:03,810 ERROR - Failed to extract document metadata for: /home/naka/work/mf_inspector/test_folder/doc/plant.docx 2020-06-11 09:32:03,810 WARNING - Invalid file format for "/home/naka/work/mf_inspector/test_folder/doc/plant.docx" for mime type "application/vnd.openxmlformats-officedocument.wordprocessingml.document" 2020-06-11 09:32:07,426 ERROR - Failed to extract pdf metadata for: /home/naka/work/mf_inspector/test_folder/pdf/encrypted.pdf - file has not been decrypted 2020-06-11 09:32:07,426 WARNING - Invalid file format for "/home/naka/work/mf_inspector/test_folder/pdf/encrypted.pdf" for mime type "application/pdf" 2020-06-11 09:32:07,965 WARNING - Found 18 suspicious items in "/home/naka/work/mf_inspector/test_folder/maldoc" 2020-06-11 09:32:07,965 WARNING - Found 4 IOCs: ['lewd.exe', 'MSVBVM60.DLL', 'user32.dll', 'VBA6.DLL'], in "/home/naka/work/mf_inspector/test_folder/maldoc"
mf_inspector gives each file a malicious score based on its findings. The more suspicious and/or malicious content it finds in a file, the higher the score. This score is printed in both an ASCII table at the end of console/log output, as well as the results CSV.
Below is an example of the malicious score table when the -c (ClamAV) and -v (Virus Total) flags are used:
+------------------------------------------------------------------------+-------------------+------------------+-------------------+ | File | ClamAV Detected | VTotal Matches | Malicious Score | +========================================================================+===================+==================+===================+ | mf_inspector/test_folder/maldoc | True | 43/60 | 249 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+ | mf_inspector/test_folder/pdf/bad.pdf | True | 0 | 60 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+ | mf_inspector/test_folder/pdf/2Fs11235-017-0334-z.pdf | False | 0 | 6 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+ | /mf_inspector/test_folder/pdf/encrypted.pdf | False | 0 | 3 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+ | /mf_inspector/test_folder/info.sys | False | 0 | 1 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+ | /mf_inspector/test_folder/pdf/bottle.pdf | False | 0 | 1 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+ | /mf_inspector/test_folder/bin/Vysor-win32-ia32.exe.file | False | 0 | 1 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+ | /mf_inspector/test_folder/doc/plant.docx | False | 0 | 1 | +------------------------------------------------------------------------+-------------------+------------------+-------------------+
mf_inspector also collects and outputs a summary of information in a CSV file, including:
Here’s some example output:
An anonymous mode is available by specifying the -a flag. Anonymous mode prevents disclosure of sensitive information contained in file metadata and file names.
In anonymous mode, all filenames and file paths are replaced with the hash of their respective filepaths when printed to stdout, logging and the output CSV files. Additionally, metadata is not extracted from the files.
2020-06-10 15:18:01,702 INFO - ##### LOG START - mf_inspector ##### 2020-06-10 15:18:01,702 INFO - Walking filesystem found in "/home/naka/work/mf_inspector/test_folder/" 2020-06-10 15:18:01,702 INFO - Found 18 files 2020-06-10 15:18:01,703 INFO - ClamAV - Ping service success 2020-06-10 15:18:01,752 WARNING - Active "JavaScript" content found in PDF "1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054" 2020-06-10 15:18:01,753 WARNING - Active "OpenAction" content found in PDF "1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054" 2020-06-10 15:18:01,758 WARNING - ClamAV: ('FOUND', 'Pdf.Downloader.DeepLink-6622195-0') for 1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054 2020-06-10 15:18:01,758 INFO - VTotal - Attempting virus total lookup of hash for: 1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054 2020-06-10 15:18:02,084 INFO - VTotal - No match found on hash for: 1aa5f45734e6200f21fa96dddd2df55f353d22e42c3b1d6653c0ddbfd5a76054 2020-06-10 15:18:06,121 WARNING - ClamAV: ('FOUND', 'Doc.Dropper.Fareit-572') for 345b804a9416595840516674caaa65e65be57591d300beab2b6190298a9eac78 2020-06-10 15:18:06,121 INFO - VTotal - Attempting virus total lookup of hash for: 345b804a9416595840516674caaa65e65be57591d300beab2b6190298a9eac78 2020-06-10 15:18:06,415 WARNING - VTotal - 43/60 (71 percent) of vendors identified the file as malicious 2020-06-10 15:18:08,343 WARNING - Active "OpenAction" content found in PDF "4774a4ca47f89bb28cf5c19cf94c8b7868137a1d2cac27802ff385a25e566b24"
A separate file, mapping a file to its hash is created when this mode is enabled. This allows the anonymised results to be passed to an analyst without disclosing sensitive information about the files and their environment.
Below is an example of a file mapping produced by anonymous mode:
path,SHA-256 /home/naka/work/mf_inspector/test_folder/maldoc,345b804a9416595840516674caaa65e65be57591d300beab2b6190298a9eac78 /home/naka/work/mf_inspector/test_folder/cav-linux_x64.deb,325b819b041a7b27026ba85f66ea808d0d11ad39d94bc13ae6d95802413495b6 /home/naka/work/mf_inspector/test_folder/info.sys,e13d65c0f1c5a37d1f5d854795ccdfec18c0b8de18a4b33a5df42a5197863071
mf_inspector is also able to request that files be scanned by ClamAV if the the -c flag is supplied as a CLI parameter. This requires the clamd service to be installed and up and running.
If the -v flag is provided along with the Virus Total API key, mf_inspector will send the SHA256 hashes of files deemed suspicious and report the percentage of matches, if any.
This is version 1.0 of the tool. Some of the planned features for version 2.0 are listed below:
See the project page on GitHub, and follow our Cyber Lab on Twitter for more news on our research.