Programmers Manual for Developing Bulk Extractor Scanner Plug-ins

bulk extractor 1.4

USER MANUAL

Quickstart Guide Included

March 23, 2015

Authored by:

Jessica R. Bradley

Simson L. Garﬁnkel

One Page Quickstart for Linux & Mac OS X Users

This page provides a very brief introduction to downloading, installing and running

bulk_extractor.

1. If you do not already have one, obtain a disk image on which to run bulk_extractor.

Sample images can be downloaded from http://digitalcorpora.org/corpora/

disk-images. Suggestions include nps-2009-domexusers and

nps-2009-ubnist1.gen3.E01.

2. Download the latest version of bulk_extractor. It can be obtained from http://

digitalcorpora.org/downloads/bulk_extractor/. The ﬁle is called bulk_extractor-x.y.z.tar.gz

where x.y.z is the latest version.

3. Un-tar and un-zip the ﬁle. In the newly created bulk_extractor-x.y directory, run

the following commands:

 ./configure

 make

 sudo make install

[Refer to Subsubsection 3.1.1 Installing on Linux or Mac OS X. Note, for

full functionality, some users may need to ﬁrst download and install dependent

library ﬁles. Instructions are outlined in the referenced section.]

4. To run bulk_extractor from the command line, type the following command:

 bulk_extractor -o output mydisk.raw

In the above command, output is the directory that will be created to store

bulk_extractor results. It can not already exist. The input mydisk.raw is the

disk image to be processed. [See Subsection 3.2 Run bulk_extractor from

the Command Line]

5. To run bulk_extractor from the Bulk Extractor Viewer, navigate to the direc-

tory called /java_gui in the bulk_extractor folder and run the following command:

 ./BEViewer

In the Bulk Extractor Viewer, click on the Gear/down arrow icon as depicted

below.

A window will pop up and the ﬁrst two input boxes allow you to select an Image

File and specify an Output Feature Directory to create. Enter both of those and

then select the button at the bottom of the window titled "Start bulk_extractor"

to run bulk_extractor. [See Subsection 3.3 Run bulk_extractor from Bulk

Extractor Viewer]

6. Whether bulk_extractor was run from the command line or the Bulk Extractor

Viewer tool, after the run the resulting output ﬁles will be contained in the

speciﬁed output directory. Open that directory and verify ﬁles have been created.

There should be 15-25 ﬁles. Some will be empty and others will be populated with

data.

7. Users can join the google email users group for more information and help with any

issues encountered. Email bulk_extractor-users+subscribe@googlegroups.com

with a blank message to join.

iii

One Page Quickstart for Windows Users

This page provides a very brief introduction to downloading, installing and running

bulk_extractor.

1. If you do not already have one, obtain a disk image on which to run bulk_extractor.

Sample images can be downloaded from http://digitalcorpora.org/corpora/

disk-images. Suggestions include nps-2009-domexusers and

nps-2009-ubnist1.gen3.E01.

2. Download the latest version of the bulk_extractor Windows installer. It can be

obtained from http://digitalcorpora.org/downloads/bulk_extractor. The

ﬁle to download is called bulk_extractor-x.y.z-windowsinstaller.exe where

x.y.z is the latest version number. Run the installer ﬁle. This will automatically

install bulk_extractor on your machine. The automatic installation includes the

complete bulk_extractor system as well as the Bulk Extractor Viewer tool. [See

Subsubsection 3.1.2 Installing on Windows]

3. To run bulk_extractor from the command line, type the following command:

 bulk_extractor -o output mydisk.raw

In the above command, output is the directory that will be created to store

bulk_extractor results. It can not already exist. The input mydisk.raw is the

disk image to be processed. [See Subsection 3.2 Run bulk_extractor from

the Command Line]

4. To run bulk_extractor from the Bulk Extractor Viewer, run the program Bulk

Extractor X.Y from the Start Menu.

In the Bulk Extractor Viewer, click on the Gear/down arrow icon as depicted

below.

A window will pop up and the ﬁrst two input boxes allow you to select an Image

File and specify an Output Feature Directory to create. Enter both of those and

then select the button at the bottom of the window titled "Start bulk_extractor"

to run bulk_extractor. [See Subsection 3.3 Run bulk_extractor from Bulk

Extractor Viewer]

5. Whether bulk_extractor was run from the command line or the Bulk Extractor

Viewer tool, after the run the resulting output ﬁles will be contained in the

speciﬁed output directory. Open that directory and verify ﬁles have been created.

There should be 15-25 ﬁles. Some will be empty and others will be populated with

data.

6. Users can join the google email users group for more information and help with any

issues encountered. Email bulk_extractor-users+subscribe@googlegroups.com

with a blank message to join.

Contents

1 Introduction 1

1.1 Overview of bulk_extractor . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 A bulk_extractor Success Story . . . . . . . . . . . . . . . . . . . 2

1.2 Purpose of this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Conventions Used in this Manual . . . . . . . . . . . . . . . . . . . . . . 3

2 How bulk_extractor Works 3

3 Running bulk_extractor 6

3.1 Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 Installing on Linux or Mac OS X . . . . . . . . . . . . . . . . . . 7

3.1.2 Installing on Windows . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Run bulk_extractor from the Command Line . . . . . . . . . . . . . . . 8

3.3 Run bulk_extractor from Bulk Extractor Viewer . . . . . . . . . . . 12

3.4 Run bulk_extractor from Bulk Extractor Viewer . . . . . . . . . . . . . 12

4 Processing Data 17

4.1 Types of Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Scanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Carving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Suppressing False Positives . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Using an Alert List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6 The Importance of Compressed Data Processing . . . . . . . . . . . . . 28

5 Use Cases for bulk_extractor 29

5.1 Malware Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Cyber Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3 Identity Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Password Cracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.5 Analyzing Imagery Information . . . . . . . . . . . . . . . . . . . . . . . 34

5.6 Using bulk_extractor in a Highly Specialized Environment . . . . . . . . 34

6 Tuning bulk_extractor 34

7 Post Processing Capabilities 35

7.1 bulk_diﬀ.py: Diﬀerence Between Runs . . . . . . . . . . . . . . . . . . . 35

7.2 identify_ﬁlenames.py: Identify File Origin of Features . . . . . . . . . . 36

8 Worked Examples 36

8.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

9 2009-M57 Patents Scenario 37

9.1 Run bulk_extractor with the Data . . . . . . . . . . . . . . . . . . . . . 37

9.2 Digital Media Triage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

9.3 Analyzing Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9.4 Password Cracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

9.5 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

10 NPS DOMEX Users Image 49

10.1 Malware Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

10.2 Cyber Investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

11 Troubleshooting 55

12 Related Reading 56

Appendices 58

A Output of bulk_extractor Help Command 58

1 Introduction

1.1 Overview of bulk_extractor

bulk_extractor is a program that extracts features such as email addresses, credit card

numbers, URLs, and other types of information from digital evidence media. It is a

useful forensic investigation tool for many tasks such as malware and intrusion inves-

tigations, identity investigations and cyber investigations, as well as analyzing imagery

and password cracking. The program provides several unusual capabilities including:

• It ﬁnds email addresses, URLs and credit card numbers that other tools miss

because it can process compressed data (like ZIP, PDF and GZIP ﬁles) and in-

complete or partially corrupted data. It can carve JPEGs, oﬃce documents and

other kinds of ﬁles out of fragments of compressed data. It will detect and carve

encrypted RAR ﬁles.

• It builds word lists based on all of the words found within the data, even those in

compressed ﬁles that are in unallocated space. Those word lists can be useful for

password cracking.

• It is multi-threaded; running bulk_extractor on a computer with twice the number

of cores typically makes it complete a run in half the time.

• It creates histograms showing the most common email addresses, URLs, domains,

search terms and other kinds of information on the drive.

bulk_extractor operates on disk images, ﬁles or a directory of ﬁles and extracts use-

ful information without parsing the ﬁle system or ﬁle system structures. The input is

split into pages and processed by one or more scanners. The results are stored in fea-

ture ﬁles that can be easily inspected, parsed, or processed with other automated tools.

bulk_extractor also creates histograms of features that it ﬁnds. This is useful because

features such as email addresses and internet search terms that are more common tend

to be important.

In addition to the capabilities described above, bulk_extractor also includes:

• A graphical user interface, Bulk Extractor Viewer, for browsing features stored

in feature ﬁles and for launching bulk_extractor scans

• A small number of python programs for performing additional analysis on feature

ﬁles

bulk_extractor 1.5 detects and optimistically decompresses data in ZIP, GZIP, RAR,

and Microsoft’s Hibernation ﬁles. This has proven useful, for example, in recovering

email addresses from fragments of compressed ﬁles found in unallocated space.

bulk_extractor contains a simple but eﬀective mechanism for protecting against decom-

pression bombs. It also has capabilities speciﬁcally designed for Windows and malware

analysis including decoders for the Windows PE, Linux ELF, VCARD, Base16, Base64

and Windows directory formats.

bulk_extractor gets its speed through the use of compiled search expressions and multi-

threading. The search expressions are written as pre-compiled regular expressions, es-

sentially allowing bulk_extractor to perform searches on disparate terms in parallel.

Threading is accomplished through the use of an analysis thread pool. After the fea-

tures have been extracted, bulk_extractor builds a histogram of email addresses, Google

search terms, and other extracted features. Stop lists can also be used to remove features

not relevant to a case.

bulk_extractor is distinguished from other forensic tools by its speed and thoroughness.

Because it ignores ﬁle system structure, bulk_extractor can process diﬀerent parts of the

disk in parallel. This means that an 8-core machine will process a disk image roughly

8 times faster than a 1-core machine. bulk_extractor is also thorough. It automatically

detects, decompresses, and recursively re-processes data that has been compressed with

a variety of algorithms. Our testing has shown there is a signiﬁcant amount of com-

pressed data in the unallocated regions of ﬁle systems missed by most forensics tools

that are commonly in use today[?]. Another advantage of ignoring ﬁle systems is that

bulk_extractor can be used to process any kind of digital media. The program has been

used to process hard drives, SSDs, optical media, camera cards, cell phones, network

packet dumps, and other kinds of digital information.

Between 2005 and 2008, the bulk_extractor team interviewed law enforcement regarding

their use of forensic tools. Law enforcement oﬃcers wanted a highly automated tool for

ﬁnding email addresses and credit card numbers (including track 2 information), phone

numbers, GPS coordinates and EXIF information from JPEGs, search terms (extracted

from URLs), and all words that were present on the disk (for password cracking). The

tool needed to run on Windows, Linux and Mac OS X systems with no user interaction.

It also had to operate on raw disk images, split-raw volumes and E01 ﬁles. The tool

needed to run at the maximum I/O speed of the physical drive and never crash. Through

these interviews, the initial requirements for the bulk_extractor system were developed.

Over the past ﬁve years, we have worked to create the tool that those oﬃcers desired.

1.1.1 A bulk_extractor Success Story

One early bulk_extractor success story comes from the City of San Luis Obispo Police

Department in the Spring of 2010. The District Attorney ﬁled charges against two in-

dividuals for credit card fraud and possession of materials to commit credit card fraud.

The defendants were arrested with a computer. Defense attorneys were expected to

argue that the defendants were unsophisticated and lacked knowledge to commit the

crime. The examiner was given a 250 GB drive the day before the preliminary hearing;

typically it would take several days to conduct a proper forensic investigation of that

much data.

bulk_extractor found actionable evidence in only two and a half hours including the

following information:

• There were over 10,000 credit card numbers on the hard drive (illegal materials).

Over 1000 of the credit card numbers were unique.

• The most common email address belonged to the primary defendant (evidence of

possession).

• The most commonly occurring internet search engine queries concerned credit card

fraud and bank identiﬁcation numbers (evidence of intent).

• The most commonly visited websites were in a foreign country whose primary

language is spoken by the defendant (evidence of ﬂight risk).

Armed with this data, the defendants were held without bail.

As bulk_extractor has been deployed and used in diﬀerent applications, it has evolved

to meet additional requirements. This manual describes use cases for the bulk_extractor

system and demonstrates how users can take full advantage of all of its capabilities.

1.2 Purpose of this Manual

This User Manual is intended to be useful to new, intermediate and experienced users of

bulk_extractor. It provides an in-depth review of the functionality included in bulk_extractor

and shows how to access and utilize features through both command line operation and

the Bulk Extractor Viewer. This manual includes working examples with links to

the input data (disk images) used, giving users the opportunity to work through the

examples and utilize all aspects of the system.

1.3 Conventions Used in this Manual

This manual uses standard formatting conventions to highlight ﬁle names, directory

names and example commands. The conventions for those speciﬁc types are described

in this section.

Names of programs including the post-processing tools native to bulk_extractor and

third-party tools are shown in bold, as in tcpﬂow.

File names are displayed in a ﬁxed width font. They will appear as filename.txt within

the text throughout the manual.

Directory names are displayed in italics. They appear as directoryname/ within the text.

The only exception is for directory names that are part of an example command. Di-

rectory names referenced in example commands appear in the example command format.

Scanner names are denoted with bold, italicized text. They are always speciﬁed in

lower-case, because that is how they are referred in the options and usage information

for bulk_extractor. Names will appear as scannername.

This manual contains example commands that should be typed in by the user. A com-

mand entered at the terminal is shown like this:

 command

The ﬁrst character on the line is the terminal prompt, and should not be typed. The

black square is used as the standard prompt in this manual, although the prompt shown

on a users screen will vary according to the system they are using.

2 How bulk_extractor Works

bulk_extractor ﬁnds email addresses, URLs, and CCNs that other tools miss. This is

due in part to the fact that bulk_extractor optimistically decompresses and re-analyzes

EXTRACT FEATURES

HISTOGRAM

CREATION

POST PROCESSING

.E01

.aff

.dd

.000, .001

Disk image

ﬁles

...

DONE

report.xml — log ﬁle

telephone.txt — list of phone numbers with context

telephone_histogram.txt — histogram of phone numbers

vcard/ — directory of VCARDs

...

Figure 1: Three Phases of bulk_extractor Operation

all data (e.g. zip fragments, gzip browser cache runs). The decompression operates on

incomplete and corrupted data until decompression fails. bulk_extractor can also build

word lists for password cracking

There are three phases of operation in bulk_extractor: feature extraction, histogram cre-

ation, post processing as shown in Figure 1. The output feature ﬁles contain extracted

data designed for easy processing by third party programs or use in spreadsheet tools.

The bulk_extractor histogram system automatically summarizes features.

Features ﬁles are written using the feature recording system. As features are discovered,

they are sent to the feature recorder and recorded in the appropriate ﬁle. Multiple scan-

ners might write to the same feature ﬁle. For example, the exif scanner searches the ﬁle

formats used by digital cameras and ﬁnds GPS coordinates in images. Those ﬁndings

are written to the output ﬁle gps.txt by the gps feature recorder. A separate scanner,

the gps scanner, searches Garmin Trackpoint data and also ﬁnds GPS coordinates and

writes them to gps.txt. It is worth noting that some scanners also ﬁnd more than one

type of feature and write to several feature ﬁles. For example, the email scanner looks

for email addresses, domains, URLs and RFC822 headers and writes them to email.txt,

domain.txt, url.txt, rfc822.txt and ether.txt respectively.

A feature ﬁle contains rows of features. Each row is typically comprised of an oﬀset, a

feature, and the feature in evidence context although scanners are free to store whatever

information they wish. A few lines of an email feature ﬁle might look like the following:

OFFSET FEATURE FEATURE IN E V IDENCE C ONTEXT

48198832 domexu s er2 @ gmai l . com __ < name > d omex user 2@gm a il .com / Home

48200361 domexus er2@ live . com __ <name > dome xuse r2@l i ve .com </ name

48413823 siege@ preo ccup ied .net ’ Brien < si e ge@p reoc cupi ed .net > _l

The types of features displayed in the feature ﬁle will vary depending on what type of

feature is being stored. However, all feature ﬁles use the same format with each row cor-

responding to one found instance of a feature and three columns describing the related

data (oﬀset, feature, and feature in evidence context).

Histograms are a powerful tool for understanding certain kinds of evidence. A histogram

of emails allows us to rapidly determine the drive’s primary user, the user’s organiza-

tion, primary correspondents and other email addresses. The feature recording system

automatically makes histograms as data are processed. When the scanner writes to the

feature recording system, the relevant histograms are automatically updated.

A histogram ﬁle will, in general, look like the following ﬁle excerpt:

n =875 moz i lla@ k ewis .ch ( utf16 =3)

n =651 char lie@m5 7 . biz ( utf16 =120)

n =605 ajb anck@ plan e t . nl

...

n =288 mat twil lis@ gmai l . com

n =281 gart hs@oe one .com

n =226 michael . buet t ner@s un . com ( utf16 =2)

n =225 bu gzil la@ b aby lon s oun ds .com

n =218 berend . cornel ius@s un . com

n =210 ips@mail . ips . es

n =201 msc hro e der @ moz illa .x- home . org

n =186 pat@m57 . biz ( utf16 =1)

Each line shows a feature and the number of times that feature was found by bulk_extractor

(the histogram indicates how many times the item was found coded as UTF-16). Fea-

tures are stored in the ﬁle in order of occurrence with most frequent features appearing

at the top of the ﬁle and least frequent displayed at the bottom.

bulk_extractor has multiple scanners that extract features. Each scanner runs in an

arbitrary order. Scanners can be enabled or disabled which can be useful for debug-

ging and speed optimization. Some scanners are recursive and actually expand the data

they are exploring, thereby creating more data that bulk_extractor can analyze. These

blocks are called sbufs. The "s" stands for the word safe. All access to data in the sbuf

is bounds-checked, so buﬀer overﬂow events are very unlikely. The sbuf data structure

is one of the reasons that bulk_extractor is so crash resistant. Recursion is used for,

among other things, decompressing ZLIB and Windows HIBERFILE, extracting text

from PDFs and handling compressed browser cache data.

The recursion process requires a new way to describe oﬀsets. To do this, bulk_extractor

introduces the concept of the “forensic path.” The forensic path is a description of the

origination of a piece of data. It might come from, for example, a ﬂat ﬁle, a data stream,

or a decompression of some type of data. Consider an HTTP stream that contains a

GZIP-compressed email as shown in Figure 2. A series of scanners will ﬁrst ﬁnd the ZLIB

compressed regions in the HTTP stream that contain the email, decompress them, and

then ﬁnd the features in that email which may include email addresses, names and phone

numbers. Using this method, bulk_extractor can ﬁnd email addresses in compressed

data. The forensic path for the email addresses found indicate that it originated in an

email, that was GZIP compressed and found in an HTTP stream. The forensic path of

the email addresses features found might be represented as follows:

11052168704 - GZIP -3437 live . com eMn =’ dome xuser @liv e .com ’; var srf_sDis p M

11052168704 - GZIP -3475 live . com pMn =’ dome xuser @liv e .com ’; var srf_sDre C k

Figure 2: Forensic path of features found in email lead back to HTTP Stream

11052168704 - GZIP -3512 live . com eCk =’ dome xuser @liv e .com ’; var s r f_sFT = ’<

The full functionality of bulk_extractor is provided both through command line opera-

tion and the GUI tool, Bulk Extractor Viewer. Both modes of operation work for

Linux, Mac and Windows. The following section describes how to download, install and

run bulk_extractor using either the command line or the Bulk Extractor Viewer.

3 Running bulk_extractor

bulk_extractor is a command line tool with an accompanying graphical user interface

tool, Bulk Extractor Viewer. All of the command line functionality of bulk_extractor

is also available in the Bulk Extractor Viewer. Users can access the functionality in

whichever way they prefer. In this manual we review the bulk_extractor user options in

both formats.

bulk_extractor can be run on a Linux, Mac OS X or Windows system. The fastest

way to run bulk_extractor is on a Linux system. Running bulk_extractor on Windows

provides the same results, but the run will typically take 40 percent longer on the same

hardware. The software can actually run faster on a Linux virtual machine running on

Windows with VMware workstation than on the native Windows OS.

3.1 Installation Guide

Installation instructions vary for Linux, Mac OS X users and Windows users. The

following sections explain how to install bulk_extractor.

3.1.1 Installing on Linux or Mac OS X

Before compiling bulk_extractor for your platform, you may need to install other pack-

ages on your system which bulk_extractor requires to compile cleanly and with a full

set of capabilities.

Dependencies for Fedora

The following commands should add the appropriate packages:

 sudo yum update

 sudo yum groupinstall development-tools

 sudo yum install flex

Dependencies for Debian (wheezy) or Ubuntu (13.0)

The following command should add the appropriate libraries:

 sudo apt-get -y install gcc g++ flex libewf-dev

Dependencies for Mac OS X

Mac OS X users must ﬁrst install Apple’s Xcode application (available in the OS X App

store), and then install the command line tools. To install the command line tools in

Mavericks and Yosemite, enter this command in the terminal:

 xcode-select --install

Other components can be downloaded using the MacPorts system. To install MacPorts,

get the latest ports for your version of OS X here: http://macports.com After the latest

ports are installed, you still need to make sure some optional packages are added using

these commands:

 sudo port install flex autoconf automake

 sudo port install libewf-devel

Mac OS X users should note that libewf-devel may not be available in ports, and at

present, libewf isn’t new enough. If the required version isn’t available as a port, then

download and un-tar the libewf source (for example, in /tmp), cd into the source direc-

tory and run:

 ./configure

 make

 sudo make install

Download and Install bulk_extractor

Next, download the latest version of bulk_extractor. The software can be downloaded

from http://digitalcorpora.org/downloads/bulk_extractor/. The ﬁle to down-

load will be called bulk_extractor-x.y.z.tar.gz where x.y.z is the latest version. As

of publication of this manual, the latest version of bulk_extractor is 1.5.

After downloading the tar.gz ﬁle, decompress and un-tar it. Then, cd into the newly cre-

ated bulk_extractor-x.y.z directory, and run the following commands to install bulk_extractor

in /usr/local/bin (by default):

 ./configure

 make

 sudo make install

With these instructions, the following directory will not be installed:

• plugins/ - This is for C/C++ developers only. You can develop your own bulk_extractor

plugins which will then be run at run-time with the bulk_extractor executable. Re-

fer to the bulk_extractor Programmers Manual for Developing Scanner

Plug-ins [?] for more information.

Instructions on running bulk_extractor from the command line can be found in Sub-

section 3.2.

The Bulk Extractor Viewer tool is installed as part of the above installation process.

Speciﬁc instructions on running it can be found in Subsection 3.3.

3.1.2 Installing on Windows

Windows users should download the Windows Installer for bulk_extractor. The ﬁle

to download is located at http://digitalcorpora.org/downloads/bulk_extractor/

and is called bulk_extractor-x.y.z-windowsinstaller.exe where x.y.z is the latest

version number (1.5.0 as of publication of this manual).

Next, run the bulk_extractor-x.y.z-windowsinstaller.exe ﬁle. This will automat-

ically install bulk_extractor on your machine. Because this ﬁle is not used by many

Windows users, some anti-virus systems will try to manual delete it on download or

block the download as shown in Figure 3. Be aware that you may have to work around

your anti-virus system. Additionally, some Windows versions will try to prevent you

from running it. Figure 4 shows the message Windows 8 displays when trying to run

the installer. To run anyway, click on “More info” and then select “Run Anyway.”

When the installer ﬁle is executed, the installation will begin and show a dialog like the

one shown in Figure 5. Users should select the default conﬁguration, which will be the

64-bit conﬁguration for 64-bit Windows systems, or the 32-bit conﬁguration for 32-bit

Windows systems. Click on “Install” and the installer will install bulk_extractor on your

system and then notify you when it is complete.

The automatic installation includes the Bulk Extractor Viewer tool as well as the

complete bulk_extractor system that can be run from the command line. Java 6 or above

must be installed on the machine for the Bulk Extractor Viewer to run. Instructions

on running bulk_extractor from the command line can be found in Subsection 3.2.

Instructions on running it from the Bulk Extractor Viewer are located in Subsec-

tion 3.3.

3.2 Run bulk_extractor from the Command Line

The two main parameters required to run bulk_extractor are an output directory and a

disk image. The output directory must be a directory that does not already exist. The

disk image can be either a ﬁle such as a disk image or a directory of individual ﬁles.

Note that bulk_extractor cannot process a directory of disk images.

In the following instructions, output is the name of the directory that will be created

to store the bulk_extractor output. The ﬁle mydisk.raw is the name of the disk image

Figure 3: Anti-virus software, such as Symantec, often tries to block download of the

installer ﬁle

Figure 4: Windows 8 warning when trying to run the installer

Figure 5: Dialog appears when the user executes the Windows Installer

that will be extracted by bulk_extractor.

To run bulk_extractor from the command line on any machine, type the following com-

mand:

 bulk_extractor -o output mydisk.raw

The above command on any of the supported operating systems assumes that the disk

image mydisk.raw is located in the directory where the command is being executed.

However, you can point bulk_extractor to a disk image found elsewhere on your ma-

chine by explicitly entering the path to that image.

The following text shows the output that is produced when bulk_extractor is run on

the ﬁle nps-2010-emails.E01. The information printed indicates the version number,

input ﬁle, output directory and disk size. The screen is updated as bulk_extractor runs

with status information. bulk_extractor then prints performance information and the

number of features found when the run is complete.

C:\>bulk_extractor -o bulk_extractor\Output\nps-2010-emails bulk_extractor\In

putData\nps-2010-emails.E01

bulk_extractor version: 1.5.0

Input file: bulk_extractor\InputData\nps-2010-emails.E01

Output directory: bulk_extractor\Output\nps-2010-emails

Disk Size: 10485760

Threads: 4

All data are read; waiting for threads to finish...

Time elapsed waiting for 1 thread to finish:

(timeout in 60 min .)

Time elapsed waiting for 1 thread to finish:

6 sec (timeout in 59 min 54 sec.)

Thread 0: Processing 0

All Threads Finished!

Producer time spent waiting: 0 sec.

Average consumer time spent waiting: 8.32332 sec.

Phase 2. Shutting down scanners

Phase 3. Creating Histograms

ccn histogram... ccn_track2 histogram... domain histogram...

email histogram... ether histogram... find histogram...

ip histogram... lightgrep histogram... tcp histogram...

telephone histogram... url histogram... url microsoft-live...

url services... url facebook-address... url facebook-id...

url searches...Elapsed time: 11.1603 sec.

Overall performance: 0.939557 MBytes/sec

Total email features found: 67

Note that bulk_extractor automatically choose to use 4 threads because the program was

run on a computer with 4 cores. In general, bulk_extractor automatically determines

the number of cores to use. Therefore, it is not necessary to set the number of threads

unless you want to limit the number to use.

After running bulk_extractor , examine the output directory speciﬁed by name in the run

command. There should now be a number of generated output ﬁles in that directory.

There are several categories of output created for each bulk_extractor run. First, there

are feature ﬁles grouped by category, which contain the features found and include the

path, feature and context. Second, there are histogram ﬁles that allow users to quickly

see the features grouped by the frequency in which they occur. Certain kinds of ﬁles,

such as JPEGs and KML ﬁles, may be carved into directories. Finally, bulk_extractor

creates a ﬁle report.xml, in DFXML format, that captures the provenance of the run.

After bulk_extractor has been run, all of these ﬁles will be found in the output directory

speciﬁed by the user.

The text below shows the results of running the command ls -s within the output

directory from the bulk_extractor run on the disk image nps-2010-emails.E01. The

numbers next to the ﬁle names indicate the ﬁle size and show that several of the ﬁles,

including email.txt and domain.txt, were populated with features during the run.

C:\bulk_extractor\Output\nps-2010-emails>ls -s

total 303

0 aes_keys.txt 0 kml.txt

0 alerts.txt 0 lightgrep.txt

0 ccn.txt 0 lightgrep_histogram.txt

0 ccn_histogram.txt 0 rar.txt

0 ccn_track2.txt 8 report.xml

0 ccn_track2_histogram.txt 0 rfc822.txt

64 domain.txt 0 tcp.txt

1 domain_histogram.txt 0 tcp_histogram.txt

0 elf.txt 0 telephone.txt

16 email.txt 0 telephone_histogram.txt

4 email_histogram.txt 96 url.txt

0 ether.txt 0 url_facebook-address.txt

0 ether_histogram.txt 0 url_facebook-id.txt

1 exif.txt 4 url_histogram.txt

0 find.txt 0 url_microsoft-live.txt

0 find_histogram.txt 0 url_searches.txt

0 gps.txt 1 url_services.txt

0 hex.txt 0 vcard.txt

0 ip.txt 12 windirs.txt

0 ip_histogram.txt 0 winpe.txt

0 jpeg 0 winprefetch.txt

8 jpeg_carved.txt 88 zip.txt

0 json.txt

There are numerous feature ﬁles produced by bulk_extractor for each run. A feature

ﬁle is a tab-delimited ﬁle that show a feature on each row. Each row includes a path, a

feature and the context. The ﬁles are in UTF-8 format.

Any of the feature ﬁles created by bulk_extractor may have an accompanying *_stopped.txt

ﬁle found in the output directory. This ﬁle will show all stopped entries of that type that

have been found so that users can examine those ﬁles to make sure nothing critical has

been hidden. A stopped features is a feature that appears in a stop list. The stop list is a

list of features that are not of concern for a particular investigation. For example, users

may input a stop list ﬁle to bulk_extractor that contains numerous email addresses that

should be ignored and not marked as a found feature. Rather than throwing away those

results when they are found, bulk_extractor will create a ﬁle, named email_stoppe d.txt

that shows all email addresses from the stop list that were found during the run. The

stopped email addresses will not show up in the email.txt ﬁle. More information on

creating and using stop lists can be found in Subsection 4.4.

While the above commands are all that is required for basic operation, there are nu-

merous usage options that allow the user to aﬀect input and output, tuning, path pro-

cessing mode, debugging, and control of scanners. All of those options are described

when bulk_extractor is run with the -h (help) option. It is important to note that the

overwhelming tendency of users is to use many of these options; however, that is not

generally recommended. Most of the time, the best way to run bulk_extractor is with

no options speciﬁed other than -o to specify the output directory. For best performance

and results users should avoid adding them in general. Only advanced sers in speciﬁc

cases should use these options.

Running bulk_extractor with only the -h option speciﬁed produces the output shown

in Appendix A. To run any optional usage options, they should be inserted before

the input and output options are speciﬁed. Speciﬁcally, the order should look like the

following:

 bulk_extractor [Usage Options] -o output mydisk.raw

The speciﬁc order in which multiple usage options are speciﬁed matters. Some of the

options are discussed within the following sections for speciﬁc use cases, other options

are for programmer or experimental use. In general, avoid using the options unless in-

dicated for a speciﬁc purpose.

3.3 Run bulk_extractor from Bulk Extractor Viewer

On a Linux or Mac OS X system, go to the directory where the Bulk Extractor Viewer

is installed or specify the full path name to the jar ﬁle. It will be in the location where

the bulk_extractor code was installed and in the sub-directory labeled java_gui. From

that directory, run the following command to start the Bulk Extractor Viewer:

 ./BEViewer

3.4 Run bulk_extractor from Bulk Extractor Viewer

Windows users should go to the Start menu and choose Programs->Bulk_Extractor

x.y.z->BE Viewer with Bulk_extractor x.y.z (64-bit). If the 64-bit version can not be

run on your machine, you can choose the 32-bit version. The Troubleshooting section

Figure 6: What Bulk Extractor Viewer looks like when it is started

describes some limits users of the 32-bit version might encounter.

When the Bulk Extractor Viewer starts up, it will look like Figure 6. The look and

feel may vary slightly according to the speciﬁc operating system but all options should

appear similar. To run bulk_extractor from the viewer, click on the icon that looks like

a gear with a down arrow. It is next to the Print icon below the Tools menu. Clicking

on this icon will bring up the “Run bulk_extractor” Window as shown in Figure 7.

Next, in the “Run bulk_extractor” window select the Image File and Output Feature

Directory to run bulk_extractor. Figure 8 shows an example where the user has selected

the ﬁle nps-2010-emails.E01 as input and is going to create a directory called nps-2010-

charlie-output in the parent directory C:\bulk_extractor\Output. Note that ﬁgures may

vary slightly in future versions of bulk_extractor but the major functionality will remain

the same.

After selecting the input and output directories, click on the button at the bottom of the

“Run bulk_extractor” window labeled “Start bulk_extractor.” This will bring up the

window shown in Figure 9 that updates as bulk_extractor is running, providing status

information during the run and after the run is complete.

When the run is complete, a dialog will pop-up indicating the results are ready to be

viewed. Figure 10 shows this dialog. Click the “Ok” button which will return you to the

main Bulk Extractor Viewer window to view the results of the run. The “Reports”

Figure 7: Clicking on the gear icon brings up this “Run bulk_extractor” Window

Figure 8: After selecting an Image File for input, the user must select an output

directory to create

Figure 9: Status window that shows what happens as bulk_extractor runs and indicates

when bulk_extractor is complete

Figure 10: Dialog indicating the run of bulk_extractor is complete and results are

ready to be viewed

window on the left will now show the newly created report. In this example, the report

is called “nps-2010-emails-output.” Clicking once on this report name will expand the

report and show all of the ﬁles that have been created as shown in Figure 11.

Clicking on one of the ﬁles will bring that ﬁle up in the “Feature File” window in the

middle of the screen. In the example, the user clicked on email.txt to view the email

feature ﬁle. Clicking on one of the features, in this case [email protected], shows

the feature in context within the feature ﬁle on the right-hand side of the window as

shown in Figure 12.

The user can also view histogram ﬁles in the Bulk Extractor Viewer. Clicking on

the ﬁle, email_histogram.txt in the Reports window on the left hand side will bring

up the contents of the histogram ﬁle in the middle window. It will also display the

referenced feature ﬁle in the window below the histogram ﬁle. In this case, the refer-

enced feature ﬁle is email.txt. Clicking on a feature in the histogram, in this example

[email protected], will display the feature in context as found within the feature

ﬁle on the right-hand side of the screen as shown in Figure 13.

4 Processing Data

4.1 Types of Input Data

The bulk_extractor system can handle multiple image formats including E01, raw, split

raw and individual disk ﬁles as well as raw devices or ﬁles. It can also operate on mem-

ory and packet captures, although packet captures will be more completely extracted if

you pre-process them with tcpﬂow.

The scanners all serve diﬀerent functions and look for diﬀerent types of information.

Often, a feature will be stored in a format not easily accessible and will require multiple

scanners to extract the feature data. For example, some PDF ﬁles contain text data

but the PDF format is not directly searchable by the scanner that ﬁnds email addresses

or the scanner that looks for keywords. bulk_extractor resolves this by having the two

scanners work together. The pdf scanner will ﬁrst extract all of the text from the PDF

and then the other scanners will look at the extracted text for features. This is important

to remember when turning scanners oﬀ and on, as scanners work together to retrieve the

features from the disk image. The types of information examined, extracted or carved

by the existing bulk_extractor scanners are as described in Table 4.1, along with the

Figure 11: Reports window shows the newly created report and all of the ﬁles created

in that report

Figure 12: While viewing the feature ﬁle, the user can select a feature to view with it

full context in the feature ﬁle as shown in the right-hand side of the window

Figure 13: User can view histograms of features, referenced feature ﬁles and speciﬁc

features in context

scanners that process them and the speciﬁc sections where they are referenced in this

manual.

Table 1: Input Data Processed by the Scanners

Scanner

Name

Data Type Section Dis-

cussed in Man-

ual

accts Numeric accounts, such as phone

numbers and CCNs

aes In-memory AES keys from their key

schedules

Subsection 5.2

base16 Base 16 (hex) encoded data (in-

cludes MD5 codes embedded in the

data)

Subsection 5.2

base64 Base 64 code Subsection 4.6

and Subsec-

tion 5.2

elf Executable and Linkable Format

(ELF)

Subsection 5.1

exif EXIF structures from JPEGS (and

carving of JPEG ﬁles)

Subsection 5.5

facebook Facebook HTML

gps XML from Garmin GPS devices

(processed)

Subsection 5.3

gzip GZIP ﬁles and ZLIB-compressed

GZIP streams

Subsection 4.6

and Subsec-

tion 5.2

hashdb NPS Hash Database support

hiber Windows Hibernation File Frag-

ments (decompressed and processed,

not carved)

Subsection 4.6

httplogs HTTP log ﬁles

jpeg JPEG carving. Default is only en-

coded JPEGs are carved. JPEGs

without EXIFs are also carved

Subsection 4.3

and Subsec-

tion 5.5

json JavaScript Object Notation ﬁles

and objects downloaded from web

servers, as well as JSON-like objects

found in source code

Subsection 5.1

kml KML ﬁles (carved) Subsection 5.3

outlook Outlook Compressable Encryption

pdf Text from PDF ﬁles (extracted for

processing not carved)

Subsection 4.6

rar RAR components in unencrypted

archives are decrypted and pro-

cessed. Encrypted RAR ﬁle are

carved.

Subsection 4.3

Scanner

Name

Data Type Section Dis-

cussed in Man-

ual

sqlite SQLite3 database ﬁle detection and

carving

vcard vCard ﬁles (carved) Subsection 5.3

windirs Windows FAT32 and NTFS direc-

tory entries

Subsection 5.2

winlnk Windows LNK ﬁle carving and de-

coding

winpe Windows Preinstallation Environ-

ment (PE) Executables (.exe and

.dll ﬁles notated with MD5 hash of

ﬁrst 4k)

Subsection 5.1

winprefetch Windows Prefetch ﬁles, ﬁle frag-

ments (processed)

Subsection 5.1

zip ZIP ﬁles and zlib streams (pro-

cessed, and optionally carved)

Subsection 4.3

and Subsec-

tion 4.6

4.2 Scanners

There are multiple scanners deployed with the bulk_extractor system. For a detailed list

of the scanners installed with your version of bulk_extractor, run the following command:

 bulk_extractor -H

This command will show all of the scanners installed with additional information in-

cluded about each scanner. Speciﬁcally, there is a description for each scanner, a list of

the features it ﬁnds and any relevant ﬂags. A sample of the output is below:

Scanner Name: accts

flags: NONE

Scanner Interface version: 3

Author: Simson L. Garfinkel

Description: scans for CCNs, track 2, and phone #s

Scanner Version: 1.0

Feature Names: alerts ccn ccn_track2 telephone

Scanner Name: base16

flags: SCANNER_RECURSE

Scanner Interface version: 3

Author: Simson L. Garfinkel

Description: Base16 (hex) scanner

Scanner Version: 1.0

Feature Names: hex

...

Scanner Name: wordlist

flags: SCANNER_DISABLED

Scanner Interface version: 3

Author:

Description:

Scanner Version:

Feature Names: wordlist

This output shows that the accts scanner looks for credit card numbers, credit card track

2 information and phone numbers and ﬁnds the feature names alerts, ccn, ccn_track2

and telephone. This means it writes to the feature ﬁles alerts.txt, ccn.txt, ccn_track2.txt,

and telephone.txt.

The output also shows that the base16 scanner is a recursive scanner (indicated by

the ﬂag SCANNER_RECURSE) meaning it expands data or ﬁnds new data for other

scanners to process. It also writes to the ﬁle hex.txt.

Finally, the output shows that the wordlist scanner is disabled by default (indicated

by the ﬂag SCANNER_DISABLED). This means that if the user would like to use the

wordlist scanner, it will have to be speciﬁcally enabled. The wordlist scanner is useful

for password cracking and is discussed in Subsection 5.4.

In general, most users will not need to enable or disable scanners. The default settings

installed with the bulk_extractor system work best for the majority of users. However,

individual scanners can be enabled or disabled for diﬀerent purposes. To enable the

wordlist scanner, which is disabled by default, use the following command:

 bulk_extractor -e wordlist -o output diskimage.raw

Additionally, users can disable a scanner that is enabled by default. Most of the scanners

are enabled by default. To disable the accts scanner, which is very CPU intensive, run

the following command:

 bulk_extractor -x accts -o output diskimage.raw

The command -E disables all scanners, then enables the one that follows the option. For

example, to disable all scanners except the aes scanner, use the following command:

 bulk_extractor -E aes -o output diskimage.raw

The options -E, -e and -x are all processed in order. So, the following command will also

disable all scanners and then enable the aes scanner:

 bulk_extractor -x all -e aes -o output diskimage.raw

Some of the scanners installed with bulk_extractor have parameters that can be set and

utilized by advanced users for diﬀerent purposes. Those parameters are also described

in the -H output described above (as well as the -h output) and include the following:

Settable Options (and their defaults):

-S work_start_work_end=YES Record work start and end of each scanner in report.xml file ()

-S enable_histograms=YES Disable generation of histograms ()

-S debug_histogram_malloc_fail_frequency=0 Set >0 to make histogram maker fail with memory allocations ()

-S hash_alg=md5 Specifies hash algorithm to be used for all hash calculations ()

-S dup_data_alerts=NO Notify when duplicate data is not processed ()

-S write_feature_files=YES Write features to flat files ()

-S write_feature_sqlite3=NO Write feature files to report.sqlite3 ()

-S report_read_errors=YES Report read errors ()

-S ssn_mode=0 0=Normal; 1=No ‘SSN’ required; 2=No dashes required (accts)

-S min_phone_digits=6 Min. digits required in a phone (accts)

-S carve_net_memory=NO Carve network memory structures (net)

-S word_min=6 Minimum word size (wordlist)

-S word_max=14 Maximum word size (wordlist)

-S max_word_outfile_size=100000000 Maximum size of the words output file (wordlist)

-S wordlist_use_flatfiles=NO Override SQL settings and use flatfiles for wordlist (wordlist)

-S hashdb_mode=none Operational mode [none|import|scan]

none - The scanner is active but performs no action.

import - Import block hashes.

scan - Scan for matching block hashes. (hashdb)

-S hashdb_block_size=4096 Hash block size, in bytes, used to generte hashes (hashdb)

-S hashdb_ignore_empty_blocks=YES Selects to ignore empty blocks. (hashdb)

-S hashdb_scan_path_or_socket=your_hashdb_directory File path to a hash database or

socket to a hashdb server to scan against. Valid only in scan mode. (hashdb)

-S hashdb_scan_sector_size=512 Selects the scan sector size. Scans along

sector boundaries. Valid only in scan mode. (hashdb)

-S hashdb_import_sector_size=4096 Selects the import sector size. Imports along

sector boundaries. Valid only in import mode. (hashdb)

-S hashdb_import_repository_name=default_repository Sets the repository name to

attribute the import to. Valid only in import mode. (hashdb)

-S hashdb_import_max_duplicates=0 The maximum number of duplicates to import

for a given hash value, or 0 for no limit. Valid only in import mode. (hashdb)

-S exif_debug=0 debug exif decoder (exif)

-S jpeg_carve_mode=1 0=carve none; 1=carve encoded; 2=carve all (exif)

-S min_jpeg_size=1000 Smallest JPEG stream that will be carved (exif)

-S zip_min_uncompr_size=6 Minimum size of a ZIP uncompressed object (zip)

-S zip_max_uncompr_size=268435456 Maximum size of a ZIP uncompressed object (zip)

-S zip_name_len_max=1024 Maximum name of a ZIP component filename (zip)

-S unzip_carve_mode=1 0=carve none; 1=carve encoded; 2=carve all (zip)

-S rar_find_components=YES Search for RAR components (rar)

-S raw_find_volumes=YES Search for RAR volumes (rar)

-S unrar_carve_mode=1 0=carve none; 1=carve encoded; 2=carve all (rar)

-S gzip_max_uncompr_size=268435456 maximum size for decompressing GZIP objects (gzip)

-S pdf_dump=NO Dump the contents of PDF buffers (pdf)

-S opt_weird_file_size=157286400 Weird file size (windirs)

-S opt_weird_file_size2=536870912 Weird file size2 (windirs)

-S opt_max_cluster=67108864 Ignore clusters larger than this (windirs)

-S opt_max_cluster2=268435456 Ignore clusters larger than this (windirs)

-S opt_max_bits_in_attrib=3 Ignore FAT32 entries with more attributes set than this (windirs)

-S opt_max_weird_count=2 Ignore FAT32 entries with more things weird than this (windirs)

-S opt_last_year=2019 Ignore FAT32 entries with a later year than this (windirs)

-S xor_mask=255 XOR mask string, in decimal (xor)

-S sqlite_carve_mode=2 0=carve none; 1=carve encoded; 2=carve all (sqlite)

To use any of these options, the user should specify the -S with the name=value pair

when running bulk_extractor as in the following example:

 bulk_extractor -S name=value -o output diskimage.raw

As with the other scanner and bulk_extractor usage options, most users will not have

to use any of these options.

4.3 Carving

File carving is a special kind of carving in which ﬁles are recovered. File carving is use-

ful for both data recovery and forensic investigations because it can recover ﬁles when

sectors containing ﬁle system metadata are either overwritten or damaged [?]. Cur-

rently, bulk_extractor provides carving of contiguous JPEG, ZIP and RAR ﬁles. To

carve fragmented ﬁles we recommend PhotoRec (free) or Adroit Photo Recovery

(commercial). Additionally, Forensics Toolkit and EnCase Forensic provide some

carving capability on fragmented ﬁles.

Carved results are stored in two diﬀerent places. First, a ﬁle listing all the ﬁles that

are carved are written to a corresponding .txt ﬁle: JPEG ﬁles to jpeg_carved.txt,

ZIP ﬁles to unzip.txt and RAR ﬁles to unrar.txt. Second, the carved JPEG, ZIP

and RAR ﬁles are placed in binned directories that are named /jpeg, /unzip and /unrar

respectively. For example, all carved JPEGs will go in the directory /jpeg. The output

ﬁles are further binned with 1000 ﬁles in each directory. The directory names are 3

decimal digits. If there are more than 999,000 carved ﬁles of one type, then the next set

of directories are named with 4 digits. File names for JPEGs are the forensicpath.jpg.

File names for the ZIP carver are the forensicpath_ﬁlename. If the ZIP ﬁle name has

slashes in it (denoting directories), they are turned into ’_’ (underbars). For example,

the ﬁle mydocs/output/specialfile will be named mydocs_outp ut_specialfile.

Table 2: There are three carving modes in bulk_extractor that are speciﬁed separately

for each ﬁle type, JPEG, ZIP and RAR.

Mode Mode Description

0 Do not carve ﬁles of the speciﬁed type.

Only carve encoded ﬁles of the speciﬁed

type

2 Carve everything of the speciﬁed type.

As the above table describes, there are three carving modes in bulk_extractor that can

be speciﬁed separately for each ﬁle type, JPEG, ZIP or RAR. The ﬁrst mode, mode 0,

explicitly tells bulk_extractor not to carve ﬁles of that type. The second mode, mode

1, is on by default and tells bulk_extractor to carve only encoded ﬁles of that type. If

the user is running the ZIP carver in mode 1 and there is a simple ZIP ﬁle, it will not

get carved. However, if there is an encoded attachment of that ﬁle (like Base64) it will

get carved. The ﬁnal mode, mode 2, will carve everything of that type. There is no way

to specify which types of ﬁles (particular extensions) will get carved and which will not

in mode 2. For example, bulk_extractor will carve both JPEGs and doc ﬁles. It carves

whatever is encountered.

To specify the carving modes for bulk_extractor , command line arguments can be spec-

iﬁed. To modify the JPEG carving modes, type the following where carve mode 1=de-

fault value that does not need to be speciﬁed (carve encoded), 0=no carving or 2=carve

everything:

 bulk_extractor -S jpeg_carve_mode=1 -o output diskimage.raw

To modify the ZIP carving modes, type the following where carve mode 1=default value

that does not need to be speciﬁed (carve encoded), 0=no carving or 2=carve everything:

 bulk_extractor -S unzip_carve_mode=1 -o output diskimage.raw

To modify the RAR carving modes, type the following where carve mode 1=default value

that does not need to be speciﬁed (carve encoded), 0=no carving or 2=carve everything:

 bulk_extractor -S unrar_carve_mode=1 -o output diskimage.raw

Any combination of the carving mode options can be speciﬁed for a given run. The

carvers can run in any combination of modes. For example, the JPEG carver can be

run in mode 2 while the RAR carving is turned oﬀ in mode 1 and the ZIP carver carves

only encoded ﬁles in mode 1.

Because bulk_extractor can carve ﬁles and preserve original ﬁle extensions, there is a real

possibility that bulk_extractor might be carving out malware. There is no protection

in bulk_extractor against putting malware in a ﬁle on your hard drive. Users running

bulk_extractor to look for malware should turn oﬀ all anti-virus software because the

anti-virus program will think its creating malware and stop it. Then the user should

carefully scan the results looking for malware before re-enabling the anti-virus.

4.4 Suppressing False Positives

Modern operating systems are ﬁlled with email addresses. They come from Windows

binaries, SSL certiﬁcates and sample documents. Most of these email addresses, par-

ticularly those that occur the most frequently, such as [email protected], are not

relevant to the case. It is important to be able to suppress those email addresses not

relevant to the case. To address this problem, bulk_extractor provides two approaches.

First, bulk_extractor allows users to build a stop list or use an existing one available for

download. These stop lists are used to recognize and dismiss the email addresses that

are native to the Operating System. This approach works well for email addresses that

are clearly invalid, such as [email protected]. For most email addresses, however,

you will want to stop them in some circumstances but not others. For example, there

are over 20,000 Linux developers, you want to stop their email addresses in program

binaries, not in email messages.

To address this problem, bulk_extractor uses context-sensitive stop lists. Instead of a

stop list of features, this approach uses the feature+context. The following example is

an excerpt from a context-sensitive stop list ﬁle.

ubuntu - users @ lists . ubuntu . com Maint \ x0A935 26135 7 \ x09ubuntu - u sers@l ists . ubuntu . com \ x0

ubuntu - motu@l i sts . ubuntu . com untu_ \ x 0A923 86704 7 \ x09ubuntu - motu@ lists . ubuntu . com \ x09

psch iffe @red h at .com Peter Schiffe r - 0.8 -1.1 N\ x94 /\ xC0 -

php deve l@ec hosp ace . com : Vlad Krupin < ph pdev el@e chos pace .com >\ x 0AMAI NTEN A NCE :

anho lt@f r eebs d . org 34 - GZIP -1021192\ x09a nhol t@fr eebs d . org \ x09r : EricAnho lt

ubuntu - motu@l i sts . ubuntu . com http \ x0A9 3 89664 89 \ x09ubuntu - motu@list s . ubuntu . com \ x09

The context for the feature is the 8 characters on either side of the feature. Each “stop

list” entry is the feature+context. This ignores Linux developer email addresses in Linux

binaries. The email address will be ignored if found in that context but reported if it

appears in a diﬀerent context.

There is a context-sensitive stop list for Microsoft Windows XP, 2000, 2003, Vista and

several Linux systems. The total stop list is 70 MB and includes 628,792 features in

a 9 MB zip ﬁle. The context-sensitive stop list prunes many of the OS-supplied fea-

tures. By applying it to the domexusers HD image (the image can be downloaded at

http://http://digitalcorpora.org/corp/nps/drives/nps-2009-domexusers/, the

number of emails found went from 9,143 down to 4,459. This signiﬁcantly reduces the

amount of work to be done by the investigator. Figure 14 shows how the histogram of

email addresses diﬀers when bulk_extractor is run with and without the context-sensitive

stop list. The context-sensitive stop list built for the various operating systems de-

scribed above can be downloaded from http://digitalcorpora.org/downloads/bulk_

extractor. The ﬁle will have the words “stoplist” in it somewhere. The current version

as of publication of this manual is called bulk_extractor-3-stoplist.zip.

Figure 14: Email Histogram Results With and Without the Context-Sensitive Stop

List. Results from the Domexusers HD image.

It should be noted that bulk_extractor does allow the users to create stop lists that

are not context sensitive. A stop list can simply be a list of words that the user wants

bulk_extractor to ignore. For example, the following three lines would constitute a valid

stop list ﬁle:

[email protected]

www.google.com

However for the reasons stated above, it is recommended that users rely on context-

sensitive stop lists when available to reduce the time required to analyze the results of

a bulk_extractor run.

Stopped results are not completely hidden from users. If stopped feature are discov-

ered, they will be written to the appropriate category feature ﬁle with the extension

_stopped.txt. For example, stopped domain names that are found in the disk image

will be written to domain_stopped.txt in the output directory. The stopped ﬁles serve

the purpose of allowing users to verify that bulk_extractor is functioning properly and

that the lists they have written are being processed correctly.

4.5 Using an Alert List

Speciﬁc words or features in a given context might be important to a user’s investigation.

The alert list can contain a list of words and/or feature ﬁlenames, and when a match is

found, it will alert the user. The way the feature ﬁle alert works is similar to how they

are used for context-sensitive stop lists. It will only alert on a speciﬁed feature when it’s

found in the speciﬁed context.

A sample alert list ﬁle might look like the following:

[email protected]

SilentFury2012

www.maliciousintent.com

While this list does not appear to help in any particular investigation, it demonstrates

that you can specify distinct words that are important to their analysis. Results con-

taining the alert list information are found in the ﬁle alert.txt in the bulk_extractor

output directory.

4.6 The Importance of Compressed Data Processing

Many forensic tools frequently miss case-critical data because they do not examine cer-

tain classes of compressed data. For example, a recent study of 1400 drives found

thousands of email addresses that were compressed (and happened to be in unallocated

space).[?]. Without looking at all the data on each drive and optimistically decompress-

ing it, critical features might be missed. Compressed email addresses, such as those in a

GZIP ﬁle, do not look like email addresses to a scanner; they must ﬁrst be decompressed

to be identiﬁed. Although some of these features are from software distributions, many

are not. Table 3 shows the kinds of encodings that can be decoded by bulk_extractor

[?].

Table 3: The kinds of encodings that can be decoded by bulk_extractor and the amount

of context required for the decoding

Encoding Can be decoded when bulk_extractor ﬁnds

GZIP The beginning of a zlib-compressed stream

BASE64 The beginning of a BASE64-encoded stream

HIBER Any fragment of a hibernation ﬁle can generally be

decompressed, as each Windows 4k page is separately

compressed and the beginning of each compressed page

in the hibernation ﬁle is indicated by a well-known

sequence

PDF Any PDF stream compressed with ZLIB bracketed by

stream and endstream

ZIP The local ﬁle header of a ZIP-ﬁle component

The reason that users must be aware of this is because users have a tendency to want to

enable and disable scanners for speciﬁc uses, but one can unintentionally compromise the

results. For example, if a user only wants to ﬁnd email addresses, they may try to turn

oﬀ all scanners except the email scanner. This will ﬁnd some email addresses. However,

it will miss the email addresses on the media that are only present in compressed data.

This is because scanners such as zip, rar and gzip will not be running. Those scanners

each work on a diﬀerent type of compressed data. For example, the gzip scanner will

ﬁnd GZIP compressed data, decompress it and then pass it other scanners to search for

features. In that way, GZIP compressed emails can be processed by bulk_extractor.

The pdf scanner is another type of scanner that ﬁnds text that otherwise wouldn’t be

found. While PDF ﬁles are human readable, they are not readable but many software

tools and scanners because of their formatting. The pdf scanner extracts some kinds

of text found within PDFs and then passes that text on to other scanners for further

processing. Many typical disk images include PDF ﬁles, so most users will want to have

this scanner enabled (as it is by default).

Finally, the hiber scanner decompresses Windows hibernation ﬁles. If the disk image

being analyzed is from a Windows system, bulk_extractor users will want that turned

on (as it is by default). The scanner is very fast, however, so it will not signiﬁcantly

decrease performance on non-Windows drives.

5 Use Cases for bulk_extractor

There are many digital forensic use cases for bulk_extractor— more than we can enu-

merate within this manual. In this section we highlight some of the most common uses

of the system. Each case discusses which output ﬁles, including feature ﬁles and his-

tograms, are most relevant to these types of investigations. In Section 8, Worked

Examples, we provide more detailed walk-throughs and refer back to these use cases

with more detailed output ﬁle information.

5.1 Malware Investigations

Malware is a programmatic intrusion. When performing a malware investigation, users

will want to look at executables, information that has been downloaded from web-

based applications and windows directory entries (for Windows-speciﬁc investigations).

bulk_extractor enables this in several ways.

First, bulk_extractor ﬁnds evidence of virtually all executables on the hard drive includ-

ing those by themselves, those contained in ZIP ﬁles, and those that are compressed.

It does not give you the hash value of the full ﬁle, rather, it gives the hash of just the

ﬁrst 4KB of the ﬁle. Our research has shown that the ﬁrst 4KB is predictive because

most executables have a distinct hash value for the ﬁrst 4KB of the ﬁle [?]. Additionally,

many of these ﬁles may be fragmented and looking at the ﬁrst 4KB will still provide

information relevant to an investigation because fragmentation is unlikely to happen

before the ﬁrst 4KB. The full hash of a fragmented ﬁle is not available in bulk_extractor .

Several output feature ﬁles produced by bulk_extractor contain relevant and important

information about executables. These ﬁles include:

• elf.txt — This ﬁle (produced by the elf scanner) contains information about

ELF executables that can be used to target Linux and Mac OS X systems.

• winprefetch.txt — This ﬁle (produced by the winprefetch scanner lists the

current and deleted ﬁles found in the Windows prefetch directory.

The XML in these feature ﬁles is too complicated to review without using other ap-

plications. The recommended way to analyze the executable output is to use a third

party tool that analyzes executables or pull the results into a spreadsheet. In a spread-

sheet, one column could contain the hash values and those values can be compared

against a database of executable hashes. There is also a python tool that comes with

bulk_extractor called identify_ﬁlenames.py that can be used to get the full ﬁlename

of the ﬁle. The python tool is discussed in more detail in Section 7.

For Windows speciﬁc malware investigations, the ﬁles winpe.txt and winprefetch.txt

are very useful. They are produced by the winpe and winprefetch scanners respec-

tively. Windows Prefetch shows ﬁles that have been prefetched in the Windows prefetch

directory and shows the deleted ﬁles that were found in unallocated space. The Windows

PE feature ﬁle shows entries related to the Windows executable ﬁles.

JSON, the JavaScript Object Notation, is a lightweight data-interchange format. Web-

sites tend to download a lot of information using JSON. The output ﬁle json.txt,

produced by the json scanner, can be useful for malware investigations and analysis of

web-based applications. If a website has downloaded information in JSON format, the

JSON scanner may ﬁnd that information in the browser cache.

5.2 Cyber Investigations

Cyber investigations may scan a wide variety of information types. A few unifying

features of these investigations are the need to ﬁnd encryption keys, hash values and

information about ethernet packets. bulk_extractor provides several scanners that pro-

duce feature ﬁles containing this information.

For encryption information, the following feature ﬁles may be useful:

• aes.txt — AES is an encryption system. Many implementations leave keys in

memory that can be found using an algorithm invented at Princeton University.

bulk_extractor provides an improved version of that algorithm to ﬁnd AES keys

in the aes scanner. When it scans memory, such as swap ﬁles or decompressed

hibernation ﬁles, it will identify the AES keys. The keys can be used for software

that will decrypt AES encrypted material.

• hex.txt — The base16 scanner decodes information that is stored in Base16,

breaking it into the corresponding hexidecimal values. This is useful if you are

looking for AES keys or SHA1 hashes. This scanner only writes blocks that are of

size 128 and 256 because they are the sizes used for encryption keys. The feature

ﬁle is helpful if the investigator is looking for people who have emailed encryption

keys or hash values in a cyber investigation.

Additionally, the base64 scanner is important for cyber investigations because it looks

mostly at email attachments that are coded in Base64. The information found in these

attachments will be analyzed by other scanners looking for speciﬁc features.

The windirs scanner ﬁnds Windows FAT32 and NTFS directory entries and will also be

useful for cyber investigations involving Windows machines, as they may be indicators

of times that activity took place.

Finally, the ﬁles ether.txt, ip.txt, tcp.txt and domain.txt are all produced by the

net scanner. It searches for ethernet packets and memory structures associated with

network data structures in memory. It is important to note that tcp connections have

a lot of false positives and many of the information found by this scanner will be false.

Investigators should be careful with the interpretation of these feature ﬁles for that

reason.

5.3 Identity Investigations

Identity investigations may be looking for a wide variety of information including email

addresses, credit card information, telephone numbers, geographical information and

keywords. For example, if the investigator is trying to ﬁnd out of who a person is and

who their associates are, they will be looking at phone numbers, search terms to see

what they are doing and emails to see who they are communicating with.

The accts scanner is very useful for identity investigations. It produces several feature

ﬁles with identity information including:

• ccn.txt — credit card numbers

• ccn_track2.txt - credit card track two information - relevant information if some-

one is trying to make physical fake credit cards

• pii.txt - personally identiﬁable information including birth dates and social num-

bers

• telephone.txt - telephone numbers

The kml and gps scanner both produce GPS information that give information about

a person in a certain area or link to what they have been doing in a certain area. Both

of these scanners write to gps.txt. KML is a format used by Google Earth and Google

Map ﬁles. This scanner searches in that formatted data for GPS coordinates. The gps

scanner looks at Garmin Trackpoint formatted information and ﬁnds GPS coordinates

in that data.

The email scanner looks for email addresses in all data and writes that to email.txt.

The vcard scanner looks at vCard data, an electronic business card format, and ﬁnds

names, email addresses and phone numbers to write to the respective feature ﬁle.

The are multiple url ﬁles including url.txt, url_facebook-address.txt, url_facebook-id.txt,

url_microsoft-live.txt, url_searches.txt and url_services.txt that are pro-

duced by the email scanner. They are useful for looking at what websites a person

has visited as well as the people they are associating with.

An important aspect of identity investigations (as well as other types) is the ability to

search the data for a list of keywords. bulk_extractor provides the capability to do that

through two diﬀerent means. First, the ﬁnd scanner is a simple regular expression ﬁnder

that uses regular expressions. The ﬁnd scanner looks through the data for anything

listed in the global ﬁnd list. The format of the ﬁnd list should be rows of regular

expressions while any line beginning with a # is considered a comment. The following

is an excerpt from a sample ﬁnd list ﬁle:

# This is a co m ment line

\b \d {1 ,3}\.\ d{1 ,3}\.\d {1 ,3}\.\ d {1 ,3}\ b

# ano t h er comment line

/^[ a -z0 -9 _ -]{3 ,16} $ /

The ﬁrst regular expression from the above example, beginning with \b, looks for the

following in order: a word boundary followed a digit repeated between 1-3 times, a digit

repeated between 1-3 times, a digit repeated 1-3 times, a ’.’, a digit repeated 1-3 times,

a digit repeated 1-3 times and the end of the word boundary. That regular expression

would ﬁnd, for example, the sequence 2219.889 separated out from other text by a word

boundary.

The second regular expression from the above example, beginning with / looks for the

following in order: a ’/’, the beginning of a line, repeats of any character in lowercase

a-z, 0-9, ’_’, or ’-’, repeated 3 to 16 times, and the end of the line followed by ’\.’ That

expression would ﬁnd, for example, the following sequence:

284284284284

Regular expressions can be used to represent character and number sequences (or ranges

of values) that might be of particular importance to an investigation.

The ﬁnd list is sent in as input to bulk_extractor using the “-F findlist” option. To

run bulk_extractor with a ﬁnd list, the following basic parameters are required (where

findlist.txt is the name of the ﬁnd list):

 bulk_extractor -F findlist.txt -o output mydisk.raw

Another scanner, the lightgrep scanner provides the same functionality as the ﬁnd

scanner but it is much faster and provides more functionality. It is also a regular ex-

pression scanner that looks through the buﬀers and matches in the global ﬁnd list. A

syntax sheet of regular expressions that might be helpful to users in creating a ﬁnd list

to be used by the Lightgrep Scanner is shown in Figure 15.

The lightgrep scanner uses the Lightgrep library from Lightbox Technologies. An

open source version of that library can be downloaded from https://github.com/

LightboxTech/liblightgrep. Installation instructions are also available at the down-

load site. The lightgrep scanner is preferable because it looks for all regular expressions

at once, on the ﬁrst pass through the data. The ﬁnd scanner actually looks for each

expression in the ﬁnd list one at a time. For example, if the ﬁnd list is a list of medical

terms and diagnoses and bulk_extractor is searching medical records, the ﬁnd scanner

looks for each term in each piece of data on one pass through, one at a time. A list of

35 expressions would require 35 passes through the data. The lightgrep scanner will

search a given buﬀer for all of the medical terms at once, in one pass through.

If the Lightgrep library is installed and the ﬁnd list is provided to bulk_extractor, it

will run the lightgrep scanner. If not, it will use the ﬁnd scanner. Neither scanner

needs to be enabled by the user speciﬁcally, calling bulk_extractor with the ﬁnd list will

automatically enable the appropriate scanner. However, we do not recommend using

the ﬁnd list without the Lightgrep library — it will make bulk_extractor run very slowly

because each ﬁnd search will be sequentially executed. This will provide an exponential

slow-down.

Investigators looking for identity information may rely heavily on the ﬁnd list to search

for speciﬁc names, numbers or keywords relevant to the investigation. The features

found by the ﬁnd or lightgrep scanner will be written to the ﬁles find.txt and

lightgrep.txt respectively.

Lightgrep Cheat Sheet

c the character c

⇤

\a U+0007 (BEL) bell

\e U+001B (ESC) escape

\f U+000C (FF) form feed

\n U+000A (NL) newline

\r U+000D (CR) carriage return

\t U+0009 (TAB) horizontal tab

\ooo U+ooo,1–3octaldigitso,  0377

\xhh U+00hh, 2 hexadecimal digits h

\x{hhhhhh} U+hhhhhh,1–6hexdigitsh

\zhh the byte 0xhh (not the character!)

†

\N{name} the character called name

\N{U+hhhhhh} same as \x{hhhhhh}

\c the c harac ter c

‡

⇤

except U+ 000 0 (NUL) and metacharacters

†

Lightgrep extension; not part of PCRE.

‡

except any of: adefnprstwDPSW1234567890

1 Single Characters

. any character

\d [0-9] (= ASCII digits)

\D [^0-9]

\s [\t\n\f\r ] (= ASCII whitespace )

\S [^\t\n\f\r ]

\w [0-9A-Za-z_] (= ASCII words)

\W [^0-9A-Za-z_]

\p{property} any character having property

\P{property} any character lacking property

2 Named Character Classes

[stu ] any character in stu

[^stu ] any character not in stu

where stu is. . .

c acharacter

a-b acharacterrange,inclusive

\zhh abyte

\zhh-\zhh abyterange,inclusive

[S] acharacterclass

ST S [ T (union)

S&&TS\ T (intersection)

S--TS T (dierence)

S~~TS4 T (symmetric dierence, XOR)

3 Character Classes

(S) makes any pattern S atomic

4 Grouping

ST matches S, then matches T

S|T matches S or T , preferring S

5 Concatenation & Alte rnat ion

Repeats S...

Greedy

S* 0 or more times (= S{0,})

S+ 1 or more times (= S{1,})

S? 0 or 1 time (= S{0,1})

S{n,} n or more times

S{n,m} n–m times, inclusive

Reluctant

S*? 0 or more times (= S{0,})

S+? 1 or more times (= S{1,})

S?? 0 or 1 time (= S{0,1})

S{n,}? n or mo re times

S{n,m}? n–m times, inclusive

6 Repetition

Any Assigned

Alphabetic White_Space

Uppercase Lowercase

ASCII Noncharacter_Code_Point

Name=name Default_Ignorable_Code_Point

General_Category=category

L, Letter P, Punctuation

Lu, Uppercase Letter Pc, Connector Punctuation

Ll, Lowercase Letter Pd, Dash Punctuation

Lt, Titlecase Letter Ps, Open Punctuation

Lm, Modifier Letter Pe, Close Punctuation

Lo, Other Letter Pi, Initial Punctuation

M, Mark Pf, Final Punctuation

Mn, Non-Spacing Mark Po, Other Punctuation

Me, Enclosing Mark Z, Separator

N, Number Zs, Space Separator

Nd, Decimal Digit Number Zl, Line Separator

Nl, Letter Number Zp, Paragraph Separator

No, Other Number C, Other

S, Symbol Cc, Control

Sm, Math Symbol Cf, Format

Sc, Currency Symbol Cs, Surrogate

Sk, Modifier Symbol Co, Private Use

So, Other Symbol Cn, Not Assigned

Script=script

Common Latin Greek Cyrillic Armenian Hebrew Ara-

bic Syraic Thaana Devanagari Bengali Gurmukhi Gu-

jarati Oriya Tamil Telugu Kannada Malayalam Sin-

hala Thai Lao Tibetan Myanmar Georgian Hangul

Ethiopic Cherokee Ogham Runic Khmer Mongolian

Hiragana Katakana Bopomofo Han Yi Old_Italic

Gothic Inherited Tagalog Hanunoo Buhid Tagbanwa

Limbu Tai_Le Linear_B Ugaritic Shavian Osmanya

Cypriot Buginese Coptic New_Tai_Lue Glagolitic

Tifinagh Syloti_Nagri Old_Persian Kharoshthi Ba-

linese Cuneiform Phoenician Phags_Pa Nko Sudanese

Lepcha ... See Unicode Standard for more.

7 Selected Unicode Properties

c the character c (except metacharacters)

\xhh U+00hh, 2 hexadecimal digits h

\whhhh U+hhhh, 4 hexadecimal digits h

\c the character c

. any character

#[0-9](= ASCII digits)

[a-b] any charac ter in the range a–b

[S] any character in S

[^S] any charac ter not in S

(S) grouping

S* repeat S 0 or more times (max 255)

S+ repeat S 1 or more times (max 255)

S? repeat S 0 or 1 or time

S{n,m} repeat Sn–m times (max 255)

ST matches S, then matches T

S|T matches S or T

8 EnCase GREP Synt ax

\whhhh ! \xhhhh

# ! \d

S* ! S{0,255}

S+ ! S{1,255}

and

S+

are limited to

255 repetitions by EnCase;

Lightgrep preserves this in

imported p att ern s.

\w is limited to BMP characters ( U+10000) only.

9 Importing from EnC ase into Li g ht g rep

Some people, when confronted with a problem, think “I know,

I’ll use regular expressions.” Now they have two problems.

—JWZ in alt.religion.emacs, 12 August 1997

Lightgrep Search

for EnCase



Fast Search for

Forensics

www.lightgrep.com

Notes & Examples

Characters:

.*?\x00 (= null- te rmin ate d string)

\z50\z4B\z03\z04 (= ZIP signature)

\N{EURO SIGN}, \N{NO-BREAK SPACE}

\x{042F} (= CYRILLIC CAPITAL LETTER YA)

\+12\.5% (= escaping metacharacters)

Grouping: Operators bind tightly. Use

(aa)+

not aa+, to match pairs of a’s.

Ordered alternation:

a|ab

matches

twice in

aab. Left alternatives preferred to right.

Repetition: Greedy operators match as much

as possible. Reluctant operators match as little

as p ossi b l e.

a+a

matches all of

aaaa

;

a+?a

matches the ﬁrst aa,thenthesecondaa.

will (uselessly) match the

entire

input.

Prefer reluc tan t operators when possible.

Character classes:

[abc] = a, b,orc

[^a] =anythingbuta

[A-Z] = A to Z

[A\-Z]

= A, Z,orhyphen(!)

[A-Zaeiou] =capitals

or lowercas e vowels

[.+*?\]]

= ., +, *, ?,or]

[Q\z00-\z7F]

= Q or 7-bit bytes

[[abcd][bce]]

= a, b, c, d,ore

[[abcd]&&[bce]]

= b or c

[[abcd]--[bce]]

= a or d

[[abcd]~~[bce]]

= a, d,ore

[\p{Greek}\d]

=Greekordigits

[^\p{Greek}7]

=neitherGreeknor7

[\ p { G r e e k } & & \ p { L l }]

= lowercas e Greek

Operators need not be

escaped inside char-

acter classes.

Email addresses: [a-z\d!#$%&’*+/=?^_‘{|}~-][a-z\d!#$%&’*+/=?^_‘{|}~.-]{0,63}

@[a-z\d.-]{1,253}\.[a-z\d-]{2,22}

Hostnames: ([a-z\d]([a-z\d_-]{0,61}[a-z\d])?\.){2,5}[a-z\d][a-z\d-]{1,22}

N. American phone numbers: \(?\d{3}[ ).-]{0,2}\d{3}[ .-]?\d{4}\D

Visa, MasterCard: \d{4}([ -]?\d{4}){3}

American Express: 3[47]\d{2}[ -]?\d{6}[ -]?\d{5}

Diners Club: 3[08]\d{2}[ -]?\d{6}[ -]?\d{4}

EMF header: \z01\z00\z00\z00.{36}\z20EMF

JPEG: \zFF\zD8\zFF[\zC4\zDB\zE0-\zEF\zFE] Footer: \zFF\zD9

GIF: GIF8[79] Footer: \z00\z3B BMP: BM.{4}\z00\z00\z00\z00.{4}\z28

PNG: \z89\z50\z4E\z47 Footer: \z49\z45\z4E\z44\zAE\z42\z60\z82

ZIP: \z50\z4B\z03\z04 Footer: \z50\z4B\z05\z06

RAR: \z52\z61\z72\z21\z1a\z07\z00...[\z00-\z7F]

Footer: \z88\zC4\z3D\z7B\z00\z40\z07\z00

GZIP: \z1F\z8B\z08 MS Oce 97–03: \zD0\zCF\z11\zE0\zA1\zB1\z1A\zE1

LNK: \z4c\z00\z00\z00\z01\z14\z02\z00

PDF: \z25\z50\z44\z46\z2D\z31 Footer: \z25\z45\z4F\z46

Figure 15: Guide to Syntax Used by Lightgrep Scanner

5.4 Password Cracking

If an investigator is looking to crack a password, the wordlist scanner can be useful. It

generates a list of all the words found on the disk that are between 6 and 14 characters.

Users can change the minimum and maximum size of words by specifying options at

run-time but we have found this size range to be optimal for most applications. Because

the wordlist scanner is disabled by default, users must speciﬁcally enable it at run-time

when needed. To do that, run the following command:

 bulk_extractor -e wordlist -o output mydisk.raw

This will produce two ﬁles useful for password cracking, wordlist_his togram.txt and

wordlist.txt. These ﬁles will contain large words that can be used to recommend

passwords.

5.5 Analyzing Imagery Information

In an investigator needs to speciﬁcally analyze imagery, for something such as a child

pornography investigation, the exif scanner would be useful. It ﬁnds JPEGs on the

disk image and then carves the encoded ones that might be in, for example, ZIP ﬁles or

hibernation ﬁles. It writes the output of this carving to jpeg_carved.txt.

5.6 Using bulk_extractor in a Highly Specialized Environment

If using bulk_extractor in a specialized environment, two speciﬁc features might be

useful. The ﬁrst is the option to include a banner on each output ﬁle created by

bulk_extractor. The banner ﬁle, speciﬁed in the example command below as banner.txt

could include a security classiﬁcation of the output data. When bulk_extractor is run

with the command speciﬁed below, the data in the banner ﬁle will be printed at the top

of each output ﬁle produced.

 bulk_extractor -b banner.txt -o output mydisk.raw

The second feature might be useful to users in a specialized environment is the ability

to develop plug-ins. Plug-ins in bulk_extractor are external scanners that an individual

or organization can run in addition to the open source capabilities provided with the

bulk_extractor system. The plug-in system gives the full power of bulk_extractor to ex-

ternal developers, as all of bulk_extractor’s native scanners are written with the plug-in

system. This power gives third party developers the ability to utilize proprietary or secu-

rity protected algorithms and information in bulk_extractor scanners. It is worth noting

that all scanners installed with bulk_extractor use the plug-in system, bulk_extractor is

really just a framework for running plug-ins. The separate publication Programmers

Manual for Developing Scanner Plug-ins [?] provides speciﬁc details on how to

develop and use plug-ins with bulk_extractor.

6 Tuning bulk_extractor

All data that bulk_extractor processes is divided into buﬀers called sbufs. Buﬀers cre-

ated from disk images are created with a pre-determined size (bufsize). The buﬀer

includes a page and an overlap area. As shown in Figure 16, the pages overlap with

each other in the red region. The red overlap region is called the margin. bulk_extractor

scans the pages one-by-one looking for features. Pages overlap with each other so that

Disk Image

pagesize

bufsize

Figure 16: Image Processor divides the disk image into buﬀers. Each buﬀer is the size

of a page (pagesize) with a buﬀer overlap in an area called the margin. (marginsize is

equal to bufsize-pagesize). The buﬀers overlap with each other to ensure all information

is processed.

bulk_extractor won’t miss any features that cross from one page into another across

boundaries.

Users may be looking for potentially large features that are bigger than the buﬀer size

or that overlap into the margin. In that case, they may want to adjust the margin size

or buﬀer size. For example, if the input data includes a 30 MB ZIP ﬁle (possibly a soft-

ware program), bulk_extractor won’t ﬁnd features in the program because it overlaps

the margins. To ﬁnd features of that size, the margin size must be increased.

To adjust the page size, the following usage options need to be included where NN should

be set to the size (default page size is 16777216):

 bulk_extractor -G NN -o output mydisk.raw

To adjust the margin size, the following usage options need to be included where NN

should be set to the size (default margin size is 4194304):

 bulk_extractor -g NN -o output mydisk.raw

bulk_extractor provides many other tuning capabilities that are primarily recommended

for users doing advanced research. Many of those options relate to specifying ﬁle sizes

for input or output, specifying block sizes, dumping the contents of a buﬀer or ignoring

certain entries. Those options are all found in the output of the -h input to bulk_extractor

and listed in Appendix A.

7 Post Processing Capabilities

There are two Python programs useful for post-processing the bulk_extractor output.

Those programs are bulk_diﬀ.py and identify_ﬁlenames.py. To run either of these

programs, you must have Python version 2.7 or higher installed on your system. On

Linux and Mac systems, the bulk_extractor python programs are located in the direc-

tory ./python under the main bulk_extractor installation.

7.1 bulk_diﬀ.py: Diﬀerence Between Runs

The program bulk_diﬀ.py takes the results of two bulk_extractor runs and shows the

diﬀerences between the two runs. This program essentially tells the diﬀerence between

two disk images. It will note the diﬀerent features that are found by bulk_extractor

between one image and the next. It can be used, for example, to easily tell whether or

not a computer user has been visiting websites they are not supposed to by comparing

a disk image from their computer from one week to the next. To run the program, users

should type the following, where pre and post are both locations of two bulk_extractor

output directories:

 bulk_diff.py <pre> <post>

Note, Linux and Mac OS X users may have to type python2.7, python3, or python3.3

before the command, indicating the version of Python installed on your machine. An

example use of the bulk_diﬀ.py program can be found in Section 8.

7.2 identify_ﬁlenames.py: Identify File Origin of Features

The program identify_ﬁlenames.py operates on the results of bulk_extractor run and

identiﬁes the ﬁlenames (where possible) of the features that were found on the disk im-

age. The user can run this program on one or all of the features ﬁle produced by a given

run. It can be used, for example, to ﬁnd the full content of an email when references to

its contents are found in one of the feature ﬁles. Often email features are relevant to an

investigation and an investigator would like to be able to view the full email.

To run this program, users will need the program ﬁwalk installed on their machine or

have a DFXML ﬁle generated by ﬁwalk that corresponds to the disk image. ﬁwalk

is part of the SleuthKit and can be installed by installing Sleuthkit, available at

http://www.sleuthkit.org/.

The identify_ﬁlenames.py program provides various usage options but to run the

program on all feature ﬁles produced by a bulk_extractor run, the user should type

the following (where “bulkoutputdirectory” is the directory containing the output of

a bulk_extractor run and “idoutput” will contain the annotated feature ﬁles after the

program runs):

 identify_filenames.py --all bulkoutputdirectory idoutput

Note, Linux and Mac OS X users may have to type python2.7, python3, or python3.3

before the command, indicating the version of Python installed on your machine. An

example use of the bulk_diﬀ.py program can be found in Section 8.

8 Worked Examples

The worked examples provided are intended to further illustrate how to use bulk_extractor

to answer speciﬁc questions and conduct investigatons. Each example uses a diﬀerent,

publicly available dataset and can be replicated by readers of this manual.

8.1 Encoding

We describe the encoding system here in order to prepare users to view the feature ﬁles

produced by bulk_extractor. Unicode is the international standard used by all modern

computer systems to deﬁne a mapping between information stored inside a computer

and the letters, digits, and symbols that are displayed on the screens or printed on

paper. UTF-8 is a variable width encoding that can represent every character in the

Unicode character set. It was designed for backward compatibility with ASCII and to

avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

Feature ﬁles in bulk_extractor are all coded in UTF-8 format. This means that the odd

looking symbols, such as accented characters (è ), funny symbols (

∴

) and the occa-

sional Chinese character (



) that may show up in the ﬁles are legitimate. Glyphs from

language, for example, Cyrillic (

) or Arabic (

) may show up in features ﬁles as all

foreign languages can be coded in UTF-8 format. It is perfectly appropriate and typical

to open up a feature ﬁle and see characters that the user may not recognize.

9 2009-M57 Patents Scenario

The 2009-M57-Patents scenario tracks the ﬁrst four weeks of corporate history of the

(ﬁctional) M57 Patents company. The company started operation on Friday, November

13th, 2009, and ceased operation on Saturday, December 12, 2009. This speciﬁc scenario

was built to be used as a teaching tool both as a disk forensics exercise and as a network

forensics exercise. The scenario data is also useful for computer forensics research be-

cause the hard drive of each computer and each computers memory were imaged every

day. In this example, we are not particularly interested in the exercises related to illegal

activity, exﬁltration and eavesdropping; they do however provide interesting components

for us to examine in the example data[?].

9.1 Run bulk_extractor with the Data

For this example, we downloaded and utilized one of the disk images from the 2009-

M57-Patents Scenario. Those images are available at http://digitalcorpora.org/

corp/nps/scenarios/2009-m57-patents/drives-redacted/. The ﬁle used through-

out this example is called charlie-2009-12-11.E01. Running bulk_extractor on the

command line produces the following output (text input by the user is bold):

C:\bulk_extractor>bulk_extractor -o ../Output/charlie-2009-12-11 charlie-2009-12-11.E01

bulk_extractor version: 1.4.0

Input file: charlie-2009-12-11.E01

Output directory: ../Output/charlie-2009-12-11

Disk Size: 10239860736

Threads: 4

8:02:08 Offset 67MB (0.66%) Done in 1:21:23 at 09:23:31

8:02:34 Offset 150MB (1.47%) Done in 1:05:18 at 09:07:52

8:03:03 Offset 234MB (2.29%) Done in 1:01:39 at 09:04:42

8:03:49 Offset 318MB (3.11%) Done in 1:09:19 at 09:13:08

...

9:06:23 Offset 10049MB (98.14%) Done in 0:01:13 at 09:07:36

9:06:59 Offset 10133MB (98.96%) Done in 0:00:41 at 09:07:40

9:07:29 Offset 10217MB (99.78%) Done in 0:00:08 at 09:07:37

All data are read; waiting for threads to finish...

Time elapsed waiting for 4 threads to finish:

(timeout in 60 min .)

Time elapsed waiting for 3 threads to finish:

7 sec (timeout in 59 min 53 sec.)

Thread 0: Processing 10200547328

Thread 2: Processing 10217324544

Thread 3: Processing 10234101760

Time elapsed waiting for 2 threads to finish:

13 sec (timeout in 59 min 47 sec.)

Thread 0: Processing 10200547328

Thread 2: Processing 10217324544

All Threads Finished!

Producer time spent waiting: 3645.8 sec.

Average consumer time spent waiting: 3.67321 sec.

*******************************************

bulk_extractor is probably CPU bound.

Run on a computer with more cores

to get better performance.

*******************************************

Phase 2. Shutting down scanners

Phase 3. Creating Histograms

ccn histogram... ccn_track2 histogram... domain histogram...

email histogram... ether histogram... find histogram...

ip histogram... lightgrep histogram... tcp histogram...

telephone histogram... url histogram... url microsoft-live...

url services... url facebook-address... url facebook-id...

url searches...Elapsed time: 3991.77 sec.

Overall performance: 2.56524 MBytes/sec

Total email features found: 15277

All of the results from the bulk_extractor run are stored in the output directory charlie-

2009-12-11. The contents of that directory after the run include the feature ﬁles, his-

togram ﬁles and carved output. Figure 17 is a screenshot of the Windows output

directory. Additionally, the following output shows a list of the ﬁles, directories and

their sizes under Linux:

C:\bulk_extractor\charlie-2009-12-11>ls -s -F

1 aes_keys.txt 0 kml.txt

0 alerts.txt 0 lightgrep.txt

4 ccn.txt 0 lightgrep_histogram.txt

1 ccn_histogram.txt 196 packets.pcap

0 ccn_track2.txt 1 rar.txt

0 ccn_track2_histogram.txt 108 report.xml

23028 domain.txt 3728 rfc822.txt

192 domain_histogram.txt 20 tcp.txt

0 elf.txt 4 tcp_histogram.txt

1696 email.txt 60 telephone.txt

36 email_histogram.txt 8 telephone_histogram.txt

24 ether.txt 70108 url.txt

1 ether_histogram.txt 1 url_facebook-address.txt

508 exif.txt 0 url_facebook-id.txt

0 find.txt 6684 url_histogram.txt

0 find_histogram.txt 0 url_microsoft-live.txt

0 gps.txt 12 url_searches.txt

0 hex.txt 156 url_services.txt

32 ip.txt 0 vcard.txt

4 ip_histogram.txt 16432 windirs.txt

12 jpeg/ 20800 winpe.txt

504 jpeg.txt 1864 winprefetch.txt

1896 json.txt 29624 zip.txt

Many of the feature ﬁles and histograms are populated with data. Additionally, there

were some JPEG ﬁles carved and placed in the jpeg directory. In the following sections,

we demonstrate how to look at these results to discover more information about the disk

user and the ﬁles contained on the disk image.

Figure 17: Screenshot from Windows Explorer of the Output Directory Created by

the bulk_extractor run

9.2 Digital Media Triage

Digital media triage is the process of using the results of a rapid and automated analysis

of the media, performed when the media is ﬁrst encountered to determine if the media

is likely to have information of intelligence value and, therefore, should be prioritized

for immediate analysis. bulk_extractor performs bulk data analysis to help investiga-

tors quickly decide which piece of digital media is the most relevant and useful to an

investigation. Thus, bulk_extractor can be used to aid in investigations (through the

identiﬁcation of new leads and social networks) rather than just aiding in conviction-

support (through the identiﬁcation of illegal materials)[?].

In this example, we look at the charlie-2009-12-11.E01 image to quickly assess what

kinds of information useful to an investigation might be present on the disk. For the

purposes of this example, we will assume we are investigating corporate fraud and trying

to discover the answers to the following questions:

• Who are the users of the drive?

• Who is this person communicating with?

• What kinds of websites have they have been visiting most often?

• What search terms are used?

To answer many of these questions, we look at the identify information on the drive

including email addresses, credit card information, search terms, Facebook IDs, domain

names and vCard data. The output ﬁles created by bulk_extractor contain all of this

type of information that was found on the disk image.

The scenario setup leads us to believe that Charlie is the user of the this drive (based on

the name of the disk image). First, we look at email.txt to ﬁnd information about the

email addresses contained on the disk. The ﬁrst two lines of the email features found are

the following (each block of text represents one long line of oﬀset, feature and context):

50395384 n \ x00o \ x00m \ x00b \ x00r \ x00e \ x00_ \ x001 \ x002 \ x003 \ x00@ \ x00h \ x00o \ x00t

\ x00m \ x00a \ x00i \ x00l \ x00 .\ x00c \ x00o \ x00m \ x00 e \ x00m \ x00p \ x00l \ x00o \ x00 \ x00 \ x0A \x00

\ x09\ x00n \ x00o \ x00m \ x00b \ x00r \ x00e \ x00_ \ x001 \ x002 \ x003 \ x00@ \ x00h \ x00o \ x00t \ x00m

\ x00a \ x00i \ x00l \ x00 .\ x00c \ x00o \ x00m \ x00 \ x0A \ x00 \ x09 \ x00m \ x00i \ x00n \ x00o \ x00m \ x00b \ x00

50395432 m \ x00i \ x00n \ x00o \ x00m \ x00b \ x00r \ x00e \ x00@ \ x00m \ x00s \ x00n \ x00 .\ x00c

\ x00o \ x00m \ x00 i\ x00l \ x00 .\ x00c \ x00o \ x00m \ x00 \ x0A \x00 \ x09 \ x00m \ x00i \ x00n \ x00o \ x00m

\ x00b \ x00r \ x00e \ x00@ \ x00m \ x00s \ x00n \ x00 .\ x00c \ x00o \ x00m \ x00 \ x0A\ x00 \ x09 \ x00e \ x00j

\ x00e \ x00m \ x00p \ x00l \ x00

It is important to note that UTF-16 formatted text is escaped with \x00. This means

that "\x00t \x00e \x00x \x00t" translates to "text." The ﬁrst two features found

are "[email protected]om" and "[email protected]." Both of the oﬀset values,

50395384 and 50395432, are early on the disk. At this point, there is no way to know

if either of these email addresses are of any signiﬁcance unless they happen to belong

to a suspect or person related to the investigation. The ﬁrst set of email features found

appear on the disk printed in UTF-16 formatted text, like the lines above.

Further down in the feature ﬁle, we ﬁnd the following:

9263459 char l ie@m5 7 . biz 21)(88= Charli e < c harlie @m57 .biz >)(89\ x0D \ x0A =Pat

9263497 p at@m57 . biz = Pat McGoo )(8 B= WELCO M E TO

Finding Charlie’s email address on the computer begins to further conﬁrm the assump-

tion that this is his computer. The email_histogram.txt ﬁle provides important infor-

mation. It shows the most frequently occurring email addresses found on the disk. The

following is an excerpt from that top of that ﬁle:

n =875 m o zill a @kew is . ch ( utf16 =3)

n =651 ch arlie@ m57 .biz ( utf16 =120)

n =605 a jbanc k@pl a net .nl

n =411 mi kep@oe one .com

n =395 b e lhair e@ief .u- psud . fr

n =379 premium - s erve r @tha w te .com ( utf16 =11)

n =356 l ilma t t@mo zill a . com

n =312 cedric . cor a zza@ wana doo .fr

This histogram output shows us that Charlie’s email address is the second most fre-

quently occurring name on the disk. It would likely be the ﬁrst but, as described in

the scenario description, this company has only been in business for three weeks and

its employees are new users of the computers. Looking at this histogram ﬁle also gives

us some insight into who the user of this disk is communicating with. Those email

addresses occurring most frequently that are not part of the software installed on the

machine (such as a[email protected]) might indicate addresses of people with whom the

drive user is corresponding or they may result from other software or web pages that

were downloaded. (In this case, the email is from a Firefox extension.)

The ﬁle domain.txt provides a list of all the "domains" and host names that were

found. The sources include URLS, email and dotted quads. Much of the beginning of

the feature ﬁle is populated with microsoft.com domains. This is largely due to the fact

that the disk image is from a Windows machine. Further down in the ﬁle we ﬁnd the

following:

53878576 www . uspto . gov <a href =" http :// www . uspto . gov / patft / index . htm

53879083 www . uspto . gov <A HREF =" http :// www . uspto . gov / patft / help / help

53880076 ebiz1 . uspto . gov <A HREF =" http :// ebiz1 . uspto . gov / vision - service /

53880536 ebiz1 . uspto . gov <A HREF =" http :// ebiz1 . uspto . gov / vision - service /

The domains that were found make sense given that the disk image was obtained from a

startup company that deals with patents. Many of the domains found in the ﬁle are also

in UTF-16 format (with "escaped" characters). It is also worth noting as users browse

the domain output ﬁle that domains are common in compressed data.

The domain_histogram.txt ﬁle provides a histogram of the domains found on the

disk image. It tends to give us better information for digital media triage than the

domain.txt ﬁle as it provides information about which domains most frequently appear

on the disk image and not just the order in which they were found. The beginning of

the histogram ﬁle looks like the following:

n =10749 www . w3 . org

n =6670 c hron icli ngam eric a . loc . gov

n =6384 op e noffic e . org

n =5998 www . uspto . gov

n =5733 www . mozilla . org

n =5212 www . osti . gov

n =4952 www . micros o ft . com

n =4470 patft . uspto . gov

Many of these domains are part of the operating system, such as openoﬃce.org, but

some are not, such as www.uspto.gov. The histogram ﬁle provides insight into the users

activity on the machine and which sites they were most frequently visiting.

The ﬁle rfc822.txt primarily provides email headers and HTTP headers both of which

are in a format speciﬁed by RFC822, the Internet Message Standard. It can be useful

to see the subject of emails that have been sent and information form HTTP requests.

The following is an excerpt from the text ﬁle:

114074 1 96 SUBJE C T : s o ftabs ll | micro )\ x5CW ?cap \ x 00SUBJE CT : softabs \ x00SUBJE C T : Caili

114074 2 12 SUBJE C T : Cailis SUBJECT : softabs \ x00SU BJECT : Cailis \ x00 \ x 0 0SUBJE CT : st0ck

114074 2 28 SUBJE C T : st0ck SU B JECT : Ca i l i s \ x00 \ x00SUB JECT : s t0ck \x00 \ x00 \ x00SUB J ECT : Your

114074 2 44 SUBJE C T : Your Person al Qu a rantin e Folder

SUBJECT : st0ck \x00 \ x00\ x00SUB J ECT : Your P ersonal Quara ntine Folder \ x00SUBJE CT : rolex \x00

114074 2 84 SUBJECT : rolex arantin e Folder \ x00SUB J ECT : rolex \ x00 \ x00 \ x00SUBJEC T :( bro

Much of what is found in the ﬁle shown above are spam messages.

Telephone numbers found on the disk image are stored in telephone.txt. This follow-

ing numbers found in the ﬁle are clearly for technical support (found within installed

software):

88850883 (800) 563 -9048 rmation centre : (800) 563 -9048\ x0D \ x0A Tech

88850995 (905) 568 -4494 indows & nbsp ;95: (905) 568 -4494\ x0D\ x0A Microsof t

88851056 (905) 568 -2294 ice compo n ents : (905) 568 -2294\ x0D \ x0A Other sta

88851111 (905) 568 -3503 hnical sup p ort : (905) 568 -3503\ x0D \x0A P riority

88851162 (800) 668 -7975 rt in format ion : (800) 668 -7975\ x0D \ x0A Text Tele

The next set of "telephone" numbers are clearly bogus numbers:

36496 8 4174 008 -017 -0108 WA ,98366 ,1 ,4031 -008 -017 -0108 , City of Port Or

36496 8 4741 000 -031 -0009 98337 ,0.13 ,3768 -000 -031 -0009 , Kitsap County C

36498 1 8237 000 -001 -0005 8312 ,2.25 ,"3768 -000 -001 -0005 , 3768 -000 -003 -0

36498 1 8274 000 -004 -0002 0 -003 -003 , 3768 -000 -004 -0002 , 3768 -000 -005 -0

Finally, many of the numbers found are legitimate ones. These numbers were all found

in GZIP compressed data:

3772517888 - GZIP -28322 (831) 373 -5555 onterey - (831) 373 -5555 <a cl

3772517888 - GZIP -29518 (831) 899 -8300 Seaside - (831) 899 -8300 <a cl

3772517888 - GZIP -31176 (831) 899 -8300 Seaside - (831) 899 -8300 <a cl

Typically, the ﬁle telephone_histogram.txt is the best place to look for phone num-

bers. In this ﬁle, the non-digits are extracted from the phone numbers. The following

is an excerpt from the beginning of that ﬁle:

n =42 +141 59618 8 30

n =35 84771 8 0400

n =24 +271 12570 0 00

n =24 22255 5 2222

n =18 80050 4 3248

n =15 22255 5 1111

n =13 86623 4 7350

n =12 87727 6 8437

n =11 25222 7 7013

Investigators looking for speciﬁc information about the user of a disk image or who

they have been communicating with can look quickly at this ﬁle and see how frequently

numbers appear. It also consolidates the numbers in a way that makes it easy for inves-

tigators looking for a speciﬁc number or set of numbers to see them quickly.

Finally, in performing digital media triage on the disk image, investigators would like

to know what speciﬁc URLs have been visited and what search terms the user has been

using. The set of URL ﬁles provided as output provide insight into this information.

First, url.txt contains the URLs found on the disk. The following is an excerpt from

that ﬁle (note that the UTF-16 formatted information is escaped):

175165 3 85 http :// www . unicode . org / reports / tr25 /# _TocDe l imit ers E and U +23 DF :\ x0A #

http :// www . u n icode . org / repo r ts / tr25 /# _Toc Delim iter s \ x0A\ x5Cu23DE = \ x 5 CuE13B

159045 3 97 h \ x00t \ x00t \ x00p \ x00 :\ x00 /\ x00 /\ x00w \ x00w \ x00w \ x00 .\ x00d \ x00o \ x00w

\ x00n \ x00l \ x00o \ x00a \ x00d \ x00 .\ x00w \ x00i \ x00n \ x00d \ x00o \ x00w \ x00s \ x00u \ x00p

\ x00d \ x00a \ x00t \ x00e \ x00 .\ x00c \ x00o \ x00m \ x00 /\ x00m \ x00s \ x00d \ x00o \ x00w \ x00n \ x00l \ x00o

\ x00a \ x00d \ x00 /\ x00u \ x00p \ x00d \ x00a \ x00t \ x00e \ x00 /\ x00s \ x00o \ x00f \ x00t \ x00w \ x00a \ x00r

\ x00e \ x00 /\ x00s \ x00e \ x00c \ x00u \ x00 /\ x002 \ x000 \ x000 \ x008 \ x00 /\ x000 \ x006 \ x00 /\ x00w \ x00i

\ x00n \ x00d \ x00o \ x00w \ x00s \ x00x \ x00p \ x00 -\ x00k \ x00b \ x009 \ x005 \ x001 \ x003 \ x007 \ x006 \ x00 -

\ x00v \ x002 \ x00 -\ x00x \ x008 \ x006 \ x00 -\ x00e \ x00n \ x00u \ x00_ \ x00e \ x009 \ x00b \ x006 \ x008 \ x00c

\ x005 \ x00e \ x006 \ x003 \ x00a \ x00c \ x00b \ x005 \ x007 \ x008 \ x006 \ x00a \ x000 \ x005 \ x00b \ x005 \ x003

\ x00b \ x004 \ x00 \ xB4 \ xF4 \ x82 \x94C \ xE3 \ xB6C \ xB1p \ x9Ae \ xBC \ x82 , wh\ x00t \ x00t \ x00p \ x00 :

\ x00 /\ x00 /\ x00w \ x00w \ x00w \ x00 .\ x00d \ x00o \ x00w \ x00n \ x00l \ x00o \ x00a \ x00d \ x00 .\ x00w

\ x00i \ x00n \ x00d \ x00o \ x00w \ x00s \ x00u \ x00p \ x00d \ x00a \ x00t \ x00e \ x00 .\ x00c \ x00o

\ x00m \ x00 /\ x00m \ x00s \ x00d \ x00o \ x00w \ x00n \ x00l \ x00o \ x00a \ x00d \ x00 /\ x00u \ x00p \ x00d

\ x00a \ x00t \ x00e \ x00 /\ x00s \ x00o \ x00f \ x00t \ x00w \ x00a \ x00r \ x00e \ x00 /\ x00s \ x00e \ x00c \ x00u

\ x00 /\ x002 \ x000 \ x000 \ x008 \ x00 /\ x000 \ x006 \ x00 /\ x00w \ x00i \ x00n \ x00d \ x00o \ x00w \ x00s \ x00x

\ x00p \ x00 -\ x00k \ x00b \ x009 \ x005 \ x001 \ x003 \ x007 \ x006 \ x00 -\ x00v \ x002 \ x00 -\ x00x \ x008 \ x006

\x00 -\ x00e \ x00n \ x00u \ x00_ \ x00e \ x009 \ x00b \ x006 \ x008 \ x00c \ x005 \ x00e \ x006 \ x003 \ x00a \ x00c

\ x00b \ x005 \ x007 \ x008 \ x006 \ x00a \ x000 \ x005 \ x00b \ x005 \ x003 \ x00b \ x004 \ x003 \ x003 \ x002 \ x004

\ x006 \ x005 \ x00d \ x00e \ x00

175197 9 93 http :// www . uspto . gov / patft / index . html enter >\ x0A <a href =" http :// www.

uspto . gov / patft / index . html "> < img src ="/ net

175198 5 00 http :// www . uspto . gov / patft / help / help . htm e > </ a >\ x0A < AHREF =" http :// www .

uspto . gov / patft / help / help . htm "> < IMG BORDER ="0

The ﬁle url_histogram.txt provides the histogram of the potential urls. In that ﬁle,

UTF-16 formatted text is converted to UTF-8. Note that not all URLs contained in the

histogram ﬁle are accurate. The are actually URLs that were typed into a web browser.

The following are lines taken from that ﬁle:

n =3922 http :// www . m ozilla . org / k eymaster / gat ekeeper / there . is. only . xul ( utf16 =2609)

n =859 http :// www . mozilla . org / key m aster / gatek eeper / there . is . only . xu ( utf16 =858)

...

n =2 http :// math . nist .gov /~ KReming t on / papers / euro p v m . ps

n =2 http :// math . nist .gov /~ MDonahue /pubs / nan . ps . gz

n =2 http :// math . nist .gov /~ RBoisvert / pu blica t ions / ADL95 . ps . gz

n =2 http :// math . nist .gov /~ RBoisvert / pu blica t ions / IMACS97 .ps . gz

Because the histogram ﬁle converts the UT-16 formatted text to UTF-8, the histogram

ﬁle is more human readable than the url.txt ﬁle alone. The ﬁles url_facebook.txt,

url_microsoft-live, url_services and url_searches all extract speciﬁc types of

information from URLs. The most useful for digital media triage is likely the ﬁle

url_searches.txt because it shows histogram of searches from the disk image. Searches

frequently convey intent. The following is an excerpt from that ﬁle:

n =60 1

n =53 exotic + car + dealer

n =41 ford +car + dealer

n =34 2009+ Shelby

n =25 ste g anog r aphy

n =23 General + Electric

n =23 time + travel

n =19 ste g anog r aphy + tool + free

n =19 vacation + packag e s

n =16 firefox

n =16 quickt i me

n =14 7 zip

The ﬁle ccn.txt provides credit card numbers that have been found on the disk. Based

on the scenario set-up for this disk image, credit card numbers are not necessarily highly

relevant to this investigation. However, bulk_extractor did ﬁnd some credit card num-

bers on this disk image that are not actually credit card numbers; This is common

behavior so it is worth examining the ﬁle here to demonstrate how it can be used in

other investigations. The credit card number ﬁnder considers a pattern of digits and

uses the Luhn checksum algorithm and the distribution of digits and the local context to

identify potential credit card numbers. It is important to note that there are frequently

false positives. The ﬁrst few lines of the ccn.txt ﬁle for this disk image look like the

following:

88284672 - GZIP -177427 527 3 3474 5864 2687 734 B55CD5 \ x0 A 527 3347 4586 4268 7 \ x0AC 0841 BAF A 1B4 C 28

4814857216 - GZIP -793 40 1575 1530 1020 9 7 ebO . d =0; ebO . rnd =4015 7 515 3010 2097 ; ebO . title =""; eb

49090 6 9775 654 3210 1234 5678 8 \ x0Adda dd754 0 add ’6543210123456788 ’ 0.499 99999 9

49090 6 9811 654 3210 1234 5678 8 4 9 9999999 -> ’6543210123456788 ’ In e x act Rounde

49090 6 9861 654 3210 1234 5678 8 \ x0Addad d7541 add ’6543210123456788 ’ 0.5

49090 6 9897 654 3210 1234 5678 8 5 -> ’6543210123456788 ’ Inexact R o u n d e

49090 6 9947 654 3210 1234 5678 8 \ x0Addad d7542 add ’6543210123456788 ’ 0 . 50000 0 001

53042 2 1350 567 8901 2345 6000 0 +4 -> 56 7890 1234 5600 00\ x0D \ x0 A ddshi 0 52 shift

56123 7 5618 654 3210 1234 5678 8 \ x0D \ x0 Aaddx6 240 add ’6543210123456788 ’ 0.4 999999 99

56123 7 5654 654 3210 1234 5678 8 4 9 9999999 -> ’6543210123456788 ’ In e x act Rounde

56123 7 5703 654 3210 1234 5678 8 \ x0D \ x0 Aaddx6 241 add ’6543210123456788 ’ 0.5

56123 7 5739 654 3210 1234 5678 8 5 -> ’6543210123456788 ’ I nexact Rounde

56123 7 5788 654 3210 1234 5678 8 \ x0D \ x0 Aaddx6 242 add ’6543210123456788 ’ 0.5 000000 01

56127 1 5901 570 0122 1522 7469 6 div4 0 36 divide 5 7001 2215 2274 696 5 7001 2 2152 2 51

In the above example, ‘525273347458642687’ looks like it could be a valid credit card

number from the context (\x0A is a new line). The number ‘4015751530102097’ looks

like a random number in a piece of Java Script. Note that both of those numbers were

compressed; the oﬀset indicates they were found in GZIP streams (shown as a number

followed by ‘-GZIP’). The numbers whose context include “Inexact Rounde” are actually

from Python source code and not credit card numbers at all. Again, the ccn.txt tends

to alert on a lot of false positives.

The ccn_track2.txt ﬁle did not ﬁnd any information in this disk image but is also

useful for credit card fraud and identity theft investigations. It will contain credit card

track 2 information found on the disk image.

Using the ﬁles produced by bulk_extractor described above, an investigator can quickly

review a disk image for important information that is relevant to an investigation and

ﬁnd actionable intelligence quickly.

9.3 Analyzing Imagery

The scenario described in the M57 Patents data is not necessarily relevant to an imagery

investigation. However, there is imagery information on the disk. We use that informa-

tion here to demonstrate how imagery information can be analyzed by an investigator

using bulk_extractor.

The ﬁle in the output directory, jpeg.txt, lists all JPEGs that were found on the disk

whether they were carved or not. bulk_extractor was run with default values meaning

that only encoded JPEGs were carved. The following excerpt from the JPEG ﬁle shows

information about JPEGs found on the disk image:

54798824 ../ Output /charlie -2009 -12 -11/ jpeg /5478 3 488. jpg < fileobject >< filename >

../ Output /charlie -2009 -12 -11/ jpeg /5478 3 488. jpg </ filename > < filesize >15336 </ filesize >

< hashd igest type = ’md5 ’ >13823 ce 7c21 587 d31 f6e b 447 461 2e6 6 0

</ hashdigest > </ fileobject >

The JPEG described above was not carved because it was not encoded. However, the ﬁrst

section “../Output/charlie-2009-12-11/jpeg/54783488.jpg” shows where the ﬁle would be

found in the output directories if it had been carved. The next section of information

also gives the ﬁle size, the hash type (in this case ‘md5’) and the hash value of the ﬁle

(in this case 13823ce7c21587d31f6eb4474612e660). Note that this may not match the

hash value of the ﬁle in the original ﬁle system as bulk_extractor cannot properly carve

fragmented ﬁles.

Information about encoded JPEGs can also be found in the jpeg.txt ﬁle. The following

excerpt shows a description of a JPEG found in a GZIP format on the disk:

3771686400 - GZIP -8332 ../ Output / charlie -2009 -12 -11/ jpeg /3771686400 - GZIP -0. jpg

< fileobject >< filename >../ Output / charlie -2009 -12 -11/ jpeg /3771686400 - GZIP -0. jpg

</ filename > < filesize >8332 </ filesize > < h ashdig e st type =’ md5 ’>

5 b 7703 5c98 3 b049 9677 4370 f735e a72a </ hashdigest ></ fileobject >

The JPEG described above was carved and can be found in the /jpeg output directory

in the ﬁle named 3771686400-GZIP-0.jpg. The ﬁle also gives information about the

ﬁlesize, hash type and hash ID. That ﬁle is shown in the directory output shown below

along with all of the encoded JPEGs that were found on the disk image and were carved.

The contents of the /jpeg directory are as follows:

10037939712-GZIP-0.jpg 5324841013-ZIP-0.jpg

10117679783-ZIP-0.jpg 6039195136-GZIP-0.jpg

3761630720-GZIP-0.jpg 6039215616-GZIP-0.jpg

3764534784-GZIP-0.jpg 6039223808-GZIP-0.jpg

3771686400-GZIP-0.jpg 6039232000-GZIP-0.jpg

3771706880-GZIP-0.jpg 6039244288-GZIP-0.jpg

3771715072-GZIP-0.jpg 6039301632-GZIP-0.jpg

3771723264-GZIP-0.jpg 6039318016-GZIP-0.jpg

3771735552-GZIP-0.jpg 6883925636-ZIP-0.jpg

3771792896-GZIP-0.jpg 6884040324-ZIP-0.jpg

3771809280-GZIP-0.jpg 6884056948-ZIP-0.jpg

3771833856-GZIP-0.jpg 7276064256-GZIP-0.jpg

3771858432-GZIP-0.jpg 7279128576-GZIP-0.jpg

429788672-GZIP-0.jpg 8877243047-ZIP-0.jpg

5310405287-ZIP-0.jpg 9948655104-GZIP-0.jpg

All of these JPEG ﬁles can be viewed and used by investigators. The ﬁlename is the

forensic path of where the JPEG was found. The ﬁle 3771686400-GZIP-0.jpg mentioned

above is shown in Figure 18.

Figure 18: A JPEG carved from encoded data on the M57 Patents disk image

9.4 Password Cracking

The wordlist generates a list of all the words found on the disk that are between 6 and

14 characters long. The word list that is generated by the scanner can be very useful in

determining combinations of words to use for password cracking. The scanner is enabled

by default because it slows down the bulk_extractor run signiﬁcantly. To show the word

list in this example, bulk_extractor was run again on the M57 Patents scenario data

with the wordlist scanner enabled. Running bulk_extractor on the command line with

it enabled produces the following output:

C:\be\>bulk_extractor -e wordlist -o ../Output/charlie-wordlist charlie-2009-12-11.E01

bulk_extractor version: 1.4.0

Input file: charlie-2009-12-11.E01

Output directory: ../Output/charlie-wordlist

Disk Size: 10239860736

Threads: 4

12:58:46 Offset 67MB (0.66%) Done in 1:14:55 at 14:13:41

...

14:03:24 Offset 10217MB (99.78%) Done in 0:00:08 at 14:03:32

All data are read; waiting for threads to finish...

Time elapsed waiting for 4 threads to finish:

(timeout in 60 min .)

Time elapsed waiting for 4 threads to finish:

8 sec (timeout in 59 min 52 sec.)

Thread 0: Processing 10200547328

Thread 1: Processing 10234101760

Thread 2: Processing 10183770112

Thread 3: Processing 10217324544

Time elapsed waiting for 1 thread to finish:

14 sec (timeout in 59 min 46 sec.)

Thread 3: Processing 10217324544

All Threads Finished!

Producer time spent waiting: 3627.92 sec.

Average consumer time spent waiting: 4.1518 sec.

*******************************************

bulk_extractor is probably CPU bound.

Run on a computer with more cores

to get better performance.

*******************************************

Phase 2. Shutting down scanners

Phase 3. Uniquifying and recombining wordlist

Phase 3. Creating Histograms

ccn histogram... ccn_track2 histogram... domain histogram...

email histogram... ether histogram... find histogram...

ip histogram... lightgrep histogram... tcp histogram...

telephone histogram... url histogram... url microsoft-live...

url services... url facebook-address... url facebook-id

url searches...Elapsed time: 4065.09 sec.

Overall performance: 2.51898 MBytes/sec

Total email features found: 152775

Note that it took 3991.71 seconds to run bulk_extractor without the wordlist scanner

enabled and, in this case, it took 4065.09 seconds with wordlist enabled. The new

output directory contains a ﬁle called wordlist.txt. That ﬁle has both ﬁlenames and

words in it. The following is an excerpt from that ﬁle:

50497556 usem odem . jpg

50497624 usemsn . jpg

50497692 use msnnow .jpg

50497760 welcome . htm

50497828 wher eNow . htm

50497896 xmlutil .js

50497987 ^ Photoshop

50498009 Re s olutio n

50498050 Global

50498057 Ligh ting

50498090 Global

50498097 Alti tude

50498153 Cop yright

50498181 Japa nese

50498229 Half tone

50498238 Sett ings

50498335 Tran sfer

The wordlist contains ALL words found on the disk between 6 and 14 characters long.

Automated programs can be used to generate passwords from combinations of these

words. The wordlist scanner also generates a split wordlist containing the same words

found in the wordlist.txt ﬁle with all words deduplicated, sorted by size and alpha-

betized. The following is an excerpt from the ﬁle wordlist_split_000.txt generated

from the disk image:

conclu d ed |1

conclu d er /2

conclu d er / M

concluir /XQ

conclu r ai / x

conclusion ,

concl u sion .

concl usione

concl usions

conclusive ,

The split wordlist is the ﬁle that is typically fed to password cracking software.

9.5 Post Processing

The programs identify_ﬁlenames.py and bulk_diﬀ.py can provide further insight

into the data contained on the disk image. The identify_ﬁlenames.py program can

be used on the feature ﬁles produced from the bulk_extractor run to show the ﬁle lo-

cation of the features that were found. Running the program on all of the feature ﬁles

produced by the bulk_extractor run produces the following output (where charlie-2009-

12-11 is the bulk_extractor output directory and charlieAnnotatedOutput is where all

the annotated ﬁles are written):

C:\be\>identify_filenames.py –all charlie-2009-12-11 charlieAnnotatedOutput

Reading file map by running fiwalk on charlie-2009-12-11.E01

Processed 1000 fileobjects in DFXML file

Processed 2000 fileobjects in DFXML file

...

Processed 39000 fileobjects in DFXML file

Processed 40000 fileobjects in DFXML file

feature_file: aes_keys.txt

feature_file: ccn.txt

feature_file: domain.txt

feature_file: email.txt

feature_file: ether.txt

feature_file: exif.txt

feature_file: ip.txt

feature_file: jpeg.txt

feature_file: json.txt

feature_file: rar.txt

feature_file: rfc822.txt

feature_file: telephone.txt

feature_file: url.txt

feature_file: windirs.txt

feature_file: winpe.txt

feature_file: winprefetch.txt

feature_file: zip.txt

******************************

Total Features: 754038

Total Located: 754038

******************************

Note, in this example that ﬁwalk is installed on the computer running the iden-

tify_ﬁlenames.py program. The directory charlieAnnotatedOutput contains all of the

annotated feature ﬁles, showing the ﬁle location of the features. The directory contents

are as follows:

annotated_aes_keys.txt annotated_rar.txt

annotated_ccn.txt annotated_rfc822.txt

annotated_domain.txt annotated_telephone.txt

annotated_email.txt annotated_url.txt

annotated_ether.txt annotated_windirs.txt

annotated_exif.txt annotated_winpe.txt

annotated_ip.txt annotated_winprefetch.txt

annotated_jpeg.txt annotated_zip.txt

annotated_json.txt

The annotated ﬁles display the feature with the ﬁle in which the feature was found (where

it was identiﬁed by the program). The following is an excerpt from the annotated_email.txt

ﬁle:

27767966 pat@m57 . biz m: " Pat McGoo " \ x0D \ x0ATo : < ch a rlie@ D ocuments

and Settings / Charlie / App l icati o n Data / T h under b ird / Profiles /4 zy3 4 x9h . default / Mail / Local

Folders / Inbox dc b79 4e3 5 0b d 198 c42 796 14e ae6 c8b 76

27767985 ch a rlie@ m 57 . biz @m57 .biz >\ x0D \ x0ATo : < c harlie @m57 .biz > ,\ x0D \x0A \ x09 < jo@m

57. biz Documents and Setti n gs / Charlie / Applicat ion Data / Thunderb ird / Profiles /4 zy34x9 h .

default /Mail / Local F olders / Inbox dc b79 4e3 50b d19 8 c42 796 14e ae6 c8b 76

27768022 ter r y@m57 .biz jo@m57 . biz > ,\ x0D \ x0A \ x09 < ter r y@m57 .biz >\ x0D \ x0AX - ASG - Orig -

Su Do cuments and Setting s / C harlie / Applic a tion Data / Thunde r bird / Profile s /4 z y34x9h . def

ault / Mail / Local Folders / Inbox dc b79 4e3 5 0b d 198 c42 796 14e ae6 c8b 7 6

The email address "pat@m57biz" was found in the ﬁle Documents and Settings/Charlie/

Application Data/Thunderbird/Profiles/4zy34x9h.default/Mail/Local Folders/Inbox

and investigators can refer to that location on the disk image to view the full text.

The program bulk_diﬀ.py shows the diﬀerence between two bulk_extractor runs. In

this case, we used a disk image from the same user ("charlie") taken almost a month be-

fore the disk image that has been used throughout this example. The disk image we have

been using throughout this example is dated December 11, 2009. The older disk image

we downloaded for comparison is dated November 17, 2009. The earlier disk image data

is stored in a ﬁle named charlie-2009-11-17.E01 and can be downloaded from http://

digitalcorpora.org/corp/nps/scenarios/2009-m57-patents/drives-redacted/.

After running bulk_extractor using the earlier disk image, we ran the program bulk_diﬀ.py

on the output of that disk image and on the output of the charlie-2009-12-11.E01

run. To run, we typed the following, piping the output of the program to a ﬁle called

bulkdiffoutput.txt:

 bulk_diff.py /charlie-2009-11-17 /charlie-2009-12-11 > bulkdiffoutput.txt

The output shows the features diﬀerences on the disk image. The following is an excerpt

of that output:

dom a in_h isto gram . txt :

# in PRE # in POST Value

-- --- -- - -- --- -- --- -- --- -- - -- --- -- --- -- - -- --- -- --- -- --- -- - -- --- -- --- -- --- -- - -- --- -- --

401 4 ,470 4 ,069 patft . uspto . gov

181 3 ,151 2 ,970 www . wipo . int

295 3 ,157 2 ,862 www . google . com

0 2 ,537 2 ,537 l. yimg . com

The output speciﬁcally shows the diﬀerences in the histograms between the two runs

across all of the histogram ﬁles that were created. The excerpt above shows that "charlie"

(the disk user) visited the domain "patft.uspto.gov" frequently between the time the two

images were recorder. It was found 4,069 more times in the later disk image than in

the one taken earlier. It also shows that the domain "l.yimg.com" was not found on the

earlier disk image but was found 2,537 times on the later disk image. The results are

sorted by the amount of the diﬀerence. This means that features that are most diﬀerent

appear ﬁrst. This can be very helpful because those features generally give the most

insight into the disk users activity over that period of time.

10 NPS DOMEX Users Image

NPS Test Disk Images are a set of disk images that have been created for testing com-

puter forensic tools. These images are free of non-public Personally Identiﬁable Infor-

mation (PII) and are approved for release to the general public. The NPS-created data

in the images is public domain and free of any copyright restriction; the images may

contain some copyrighted data that was made available by the copyright holder. These

copyrights, where known, are noted in the ﬁles themselves[?].

The NPS DOMEX users image is a disk image of a Windows XP SP3 system that has two

users, domexuser1 and domexuser2, who communicate with a third user (domexuser3)

via IM and email. The data is available for download at http://digitalcorpora.org/

corp/nps/drives/nps-2009-domexusers/. For this example, we use the ﬁle nps-2009-domexusers.E01

which includes the full system including the Microsoft Windows executables. Running

bulk_extractor on the command line produces the following output:

C:\be\>bulk_extractor -o ../Output/nps-2009-domexusers nps-2009-domexusers.E01

bulk_extractor version: 1.4.0

Input file: nps-2009-domexusers.E01

Output directory: ../Output/nps-2009-domexusers2

Disk Size: 42949672960

Threads: 4

16:50:53 Offset 67MB (0.16%) Done in 4:23:43 at 21:14:36

16:51:19 Offset 150MB (0.35%) Done in 3:58:37 at 20:49:56

...

16:13:12 Offset 42849MB (99.77%) Done in 0:00:11 at 16:13:23

16:13:13 Offset 42932MB (99.96%) Done in 0:00:01 at 16:13:14

All data are read; waiting for threads to finish...

Time elapsed waiting for 3 threads to finish:

(timeout in 60 min .)

Time elapsed waiting for 1 thread to finish:

6 sec (timeout in 59 min 54 sec.)

Thread 0: Processing 42932895744

Time elapsed waiting for 1 thread to finish:

12 sec (timeout in 59 min 48 sec.)

Thread 0: Processing 42932895744

All Threads Finished!

Producer time spent waiting: 4254.07 sec.

Average consumer time spent waiting: 89.309 sec.

*******************************************

bulk_extractor is probably CPU bound.

Run on a computer with more cores

to get better performance.

*******************************************

Phase 2. Shutting down scanners

Phase 3. Creating Histograms

ccn histogram... ccn_track2 histogram... domain histogram...

email histogram... ether histogram... find histogram...

ip histogram... lightgrep histogram... tcp histogram...

telephone histogram... url histogram... url microsoft-live...

url services... url facebook-address... url facebook-id...

url searches...Elapsed time: 4846.74 sec.

Overall performance: 8.86156 MBytes/sec

Total email features found: 8774

All of the results from the bulk_extractor run are stored in the output directory nps-

2009-domex. The contents of that directory after the run are as follows:

1 aes_keys.txt 1 kml.txt

0 alerts.txt 0 lightgrep.txt

1 ccn.txt 0 lightgrep_histogram.txt

1 ccn_histogram.txt 4 packets.pcap

0 ccn_track2.txt 1 rar.txt

0 ccn_track2_histogram.txt 424 report.xml

7364 domain.txt 536 rfc822.txt

44 domain_histogram.txt 1 tcp.txt

0 elf.txt 1 tcp_histogram.txt

1528 email.txt 48 telephone.txt

32 email_histogram.txt 4 telephone_histogram.txt

1 ether.txt 51888 url.txt

1 ether_histogram.txt 0 url_facebook-address.txt

152 exif.txt 0 url_facebook-id.txt

0 find.txt 1240 url_histogram.txt

0 find_histogram.txt 0 url_microsoft-live.txt

0 gps.txt 4 url_searches.txt

0 hex.txt 32 url_services.txt

4 ip.txt 0 vcard.txt

1 ip_histogram.txt 15228 windirs.txt

20 jpeg/ 26516 winpe.txt

380 jpeg.txt 1312 winprefetch.txt

316 json.txt 1956 zip.txt

For this example, we will focus on the ﬁles that are most important to malware inves-

tigations and cyber investigations, showing how those ﬁles can be interpreted and used

by investigators.

10.1 Malware Investigations

In a malware investigation, investigators are looking for information about program-

matic intrusions. In this example, we examine all ﬁles that provide information about

executables, Windows directory entries and information downloaded from web-based

applications. We recommend that "-e xor" be enabled for malware investigations.

The ﬁle windirs.txt provides information about FAT32 and NTFS directories. It con-

tains most of the disk entries. The following is an excerpt showing one line from the

ﬁle:

281954 8 16 A0001 8 01 . dll < fileob j ect

src =’mft ’ >< atime >2008 -10 -21 T00 :45:51Z </ atime >< attr_flags >8224 </ attr_flags >

< filename > A 0 001801 .dll </ filename > < filesize >1000 0 00000000 < / filesize > < fil e size_alloc >

0 </ filesize_alloc >< lsn > 1 2 3 4 3 7 3 3 9 < / lsn >< mtime >2008 -10 -21 T00 :45:51Z </ mtime >

</ fileobject >

The line from the ﬁle gives information about the disk entry A0001801.dll. It provides

some data about the ﬁle including the ﬁle size, ﬁle creation time (ctime) and time of last

ﬁle modiﬁcation (mtime). It is important to note that the error rate for FAT32 entries

is high and those entries should be ignored if the drive is not FAT.

For investigations on Windows disk images, such as the nps-2009-domexusers, the ﬁle

winpe.txt shows Windows executables related to the Windows Preinstallation Envi-

ronment. These ﬁle entries contain very long lines. The following is one line from the

ﬁle:

42753536 87 d84 1 54e 778 901 387 8c6 3 40a 4d2 d44 5 <PE >< F i leHead er Machi n e =

" I MAG E_F I LE_ MAC H INE _I3 8 6 " Num b erOf Sect ions ="3" T i meDa t eSta mp =" 12081 31815 "

Poi nter ToS y mbo lTab le ="0" Numb erOf S ymbo ls ="0" Siz e OfO ptio nalH ead e r ="224" >

< Characteri s tics >

< IM AGE _FI L E_L INE _NU MS_ STR I PPE D / >< IMA GE_ FIL E _LO CAL _SY MS_ STR I PPE D />

< IM AGE_ FIL E_3 2 BIT _MA CHIN E / >< IMAGE _FILE _DLL / > </ Characteristics >

</ FileHeader > < O ptio nal H ead erS t and ard Magic =" PE32 " Maj orLi nker Vers ion ="7"

Min orLi nker Vers ion ="10" S izeOfC o de ="512" Siz eOfI nit i ali zedD ata ="1536"

Siz eOf U nin iti a liz edD a ta ="0" Ad dre s sOf Entr yPoi nt ="0 x1046 " Base O fCode =

"0 x1000 " />< Opt iona lHe a der Wind ows Ima g eBase ="0 x6c6c0000 " Se ctio n Alig nmen t

="1000" File Align ment = " 2 0 0 " Maj orO per ati n gSy ste mVe r sio n ="5"

Min orO per a tin gSy ste mVer sio n ="1" Maj o rIm a geV e rsi o n ="5"

Min o rIm a geV e rsi o n ="1" Maj orSu bsy stem Ver s ion ="4" M ino r Sub sys t emV ersi on ="0"

Win 3 2Ve r sio n Val u e ="0" SizeO fImage ="4000" S izeO f Head ers ="400" CheckSum ="

0 x7485 " SubSyst e m ="" S izeO fSt a ckR e ser ve ="4 0 000" Si zeOf Stac kCom mit = "1000"

Siz e OfH e apR e ser v e ="1 0 0000" Siz eOfH eapC ommi t ="1000" Load erFlag s ="0"

Num berO fRva AndS ize s ="10" > < DllCharact eristics >

< IM AGE _DL L_C HAR A CTE RIS TIC S_N O_S EH / ></ Dl l Charact e ristics >

</ Optional Header W indows >< Sections > < S ecti o nHea d er Name =". text " Vir t ualSi ze

=" be " Virt ualA d dres s ="1000" Siz e OfRa wData ="200" Po inte rToR a wDa t a ="400"

Poi nter ToRe loc atio ns ="0" Po inte rToL ine n umb ers ="0" >< Characteri s tics >

< IM A GE_ S CN_ CNT_ CODE />

< IM A GE_ S CN_ MEM_ READ / > </ Characteristic s > </ SectionHeader >< S e ctio n Head e r

Name =". rsrc " Virtu alSiz e ="400" Vi rtua l Addr ess ="2 0 00" Si zeOfR awDat a ="400"

Poi n terT oRaw Data ="600" P o int erTo Relo cat i ons ="0" P oin t erT oLin enu m ber s ="0"

>< Characteri stics > < IMAGE _SC N_C NT_ INI T IAL IZE D_D ATA />

< IM A GE_ S CN_ MEM_ READ / > </ Characteristic s > </ SectionHeader >

< Sec tionH eader Name =". reloc " Vir tualSi ze ="8" Virt ualAd dres s ="3000"

Size OfRaw Data ="200" Po inte rToR awDa t a =" a00 " Poi nte r ToR eloc atio ns ="0"

Poi nter ToLi nen umbe rs ="0" >< Characteristi c s >

< IM AGE _ SCN _ME M_DI SCA RDA B LE /> < IMA GE_S CN_M EM_R EAD / > </ C h aracteristics >

</ Se ctionHeader ></ Sections ></PE >

The ﬁrst number is the oﬀset and tells you were to ﬁnd the ﬁle. Most executables are

not fragmented. The second is the MD5 has of the ﬁrst 4k of the ﬁle that can be used

to deduplicate and look up the ﬁle in the hash database. Finally, the bulk of the infor-

mation is contained in the <PE> XML block that breaks out all of the Windows PE

header information. It contains information about the File header, the characteristics

of the ﬁle, Windows header information and section header information.

The ﬁle winprefetch.txt contains the information from carved ﬁles Windows Prefetch

that were discovered anywhere on the drive. bulk_extractor will carve the Prefetch

ﬁles from unallocated space. This extremely useful because Prefetch ﬁles are frequently

deleted. A single line in the prefetch output ﬁle is also very long. The following is only

the beginning of one line from the ﬁle:

55758336 MSIEXEC . EXE < prefetch >< os > Windows

XP </os > < filename > MSIEXEC . EXE </ filename >< header_size >152 </ header_size >

<file >\ x5C DEVICE \ x5 C HAR DDIS KVOL UME1 \ x5 C WINDOW S \ x 5 CSYST E M32 \ x5CNTDLL . DLL

</ file >< file >\ x 5 CDEVICE \ x5 CHA R DDI S KVO L UME 1 \ x 5 CWINDO WS \ x5CSYS T EM32 \ x5CK ERNEL3 2 . DLL

...

Printing the line out here would cover almost two pages. It includes a lot of information

about the Prefetch ﬁle including the name of the executable, the name of the DLLs,

the directory of DLLs, the atime, the number of runs, the serial number, and the ctime.

The Prefetch ﬁle is searchable and useable by investigators searching for EXEs or DLLs

related to a malware investigation.

JSON is the JavaScript Object Notation (used in Facebook, etc). The ﬁle json.txt

provides the oﬀset, JSON and MD5 hash of the JSON information found on the disk

image. bulk_extractor is great at ﬁnding JSON in compressed streams and HIBER ﬁles.

The following are a few lines from the JSON ﬁle:

62836579 {" ask ":[" Ask "] ," delici ous ":[" Del . icio . us "] ," digg ":[" Digg "] ," email ":[" Email "],

" fa vorites ":[" Favori tes "] ," facebook ":[" Facebook "] ," fark ":[" Fark "] ," furl ":[" Furl "] ,

" google ":[" Google "] ," live ":[" Live "] ," myspace ":[" MySpa c e "] ," myweb ":[" Yahoo MyWeb "

," yahoo - myweb "] ," news v ine ":[" Newsvine "] ," reddit ":[" Reddit "] ," sk * rt ":[" Sk* rt " ," skrt "] ,

" sla shdot ":[" Slashdot "] ," stum b leupo n ":[" S tumbl e Upon " ," su "] ," s tylehive ":[" Style hive "] ,

" tai lrank ":[" Tailrank "," t ailrank2 "] ," t echnora ti ":[" Technora t i "] ," thisnext ":

[" Thi s Next "] ," twitter ":[" Twitter "] ," ballhype ":[" BallHype "] ," yardb arker ":

[" Ya rdbarke r "] ," kaboodle ":[" Kaboodle "] ," more ":[" More ..."]}

26 d 3b8 c50 10f 4d3 9 250 dab 3a1 c1b 839 e

62842797 ["6 jb4 " ,"3 j1d " ," v1me " ," gu83 " ," uefc " ," fq1j "," r5l7 " ," ftho " ," gdq9 " ,"717 h" ,

"24 b7 "," d0en " ," ads7 " ," m9b4 " ," n0lq " ,"42 c3 " ," p5mp " ,"7 hbi " ," f0g6 " ,"7 v98 " ," mv86 ",

" d0ns " ,"9 a8a " ,"64 gg " ," jogl "," cehp " ," mu2r " ,"6 h7h " ," sntb " ,"94 ds" ," n1fv " ,"3 a2i ",

"3 end " ," l42s "," a9j " ," q3dj " ," s150 " ," di3s " ,"3 nu5 " ," sk74 " ," e39d "," mkvj " ,"482 d " ," kfej ",

" nlcv " ," eroi " ," m6ee " ," rvaa " ,"9 nis " ," ef6b " ," g00q " ," b4hp " ," kbpq " ," bm4l " ," f7iu " ,

" e5gb " ,"1 sbj " ," rk0a " ," ck86 " ,"1 etp " ,"26 sr " ," fivt " ,"3 v95 " ," foqq " ," vtmj " ," canb ",

" bchv " ," ku35 " ," q4p9 " ," gdkt "," gng8 " ," mdb9 " ," ejjg " ,"27 k9 " ,"30 mf" ," nene ",

" smmm " ," q204 " ,"83 ot " ,"6 kbr "," df1o " ,"1 q0j " ," nh32 " ," ebso " ," d6t5 "," f2dp ",

"3 sqp " ," i4cs " ,"6 k7b " ," a1pv " ," ki2l " ,"1 f7 " ," d6lv " ," u7r5 " ,"9 t0e " ,"5 h0l " ," j8kn ",

"7 akj " ,"9 tj " ," jmu3 " ,"1 ir1 "] 5 a04 af7 518 ad7 4c4 9 7c 9 e74 b70 257 36e

64044544 - GZIP -610 [" Top "," Left " ," Right " ," Bottom "] 5354 ef6 838 974 b19 79e4 9ee 379 883 c 56

Some of the JSON features found, such as the one located at ’62836579’, are comprised

of a lot of information in the notation. Other JSON features are very short, such as the

feature located at in the GZIP compressed stream at ’64044544-GZIP-610.’ All of the

lines contain the MD5 hash of the JSON that is used for deduplication.

The ﬁle elf.txt typically contains information about ELF executables, which is the

executable ﬁle format for Linux and Android systems. The sample corpus used in this

example is from a Windows machine and does not contain any ELF executables.

10.2 Cyber Investigations

Cyber investigations cover a wide variety of areas. However, most involve looking for

encryption keys, hash values or information about ethernet packets. bulk_extractor

ﬁnds all of those things on the disk and writes them to diﬀerent output ﬁles. Of note,

bulk_extractor also ﬁnds information in Base64 encoding and decompresses fragments

of Windows Hibernation ﬁles. There are not speciﬁc ﬁles created for that processing;

the information found in data with these encodings will be processed by other scanners

and stored in the appropriate feature ﬁles. The fact that a feature came from encoded

data will be indicated in the forensic path. The information contained therein may very

well be relevant to cyber investigations.

AES encryption implementation system sometimes leaves keys in memory and bulk_extractor

ﬁnds those keys, usually in RAM, Swap or hibernation ﬁles. The keys can sometimes be

used to decrypt AES encrypted material. The ﬁle aes.txt contains the keys that are

found. There was only one AES key found on the nps-2009-domexusers disk image.

The following is the line that describes it from the keys ﬁle including the oﬀset, key and

key size descriptor (AES256):

16085 8 0652 28 90 90 5 e f7 ce b4 a7 2 b 7 d d9 45 d8 b0 56 99 97 f4 42

33 35 f1 54 9 a 79 36 e7 1 c 94 02 28 78 AES256

The ﬁle hex.txt contains extracted hexidecimal strings of a special length. The block

sizes cotained within it are either 128 or 256 due to the fact that those are the sizes used

for encryption keys and hash values. The disk image used in this example does not have

any of those and the ﬁle is blank.

bulk_extractor produces network information including PCAP ﬁles, Ethernet addresses,

and TCP/IP connections. The ﬁles ether.txt and ether_histogram.txt provide a

list of ethernet addresses from packets and ASCII. These are the addresses found on the

disk and located in ether.txt:

24358 6 3552 00:0 C :29:2 6 : BB : CD ( ether_ dhost )

24358 6 3552 00 :50:56: E0: FE :24 ( e t her_s h ost )

24358 6 5088 00:0 C :29:2 6 : BB : CD ( ether_ dhost )

24358 6 5088 00 :50:56: E0: FE :24 ( e t her_s h ost )

22637 986225 00:80: C7 :8F :6 C :96 apter .\ x0AE x ample : 00:80: C7 :8 F :6 C :96\ x00 \ x00

The ﬁle ether_histogram.txt groups these ethernet addresses in a histogram:

n =2 00:0 C :2 9 :26: BB : CD

n =2 0 0:50:56: E0: FE :24

n =1 00:80: C7 :8 F :6 C :96

Packets likely traveled from 00:0C:29:26:BB:CD to 00:50:56:E0:FE:24. The other usage

has Ethernet addresses in UTF-16 format.

The ﬁle ip.txt contains IP addresses from packet carving, not from dotted quads. The

following is an excerpt from that ﬁle:

24358 6 5102 in et_ntop win32 struct ip L ( src) cksum - ok

24358 6 5102 in et_ntop win32 struct ip R ( dst) cksum - ok

28055 3 4669 1 23.12 . 0.192 socka ddr_in

86943 9 7397 1 35.5.0 .234 so ckadd r_in

90473 1 8477 1 23.12 . 0.192 socka ddr_in

94469 5 9573 1 35.5.0 .234 so ckadd r_in

11295 228937 1.70.0.1 sock a ddr_i n

The L or R in the ’struct ip’ information indicates Local or Remote. This line also

includes the IP checksum is ok. The value could also be listed as "cksum-bad" to

indicate it is bad. Bad checksums may indicate a false positive and not a legitimate IP

address. Finally, the "sockaddr_in" indicates the IP address is from a "sockaddr_in"

structure. The ﬁle ip_histogram.txt removes the random noise that is found in the

ip.txt. Here is an excerpt from the histogram ﬁle:

n =5 2 .172. 0 .101

n =4 123. 1 2.0.1 92

n =4 i net_ntop win32

n =3 1 35.5. 0 .234

n =2 209. 85.1 4 7.10 9

n =2 65.5 5 .15.2 42

The ﬁle packets.pcap is a pcap ﬁle made from carved packet. To view that ﬁle, use any

packet analysis tool you like (such as tcpdump). Only packets carved from a PCAP

ﬁle will have the correct packet time stamp; others will given a time in 1970.

Finally, the ﬁle tcp.txt contains details about TCP (and UDP) network ﬂows. It

contains more detail than ip.txt but investigators should be careful of false positives,

as there are often many in this ﬁle. The following are the two lines found in that ﬁle:

24358 6 3566 in et_ntop win32 :80 -> inet_ ntop win32 :1034 ( TCP ) Size : 1472

24358 6 5102 in et_ntop win32 :80 -> inet_ ntop win32 :1034 ( TCP ) Size : 1252

The ﬁle tcp_histogram.txt often provides further insight into the tcp information found

on the disk image. In this case, it does not because there were only two features found.

It is important to note that the histogram ﬁle still contains a lot of false positives.

11 Troubleshooting

Every forensic tool crashes at times because the tools are routinely used with data frag-

ments, non-standard codings, etc. One major issue is that the evidence that makes the

tool crash typically cannot be shared with the developer. The bulk_extractor system

implements checkpointing to protect the user and the results. bulk_extractor check-

points the current page in the ﬁle report.xml. After a crash, the user can just hit the

up-arrow at the command line prompt and return. bulk_extractor will restart at the

next page.

All bulk_extractor users should join the bulk_extractor users Google group for more in-

formation and help with any issues encountered. To join, send an email to bulk_extractor-

users+subscribe@googlegroups.com.

For the most part, the only kind of debugging bulk_extractor users should be doing is

turning oﬀ scanners. If bulk_extractor crashes repeatedly on a data set, the scanners

can all be disabled and then turned back on, one by one, until it crashes again. Then,

the user can report the speciﬁc scanner that made bulk_extractor crash on their disk

image. In general, users who experience crashes should feel free to report issues and

problems to the developers via the Google users group.

Users running the 32-bit version of bulk_extractor may occasionally encounter memory

allocation errors. This problem is more likely to occur on machines with a greater

number of cores. Our testing has shown this to be an issue using one of our test data

sets on a 32-bit machine with 12 cores. If the user encounters memory allocation errors

with bulk_extractor they will likely see an error similar to the following:

bulk_extractor scan error: ’std::exception Scanner: gzip Exception:

std::bad_alloc sbuf.pos0: (|21894266880) bufsize=20971520’

Memory allocation errors such as the one shown above will contain the phrase “bad_alloc”

somewhere in the message. If the user encounters this error, they should try run-

ning bulk_extractor with fewer threads. For example, the following command will run

bulk_extractor with only 4 threads (the -j option changes this parameter):

 bulk_extractor -j 4 -o output mydisk.raw

Reducing the number of threads and re-running the program should eliminate the prob-

lem.

Users may encounter errors if they are processing a large disk image and trying to write

the output of bulk_extractor to an output ﬁle directory on a smaller drive. In that case

the user might see an error similiar to the following:

bulk_extractor version: 1.5.0

Input file: G:\nps-2011-2tb\nps-2011-2tb.E01

Output directory: C:\Users\Mark Richer\Documents\BE Testing\OFD nps-2011-2tb 64bit

Disk Size: 2000054960128

Threads: 12

DISK FULL

*** carve: Cannot write(pos=7,0 len=24724184): No space left on device

DISK FULL

*** carve: Cannot write(pos=7,0 len=24724198): No space left on device

*** carve: Cannot write(pos=7,0 len=49160): No space left on device

*** carve: Cannot create C:\Users\Mark Richer\Documents\BE Testing\OFD nps-2011-2tb

64bit/kml/000/426602508288-ZIP-0.kml: No space left on device

Could not make directory C:\Users\Mark Richer\Documents\BE Testing\OFD nps-2011-2tb

64bit/kml/001: No space left on device

Phase 3. Creating Histograms

Cannot open histogram output file: C:\Users\Mark Richer\Documents\BE Testing\OFD

nps-2011-2tb 64bit/ccn_track2_histogram.txt

Elapsed time: 45111.4 sec.

Overall performance: 44.3359 MBytes/sec

Total email features found: 6716934

If this situation is encountered, the solution is to run bulk_extractor with an output

directory on a machine with more available disk space so that bulk_extractor has room

to create all the output ﬁles and directories required.

12 Related Reading

There are numerous articles and presentations available related to digital forensics,

speciﬁcally bulk_extractor, and its practical and research applications. Some of those

articles are speciﬁcally cited throughout this manual. Other useful references include

but are not limited to:

• Garﬁnkel, S. File Cabinet Forensics, Journal of Digital Forensics, Security and Law,

Vol 6(4). http://www.jdfsl.org/subscriptions/abstracts/JDFSL-V6N4-column-

Garfinkel.pdf

• Garﬁnkel, S. Every Last Byte. J. of Digital Forensics, Security and Law, 6:7âĂŞ8.

Column. http://www.jdfsl.org/subscriptions/abstracts/column-v6n2-Garfinkel.

htm

• Phillips, Kenneth N; Aaron Pickett; Simson Garﬁnkel, Embedded with Facebook:

DoD Faces Risks from Social Media, CrossTalk, May/June 2011. http://www.

dtic.mil/cgi-bin/GetTRDoc?AD=ADA542587

• Rowe, Neil, Schwamm, Riqui, Garﬁnkel, Simson. Language Translation for File

Paths, DFRWS 2013, Aug 4-7, 2013. Monterey, CA. http://www.dfrws.org/

2013/proce

edings/DFRWS2013-5.pdf

• Garﬁnkel, S., Nelson, A., Young, J., “A General Strategy for Diﬀerential Forensic

Analysis”, DFRWS 2012, Aug. 6-8, 2012, Washington, DC. http://www.dfrws.

org/2012/proceedings/DFRWS2012-6.pdf

• Garﬁnkel, S., “Lessons Learned Writing Computer Forensics Tools and Managing

a Large Digital Evidence Corpus”, DFRWS 2012, Aug. 6-8, 2012, Washington,

DC. http://simson.net/clips/academic/2012.DFRWS.DIIN382.pdf

• N. C. Rowe and S. L. Garﬁnkel, Finding anomalous and suspicious ﬁles from di-

rectory metadata on a large corpus. 3rd International ICST Conference on Digital

Forensics and Cyber Crime, Dublin, Ireland, October 2011. In P. Gladyshev and

M. K. Rogers (eds.), Lecture Notes in Computer Science LNICST 88, Springer-

Verlag, 2012, pp. 115-130. http://simson.net/clips/academic/2012.IICDFCC.

Anomalous.pdf

• Presentation - Using bulk_extractor for digital forensics triage and cross-drive anal-

ysis, DFRWS 2012. http://digitalcorpora.org/downloads/bulk_extractor/

doc/2012-08-08-bulk_extractor-tutorial.pdf

• Presentation - Digital Signatures: Current Barriers, Invited Talk, 10th Sympo-

sium on Identity and Trust on the Internet, Gaithersburg, MD, 2011. http://

middleware.internet2.edu/idtrust/2011/slides/07-digital-signatures-current

-barriers-garfinkel.pdf

• Courrejou, Timothy and Simson Garﬁnkel. A comparative analysis of ﬁle carving

software. Technical Report NPS-CS-11-006, Naval Postgraduate School, Septem-

ber 2011. http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.

pdf&AD=

ADA550119

Appendices

A Output of bulk_extractor Help Command

C:\>bulk_extractor -h

bulk_extractor version 1.5.0

Usage: bulk_extractor [options] imagefile

runs bulk extractor and outputs to stdout a summary of what was found where

Required parameters:

imagefile - the file to extract

or -R filedir - recurse through a directory of files

HAS SUPPORT FOR E01 FILES

-o outdir - specifies output directory. Must not exist.

bulk_extractor creates this directory.

Options:

-i - INFO mode. Do a quick random sample and print a report.

-b banner.txt- Add banner.txt contents to the top of every output file.

-r alert_list.txt - a file containing the alert list of features to alert

(can be a feature file or a list of globs)

(can be repeated.)

-w stop_list.txt - a file containing the stop list of features (white list

(can be a feature file or a list of globs)s

(can be repeated.)

-F <rfile> - Read a list of regular expressions from <rfile> to find

-f <regex> - find occurrences of <regex>; may be repeated.

results go into find.txt

-q nn - Quiet Rate; only print every nn status reports. Default 0; -1 for no status at all

-s frac[:passes] - Set random sampling parameters

Tuning parameters:

-C NN - specifies the size of the context window (default 16)

-S fr:<name>:window=NN specifies context window for recorder to NN

-S fr:<name>:window_before=NN specifies context window before to NN for recorder

-S fr:<name>:window_after=NN specifies context window after to NN for recorder

-G NN - specify the page size (default 16777216)

-g NN - specify margin (default 4194304)

-j NN - Number of analysis threads to run (default 4)

-M nn - sets max recursion depth (default 7)

-m <max> - maximum number of minutes to wait after all data read

default is 60

Path Processing Mode:

-p <path>/f - print the value of <path> with a given format.

formats: r = raw; h = hex.

Specify -p - for interactive mode.

Specify -p -http for HTTP mode.

Parallelizing:

-Y <o1> - Start processing at o1 (o1 may be 1, 1K, 1M or 1G)

-Y <o1>-<o2> - Process o1-o2

-A <off> - Add <off> to all reported feature offsets

Debugging:

-h - print this message

-H - print detailed info on the scanners

-V - print version number

-z nn - start on page nn

-dN - debug mode (see source code)

-Z - zap (erase) output directory

Control of Scanners:

-P <dir> - Specifies a plugin directory

Default dirs include /usr/local/lib/bulk_extractor /usr/lib/bulk_extractor and

BE_PATH environment variable

-e <scanner> enables <scanner> -- -e all enables all

-x <scanner> disable <scanner> -- -x all disables all

-E <scanner> - turn off all scanners except <scanner>

(Same as -x all -e <scanner>)

note: -e, -x and -E commands are executed in order

e.g.: ’-E gzip -e facebook’ runs only gzip and facebook

-S name=value - sets a bulk extractor option name to be value

Settable Options (and their defaults):

-S work_start_work_end=YES Record work start and end of each scanner in report.xml file ()

-S enable_histograms=YES Disable generation of histograms ()

-S debug_histogram_malloc_fail_frequency=0 Set >0 to make histogram maker fail with memory allocations ()

-S hash_alg=md5 Specifies hash algorithm to be used for all hash calculations ()

-S dup_data_alerts=NO Notify when duplicate data is not processed ()

-S write_feature_files=YES Write features to flat files ()

-S write_feature_sqlite3=NO Write feature files to report.sqlite3 ()

-S report_read_errors=YES Report read errors ()

-S ssn_mode=0 0=Normal; 1=No ‘SSN’ required; 2=No dashes required (accts)

-S min_phone_digits=6 Min. digits required in a phone (accts)

-S carve_net_memory=NO Carve network memory structures (net)

-S word_min=6 Minimum word size (wordlist)

-S word_max=14 Maximum word size (wordlist)

-S max_word_outfile_size=100000000 Maximum size of the words output file (wordlist)

-S wordlist_use_flatfiles=NO Override SQL settings and use flatfiles for wordlist (wordlist)

-S hashdb_mode=none Operational mode [none|import|scan]

none - The scanner is active but performs no action.

import - Import block hashes.

scan - Scan for matching block hashes. (hashdb)

-S hashdb_block_size=4096 Hash block size, in bytes, used to generte hashes (hashdb)

-S hashdb_ignore_empty_blocks=YES Selects to ignore empty blocks. (hashdb)

-S hashdb_scan_path_or_socket=your_hashdb_directory File path to a hash database or

socket to a hashdb server to scan against. Valid only in scan mode. (hashdb)

-S hashdb_scan_sector_size=512 Selects the scan sector size. Scans along

sector boundaries. Valid only in scan mode. (hashdb)

-S hashdb_import_sector_size=4096 Selects the import sector size. Imports along

sector boundaries. Valid only in import mode. (hashdb)

-S hashdb_import_repository_name=default_repository Sets the repository name to

attribute the import to. Valid only in import mode. (hashdb)

-S hashdb_import_max_duplicates=0 The maximum number of duplicates to import

for a given hash value, or 0 for no limit. Valid only in import mode. (hashdb)

-S exif_debug=0 debug exif decoder (exif)

-S jpeg_carve_mode=1 0=carve none; 1=carve encoded; 2=carve all (exif)

-S min_jpeg_size=1000 Smallest JPEG stream that will be carved (exif)

-S zip_min_uncompr_size=6 Minimum size of a ZIP uncompressed object (zip)

-S zip_max_uncompr_size=268435456 Maximum size of a ZIP uncompressed object (zip)

-S zip_name_len_max=1024 Maximum name of a ZIP component filename (zip)

-S unzip_carve_mode=1 0=carve none; 1=carve encoded; 2=carve all (zip)

-S rar_find_components=YES Search for RAR components (rar)

-S raw_find_volumes=YES Search for RAR volumes (rar)

-S unrar_carve_mode=1 0=carve none; 1=carve encoded; 2=carve all (rar)

-S gzip_max_uncompr_size=268435456 maximum size for decompressing GZIP objects (gzip)

-S pdf_dump=NO Dump the contents of PDF buffers (pdf)

-S opt_weird_file_size=157286400 Weird file size (windirs)

-S opt_weird_file_size2=536870912 Weird file size2 (windirs)

-S opt_max_cluster=67108864 Ignore clusters larger than this (windirs)

-S opt_max_cluster2=268435456 Ignore clusters larger than this (windirs)

-S opt_max_bits_in_attrib=3 Ignore FAT32 entries with more attributes set than this (windirs)

-S opt_max_weird_count=2 Ignore FAT32 entries with more things weird than this (windirs)

-S opt_last_year=2019 Ignore FAT32 entries with a later year than this (windirs)

-S xor_mask=255 XOR mask string, in decimal (xor)

-S sqlite_carve_mode=2 0=carve none; 1=carve encoded; 2=carve all (sqlite)

These scanners disabled by default; enable with -e:

-e base16 - enable scanner base16

-e facebook - enable scanner facebook

-e hashdb - enable scanner hashdb

-e outlook - enable scanner outlook

-e sceadan - enable scanner sceadan

-e wordlist - enable scanner wordlist

-e xor - enable scanner xor

These scanners enabled by default; disable with -x:

-x accts - disable scanner accts

-x aes - disable scanner aes

-x base64 - disable scanner base64

-x elf - disable scanner elf

-x email - disable scanner email

-x exif - disable scanner exif

-x find - disable scanner find

-x gps - disable scanner gps

-x gzip - disable scanner gzip

-x hiber - disable scanner hiber

-x httplogs - disable scanner httplogs

-x json - disable scanner json

-x kml - disable scanner kml

-x net - disable scanner net

-x pdf - disable scanner pdf

-x rar - disable scanner rar

-x sqlite - disable scanner sqlite

-x vcard - disable scanner vcard

-x windirs - disable scanner windirs

-x winlnk - disable scanner winlnk

-x winpe - disable scanner winpe

-x winprefetch - disable scanner winprefetch

-x zip - disable scanner zip