The Sleuth Kit Informer

Brian Carrier

Issue #6
July 15, 2003



The sixth issue of The Sleuth Kit Informer looks at file hashes and hash databases. It is the first in a two part series and the next issue will focus on the details of The Sleuth Kit (TSK) implementation. This issue examines the concept of hashes and an algorithm to efficiently search for them.

What's New?

Keith Jones (Foundstone) has released three new open source Unix-based tools for analyzing Windows systems. Galleta parses cookie files, Pasco parses the 'index.dat' file, and Rifiuti examines the INFO2 file in the Recycle Bin.

Chuck Willis is presenting Forensics with Linux 101: How to do Forensics for Free at Blackhat and discusses using The Sleuth Kit and Autopsy.

The Sleuth Kit and Autopsy have been recently included in the following (or at least I just noticed them):

I was pointed to a Masters Thesis by Christian Johansson that was on text analysis with open source forensic tools, including Autopsy and The Sleuth Kit.

     Computer Forensic Text Analysis with Open Source Software

If you are presenting on or distributing The Sleuth Kit anywhere and want it mentioned in the Informer, let me know.

Hunting For Hashes (Part 1 of 2)


Physical crime investigators have an advantage over digital investigators with respect to having experience with what is normal about a crime scene because they live in and are familiar with the physical world. When confronted with a computer, investigators are challenged with focusing only on the potential evidence and ignoring the thousands of normal files.

One of the solutions to this digital problem is to use a cryptographic hash of files. Using algorithms such as MD5 and SHA-1, a relatively unique value is calculated for the file. The odds of having two files with the same hash value is extremely low and therefore it can be used to recognize files, similar to how a fingerprint can be used to recognize a person.

Using a secure hash, when any bit in a file changes, the resulting value will be much different. Therefore, if an attacker changes the contents of a file or if he replaces a file with a trojan version, the hash value will be different. Hashes are taken of known files so that the files that are known to be good can be ignored and the files that are known to be bad can be identified.

The Sleuth Kit (TSK) contains a tool called 'hfind' that allows an investigator to efficiently use hash databases. Autopsy allows one to use the 'hfind' tool to look files up in a database. This article covers the basic theory for hash databases and what Autopsy offers for hash databases. The next article will provide details on 'hfind'.

Hash Databases

There are two types of files that we want to identify using hashes:

In general, it is easier to create hashes of the files that are known be good. For example, the files that are included on original media from Microsoft are typically trusted to not contain malicious code (although they may contain vulnerabilities). When a database does not exist, investigators can easily obtain a copy of a trusted installation and compare it to a suspect installation.

To create hashes of files that are known to be bad requires the time intensive process of analyzing a system and identifying which files are good and which ones are bad. This requires him to ignore the files that are known to be good and investigate the unknown ones. Those that are identified as bad are then hashed for future investigations.

Several databases of hashes exist, so that investigators do not have to recreate the wheel every time. The Hashkeeper group has been creating hash databases of known good and bad files for many years. They produce a database for a specific tool or category of files. For example, there might be a database for Microsoft Office 2000 and a database for a series of known child pornography images. The database can be imported into many forensic applications.

The National Institute of Standards and Technology (NIST) has a group working on the National Software Reference Library (NSRL), which is a hash database of known files. The NSRL contains hashes of files that are taken from software installations and other "known" files from the Internet. The NSRL can be imported into several forensic applications. The NSRL is sold with a yearly subscription and has quarterly updates.

Maintaining the hash databases of numerous applications and operating systems is a time consuming process. Sun Microsystems has led the industry by providing an on-line interface to a hash database that contains all of the files that it has recently shipped. This design forces the user to trust that the on-line database has not been compromised and had its contents changed. I am not aware of any forensic applications that interface with the on-line database.

Another on-line database is maintained at This site has been developed by the Schmoo Group and contains hashes for FreeBSD, Linux, Mac OS X, and Solaris files. This database allows one to lookup a file using its name (to identify what the MD5 should be) or by using the MD5 (to identify if it is a valid file). As with all of these databases, ensure that the source of the data can be trusted and that you can testify to its origin and authenticity.

Most importantly, you can make your own database. This is useful for corporate environments where all computers start from the same image, for creating hashes of newly found bad files, and for those that trust no one and want to do everything on their own. The 'md5sum' (or simply 'md5') command line tool for both UNIX and Windows can easily do this. It calculates the hash of a given file and displays it. An example output is as follows:

     # md5sum /etc/passwd
     MD5 (/etc/passwd) = 2c264ba9cea77d422838c879b64ab4cf

The following would calculate the MD5 hashes on a UNIX system:

     # find / -exec /usr/bin/md5sum {} \; > redhat-8.1.txt

The 'md5deep' tool by Jesse Kornblum has the ability to recurse directories while calculating MD5 hashes on both UNIX and Windows. The following will calculate the MD5 hashes for a Windows system:

     C:\> md5deep -r C:\ > windows-xp.txt

Or, the following will add a single hash to an existing list of hashes:

     C:\> md5sum bad-file.jpg >> bad-picts.txt

Binary Search Algorithm

Once the list of hashes has been created, an efficient method of looking a hash up is needed. Unfortunately, there is no standard format for the databases. The Hashkeeper format is different then the NSRL and both are very different from the plain output of 'md5sum'. Fortunately though, all are in an ASCII text format so lookup tools can be easily customized for them.

The most simple, and least efficient, lookup method is to use a keyword search tool, such as 'grep' or the find option in a text editor. These tools work by looking at each word in the file to find a match. This sounds simple, but consider that the NSRL hash database is over 1GB in size. You can imagine that it takes a while to look at each word to match a single hash, especially since you would also be searching file names and other non-hash values. This search method is similar to looking up the definition of a word using a dictionary whose entries are not in alphabetical order.

Therefore, a more efficient technique is to sort the data based on the hash value. This allows the search tool to more quickly focus on where to search. A common search algorithm for sorted data is the binary search. The NSRL and Hashkeeper are not distributed in sorted order.

This article will show an efficient algorithm for finding a value in a sorted list. With a binary search algorithm, the list of possible values is decreased by a half each time. Eventually, the goal value is the identified. The algorithm has two data points: a lower bound and an upper bound. In the beginning, the lower bound is the database entry for the lowest hash value and the upper bound is the database entry for the highest hash value. While the algorithm is running, target (or goal) hash value will be between the lower and upper bounds. After each round of the algorithm, the bounds will get closer together until the goal value is found.

In the first round, the lower bound is the lowest hash entry and the upper bound is the highest hash entry. The middle entry between the lower and upper bound is identified and the hash value is compared with the goal hash value.

If the middle hash value is less then the goal hash value, then we know that the goal hash value is in the entries after the middle so the lower bound is increased to the current middle value. If the middle value is greater than the goal value then we know that the goal hash value is somewhere in the entries before the middle and the upper bound is decreased to the current middle value. In either case, we have reduced the number of entries to search by half.

For example, consider a database with 10 entries. The lower bound is entry #1 and the upper bound is entry #10. The middle entry is entry #5. The value in entry #5 is 55 and it is compared to the goal value, which is 30.

Index: 123 45 678 910
Value: 612213055 5862777881
Bounds: Lower   Mid     Upper

The middle value is greater than the goal and therefore the upper bound is decreased to #5 and the lower bound remains at #1.

Index: 123 45 678 910
Value: 612213055 5862777881
Bounds: Lower   Upper      

Another round is performed using the new bounds. The middle entry is identified and compared with the goal and the same criteria from the first round is used to reset the bounds.

Using the previous example, the lower bound is entry #1 and the upper bound is entry #5. The middle value would therefore be entry #3. If the value in entry #3 is 21, then it is less than the goal value (30). Therefore, the lower bound is increased to entry #3.

Index: 123 45 678 910
Value: 612213055 5862777881
Bounds:   Lower Upper      

This process continues until the middle entry is equal to the goal value. If the goal value is not found before the upper and lower bounds are equal, then the hash does not exist in the database.

In our example. the middle entry is now entry #4 and its value (30) is compared to the goal value (30). They are equal, so the search ends. If they had not been equal, then we would have known that the value did not exist in the list.

The algorithm sounds much more complex then going through a file word by word, but in reality it is much faster. Hash lookups can be performed using other search algorithms, but the binary search was presented as an example of an efficient search technique. The hash tools in The Sleuth Kit utilize a binary search algorithm, and it will be used in the next issue.

Hash Databases with Autopsy

As previously mentioned, several forensic tools allow one to import hash databases. Since this newsletter is called THE SLEUTH KIT Informer, you can probably guess that The Sleuth Kit (TSK) and Autopsy support hash databases. The details of TSK support will be detailed in the next issue. This article will focus on the support from Autopsy (which of course uses the command line tools from TSK).

There are three major types of hash databases that are supported in Autopsy (the current version is 1.73). Each database has a different scope level. The NIST NSRL is used at a global level and is specified at installation time. Every case that uses Autopsy can utilize the NIST NSRL, if it is installed. Its location is stored in the '' file in the Autopsy installation directory. If the NSRL is moved or added after the Autopsy installation, either type 'make' to reconfigure it or edit the '' file by hand.

The other two hash databases are at a host scope. One of the databases is for files that are known to be good and the other is for files that are known to be bad. Both of these databases must be custom and created by 'md5' or 'md5deep' (or equivalent). The reason that they are at a host level is so that they can be specific to a given operating system. For example, the known good file can contain the hashes from just Windows XP and or a secure build, instead of hashes from all Microsoft products. The known bad database can be for all known rootkits for a given operating system or for files specific to the type of case (child porn for example).

Autopsy currently uses the hash databases in three locations. When the 'Sort Files by Type' mode is used (i.e. the 'sorter' tool), the hash databases are used to ignore the known to be good files and alert on the known to be bad files. Refer to The Sleuth Kit Informer Issue #3 for more information on how 'sorter' works.

The second location where Autopsy uses hash databases is in the 'Meta Data' mode. The 'Meta Data' mode uses the 'istat' command from TSK to display all of the details about a file. It also displays the MD5 hash value. If hash databases are configured, then Autopsy allows one to look the file up in one or more of the hash databases.

The third location where Autopsy uses hash databases is from the Host Gallery view. The 'Hash Databases' button allows one to manage the hash databases and perform single lookups.


Hash databases are useful for reducing the amount of data that needs to be examined. Thus far, they have been frequently used by law enforcement for quickly identifying known series of child pornography. The NIST NSRL will help to make hashes of trusted files easily accessible and companies will likely be spending more time preparing for computer intrusions and creating hash databases to reduce their investigation costs.

This article has shown why hash databases are useful, how a hash database can be searched, and how Autopsy uses them. In next months article, the innards of the 'hfind' tool will be examined. I'm sure you will find it nice reading for a summer day on the beach.


Autopsy Forensic Browser

known goods




Solaris Fingerprints Database

The Sleuth Kit

The Sleuth Kit Informer Issue #3

Copyright © 2003 by Brian Carrier. All Rights Reserved