carrier at sleuthkit dot org
September 15, 2004
In this issue of The Sleuth Kit Informer, we have two articles on keyword searching and one on deleted file recovery in NTFS. All three articles are based on new features that have been integrated into recent versions of TSK and Autopsy or that are planned to be integrated into version 2. The first article is by Paul Bakker and covers his search tools that index keywords for faster searching. In the second article, I discuss the new sstrings tool in TSK and in the third article I discuss NTFS orphan files and the '-p' option to 'ifind'.
Since the July issue of The Informer, there have been two releases of both TSK and Autopsy. TSK 1.71 was released on July 30, 2004 and included some NTFS fixes. There were also several improvements to the NTFS code. TSK 1.72 was released on Sept 7, 2004 and included a couple of bugs with FAT and NTFS images. Several enhancements to the EXTxFS code were also made.
Autopsy 2.02 was released on July 30, 2004 with improved support for NTFS deleted files. Version 2.03 was released on Sept 7, 2004 and it included updates for Unicode searching and a few other minor updates.
I added a link to the 'comeforth' script by Dan Higgens, which helps you to process unallocated space. Here is a paragraph from the readme:
Parse raw filesystem blocks, or block image data produced by "dls", found in the Sleuth Kit. This was inspired by lazarus (www.porcupine.org/forensics) but provides a bit more flexibility for processing very large data sets. Blocks of certain file types or matching certain regular expressions are first found and saved in a scan phase. After scanning, blocks that have been saved can be viewed, and based on their contents files can be reassembled from various other blocks. An auto-assemble feature is provided which can reassemble a complete file in many cases, knowing only the first block in the file (only for ext2/ext3 filesystems).
In response to the article about TestDisk in the last issue of The Informer, Daniel Sedory mentioned that he has a page that gives screen shots and an example of using TestDisk to recover a partition.
[Above link no longer works - Brian 1/17/08]
The Sleuth Kit Informer is looking for articles on open source tools and techniques for digital investigations (computer / digital forensics) and incident response. Articles that discuss The Sleuth Kit and Autopsy are appreciated, but not required. Example topics include (but are not limited to):
Forensic investigations of harddrives/images have a lot of benefit from the different tools that are around. Some of these tools are Open Source or Free Software, and some of these tools are commercial.
Searching for keywords is probably one of the most performed actions during forensic investigations. But depending on the tools that are used this can take a lot of time depending on the size of the harddrive/image that is investigated.
In order to speed up searches it is possible to make a time-space trade-off and create an index of the forensic image. Such an index takes up a portion of space on the hard drive but this index will greatly speed up searches afterwards.
At first I used a commercial Windows based tool to index my images and quickly search through them, but I had some very bad experiences with the tool. A number of times during important investigations the tool crashed/hung during the creation of indexes or the tool was not able to open an index it just created without any identifiable reason. As most forensic investigations are done within a very short timeframe, it is very frustrating if after a full day/night/weekend of indexing (Depending on the size of the drive) you are presented with a crash or failure to open an index file.
Because these very frustrating experiences with this tool, I decided to design/create tools that could be used to perform indexed searches on images. In addition I chose to make these tools an addition to the already existing Open Source forensic tools Autopsy and Sleuthkit.
The collection of tools is called 'Searchtools' and can be found on http://www.brainspark.nl. This article gives an overview of how Searchtools works.
Because it is not viable to create an index containing all strings that are present inside an image, a number of parameters have consequences for the size of the index files and the strings that are present herein. Among these parameters the most important are:
The mimimum and maximum string length determine the lengths of the strings that will be indexed. Indexing strings shorter that 4 characters will result in a large amount of rubish due to the high chance that 3 indexable characters occur in succession in a piece of random or binary data. Indexing strings longer than 15 will not result in more useful information.
The characters that are indexed should depend on the needs of the investigator. If only words have to be searched, it is wise to only index alphabetic characters in order to limit the size of the index and thus the time it takes to generate.
The folding parameter specifies if diacritic characters should map to their non-diacritic character in the index. This allows for easier searching of words that contain diacritic characters. Currently only folding of diacritic iso_8859-1 characters is supported.
Currently Searchtools is able to create two different types of indexes:
The Raw index type contains all the strings that are located in the raw image. This means that this index type does not take into account any form of structure that might be available or present on the image.
If a string is located in a fragmented file and spans non-consecutive sectors, then it will not be found using the raw index. To find this string, the data on the image is indexed using the original file system structure. This index is called the raw fragment index. To reduce the raw fragment index size and prevent duplicate entries in the indexes, this index contains only the strings that start in one fragment and end in a non-consecutive fragment.
In order to visualize the data that is contained in an index, a small example is presented. The example creates a simplified raw index of a file containing only the string "This looks like a sentence: look looks looked". The default parameters are used, thus only strings with a length of 4 to 15 are indexed. A simplified parsing of the file results in the following index information:
0 this 22 ence 5 looks 28 look 6 ooks 33 looks 11 like 34 ooks 18 sentence 39 looked 19 entence 40 ooked 20 ntence 41 oked 21 tence Note: All locations are zero-based.
Internally though the information is represented in a tree. So a more accurate representation would be the following simplified drawing:
e - n - c - e(22) / \ / t - e - n - c - e(19) / /- l - i - k - e(11) / \ / o - o - k(28) - s(5,33) / \ / e - d(39) / /----- n - t - e - n - c - e(20) root \ k - e - d(41) \ / \----- o - o - k - s(6,34) \ \ \ e - d(40) \ \--- s - e - n - t - e - n - c - e(18) \ -- t - e - n - c - e(21) \ h - i - s(0)
As can be seen, the internal representation uses the letters of the indexed strings as nodes in a tree. At the node of the final letter of an indexed string, the offsets of that string are located.
So if one now wanted to search for the string "look", only the letters of the string have to be walked in the tree from the root node to see that the string is present at location 28. If all strings starting with "look" are to be found, all nodes beneath that node have to be accounted for too, thus resulting in the locations 5, 28, 33 and 39.
In order to facilitate indexed searches a directory is created: The index directory. This directory contains the resulting index for one image. The index itself currently consists of three different file types:
Exactly one index configuration is located in an index directory and this file contains the general information used for creating the index itself. This file is used by the different tools of the searchtools and contains a binary form of the configuration. This file is therefore not meant to be read by human beings.
The actual index is split into a number of raw index files and a number of raw fragment index files. The reason for not using a single large index file is simple. The current process of generating an index requires a lot of memory. Each file represents a single piece of full memory dumped into a file. Thus if the generating computer has an immense amount of memory a single index file would be the result.
The index files contain very compact and optimized tree representations created during the indexing process. As described above, each file contains the contents that could fit in one full memory piece. The tree in memory is optimized for in memory use. The tree in file format is optimized for searching with the least possible searches and thus disk seeks.
The collection of searchtools consists of:
This section continues with a short description/demonstration of the most used tools. Not all options will be demonstrated and this will definitely not be a complete manpage for these tools, but this demonstration will give a general idea of the possibilities and capabilities of the Searchtools.
Some commands will be timed by using the standard 'time' command integrated in most shells.
The image that we are using is a dd image of a 50 Mb linux ext2 partition that is packed with data. Packed meaning that almost all of the 50 Mb is used by the files present on the partition.
# ls -l test.img -rw-r--r-- 1 paul paul 50M Jul 27 21:10 test.img
First we will create a standard index of the image (With the most important parameters as specified above)
# time indexer -v test.img idx_std Starting raw indexing. Done 100.0 percent: 282 kNodes 6447 kOffsets 27M Mem Saving. Read 52428800 bytes. Total nodes 369063. Total offsets 6447387. Starting raw fragment indexing. Done 100.0 percent: 1 kNodes 0 kOffsets 0M Mem 12824/12824 Inodes Saving. Total nodes 1380. Total offsets 437. real 0m35.398s user 0m28.750s sys 0m1.180s
The output of the indexer command shows us that using these index parameters a total of 6,447,387 raw indexes where indexed and a small total of 437 raw fragment indexes. The total time to index this small 50 Mb image is around 35 seconds on this 2.4 GHz PC. Thus a rough correlation would result in about 11 minutes per gigabyte of image. Note though that whenever memory is filled (250 Mb by default), the contents have to be written to disk in order to continue.
The resulting index directory contains the following files:
# ls -l idx_std 516 index.cnf 20437 raw_frag_idx.000 17672517 raw_idx.000
As can be seen the raw index file is about 17 Mb and this file contains all the 6,477,387 raw indexes that were found during the previous step.
Now the image is created we will search for 'notifications' which occurs once in the image.
# time searcher test.img idx_std notifications Type: Raw 50712898 notifications real 0m0.003s user 0m0.010s sys 0m0.000s
The output of the searcher command shows us that the string 'notifications' is located on byte offset 50,712,898 of the image and that the search took only a fraction of a second.
This time the image is searched for the string 'data' which occurs 23,270 times in the image. (Flag -i is used for case insensitive searching, -p for better parseble output format)
# time searcher test.img idx_std data -i -p raw 253452 database raw 254271 data <snipped lots of results> raw 52357097 DataBase raw_frag 1913 12274 DATA real 0m0.132s user 0m0.070s sys 0m0.060s
The search and recovery of these results took much less than one second and gives 28,189 results back. Wait a minute! Didn't I just point out that the string 'data' occurs only 23,270 times in this image? By default the searcher returns all strings starting with the search string. By specifying the '-w' flag, only the keywords that exactly match the search string are returned.
Sometimes just searching will not do. In order to find special words you want to be able to look at the number of occerences for a specific keyword, or all keywords that are present within the image. The print_keywords command prints all the keywords in an index directory or in a specific index file. In order to facilitate scripting it is possible to let print_keywords skip the count that is appended to the end by default.
# time print_keywords -d idx_std 0000 2307 00000 1982 <snipped lots of results> priorities 37 prioritized 76 prioritizing 2 priority 1824 prioritydata 42 prioritynames 2 <snipped lots of results> zzzvz 1 zzzz 2 zzzzz 2 real 0m2.761s user 0m1.650s sys 0m0.250s
This concludes the small demonstration of the searchtools.
This article only lightly discusses the internal workings of the searchtools, but I hope it is able to shed a little light on the subject for people interested in it. If any of you require extra information, don't hesitate to ask, as I probably want create the documentation anyway and then have an incentive as to actually doing it.
Almost all functionality described in this article is also available in some way from the Autopsy interface. This article only used the commandline versions of the tools in order to visualize the actions done under the hood by the Autopsy interface.
Version 1.72 of TSK included a new tool called 'sstrings'. This tool is nothing more than the 'strings' tool from GNU Binutils 2.15 that has been removed from the package and modified slightly so that it can compile on TSK platforms. The additional 's' was added to the name because it will some day be part of a bigger collection of search tools, like the ones just described by Paul.
There were two motivations for adding it to this release instead of waiting until version 2. One was that some Linux distributions, namely Fedora Core 2, are shipping with a version of 'strings' that does not support large files and compiling your own version of Binutils does not fix the problem. The second motivation was to provide a method of searching for Unicode strings.
This article provides a brief overview of how to extract Unicode and how to use the new functionality in Autopsy.
By default, the latest GNU Binutils version of strings will extract only the ASCII strings from a file. Although, there is a set of '-e' flags that will allow you to extract different types of encoded characters. Supplying '-e l' will extract Unicode characters stored in 16-bit values in little endian ordering and '-e L' will extract Unicode characters stored in 32-bit values in little endian ordering. Similarly, '-e b' will extract 16-bit big endian characters and '-e B' will extract 32-bit big endian characters. You can also use '-e s' to extract 7-bit ASCII or '-e S' for 8-bit ASCII.
By extracting Unicode strings you will find the file names in an NTFS file system and application data inside of files. NTFS uses the 16-bit little endian storage method for its characters. The long file names in FAT use Unicode, but they are stored in small chunks and you will not see it as a single string. The default output of strings when you extract the Unicode strings is ASCII, so the normal 'grep' tool can be used to search them.
Note that strings tool uses the same comparisons for both Unicode and ASCII when it determines if a values is a printable character. Therefore, only Unicode representations of ASCII characters will be extracted. For example, the ASCII representation for the number 3 is '0x33' and the 16-bit Unicode representation is '0x0033'. If a language uses '0x1234' for the number 3, then it will not be found by strings.
When conducting a search in Autopsy, you can now choose to search for the keywords in ASCII, Unicode, or both. The search results are saved to a file so that they can be easily recalled. Unicode searches are conducted by running 'sstrings -e l' on the image and using the 'grep' tool on the output.
You can also extract the Unicode strings from an image to make keyword searching faster. Autopsy uses the '-e l' argument to 'sstrings' and the results are saved to a file in the 'output' directory of the evidence locker. The file will have an extension of '.uni'.
Including Unicode searching abilities in Autopsy has made it more useful for analyzing Windows systems. The new design forced me to temporarily remove the keyword search function during live analysis, but that will be fixed in the future.
Version 1.71 of TSK added a new '-p' flag to 'ifind' and this article explains what it does and why it is needed. Directories in NTFS, like other file systems, store the names of the files and directories that have been created in them. Each name has the address where the file metadata is stored. With NTFS, the metadata is stored in the MFT. When a file is deleted, the name could be overwritten and the link between the full path and the unallocated MFT entry is erased.
This frequently occurs with NTFS, which makes it difficult to identify what file allocated a specific MFT entry that is now unallocated. In addition, it means that when you list the contents of a directory then the deleted file will not be shown.
Fortunately, each MFT entry also stores the address of its parent directory. In other words, when the file was allocated, there was a pointer from the directory to the file and a pointer from the file to the directory. Plus, the MFT entry contains the name of the file, but not the full path. You can see this when you run 'istat' on an unallocated file. For example, here is the output from an unallocated MFT entry:
# istat -f ntfs img.dd 180 MFT Entry Header Values: Entry: 180 Sequence: 4 $LogFile Sequence Number: 1608100 Not Allocated File ... $FILE_NAME Attribute Values: Flags: Archive Name: FILE1.DAT Parent MFT Entry: 31 Sequence: 1 ...
We see the name of the file is 'FILE1.DAT' and its parent directory is entry 31. We can use the 'ffind' tool to find the name of this entry.
# ffind -f ntfs img.dd 31 /DIR1
Therefore, we know that the full path of the file was '/DIR1/FILE1.DAT', even though the 'DIR1' directory no longer has a name pointer to 'FILE1.DAT'
Lets say that you want to know all files that were in 'DIR1'. If you use only 'fls', then you will not see 'FILE1.DAT'. That is why 'ifind -p' was added. You give 'ifind' the MFT entry address of the directory and it will search for all unallocated entries that point back to the given directory. Therefore, it will find the 'FILE1.DAT' file from the previous example.
# ifind -f ntfs -p 31 img.dd -/r * 180: FILE1.DAT
As of version 2.02, Autopsy uses both 'fls' and 'ifind -p' when it lists the contents of an NTFS directory. Therefore, you will find more names of deleted files. Autopsy also added filtering to that version so that duplicate copies of file names are not shown.