The Sleuth Kit Informer

                    http://www.sleuthkit.org/informer
                http://sleuthkit.sourceforge.net/informer

                             Brian Carrier
                      carrier at sleuthkit dot org

                               Issue #16
                           September 15, 2004


CONTENTS
--------------------------------------------------------------------
- Introduction
- What's New?
- Search Tools (By: Paul Bakker)
- sstrings and Unicode Searching
- NTFS Orphan Files


INTRODUCTION
--------------------------------------------------------------------

In this issue of The Sleuth Kit Informer, we have two articles on
keyword searching and one on deleted file recovery in NTFS. All
three articles are based on new features that have been integrated
into recent versions of TSK and Autopsy or that are planned to be
integrated into version 2.  The first article is by Paul Bakker and
covers his search tools that index keywords for faster searching.
In the second article, I discuss the new sstrings tool in TSK and
in the third article I discuss NTFS orphan files and the '-p' option
to 'ifind'.


WHAT'S NEW?
--------------------------------------------------------------------

Since the July issue of The Informer, there have been two releases
of both TSK and Autopsy.  TSK 1.71 was released on July 30, 2004
and included some NTFS fixes.  There were also several improvements
to the NTFS code.   TSK 1.72 was released on Sept 7, 2004 and
included a couple of bugs with FAT and NTFS images.  Several
enhancements to the EXTxFS code were also made.

    http://www.sleuthkit.org/sleuthkit/

Autopsy 2.02 was released on July 30, 2004 with improved support for
NTFS deleted files.  Version 2.03 was released on Sept 7, 2004 and
it included updates for Unicode searching and a few other minor
updates.

    http://www.sleuthkit.org/autopsy/


I added a link to the 'comeforth' script by Dan Higgens, which helps you
to process unallocated space.  Here is a paragraph from the readme:

Parse raw filesystem blocks, or block image data produced by "dls",
found in the Sleuth Kit. This was inspired by lazarus
(www.porcupine.org/forensics) but provides a bit more flexibility
for processing very large data sets. Blocks of certain file types
or matching certain regular expressions are first found and saved
in a scan phase. After scanning, blocks that have been saved can
be viewed, and based on their contents files can be reassembled
from various other blocks. An auto-assemble feature is provided
which can reassemble a complete file in many cases, knowing only
the first block in the file (only for ext2/ext3 filesystems).

    http://www.sleuthkit.org/sleuthkit/download.php


COMMENTS FROM READERS
--------------------------------------------------------------------

In response to the article about TestDisk in the last issue of The
Informer, Daniel Sedory mentioned that he has a page that gives
screen shots and an example of using TestDisk to recover a partition.

    http://therdcom.com/testdisk.html
[Above link no longer works - Brian 1/17/08]


CALL FOR PAPERS
--------------------------------------------------------------------

The Sleuth Kit Informer is looking for articles on open source tools
and techniques for digital investigations (computer / digital
forensics) and incident response. Articles that discuss The Sleuth
Kit and Autopsy are appreciated, but not required. Example topics
include (but are not limited to):

- Tutorials on open source tools
- User experiences and reviews of open source tools
- New investigation techniques using open source tools
- Open source tool testing results 

    http://www.sleuthkit.org/informer/cfp.html


SEARCHTOOLS, INDEXED SEARCHING IN FORENSIC IMAGES
--------------------------------------------------------------------
Paul Bakker <p.j.bakker at brainspark dot nl>


Description

Forensic investigations of harddrives/images have a lot of benefit
from the different tools that are around. Some of these tools are
Open Source or Free Software, and some of these tools are commercial.

Searching for keywords is probably one of the most performed actions
during forensic investigations. But depending on the tools that are
used this can take a lot of time depending on the size of the
harddrive/image that is investigated. In order to speed up searches
it is possible to make a time-space trade-off and create an index
of the forensic image. Such an index takes up a portion of space
on the hard drive but this index will greatly speed up searches
afterwards.

At first I used a commercial Windows based tool to index my images
and quickly search through them, but I had some very bad experiences
with the tool. A number of times during important investigations
the tool crashed/hung during the creation of indexes or the tool
was not able to open an index it just created without any identifiable
reason. As most forensic investigations are done within a very short
timeframe, it is very frustrating if after a full day/night/weekend
of indexing (Depending on the size of the drive) you are presented
with a crash or failure to open an index file.

Because these very frustrating experiences with this tool, I decided
to design/create tools that could be used to perform indexed searches
on images. In addition I chose to make these tools an addition to
the already existing Open Source forensic tools Autopsy and Sleuthkit.

The collection of tools is called 'Searchtools' and can be found
on http://www.brainspark.nl.  This article gives an overview of how
Searchtools works.


The Workings of Searchtools

Index Parameters

Because it is not viable to create an index containing all strings
that are present inside an image, a number of parameters have
consequences for the size of the index files and the strings that
are present herein. Among these parameters the most important are:

- Minimum string length (Default: 4)
- Maximum string length (Default: 15)
- Characters indexed    (Default: alphanumeric characters)
- Folding parameter     (Default: no folding)

The mimimum and maximum string length determine the lengths of the
strings that will be indexed. Indexing strings shorter that 4
characters will result in a large amount of rubish due to the high
chance that 3 indexable characters occur in succession in a piece
of random or binary data. Indexing strings longer than 15 will not
result in more useful information.

The characters that are indexed should depend on the needs of the
investigator. If only words have to be searched, it is wise to only
index alphabetic characters in order to limit the size of the index
and thus the time it takes to generate.

The folding parameter specifies if diacritic characters should map
to their non-diacritic character in the index. This allows for
easier searching of words that contain diacritic characters. Currently
only folding of diacritic iso_8859-1 characters is supported.

Index Types

Currently Searchtools is able to create two different types of
indexes:

- Raw index
- Raw fragments index

The Raw index type contains all the strings that are located in the
raw image. This means that this index type does not take into account
any form of structure that might be available or present on the
image.

If a string is located in a fragmented file and spans non-consecutive
sectors, then it will not be found using the raw index. To find
this string, the data on the image is indexed using the original
file system structure. This index is called the raw fragment index.
To reduce the raw fragment index size and prevent duplicate entries
in the indexes, this index contains only the strings that start in
one fragment and end in a non-consecutive fragment.


Simplified Index Example

In order to visualize the data that is contained in an index, a
small example is presented. The example creates a simplified raw
index of a file containing only the string "This looks like a
sentence: look looks looked". The default parameters are used, thus
only strings with a length of 4 to 15 are indexed. A simplified
parsing of the file results in the following index information:

 0 this               22 ence
 5 looks              28 look
 6 ooks               33 looks
11 like               34 ooks
18 sentence           39 looked
19 entence            40 ooked
20 ntence             41 oked
21 tence           

Note: All locations are zero-based.

Internally though the information is represented in a tree. So a
more accurate representation would be the following simplified
drawing:

              e - n - c - e(22)
	     /     \
            /       t - e - n - c - e(19)
           / 
          /- l - i - k - e(11)
         /    \
        /      o - o - k(28) - s(5,33)
       /                \
      /                  e - d(39)
     /	       
    /----- n - t - e - n - c - e(20)
root
    \          k - e - d(41)
     \        /
      \----- o - o - k - s(6,34)
       \              \
        \              e - d(40)
         \	 
          \--- s - e - n - t - e - n - c - e(18)
           \	 
            -- t - e - n - c - e(21)
	        \ 
	         h - i - s(0)


[Ed: Use a fixed width font or refer to the version on the website]

As can be seen, the internal representation uses the letters of the
indexed strings as nodes in a tree. At the node of the final letter
of an indexed string, the offsets of that string are located.  So
if one now wanted to search for the string "look", only the letters
of the string have to be walked in the tree from the root node to
see that the string is present at location 28. If all strings
starting with "look" are to be found, all nodes beneath that node
have to be accounted for too, thus resulting in the locations 5,
28, 33 and 39.

Index Directories

In order to facilitate indexed searches a directory is created: The
index directory. This directory contains the resulting index for one
image. The index itself currently consists of three different file
types:

- Index configuration file
- Raw index files
- Raw fragment index files

Exactly one index configuration is located in an index directory
and this file contains the general information used for creating
the index itself. This file is used by the different tools of the
searchtools and contains a binary form of the configuration. This
file is therefore not meant to be read by human beings.

The actual index is split into a number of raw index files and a
number of raw fragment index files. The reason for not using a
single large index file is simple. The current process of generating
an index requires a lot of memory. Each file represents a single
piece of full memory dumped into a file. Thus if the generating
computer has an immense amount of memory a single index file would
be the result.

The index files contain very compact and optimized tree representations
created during the indexing process. As described above, each file
contains the contents that could fit in one full memory piece. The
tree in memory isoptimized for in memory use. The tree in file
format is optimized for searching with the least possible searches
and thus disk seeks.


Different Searchtools

Overview

The collection of searchtools consists of:
- indexer: Performs the actual indexing of the image and creates
the indexes that can be used for indexed searching.

- searcher: Uses the indexes created by indexer and can perform
quick searches.

- print_keywords: Prints a sorted list of the keywords found in an
index directory or a specific index file.

- counter: Checks and counts the number of offsets and nodes that
are contained within an index file.

- print_config: Prints the information from the config file contained
in an index directory in a human readable form.

- print_header: Prints the header of an index file in a human
readable form.

Demonstration

This section continues with a short description/demonstration of
the most used tools. Not all options will be demonstrated and this
will definitely not be a complete manpage for these tools, but this
demonstration will give a general idea of the possibilities and
capabilities of the Searchtools.  Some commands will be timed by using
the standard 'time' command integrated in most shells.

The image that we are using is a dd image of a 50 Mb linux ext2
partition that is packed with data. Packed meaning that almost all
of the 50 Mb is used by the files present on the partition.

 # ls -l test.img 
 -rw-r--r--    1 paul     paul    50M Jul 27 21:10 test.img

First we will create a standard index of the image (With the most
important parameters as specified above)

 # time indexer -v test.img idx_std
 Starting raw indexing. 
 Done 100.0 percent:    282 kNodes   6447 kOffsets   27M Mem 
 Saving. 
 Read 52428800 bytes. 
 Total nodes 369063. 
 Total offsets 6447387. 
 Starting raw fragment indexing. 
 Done 100.0 percent: 1 kNodes   0 kOffsets   0M Mem 12824/12824 Inodes 
 Saving. 
 Total nodes 1380. 
 Total offsets 437. 

 real    0m35.398s 
 user    0m28.750s 
 sys     0m1.180s

The output of the indexer command shows us that using these index
parameters a total of 6,447,387 raw indexes where indexed and a
small total of 437 raw fragment indexes. The total time to index
this small 50 Mb image is around 35 seconds on this 2.4 GHz PC.
Thus a rough correlation would result in about 11 minutes per
gigabyte of image. Note though that whenever memory is filled (250
Mb by default), the contents have to be written to disk in order
to continue.

The resulting index directory contains the following files:

 # ls -l idx_std
      516 index.cnf
    20437 raw_frag_idx.000  
 17672517 raw_idx.000

As can be seen the raw index file is about 17 Mb and this file
contains all the 6,477,387 raw indexes that were found during the
previous step.

Now the image is created we will search for 'notifications' which
occurs once in the image.

 # time searcher test.img idx_std notifications 
 Type: Raw    
   50712898 notifications 

 real    0m0.003s 
 user    0m0.010s 
 sys     0m0.000s

The output of the searcher command shows us that the string
'notifications' is located on byte offset 50,712,898 of the image
and that the search took only a fraction of a second.

This time the image is searched for the string 'data' which occurs
23,270 times in the image. (Flag -i is used for case insensitive
searching, -p for better parseble output format)

 # time searcher test.img idx_std data -i -p 
 raw            253452 database 
 raw            254271 data 
 <snipped lots of results> 
 raw          52357097 DataBase 
 raw_frag     1913        12274 DATA 

 real    0m0.132s 
 user    0m0.070s 
 sys     0m0.060s

The search and recovery of these results took much less than one
second and gives 28,189 results back. Wait a minute! Didn't I just
point out that the string 'data' occurs only 23,270 times in this
image? By default the searcher returns all strings starting with
the search string. By specifying the '-w' flag, only the keywords
that exactly match the search string are returned.

Sometimes just searching will not do. In order to find special words
you want to be able to look at the number of occerences for a
specific keyword, or all keywords that are present within the image.
The print_keywords command prints all the keywords in an index
directory or in a specific index file. In order to facilitate
scripting it is possible to let print_keywords skip the count that
is appended to the end by default.

 # time print_keywords -d idx_std 
 0000                           2307 
 00000                          1982 
 <snipped lots of results> 
 priorities                     37 
 prioritized                    76 
 prioritizing                   2 
 priority                       1824 
 prioritydata                   42 
 prioritynames                  2 
 <snipped lots of results> 
 zzzvz                          1 
 zzzz                           2 
 zzzzz                          2 

 real    0m2.761s 
 user    0m1.650s 
 sys     0m0.250s

This concludes the small demonstration of the searchtools.


Conclusion

This article only lightly discusses the internal workings of the
searchtools, but I hope it is able to shed a little light on the
subject for people interested in it. If any of you require extra
information, don't hesitate to ask, as I probably want create the
documentation anyway and then have an incentive as to actually doing
it.

Almost all functionality described in this article is also available
in some way from the Autopsy interface. This article only used the
commandline versions of the tools in order to visualize the actions
done under the hood by the Autopsy interface.


SSTRINGS AND UNICODE SEARCHING
--------------------------------------------------------------------
Brian Carrier


Overview

Version 1.72 of TSK included a new tool called 'sstrings'.  This
tool is nothing more than the 'strings' tool from GNU Binutils 2.15
that has been removed from the package and modified slightly so
that it can compile on TSK platforms.  The additional 's' was added
to the name because it will some day be part of a bigger collection
of search tools, like the ones just described by Paul.

There were two motivations for adding it to this release instead
of waiting until version 2.  One was that some Linux distributions,
namely Fedora Core 2, are shipping with a version of 'strings' that
does not support large files and compiling your own version of
Binutils does not fix the problem.  The second motivation was to
provide a method of searching for Unicode strings.

This article provides a brief overview of how to extract Unicode and
how to use the new functionality in Autopsy.


Extracting ASCII-based Unicode Strings

By default, the latest GNU Binutils version of strings will extract
only the ASCII strings from a file.   Although, there is a set of
'-e' flags that will allow you to extract different types of encoded
characters.  Supplying '-e l' will extract Unicode characters stored
in 16-bit values in little endian ordering and '-e L' will extract
Unicode characters stored in 32-bit values in little endian ordering.
Similarly, '-e b' will extract 16-bit big endian characters and '-e
B' will extract 32-bit big endian characters.  You can also use '-e
s' to extract 7-bit ASCII or '-e S' for 8-bit ASCII.

By extracting Unicode strings you will find the file names in an
NTFS file system and application data inside of files.  NTFS uses
the 16-bit little endian storage method for its characters.  The
long file names in FAT use Unicode, but they are stored in small
chunks and you will not see it as a single string.  The default
output of strings when you extract the Unicode strings is ASCII,
so the normal 'grep' tool can be used to search them.  

Note that strings tool uses the same comparisons for both Unicode
and ASCII when it determines if a values is a printable character.
Therefore, only Unicode representations of ASCII characters will
be extracted.  For example, the ASCII representation for the number
3 is '0x33' and the 16-bit Unicode representation is '0x0033'.  If
a language uses '0x1234' for the number 3, then it will not be found
by strings.


Integration With Autopsy

When conducting a search in Autopsy, you can now choose to search
for the keywords in ASCII, Unicode, or both.  The search results
are saved to a file so that they can be easily recalled.  Unicode
searches are conducted by running 'sstrings -e l' on the image and
using the 'grep' tool on the output.

You can also extract the Unicode strings from an image to make
keyword searching faster.  Autopsy uses the '-e l' argument to
'sstrings' and the results are saved to a file in the 'output'
directory of the evidence locker.  The file will have an extension
of '.uni'.


Summary

Including Unicode searching abilities in Autopsy has made it more
useful for analyzing Windows systems.  The new design forced me to
temporarily remove the keyword search function during live analysis,
but that will be fixed in the future.  


NTFS ORHAN FILES
--------------------------------------------------------------------
Brian Carrier


Version 1.71 of TSK added a new '-p' flag to 'ifind' and this article
explains what it does and why it is needed.   Directories in NTFS,
like other file systems, store the names of the files and directories
that have been created in them.  Each name has the address where
the file metadata is stored.  With NTFS, the metadata is stored in
the MFT.  When a file is deleted, the name could be overwritten and
the link between the full path and the unallocated MFT entry is
erased.

This frequently occurs with NTFS, which makes it difficult to
identify what file allocated a specific MFT entry that is now
unallocated.  In addition, it means that when you list the contents
of a directory then the deleted file will not be shown.  

Fortunately, each MFT entry also stores the address of its parent
directory.  In other words, when the file was allocated, there was
a pointer from the directory to the file and a pointer from the
file to the directory.  Plus, the MFT entry contains the name of
the file, but not the full path.  You can see this when you run 'istat'
on an unallocated file.  For example, here is the output from an
unallocated MFT entry:

    # istat -f ntfs img.dd 180
    MFT Entry Header Values:
    Entry: 180        Sequence: 4
    $LogFile Sequence Number: 1608100
    Not Allocated File
    ...
    $FILE_NAME Attribute Values:
    Flags: Archive
    Name: FILE1.DAT
    Parent MFT Entry: 31    Sequence: 1
    ...

We see the name of the file is 'FILE1.DAT' and its parent directory
is entry 31.  We can use the 'ffind' tool to find the name of this
entry.

    # ffind -f ntfs img.dd 31
    /DIR1

Therefore, we know that the full path of the file was '/DIR1/FILE1.DAT',
even though the 'DIR1' directory no longer has a name pointer to
'FILE1.DAT'  

Lets say that you want to know all files that were in 'DIR1'.  If
you use only 'fls', then you will not see 'FILE1.DAT'.  That is why
'ifind -p' was added.  You give 'ifind' the MFT entry address of
the directory and it will search for all unallocated entries that
point back to the given directory.  Therefore, it will find the
'FILE1.DAT' file from the previous example.

    # ifind -f ntfs -p 31 img.dd 
    -/r * 180:     FILE1.DAT

As of version 2.02, Autopsy uses both 'fls' and 'ifind -p' when it
lists the contents of an NTFS directory. Therefore, you will find
more names of deleted files.  Autopsy also added filtering to that
version so that duplicate copies of file names are not shown.


--------------------------------------------------------------------
Copyright (c) 2004 by Brian Carrier.  All Rights Reserved

This article is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike License.
http://creativecommons.org/licenses/by-nc-sa/2.0/
Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305