Recovering image data from JPEG file fragments
The most basic task of any file system is to manage and organize data in a storage volume. Each file in the store is allocated a list of blocks (the basic unit of access), and when we access a file, the system retrieves the data in sequence from the list. Correspondingly, removing the relevant entry deletes the item (note that in most systems, deleting simply means that data is overwritten over time by newly saved files, rather than actually removed). Figure 1 depicts the layout of files distributed across blocks of a storage medium, and provides a simplified view of the data structure used to track allocated blocks. When the information for a volume is corrupt or missing, it is only possible to recover files by analyzing the fragmented raw data, a process known as file carving.
Today, file-carving tools play an important role in digital forensic investigations, where analyzing deleted files and salvaging data from damaged and faulty media are common procedures. However, when files are encoded and compressed (as with most multimedia files), recovery is dependent on the availability of the file header, which includes all the decoding parameters. If a file is partially intact with its header deleted, common carving techniques cannot recover any data. Here, we describe an algorithm1 that advances the latest developments in JPEG carving by introducing the ability to recover file fragments when the associated header is missing.
There are two main challenges associated with file carving. The first is the inability to lay out data blocks contiguously on the storage, as in the case of files b and f in Figure 1. Repeated execution of file operations, such as addition, deletion, and modification of files, over time leads to fragmentation of available free storage space. As a result, the newly generated files need to be broken into several parts to fit into the available unallocated blocks. Figure 1 illustrates this phenomenon, where files a and c are separated into two pieces. Even the new solid-state drives (which use integrated circuits, rather than disks, to store data) are susceptible to this phenomenon as they are designed to emulate the interface characteristics of hard disk drives.
The second challenge to successful file carving is that interpreting binary-formatted data requires the use of decoders, without which a block of data will reveal little or no information about the content of a file.
There has been significant recent progress in carving JPEG files, which are the most widely adopted still image compression standard today and commonly the subject of forensic investigations.2–6However, techniques for JPEG recovery still assume that a file header is always present. Without the header, the usual techniques cannot recover any data, even though the rest of the file may be intact. For example, in Figure 1, the fragments of files d, e, and g were overwritten by files c and f. Recovering an arbitrary chunk of compressed image data without the matching encoding metadata essentially requires reconstructing a new file header, which at least requires knowledge of entropy coding parameters and quantization tables, methods used during compression for downsampling the color information and image dimensions.
Most JPEG images are created on digital cameras. Consequently, one way to approach this problem is to examine the headers of images captured by a variety of cameras to determine what parameters and settings are commonly used. We considered half a million images captured by more than two thousand makes and models of camera.1 Our findings support the widely held belief that, of the four possible compression modes the JPEG standard offers (lossless, baseline, progressive, and hierachical), baseline mode (which stores data sequentially as one top-to-bottom scan of the image) is the most common compression technique, used by 99.5% of images. More critically, we determined that 98.6% of these JPEG images were entropy-coded using the same set of parameters, rather than using custom-built ones. These two observations considerably simplify the task of carving individual fragments.
However, the recovery process involves further analysis. Deletion (or loss) of file data essentially means that decoding has to start at some arbitrary bit in the compressed bitstream. JPEG encoding stuffs the encoded bitstream with additional data so that markers that appear within encoded data can be distinguished and disregarded during decoding. To do this, the decoder has to synchronize to the bitstream to decompress the incomplete file data and obtain a spatial domain representation. Next, she or he must detect the image width, determine the horizontal position of the recovered image blocks, and adjust brightness and color to produce a perceptually meaningful image.
We evaluated our proposed carving algorithm on a set of disk images created to test file-carving tools.7Starting with a secure digital camera memory card image, we created six consecutive images by deleting some of the photos in each iteration. Figure 2 shows the recovered partial images that could not be carved out by any of the existing techniques. These recovery results show that our approach can uncover a new class of evidence, and thus has significant implications for forensic investigations and intelligence gathering. Our next step is to extend this approach to other frequently used compressed file formats.
New York University
Abu Dhabi, United Arab Emirates
Husrev Taha Sencar is the local director for New York University Abu Dhabi's cyber security center and is a member of the faculty (on leave) of the Computer Engineering Department at TOBB University. His research interests include digital forensics, multimedia computing, and security.