Fingerprinting JPEGs With Optimised Huffman Tables/media/worktribe/output... · describes a...

Journal of Digital Forensics,Security and Law

Volume 13 | Number 2 Article 7

October 2018

Fingerprinting JPEGs With Optimised HuffmanTablesSean McKeownEdinburgh Napier University, [email protected]

Gordon RussellEdinburgh Napier University, [email protected]

Petra LeimichEdinburgh Napier University, [email protected]

Follow this and additional works at: https://commons.erau.edu/jdfsl

Part of the Information Security Commons

This Article is brought to you for free and open access by the Journals atScholarly Commons. It has been accepted for inclusion in Journal of DigitalForensics, Security and Law by an authorized administrator of ScholarlyCommons. For more information, please contact [email protected].

(c)ADFSL

Recommended CitationMcKeown, Sean; Russell, Gordon; and Leimich, Petra (2018) "Fingerprinting JPEGs With Optimised Huffman Tables," Journal ofDigital Forensics, Security and Law: Vol. 13 : No. 2 , Article 7.DOI: https://doi.org/10.15394/jdfsl.2018.1451Available at: https://commons.erau.edu/jdfsl/vol13/iss2/7

http://commons.erau.edu/jdfsl?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages

http://commons.erau.edu/jdfsl?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages

https://commons.erau.edu/jdfsl?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages


https://commons.erau.edu/jdfsl/vol13?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages

https://commons.erau.edu/jdfsl/vol13/iss2?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages

https://commons.erau.edu/jdfsl/vol13/iss2/7?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages


http://network.bepress.com/hgg/discipline/1247?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages

https://doi.org/10.15394/jdfsl.2018.1451


mailto:[email protected]

http://commons.erau.edu?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages

http://commons.erau.edu?utm_source=commons.erau.edu%2Fjdfsl%2Fvol13%2Fiss2%2F7&utm_medium=PDF&utm_campaign=PDFCoverPages

//creativecommons.org/licenses/by-nc-nd/4.0/

//creativecommons.org/licenses/by-nc-nd/4.0/

Fingerprinting JPEGs With Optimised Huffman Tables

Cover Page FootnoteThis research was supported by a scholarship provided by Peter KK Lee.

This article is available in Journal of Digital Forensics, Security and Law: https://commons.erau.edu/jdfsl/vol13/iss2/7


FINGERPRINTING JPEGS WITH OPTIMISED

HUFFMAN TABLESSean McKeown, Gordon Russell, Petra Leimich

School of ComputingEdinburgh Napier University, Scotland

{S.McKeown, G.Russell, P.Leimich}@napier.ac.uk

ABSTRACT

A common task in digital forensics investigations is to identify known contraband images. This istypically achieved by calculating a cryptographic digest, using hashing algorithms such as SHA256,for each image on a given medium, and comparing individual digests with a database of knowncontraband. However, the large capacities of modern storage media and time pressures placed onforensics examiners necessitates the development of more efficient processing methods. This workdescribes a technique for fingerprinting JPEGs with optimised Huffman tables which requires onlythe image header to be present on the media. Such fingerprints are shown to be robust across largedatasets, with demonstrably faster processing times.

Keywords: digital forensics, image comparison, image processing, known file analysis, partial fileanalysis

1. INTRODUCTION

Digital Forensics is developing at a rapid paceto keep in step with advances in technology. Ap-proaches which worked well decades ago may notremain effective, as the nature of the underlyingevidence shifts, or as datasets surpass the ter-abyte scale (Beebe & Clark, 2005). The natureof the field demands ongoing research, with gapsexisting in prominent in areas such as triage anddata reduction (Quick & Choo, 2014).

One common aspect of many public sec-tor investigations is the detection of contra-band material within a large number of im-ages. Traditionally, this is achieved using file fin-gerprints generated by means of cryptographichashing (Garfinkel, Nelson, White, & Roussev,2010). The fingerprints extracted from a piece ofevidence are compared with a database contain-ing fingerprints of known files of interest. Thesefingerprints require that the entire file be pro-cessed, incurring IO and CPU overhead, and aresensitive to any change in the binary data.

However, if an equivalent fingerprint could becreated from information residing near the be-ginning of a file, then significant performance im-provements could be gained by avoiding the IOrequired to process the whole file with more tra-ditional fingerprinting techniques. Even if thispartial file technique created fingerprints witha larger collision domain, false positive matchescould be additionally verified by a full file hash,and so performance gains with such a combinedhashing approach is still effective as long as thefalse positive rate was reasonably low.

The main contribution of this paper is an inex-pensive method for creating fingerprints of JPEGfiles. Huffman tables are present in the header ofevery JPEG, with the fingerprinting method inthis work exploiting Huffman tables which havebeen optimised for maximal compression, whichis an increasingly common encoding techniqueused on the Web. An analysis of the distinctnessof such fingerprints is provided, as well as an ex-amination of the portion of the file which must beprocessed to extract them. Finally, the method

is evaluated with timed benchmarks comparingthe extraction process to a traditional hashingmethod.

This paper demonstrates that the proposedfingerprinting technique is faster than traditionalhashing techniques on a large dataset of realworld images. In analysing the false positiverate, the collision rate was shown to be almostzero.

When Huffman tables are optimised for thecontent of the image, the table communi-cates coarse information about the image, whilechanges to EXIF metadata and some small im-age manipulations have no affect on the ta-ble. This work is timely, as there is a tech-nological trend towards optimally encoding im-ages in the JPEG format, such as those pub-lished recently by Mozilla (Mozilla, 2017) andGoogle (Alakuijala et al., 2017).

Experiments are carried out using threedatasets: Flickr 1 Million (Huiskes, Thomee,& Lew, 2010), containing 1 million JPEGS;Govdocs (Garfinkel, Farrell, Roussev, & Dinolt,2009), containing 100,000 JPEGs; and a pre-processed version of Govdocs with optimisedHuffman tables produced by the authors.

2. JPEG COMPRESSIONOVERVIEW

The JPEG standard (Wallace, 1992), is a lossycompression technique for reducing the file size ofimages. The standard leverages properties of hu-man vision in order to provide the best trade-offin perceived image quality to compression ratio,with several stages of compression being utilised.

During compression, images are typically con-verted to the YCbCr colour space, separating theluminance and chrominance channels, the latterof which may be optionally sub-sampled. Theresult is then divided in to 8 × 8 pixel blocks,which are transformed to the frequency domainusing the Discrete Cosine Transform (DCT) toproduce a matrix of 64 coefficients. The coeffi-cient at the top left of the matrix, known as theDC (Direct Current), represents the mean colourvalue of the block, while the remaining 63 coeffi-cients, which are known as AC (Alternating Cur-

rent), contain horizontal and vertical frequencyinformation. As most of the human sensitive as-pects of the signal are concentrated in the top-left of the matrix, which hosts the low frequencycoefficients, much of the higher frequency infor-mation may be discarded, or represented morecoarsely. This is achieved by quantization, withthe JPEG quantization table mapping the rela-tive compression ratios of each DCT coefficient.The standard provides a generic quantization ta-ble, which can be scaled to vary image quality.

The quantization process results in many ACcoefficients becoming zero. A run-length en-coding scheme is then applied which compressesthese runs of zeroes efficiently. Additionally, asthe average colour (DC) of each 8 × 8 block isexpected to change gradually throughout the im-age, differential coding is used to efficiently com-press the colour differences between blocks. Thecoefficient compression utilises variable lengthencoding schemes, with the data stored as a setof bit length and value pairs. This informa-tion is further encoded using single byte codes,which in turn represent the magnitude of theDC, or combined magnitude and run-length forthe AC coefficients. As these codes are repeatedfrequently, Huffman encoding can be used tocompress their representation to variable lengthbit strings. This allows for frequently occurringcodes to be represented in perhaps two or threebits, instead of a byte. The JPEG standard pro-vides a default mapping of these Huffman bitstrings for the AC and DC byte codes. However,for more efficient compression, a per image opti-mised Huffman table may be generated based onthe actual occurrences of these codes, resultingin smaller file sizes.

Figure 1 depicts the beginning of a sampleJPEG image. Immediately following the JPEGstart marker are the application markers whichspecify the particular JPEG form format (suchas JFIF, EXIF) and miscellaneous metadata,such as title, comments, camera settings, cam-era model, or editing software information. Thisis then followed by decompression informationcomprised of the quantization and Huffman ta-bles. The most basic (baseline) JPEG makes useof two quantization tables, one for luminance,

Figure 1: The structure of a sample JPEGas it is stored on disk. The metadata sec-tion may be long, and it is abbreviatedhere for clarity.

and another for chrominance, with four Huffmantables for the combinations of AC/DC and lumi-nance/chrominance. This is then followed by theactual image scan data, which is stored sequen-tially, from the top of the image to the bottom.An alternative format, the progressive JPEG,may contain multiple image scans, starting withlow resolution versions of the image and increas-ing in steps, allowing for images to increase inquality as they are loaded on the Web. Progres-sive JPEGs may contain more Huffman tables,which are used for each individual scan.

Mozilla’s MozJPEG (Mozilla, 2017) intro-duces tweaks to the encoding process by mod-ifying the original JPEG libraries while remain-ing compliant with the specification. The tech-nique uses optimised Huffman tables. All im-ages are converted to the progressive JPEG for-mat, and new quantization table presets are pro-vided to better accommodate high resolution im-ages. Google’s Guetzli (Alakuijala et al., 2017)takes a more aggressive approach, with coarsequantization presets, Huffman table optimisa-tion and post processing the DCT coefficient ma-trix. Guetzli produces sequential images, ratherthan using progressive JPEGs.

3. RELATED WORK

This section outlines existing work in crypto-graphic file hashing in order to detect known

files, before exploring work focusing on the anal-ysis of JPEG header features.

3.1 Forensic File Hashing

Cryptographic hashes are fundamental to thedigital forensics process, deployed as a mecha-nism for verifying the integrity of the data, sub-ject to chain of custody assurance, as well as tofingerprint both known good and known bad filesfor later database lookups. J. Kornblum (2006)noted that such hashes, which are typically basedon the entire content of a file or media, can eas-ily be attacked by modifying a single bit in theoriginal data, which produces an entirely differ-ent hash digest. Such changes may be used tointerfere with automatic detection processes. Inorder to mitigate this, piecewise hashes may beused, where data is split in to chunks which arehashed separately, such that the fingerprint itis more resistant to manipulation. J. Kornblum(2006) extends prior piecewise hashing methodsby applying a rolling hash system. This is incor-porated in to the ssdeep tool, which provides aconservative estimate of the number of identicalbytes in a file, providing a similarity score from0–100.

Piecewise hashing can give false positives dueto the existence of common data blocks in vari-ous file types. Roussev (2010) describes sdhash,a method for selecting statistically improbablefeatures when producing data fingerprints, whileGarfinkel et al. (2010) focus on reducing thenumber of common non-distinct blocks from thehash database.

Breitinger et al. (2013) compare and contrastthe properties of full file cryptographic hashing,bytewise approximate matching, and semanticapproximate matching. Binary methods wereshown to be much faster, though much less resis-tant to content preserving modifications, whileout performing semantic methods in the realmsof damaged or embedded file detection. The au-thors suggest that the relative merits of eachmethod should be exploited with an ordered ap-proach. Traditional file hashing can be used toflag up obvious contraband, followed by costly se-mantic methods to detect any modified images.Finally, damaged or embedded fragments may be

detecting using bytewise approximate matching.

McKeown, Russell, and Leimich (2017) exploitimage encoding metadata and small blocks ofscan data to create signatures for images in thePNG format. While the signatures in this workare not unique, the extraction time is much fasterthan full file hashing. The authors suggest thatinexpensive and less accurate techniques may beused to rule out the majority of non-contraband,while more expensive methods may be used toverify potential contraband. This allows for re-duced processing loads and efficient contrabanddetection.

3.2 Exploiting JPEG HeaderFeatures

Cryptographic hashing is usually applied in amanner which is indifferent to the specifics ofthe file format, processing the entire file at thebinary level. However, there is much to be gainedthrough careful analysis of the JPEG file struc-ture, with the following work focusing on theutility of features found in the JPEG header.

As JPEGs are often a central part of many in-vestigations, it is important that their integritycan be verified, so as to detect manipulation orforgeries. Piva (2013) provides an overview oftechniques addressing both of these issues. Sig-nal processing techniques can be used to iden-tify imperfections or camera fingerprints causedby particular camera lenses; image sensors; andcamera software traces, such as colour filters.Image manipulations are detected both in theJPEG pixel and compression domains, using 8×8block boundaries, pixel and frequency domaingradients, and coefficient histograms.

In contrast, a body of work has developedwhich attempts to address the issues of integrityand source identification purely via the exploita-tion of JPEG headers. Quantization table fin-gerprinting has been explored for this purpose(Farid, 2006, 2008; J. D. Kornblum, 2008; Mah-dian, Saic, & Nedbal, 2010), with findings sug-gesting that while quantization tables are notunique, they provide a good deal of discrimi-nating information. Quantization tables may beunique to a particular camera model/softwareeditor, or they may be able to identify a par-

ticular manufacturer, or group of source de-vices (Farid, 2006, 2008). Together with knowl-edge of the base tables provided in the JPEGstandard, it is possible to detect mismatchesand identify manipulation (J. D. Kornblum,2008). However, this approach fails when adap-tive quantization tables are used, which adjustthe table on a per image basis, much in thesame way that Huffman tables may be opti-mised (J. D. Kornblum, 2008).

Gloe (2012) provides an analysis of quanti-zation tables as well as structural informationpertaining to metadata, file markers and thumb-nails. It was noted that the ordering and par-ticular structure of metadata may be used as amechanism to detect image manipulation. Thisis attributed to differences between how imageeditors record this data at the lowest level, essen-tially leaving a software fingerprint behind whichcan be detected, hindering untraceable modifi-cations. It is suggested that convincing forg-eries require advanced programming skills in or-der to avoid altering these structures using ex-isting tools.

Kee, Johnson, and Farid (2011) utilise a largerset of JPEG header features to generate signa-tures for cameras and software tools. In additionto quantization tables, features are extractedfrom Huffman tables, EXIF metadata, image di-mensions and thumbnails. This was shown to beeffective for 1.3 million Flickr images, with 62%of signatures identifying a single camera model,80% to three or four cameras, and 99% identify-ing a unique manufacturer. The algorithm doesnot use complete representations of the originaldata structures and instead uses only the num-ber of Huffman codes of each length, and sim-ple counts of EXIF fields. From the extractedfeatures, EXIF data was shown to be the mostdistinct, followed by the image dimensions.

In the domain of Content Based ImageRetrieval (CBIR), Edmundson and Schae-fer (Schaefer, Edmundson, Takada, Tsuruta, &Sakurai, 2012; Schaefer, Edmundson, & Saku-rai, 2013; Edmundson & Schaefer, 2012, 2013)exploit optimised Huffman tables for inexpen-sive image comparison. This is possible as opti-mised tables are derived from the frequencies of

Figure 2: The portions of a JPEG usedfor traditional cryptographic hashing andHuffman comparison.

DCT coefficients in the compressed data stream,and can therefore serve as a coarse proxy forimage content. Image search performance wasevaluated using a number of datasets, includ-ing Flickr 1 million (Huiskes et al., 2010), withperformance levels being comparable to priorCBIR methods, while offering a 30 fold compu-tational improvement over non-Huffman basedcompressed domain methods, and 150-fold overthe fastest pixel domain method.

4. METHODOLOGY

To generate ranked similarity lists and to detectimage source devices and software, prior workutilised entire Huffman tables. Our approach isto use optimised Huffman tables to identify par-ticular JPEG images. This means that only theheader of the JPEG needs to be read, while tra-ditional hash based fingerprinting processes theentire file, as depicted in Figure 2.

4.1 Extracting Huffman tables

Structures in the JPEG format are preceded bymarkers which consist of two bytes, which alwaysbegin with 0xFF, with the second byte indicatingthe type of marker.

An example Huffman table is provided in Fig-ure 3. Huffman tables are essentially stored astwo arrays, the first containing the number ofHuffman codes of each bit length, while the lat-ter lists the corresponding DC and AC byte codesin the table. These arrays are all that is neededto reconstruct the entire Huffman decode tree.Pseudo-code for generating Huffman fingerprintsis provided in Algorithm 1, showing that an or-

Figure 3: An example Huffman table as itappears on disk. In table types, Y corre-sponds to the luminance channel, while Ccorresponds to the chrominance channels(Cb/Cr)

dered concatenation of all length/value arrays isall that is required to generate the fingerprint.

In practice, there are many existing librariesfor parsing JPEGs. The authors chose the Lib-jpeg (Independent JPEG Group, 2016) library toextract the Huffman tables using C++. Once thede-compression object is initialised, header data,including the Huffman tables, can be acquiredby calling the jpeg read header function. Thisheader information also includes JPEG type andcolour space data. At this point, the file has beenprocessed up to the Start of Scan (SOS) marker,which indicates the beginning of the compresseddata streams.

4.2 Properties and Limitations

Optimised Huffman tables provide coarse infor-mation about the relative frequencies of DCTcodes in the compressed data stream, and there-fore communicate some information about thefrequency domain representation of the sourceimage. As such, fingerprints derived from theDCT codes are robust to metadata modificationand some modifications to image content.

A limitation of this technique is that both im-ages must have been encoded using the samequantization tables, colour space, and channelsub-sampling. If this is not the case the imageswill contain a different distribution of DCT bytecodes, resulting in different Huffman tables whenoptimised.

Progressive JPEGs use multiple scans at dif-ferent resolutions, potentially resulting in manymore Huffman tables than baseline JPEGs. As

Algorithm 1: Generate Huffman Fin-gerprint. Order tables by type to avoid issuewith ordering on disk.

Input: JPEG FileOutput: JPEG FingerprinthuffmanTables= {};marker = nextMarker();/ / Loop until SOS markerwhile (marker != 0xFFDA) do

/ / Check for Huffman markerif (marker == 0xFFC4) then

length = readBytes(2);type = readBytes(1);/ / read remaining length afterlength/type bytes

htable = readBytes(length-3);huffmanTables[type]=toString(htable);

endmarker = nextMarker();

end/ / Order keys by type: 0x00, 0x01, 0x10,0x11

orderedKeys = order(huffmanTable.keys());fingerprint = ‘ ’;for (key in orderedKeys) do

fingerprint += huffmanTable[key];endreturn fingerprint

such, encoding the same image as both base-line and progressive JPEG will result in a mis-matched number of tables, though the tablesfrom the baseline image may be very similar tothe tables for coarse progressive scans. Huffmantables will appear before each scan in progressiveJPEG, such that they will be found throughoutthe file.

The process of generating optimised Huffmantables involves inspecting the corresponding DCand AC streams, and counting DCT code fre-quencies. The most frequently occurring itemsare assigned the smallest Huffman codes. Thejpegtran utility in the Libjpeg package may beused to create optimised images from unopti-mised JPEGs with the -optimize option. How-

ever, this requires the entire file to be processed,and introduces substantial overhead. Therefore,while it is possible to acquire optimised tablesfor any given image, Huffman table comparisonis best suited to images which are already opti-mised.

4.3 Evaluation

In order to demonstrate the utility of optimisedHuffman table analysis for digital forensics, sev-eral features must be investigated. The firstis the distinctness of optimised Huffman ta-bles, and whether they are capable of identify-ing uniquely images within a large dataset. Sec-ondly, the incidence of optimised Huffman tablesin the wild must be explored, in order to deter-mine how often this technique can be applied.The proportion of a JPEG required to read theHuffman tables, and any corresponding reduc-tion in processing time, must be also be quan-tified to measure the performance advantage ofthis technique.

Huffman fingerprints are constructed by ex-tracting the Huffman data, ordering the databy table type, and concatenating the raw data.For the purposes of the experiment, where twofingerprints match, clashes were verified using aSHA256 digests of the entire images.

The offset of the Start of Scan marker was ac-quired after reading the JPEG header by callingthe stdio::ftell function and then subtract-ing the remaining bytes in the src input bufferto get the correct value.

Benchmarks compare the proposed methodagainst a traditional full file hashing method us-ing the SHA256 algorithm. Benchmark timescorrespond to the duration for extracting sig-natures from a list of files, without storing thesignatures or performing database lookups. Inorder to assess the IO costs of accessing smallpieces of the file from the storage media, an addi-tional benchmark was performed which read thefirst 4KiB of each file without any processing.

4.4 Datasets

Three datasets were used in this work. Thefirst, the Flickr 1 Million dataset (Huiskes etal., 2010), contains 1 million JPEGs with op-

timised Huffman tables. The second is the Gov-docs JPEG corpus (Garfinkel et al., 2009), whichcontains approximately 109,000 JPEGs possess-ing mixed properties. The third dataset is acopy of Govdocs where all images were optimisedusing using JPEGtran with the -copy all and-optimize flags. Duplicate images were not re-moved.

While the Flickr 1 Million dataset is almostcompletely comprised of optimised JPEGs, threeimages (621224.jpg, 636616.jpg, 646672.jpg)were found to use default Huffman tables, andtherefore produced the same Huffman finger-print. These three images were optimised us-ing the same method as for the optimised Gov-docs dataset. All optimised images use theYCbCr colour space and are non-progressive,while the unmodified Govdocs dataset containsmixed types.

5. FINDINGS

5.1 Huffman Distinctness in Flickr 1Million

If Huffman tables are to be used to identify par-ticular images in a large dataset, they must con-tain enough discriminating power to do so. Tothis end, the Huffman fingerprints for all imagesin the Flickr 1 Million dataset were used to con-struct equivalence classes, grouping together im-ages with the same fingerprint. A class size of nindicates that n images possess the same finger-print, with a class size of 1 indicating a uniquefingerprint.

The Flickr 1 Million dataset contains manysets of duplicates, with a total of 746 images hav-ing at least one other image in the dataset withidentical binary data (see Table 1). Once du-plicates were removed from the results all buttwo pairs of images were found to possessunique Huffman fingerprints. This showsthat this fast fingerprinting technique has almosta zero percent false positive rate at scale.

Indeed, the two sets of JPEGs with matchingHuffman tables in this collection are almost iden-tical, and in reality differ by a small number ofpixels. Image differences were visualised usingthe Resemble.js library (Cryer, 2016), with dif-

Figure 4: Highlighted image differences forimage pairs with matching Huffman tablesin the Flickr 1 Million dataset. Images985964.jpg and 986229.jpg are representedon the left, 431419.jpg and 431931.jpg onthe right.

ferences highlighted in Figure 4. In the first case,3 pixels are different as a semi-colon is added tothe text rendered in the image, while in the sec-ond case one version of the image has two let-ters transposed. As the matching pairs also usethe same quantization tables, such differencesare small enough to result in the same optimisedHuffman tables being generated.

This result shows that optimised Huffman ta-bles possess a great deal of discriminating power,with only two pairs of nearly identical imagespossessing the same Huffman fingerprint in adataset of 1 million images. Indeed, this maybe seen as a positive property, as the fingerprintcan be tolerant to slight changes within the im-age.

5.2 Huffman Distinctness in Govdocs

Two versions of the Govdocs dataset were used:i) the unaltered original dataset, and ii) a ver-sion where all images have had their Huffman ta-bles optimised, and converted to baseline JPEGs,using jpegtran. The former is used to deriverepresentative statistics for how common partic-ular types of JPEG are in the wild, while thelatter provides a secondary test dataset for opti-mised Huffman distinctness. A small number ofJPEGs generated errors when optimising or ex-tracting Huffman tables and other information,and those were omitted from this analysis.

The unaltered Govdocs corpus is heteroge-nous, with a mix of JPEG modes, colour spaces

Flickr 1 Million Equivalence Classes

Size 1 (Unique) 2 3 4 5

No. ImagesHuffman

999250 726 18 4 5

No. ImagesSha256 Duplicates

N/A 722 15 4 5

No. ImagesHuffman No Dupes.

999250 4 0 0 0

Table 1: The number of images belonging to equivalence classes of each size for the Flickr 1 Milliondataset. A class size of n indicates that there are n images with the same fingerprint.

Colour Space YCbCr YCCK CMYK RGB Greyscale

No. Images 95783 3 0 9 13443

Table 2: The number of images for each colour space option in the Govdocs corpus.

and origin software. Of approximately 109,000images, only 6809 JPEGs (6.2%) use the pro-gressive format, with the remainder using thebaseline JPEG format. 37,879 (34.7%) baselineJPEGs use default Huffman tables, and, as such,produce the same fingerprint when extracted.39,035 images (35.7%) contained Adobe appli-cation markers, which may either use optimisedor pre-defined Huffman tables. The predominantcolour space is overwhelmingly YCbCr, as shownin Table 2.

Prior to optimisation, 39,328 (36%) of all Gov-docs images have a unique Huffman fingerprint,which is 55.1% when excluding default tables.The remaining images are primarily grouped into very large classes, with 23,706 images belong-ing to groups of equivalent Huffman tables with1000 or more members, the largest class con-taining over 10,000 images. This indicates that,even in the presence of many JPEGs using pre-defined tables, Huffman analysis can be used asthe sole method of identifying images more than1/3rd of the time. That is, this corpus suggeststhat optimised Huffman tables are used as of-ten as default tables, however the authors ar-gue that the trend is towards optimised JPEGs,and indeed the Govdocs corpus itself is relativelyold. The software developed by Mozilla (Mozilla,2017) and Google (Alakuijala et al., 2017) may

be an indication that optimised images will ap-pear more frequently on the Web, where pageload times and data transfers are relatively ex-pensive compared to other domains.

The optimised Govdocs dataset provided sim-ilar results to the Flickr 1 Million dataset (seeTable 3), in that images with matching Huffmantables were either identical in the binary domain,or demonstrate small variations of the same im-age. When combining both datasets into onecorpus of over 1.1 million images, no new equiv-alence classes were found. This confirms thatHuffman tables are very distinct for optimisedJPEGs, even across datasets.

5.3 Start of Scan Marker Offsets

Statistics for the position of the Start of Scanmarker are depicted in Table 4 for all datasets,with a visual representation for Flickr and un-modified Govdocs in Figure 5. As the SOSmarker appears after the Huffman tables, thedata shows that very few 4096 byte media blocksare required to read those Huffman tables. In thecase of the Flickr dataset, a single block read suf-fices for 96.6% of the dataset, with three blocksbeing sufficient for 99.6% of images. The figuresare slightly higher for the Govdocs corpus, whichcontains more metadata, where nine blocks arerequired to acquire 99% of Huffman tables. In

Optimised Govdocs Equivalence Classes

Size 1 (Unique) 2 3 4 5

No. ImagesHuffman Only

108539 684 4 0 0

No. ImagesSha256 Duplicates

N/A 676 0 0 0

No. ImagesHuffman No Dupes.

108539 8 4 0 0

Table 3: The number of images belonging to equivalence classes of each size for the optimised Govdocsdataset. A class size of n indicates that there are n images with the same fingerprint.

Figure 5: Log-log distribution of Start ofScan offsets for Flickr 1 Million and un-modified Govdocs. Optimised Govdocs isalmost identical to the original.

both cases, the distribution is long tailed, withthe majority of images requiring a single block.

Using mean values for marker offsets and filelengths, 1.6% of the file must be read on aver-age to acquire the SOS marker in the Flickr 1Million dataset, while both Govdocs datasets re-quire 1.2% of the file to be processed. However,when considering that 4096 bytes may be theminimal transfer size on modern storage media,the figure for the Flickr dataset rises to 3.2%,while Govdocs remains all but unchanged.

Using the proposed fast fingerprinting method,a small fraction of the file is all that is requiredto be read, as opposed to the entire file for tra-ditional hashing.

Figure 6: Distributions for the number ofDCT codes in each dataset. Flikr 1 Millionthe top, Optimised Govdocs in the middle,and Govdocs at the bottom.

5.3.1 Number of Codes and TableLengths

The maximum number of DCT byte codes pos-sible in the baseline JPEG format is 348 (12 perDC, 162 per AC table). However, the maximumnumber of codes observed for an optimised JPEGin this work was 277, suggesting that the num-ber of codes may be used as a heuristic to distin-guish between optimised and pre-defined tables.However, as can be seen in Figure 6, not all pre-defined tables use all codes. The spikes in thebottom graph of Figure 6, for 174 and 249 codes,are caused by images produced by Adobe Pho-toshop’s ‘save for web’ settings, which optimise

DatasetPercentile (B)

Mean (B)50 75 95 99 99.9

Flickr 1 Million 973 3560 3599 9060 28866 2054Govdocs 623 4181 23926 36128 51658 4205Govdocs Optimised 417 3972 23863 36205 51426 4080

Table 4: Start of Scan offsets in bytes for all datasets.

entropy encoding using alternative mechanisms.However, based on this data, images with lessthan 300 codes are very likely to make use ofoptimised JPEGs.

Ignoring Huffman table markers, the length ofthe Huffman table may be calculated by sum-ming the number of entries in the value andlength vectors for each table. The maximum pos-sible number of codes is 376 for baseline JPEGs,plus 16 bytes of marker and metadata for eachof the four Huffman tables, for a total of 440bytes. The maximum length found for an opti-mised JPEG in this work is 300 bytes, with amean of 191 bytes for the Flickr dataset.

Using this observation, it is possible to iden-tify some images with a high degree of certaintywhich use unoptimised tables. Additionally, thistable length information indicates the number ofbytes which are required to be stored for JPEGHuffman fingerprints. To reduce storage, thesefingerprints themselves could be hashed usinga cryptographic hashing mechanism with fixedlength digests.

5.4 Benchmark Results

To quantify the potential speed improvementsover traditional cryptographic hashing, severalbenchmarks were run on a workstation (i5-6490k,16GiB DDR3 RAM, Western Digital Red 4TBHDD, Crucial MX300 525GB SSD) and lap-top (i7-5500U, 8GiB DDR3 RAM, Samsung 840EVO 500GB SSD), with several multi-threadingoptions. Times represent the total extractiontime for all files, with no database lookup in-cluded. Testing was limited to the Flickr 1 Mil-lion dataset, as this is both the largest dataset,and the worst case performance scenario (withthe mean file size being 1/3 that of Govdocs,

such that the header is a larger proportion of thefile). Benchmarks were carried out on Ubuntu15.04 64bit, with memory caches being clearedbetween runs. A C++ application was compiledin g++ using Boost 1.55 for thread pools, lib-jpeg62 for JPEG parsing, and OpenSSL for cryp-tographic hashing (SHA256).

The proposed method saw no improvementwhen images were stored on the HDD. This canbe attributed to the relatively small file sizes ofthis dataset, but more importantly, due to thethe poor small block read performance of the me-chanical media.

However, substantial performance gains wereseen when utilising solid state media, which arebetter suited to random access patterns. Fig-ure 7 show results for Huffman fingerprint ex-traction, full cryptographic file hashing, andreading the first 4096 bytes of the file with nofurther processing. Results are compared acrossboth the EXT4 and NTFS file systems. Whileboth file systems scale well to two threads, NTFSperformance appears to plateau at this stage,while the EXT4 file system performance im-proves with the number of threads. Overheadswere explored by obtaining the logical block ad-dresses (LBA) for each file and running the ex-periment with physical addresses, rather thanlooking up each file in the file system. Whenthe initial pre-processing was not included in therecorded time, benchmark times using the LBAaddresses performed nearly identically to thoseof EXT4. The relatively poor performance ofNTFS could be attributed to overhead withinthe file system, which does not scale well withmany concurrent accesses. This observation wasverified using the Windows operating system torule out the Linux ntfs-3g driver as a bottle-

Figure 7: Comparison of the relative per-formance of Huffman and Hash finger-print extraction to reading the first 4K fileblock, across the EXT4 and NTFS file sys-tems. EXT4 performance is close to usingraw LBA block addresses.

Figure 8: Comparison of the relative per-formance of Huffman and Hash finger-print extraction to reading the first 4Kfile block, across two computers. The Lap-top SSD possesses better random 4K Readperformance and higher IOPS.

neck. Thus, EXT4 and LBA based addressingare preferable when performing this kind of frac-tional file access.

When comparing Huffman fingerprint extrac-tion to full file hashing on the workstation, aspeed increase of 2.5× was recorded on NTFS,and up to 5.7× on EXT4. In both cases Huff-man extraction performance mirrored 4KiB fileread performance very closely, typically with lessthan 10% overhead. This suggests that Huff-man fingerprint extraction is close to the theo-retic limits of storage media access for small filefractions.

When a storage device with higher random4KiB read performance is used, the relativeperformance of Huffman extraction to full filehashing also improves. This is depicted inFigure 8, which compares the same methodsusing the benchmark workstation and laptopmachines, which each possess different SSDs.The laptop SSD has higher small block through-put with low queue depths, which is of benefit

to both Huffman and hashing methods, despitethe less performant CPU. In this case, Huffmanextraction on the NTFS file system is 3×.

Partial file access performance with HDDs aredominated by seek time, with transfer time be-ing less of a factor. However as file sizes increasethe relative performance would be expectedto improve (McKeown et al., 2017), both formechanical media and solid state devices. SSDshave much smaller effective seek times, andthus the proposed technique holds significantlymore appeal for flash media. In modern systemsflash media are becoming increasingly common,particularly in the laptop arena, in addition toalready dominating the mobile device market.The observation that partial file access scaleswell on this type of media opens the door forflash storage optimised approaches to digitalforensics, which may well be the dominantstorage technology for personal computers inthe near future.

6. CONCLUSIONS ANDFUTURE WORK

This paper has explored the potential to useoptimised Huffman tables to identify particularJPEGs in a collection, with tables being essen-tially unique across 1.1 million JPEG imageswith optimised Huffman tables. The Huffmanfingerprint was shown to group very similar im-ages together, while also being inexpensive toextract, resulting in speed increases of up to5.7× on solid state media, despite the relativelysmall file sizes of the test dataset. On datasetsof higher resolution images, this technique maybe expected to perform well over an order ofmagnitude faster than file hashing, in line withprior benchmarks on similar fingerprint gener-ation from PNG files (McKeown et al., 2017).This method also has the benefit of being usablefor partially carved files, where only the headerexists, potentially providing strong supplemen-tary evidence in the absence of direct evidence.

The limitations are that this approach relieson Huffman tables being optimised ahead oftime, and therefore excludes many camera im-ages without initial pre-processing. However,the recent development of software to optimiseJPEGs, in lieu of adopting a new compressionstandard, is promising and may indicate wideradoption of optimisation in the future. Addition-ally, it may be possible to exploit similar contentderived features in future generations of imagecodecs, or for other file formats, such as com-pressed video.

Future work should explore the possibility ofusing the same technique on progressive JPEGsand other compressed media types. Additionally,leveraging work in the field of Content-Based Im-age Retrieval, Huffman fingerprints may be usedto detect similar images in a digital forensics con-text. Fractional file processing may be exploredin other scenarios, such as for de-duplication, orfor processing remote/networked devices, whichwill have different performance characteristicswhich may facilitate greater performance im-provements. Finally, flash optimised digitalforensics techniques may be explored, which maybe of great benefit to the field in the future.

7. ACKNOWLEDGEMENTS

This research was supported by a scholarshipprovided by Peter KK Lee.

REFERENCES

Alakuijala, J., Obryk, R., Stoliarchuk, O., Sz-abadka, Z., Vandevenne, L., & Wassen-berg, J. (2017). Guetzli: PerceptuallyGuided JPEG Encoder. arXiv preprintarXiv:1703.04421 . Retrieved 2017-05-31, from https://arxiv.org/abs/1703

.04421

Beebe, N., & Clark, J. (2005). Dealing with Ter-abyte Data Sets in Digital Investigations.In M. Pollitt & S. Shenoi (Eds.), Advancesin Digital Forensics (Vol. 194, pp. 3–16).Springer US. Retrieved from http://dx

.doi.org/10.1007/0-387-31163-7 1

Breitinger, F., Liu, H., Winter, C., Baier, H., Ry-balchenko, A., & Steinebach, M. (2013).Towards a process model for hash func-tions in digital forensics. In InternationalConference on Digital Forensics and CyberCrime (pp. 170–186). Springer.

Cryer, J. (2016). Resemble.js: Image analy-sis and comparison. Retrieved 2017-02-21, from https://github.com/Huddle/

Resemble.js

Edmundson, D., & Schaefer, G. (2012).Fast JPEG image retrieval using op-timised Huffman tables. In PatternRecognition (ICPR), 2012 21st Interna-tional Conference on (pp. 3188–3191).IEEE. Retrieved 2016-03-16, fromhttp://ieeexplore.ieee.org/xpls/

abs all.jsp?arnumber=6460842

Edmundson, D., & Schaefer, G. (2013, Novem-ber). Very Fast Image Retrieval Basedon JPEG Huffman Tables. In 2013 2ndIAPR Asian Conference on Pattern Recog-nition (ACPR) (pp. 29–33). doi: 10.1109/ACPR.2013.18

Farid, H. (2006). Digital image ballistics fromJPEG quantization (Tech. Rep.). Tech-nical Report TR2006-583, Department ofComputer Science, Dartmouth College.

Farid, H. (2008). Digital image ballisticsfrom JPEG quantization: A followupstudy. Department of Computer Sci-ence, Dartmouth College, Tech. Rep.TR2008-638 . Retrieved 2016-05-06, fromhttp://www.cs.dartmouth.edu/farid/

downloads/publications/tr08.pdf

Garfinkel, S., Farrell, P., Roussev, V., & Dinolt,G. (2009, September). Bringing scienceto digital forensics with standardizedforensic corpora. Digital Investigation,6 , S2–S11. Retrieved 2016-03-05, fromhttp://linkinghub.elsevier.com/

retrieve/pii/S1742287609000346 doi:10.1016/j.diin.2009.06.016

Garfinkel, S., Nelson, A., White, D., &Roussev, V. (2010, August). Us-ing purpose-built functions and blockhashes to enable small block and sub-file forensics. Digital Investigation, 7 ,S13–S23. Retrieved 2016-03-03, fromhttp://linkinghub.elsevier.com/


Gloe, T. (2012). Forensic analysis ofordered data structures on the exam-ple of JPEG files. In InformationForensics and Security (WIFS), 2012IEEE International Workshop on (pp.139–144). IEEE. Retrieved 2016-04-29, from http://ieeexplore.ieee.org/

xpls/abs all.jsp?arnumber=6412639

Huiskes, M. J., Thomee, B., & Lew, M. S. (2010).New trends and ideas in visual concept de-tection: the MIR flickr retrieval evaluationinitiative. In Proceedings of the interna-tional conference on Multimedia informa-tion retrieval (pp. 527–536). ACM. Re-trieved 2017-02-22, from http://dl.acm

.org/citation.cfm?id=1743475

Independent JPEG Group. (2016). Libjpeg. Re-trieved 2017-02-20, from http://www.ijg

.org/

Kee, E., Johnson, M. K., & Farid, H.(2011). Digital image authenticationfrom JPEG headers. Information Foren-sics and Security, IEEE Transactions on,6 (3), 1066–1075. Retrieved 2016-04-

05, from http://ieeexplore.ieee.org/


Kornblum, J. (2006, September). Identify-ing almost identical files using contexttriggered piecewise hashing. DigitalInvestigation, 3, Supplement , 91–97.Retrieved 2016-03-04, from http://

www.sciencedirect.com/science/

article/pii/S1742287606000764 doi:10.1016/j.diin.2006.06.015

Kornblum, J. D. (2008, September). Us-ing JPEG quantization tables toidentify imagery processed by soft-ware. Digital Investigation, 5 , S21–S25. Retrieved 2016-04-14, fromhttp://linkinghub.elsevier.com/


Mahdian, B., Saic, S., & Nedbal, R. (2010).JPEG quantization tables forensics: a sta-tistical approach. In Computational Foren-sics (pp. 150–159). Springer.

McKeown, S., Russell, G., & Leimich, P. (2017).Fast Filtering of Known PNG Files UsingEarly File Features. In Annual ADFSLConference on Digital Forensics, Securityand Law. Daytona Beach, Florida, USA.

Mozilla. (2017). Mozjpeg: Improved JPEGencoder. Retrieved 2017-02-20, fromhttps://github.com/mozilla/mozjpeg

Piva, A. (2013). An Overview on ImageForensics. ISRN Signal Processing ,2013 , 1–22. Retrieved 2016-05-06, fromhttp://www.hindawi.com/journals/

isrn.signal.processing/2013/496701/

doi: 10.1155/2013/496701Quick, D., & Choo, K.-K. R. (2014, December).

Impacts of increasing volume of digitalforensic data: A survey and future researchchallenges. Digital Investigation, 11 (4),273–294. Retrieved 2015-10-06, fromhttp://linkinghub.elsevier.com/


Roussev, V. (2010). Data fingerprintingwith similarity digests. In Advancesin digital forensics vi (pp. 207–226).Springer. Retrieved 2016-03-04, from

http://link.springer.com/chapter/

10.1007/978-3-642-15506-2 15

Schaefer, G., Edmundson, D., & Saku-rai, Y. (2013, December). FastJPEG Image Retrieval Based on ACHuffman Tables. In (pp. 26–30).IEEE. Retrieved 2016-04-15, fromhttp://ieeexplore.ieee.org/lpdocs/

epic03/wrapper.htm?arnumber=6727165

doi: 10.1109/SITIS.2013.16Schaefer, G., Edmundson, D., Takada, K.,

Tsuruta, S., & Sakurai, Y. (2012, Novem-ber). Effective and Efficient Filteringof Retrieved Images Based on JPEGHeader Information. In (pp. 644–649).IEEE. Retrieved 2016-03-31, fromhttp://ieeexplore.ieee.org/lpdocs/

epic03/wrapper.htm?arnumber=6395154

doi: 10.1109/SITIS.2012.97Wallace, G. K. (1992). The JPEG still

picture compression standard. Con-sumer Electronics, IEEE Transactions on,38 (1), xviii–xxxiv. Retrieved 2016-03-11, from http://ieeexplore.ieee.org/


Fingerprinting JPEGs With Optimised Huffman Tables/media/worktribe/output... · describes a...

Documents

Transcript of Fingerprinting JPEGs With Optimised Huffman Tables/media/worktribe/output... · describes a...