Fuzzy Hashing :Fuzzy Hashing Jesse Kornblum
Overview :Overview A Reasonable Scenario
MD5 Explained
Where MD5 falls down
Similarity
Fuzzy Hashing
Examples
Questions
A Reasonable Scenario :A Reasonable Scenario Disgruntled employee leaves BigCompany
Joins new startup in the same field
New company introduces similar product
Using detailed information from Big Company
One particular document
Not publicly available
BigCompany sues for trade secrets violation
New company has eight employees
A Reasonable Scenario :A Reasonable Scenario 100GB per employee hard drive, plus 200GB on server
1TB of data total
536,000 reams of paper
About 29 semi-trucks Photo courtesy sumsinnow via Flickr
MD5 Explained :MD5 Explained How MD5 (roughly) works:
Start with an initial state
Look at fixed size block of input
Do mathy stuff with current state and block
Get new state
Advance to next block of input
Repeat steps 2 and 3 until out of input blocks
Ending state is the hash
MD5 Explained :MD5 Explained If you change one bit in the middle, you change the next state
Which ends up changing the end result
Is this a good thing or a bad thing? ?
Similarity :Similarity Different levels of similarity
Identical
Ones and zeros
Displays the same
Behaves the same
Thematically similar
Not similar
Piecewise Hashing :Piecewise Hashing Developed for integrity during imaging
Divide input into fixed sized sections and hash separately
Insert or delete changes all subsequent hashes 3b152e0baa367a8038373f6df 40c39f174a8756a2c266849b fdb05977978a8bc69ecc46ec
Rolling Hash :Rolling Hash It would be nice to set boundaries such that
Insertions and deletions are contained within a block
Disclaimer :Disclaimer I didn’t invent this math
Originally Dr. Andrew Tridgell
Samba
rsync was part of his thesis
Modified slightly for spamsum
Spam detector in his “junk code” folder
User report that rsync confuses similar Word documents
Rolling Hash :Rolling Hash A different kind of hash function
Produces a pseudorandom output for every position in a file
Depends only on last few bytes
Lots of academic work on these
Just mathy tricks
F o u r s c o r e -> 83,742,221
F o u r s c o r e -> 5
F o u r s c o r e -> 90,281
Rolling Hash :Rolling Hash To update state (c,x,y,z,window) for a byte d:
y = y − x
y = y + size * d
x = x + d
x = x − window[c mod size]
window[c mod size] = d
c = c + 1
z = z << 5
z = z XOR d
return (x + y + z)
Rolling Hash :Rolling Hash We use the rolling hash to generate block boundaries
Select some values as trigger points
When we hit a trigger point, end the block
Example
Excerpt from "The Raven" by Edgar Allan Poe
Triggers on ood and ore
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood there, wondering, fearing
Doubting, dreaming dreams no mortals ever dared to dream before;
But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,
Lenore?, This I whispered, and an echo murmured back the word,
"Lenore!" Merely this, and nothing more
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood there, wondering, fearing
Doubting, dreaming dreams no mortals ever dared to dream before;
But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,
Lenore?, This I whispered, and an echo murmured back the word,
"Lenore!" Merely this, and nothing more
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more
Rolling Hash :Rolling Hash How do we choose the triggers?
Chosen randomly, before reading the file
Based on the size of the input file
Really just a set of numbers
Has nothing to do with type of input data
Fuzzy Hashing :Fuzzy Hashing Combine Rolling Hash with a Traditional Hash
Use Fowler/Noll/Vo (FNV) hash
That’s what Tridgell did
Faster and less complex than MD5
We’re only using a small part of the result
Start reading file, compute Rolling and Traditional Hashes
When Rolling Hash triggers
Record LSB of Traditional Hash value
When finished, combine LSBs to make signature
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more 28163 491522 57 145410213 738210
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more 28163 491522 57 145410213 738210
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing I AM THE LIZARD KING! Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing I AM THE LIZARD KING! Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more 28163 82910 57 145410213 738210
Rolling Hash :Rolling Hash Deep into the darkness peering, long I stood
there, wondering, fearing I AM THE LIZARD KING! Doubting, dreaming dreams no mortals ever dared to dream before
; But the silence was unbroken, and the stillness gave no token,
And the only word there spoken was the whispered word,Lenore
?, This I whispered, and an echo murmured back the word,"Lenore
!" Merely this, and nothing more 28163 82910 57 145410213 738210
Matching :Matching Signature 1: 3 2 7 3 0
Signature 2: 3 0 7 3 0
Edit Distance
Number of insertions, modifications and deletions to turn Signature 1 into Signature 2.
For the example above, the edit distance is one.
Signatures (and thus files) match when the ratio of the edit distance to the length is small
Demonstration :Demonstration WARNING:
EXPLICIT IMAGERY LAW ENFORCEMENT SENSITIVE
DO NOT DUPLICATE
Demonstration :Demonstration LAW ENFORCEMENT SENSITIVE
DO NOT DUPLICATE
Demonstration :Demonstration Known kitty porn MATCH Corrupted File
Demonstration :Demonstration Known kitty porn No match Different File
Demonstration :Demonstration Known kitty porn MATCH File Header
Demonstration :Demonstration Known kitty porn MATCH File Footer
Demonstration :Demonstration Known kitty porn MATCH File Footer (attached to header)
Issues :Issues Does not work for similar looking graphics
Unable to handle cropping, resizing, and other edits
Confused by many small changes throughout input
Computationally intensive
7-10 times slower than MD5
No way to sort signatures
Must compare each input to all known signatures
Fuzzy Hashing :Fuzzy Hashing Matches similar but not identical bitstreams
Great for corrupted or partial documents
Also great for source code reuse
Free Software
http://ssdeep.sf.net/
Windows, GUI, *nix, and OS X
Paper in Digital Investigation
http://dfrws.org/2006/proceedings/12-Kornblum.pdf
Future Research :Future Research "That algorithm is our last hope."
"No, there is another."
Questions :Questions Jesse Kornblum
jesse.kornblum@mantech.com