Page 35 - T-I JOURNAL19-3
P. 35
REINING IN ONLINE ABUSES 595
Figure 1. The MD5 hash of this image is 78ba217bccd6e6b4d032e54213006928.
be a first step to disrupting the global distribution At a conceptual level, however, hashing has many
of CP. desirable properties: A signature is computation-
In collaboration with NCMEC and researchers ally efficient to extract; the signature is unique and
at Microsoft, we set out to develop technology that compact; and hashing completely sidesteps the diffi-
could quickly and reliably identify images from the cult task of content-based image analysis that would
NCMEC database of known CP images. At first be needed to recognize the presence of a person,
glance, this may seem like an easy problem to solve. determine the person’s age, and recognize the diffi-
Hard-hashing algorithms such as MD5 or SHA-1 cult-to-define concept of sexually explicit. Building
(2,3) can be used to extract from an image a unique on the basic framework of hard hashing, we sought to
compact alphanumeric signature (Figure 1). This develop a robust hashing algorithm that generates a
signature can then be compared against all uploads compact and distinct signature that is stable to simple
to an online service like Facebook or Twitter. In prac- modifications to an image, such as re-compression,
tice, however, this type of hard hash would not work resizing, color changes, and annotated text.
because most online services automatically modify Although I will not go into too much detail on the
all uploaded images. Facebook, for example, resizes, algorithmic specifics, I will provide a broad overview
recompresses, and strips metadata from every image. of the robust hashing algorithm—named Pho-
The result of these and similar modifications is that, toDNA—that we developed (see also (4,5)). Shown in
although the original and modified images are per- Figure 2 is an overview of the basic steps involved in
ceptually similar, the signature (hash) is completely extracting a robust hash. First, a full-resolution color
different. The reason is that hard hashing is designed image is converted to grayscale and downsized to a
to yield distinct signatures in light of any modification lower and fixed resolution of 400 × 400 pixels. This
to the underlying image. Hard hashing, therefore, is step reduces the processing complexity in subsequent
ineffective at matching images that are modified in steps, makes the robust hash invariant to image reso-
any way at the time of upload. lution, and eliminates high-frequency differences that

