Page 35 - T-I JOURNAL19-3
P. 35

REINING IN ONLINE ABUSES                          595

































                 Figure 1. The MD5 hash of this image is 78ba217bccd6e6b4d032e54213006928.


             be a first step to disrupting the global distribution    At a conceptual level, however, hashing has many
             of CP.                                     desirable properties: A signature is computation-
               In collaboration with NCMEC and researchers  ally efficient to extract; the signature is unique and
             at Microsoft, we set out to develop technology that  compact; and hashing completely sidesteps the diffi-
             could quickly and reliably identify images from the  cult task of content-based image analysis that would
             NCMEC database of known CP images. At first  be needed to recognize the presence of a person,
             glance, this may seem like an easy problem to solve.  determine the person’s age, and recognize the diffi-
             Hard-hashing algorithms such as MD5 or SHA-1  cult-to-define concept of sexually explicit. Building
             (2,3) can be used to extract from an image a unique  on the basic framework of hard hashing, we sought to
             compact alphanumeric signature (Figure 1). This  develop a robust hashing algorithm that generates a
             signature can then be compared against all uploads  compact and distinct signature that is stable to simple
             to an online service like Facebook or Twitter. In prac-  modifications to an image, such as re-compression,
             tice, however, this type of hard hash would not work  resizing, color changes, and annotated text.
             because most online services automatically modify    Although I will not go into too much detail on the
             all uploaded images. Facebook, for example, resizes,  algorithmic specifics, I will provide a broad overview
             recompresses, and strips metadata from every image.  of the robust hashing algorithm—named Pho-
             The result of these and similar modifications is that,  toDNA—that we developed (see also (4,5)). Shown in
             although the original and modified images are per-  Figure 2 is an overview of the basic steps involved in
             ceptually similar, the signature (hash) is completely  extracting a robust hash. First, a full-resolution color
             different. The reason is that hard hashing is designed  image is converted to grayscale and downsized to a
             to yield distinct signatures in light of any modification  lower and fixed resolution of 400 × 400 pixels. This
             to the underlying image. Hard hashing, therefore, is  step reduces the processing complexity in subsequent
             ineffective at matching images that are modified in  steps, makes the robust hash invariant to image reso-
             any way at the time of upload.             lution, and eliminates high-frequency differences that
   30   31   32   33   34   35   36   37   38   39   40