Imagine handing over a complex set of construction plans to your building contractor, fresh off your fancy
$12,000 Xerox WorkCentre scanner/copier/printer, only to find there were errors in the blueprints that weren't on the original. Or, imagine passing a set of budget figures to the CFO of your company that weren't the ones you fed into your WorkCentre, one of Xerox's midrange color multifunction machines that, like any scanner or photocopier, is expected to reproduce a picture of the document you put in. As anyone who uses one of these devices in an office, school or home knows, the camera, whether photographing or digitally scanning a document, the camera doesn't lie. Or does it?
That's why German computer researcher David Kriesel was so confused when he removed scanned construction plans from his WorkCentre recently and found the reproduction did not match the original he had put in, but in a strange way: perfectly legible digits on the original had turned into different ones in the copy made by his WorkCentre. Not just fuzzy numbers that could be a 5 or a 6, but a wholesale substitution of one number for another. Somewhere in the machine, the document seemed to enter a parallel universe and produced a non-copy of the original. His machine wasn't using optical character recognition (OCR) to read the document—it was supposed to be just taking a straight picture, and reproducing the plans.
According to Kriesel's blogged account, intrigued, he then ran a set of cost figures from another document through the scanner and found a similar appearance of new figures in place of old ones. After testing additional documents and finding more new numbers, Kriesel theorized—correctly, as it turns out (link in German)—that the culprit in this digit swapping is an image compression technique the WorkCentres use, called JBIG2, which creates its own library of image patches to make up for unclear data. Essentially like an autocorrect with pictures instead of words, JBIG2 looks for a best-effort match within a particular range of error for images it can't fully process, in this case, using other numbers from different parts of the document.