Sunday, 18 May 2014

Stopping PDF encryption hurting your productivity

Here's how to get around copy and paste issues with text from a protected PDF. No programming talent required (which is good, is not a talent I have...). Sharing this knowledge may put me in contravention of some anti-consumer law somewhere, I don't know, it's a weird world.

Scenario: my partner was working on updating a report, reusing and updating material from an earlier report produced by a colleague. This earlier report was only available as a PDF and it turned out this PDF had protection enabled to prevent copy and paste. Now, I could either let her manually type out the material (amounting to quite a few pages of text) or I could find a way around it.

Simple it turns out to be...

In summary: convert the PDF to an image and feed it into an OCR engine. With care on the PDF to image conversion, and presuming a decent font/typography in the original PDF, the resulting OCR output should be near 100%.

There's a bunch of ways this can be achieved, what follows is the process I used (which should be more or less applicable anywhere, but was done with Linux on my partners XPS15). You should be able to more or less cut and paste the following commands to achieve a similar outcome on your own machine, once you've got the right software installed (tesseract and ghostscript).

First convert the PDF to an image (the TIFF format was used here as the PDF was multipage):

gs -o /path/to/outputfile.tif -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw /path/to/inputfile.pdf

Then run OCR software to convert the image to txt:

tesseract /path/to/outputfile.tif /path/to/ocroutputfile

Done. One text file, simply and cleanly formatted, containing all the text from the protected PDF. All DRM can be defeated this easily - if you can see it, it can be copied.

NB If in another scenario if you needed the final output to be more nearly a copy of the original PDF you'd need to involve OCR layout functionality - but much of that's doable with Tesseract by changing the output format to hOCR. And then, if you really wanted to, you could use hocr2pdf to recreate the PDF...