Context-Aware Text Recognition?


A scanned document, the text is askew. Next to it is a computer-generated version of the text. A passage is highlighted.

I've been playing with Google's Cloud Vision API. It is OCR (Optical Character Recognition) - but in THE CLOUD and uses MACHINE LEARNING! When it works, it is indistinguishable from magic. When it fails, it reveals a very limited understanding of human text. Let's take a look at this quick example - a piece of […]

Continue reading →

Crowdsourcing Leveson


I've already blogged about the Leveson Inquiry's disturbing habit of releasing evidence as scanned in PDFs. I had a suggestion from digital journalist Kevin Anderson @edent Put the Leveson docs up on Google Docs. I'd be curious how their OCR could handle them. Then click 'make public' — Mr Anderson (@kevglobal) May 11, 2012 Google […]

Continue reading →

Leveson - Death By A Thousand (Paper) Cuts


I've been listening to the Leveson inquiry. A large part of the exchanges seem to go like this: Jay: Turning to page 51. Witness: Which bundle? Jay: 1606. Witness: 1660? Leveson: No, the page after. Jay: Paragraph 7. Witness: I don't have a paragraph 7. Jay: Ah, I have an earlier print out. Leveson: You'll […]

Continue reading →