9

A recent answer/comment to a different question prompted me to ask this: Why does Tolkien use neither quotes nor cursive writing, and all lower-case, in this specific "quote"?

Somebody seems to suggest that somebody had painstakingly scanned in all the pages, one by one, from Lord of the Rings, and then ran OCR software to digitize the text, and then released that as a pirate copy on the Internet, and this is the reason why there was a mistake in (at least) one place where "The Prancing Pony" was spelled as "the prancing pony", because it was meant to be all-uppercase rather than formatted with cursive or quotes, and the OCR mechanism made a mistake.

But how realistic is this? Would not a pirated book origin from an actual "e-book" release, where somebody else has professionally made all the work of actually digitizing the book? And the pirate just makes a copy of it (possibly remove DRM) and releases? In fact, they wouldn't even begin from a printed copy in this case, because The Lord of the Rings was digitized decades ago and all copies since start from that digital version.

Also, pirates are lazy. I cannot imagine who would sit and scan in all the pages of a book like that only to release it as "warez", without any possibility of making any money from it. That just doesn't happen in reality, I should think.

Do people actually spend countless hours of painstaking work with a scanner at home to produce their own analog-to-digital "pirate e-books" instead of simply obtaining a professional e-book which they crack (if that's even necessary) and release for free?

6
  • My first copy of Gone With The Wind was a low-quality paperback from some Indian publisher. It consistently had Ups instead of lips, much to my confusion in the inital chapters ("hot words bubbled to her Ups"). I understood the mistake after seeing it a couple more times, but not how or why it could have happened. Years later I realized that that was something I'd expect to see from OCR going wrong.
    – muru
    Commented Apr 13, 2022 at 17:09
  • 8
    Pirates are lazy? Quite a few pirates spend a lot of time for something they're not getting paid for. Back in the 8-bit computer days, pirates had to crack various forms of DRM, which required serious intellectual effort. Many book people do work to transcribe public domain work, and some of them work to scan and sometimes OCR older works that aren't in the public domain. Look at archive.org and its magnificent collection of 1980s-1990s computer magazines, virtually technically pirated and scanned by third parties, for example.
    – prosfilaes
    Commented Apr 13, 2022 at 21:57
  • 4
    From personal experience, a LOT of pirated books are badly OCR'd. Especially those from the era before commercial ebooks became popular (aka the 2000's).
    – Vilx-
    Commented Apr 13, 2022 at 23:54
  • 1
    I imagine OCR pirating can happen, but using the linked question as evidence of it seems tenuous to me. In the linked question, text went from small caps to lower case. The main comment suggests someone may have copy-pasted text to make their version of the book, and didn't notice the small caps came out as lower case -- i.e. in the orignal, the small caps were a special font that rendered lower case letters with a small caps apperance, but the copy used a normal font throughout. (Another comment suggests OCR may have been used, but its unlikely small caps would OCR as lower case.)
    – Bavi_H
    Commented Apr 14, 2022 at 12:55
  • 2
    "I cannot imagine who would sit and scan in all the pages of a book like that only to release it as "warez", without any possibility of making any money from it. That just doesn't happen in reality, I should think." Pirates aren't in it for the money. They're in it for recognition (or other factors). Pirated scans of paper comic books are certainly a thing. You're also assuming that the pirated version was created after a professional digital version was already widely available.
    – jamesdlin
    Commented Apr 14, 2022 at 20:37

2 Answers 2

14

Digitising books is common although not trivial - it is commonly done by libraries and other institutions, and it can be done at home. But most people would probably find it easier to obtain an existing digital text, if one existed. Wikipedia's page on book scanning discusses the often very expensive and specialised machines designed for digitizing books. These are often used by libraries and other institutions to capture images of books which are then made available online, either on library websites or sites such as Google Books.

The remaining steps in obtaining a text are firstly to use OCR (optical character recognition) to convert the image to text, and secondly to correct the OCRed text which (as mentioned) will often include errors. A first pass can be automatic, but human correction is really needed. Project Gutenberg is perhaps the most famous, offering a vast archive of downloadable text files - according to Wikipedia it was originally done by typing texts in by hand, but since 1989 they have also used OCR software, with the texts then proofread against the original. More information can be found on the PG website FAQ. The quality of the OCR output depends on the input quality, so something digitised on a professional machine will come out much better than something you photographed on your phone camera.

In many cases, if you were planning on pirating a book there are easier alternatives than going through this procedure: you could hack an eBook, you could go online and search for an illegal copy, or if the text was legally available online (e.g. was out of copyright) you could download it. Tolkien is still in copyright as this answer explains so you could not get a text legally, but if something exists digitally you can probably hack it. I am sure texts of Tolkien are available online but I don't propose to search for them.

Alternatively, some fans might be so devoted that they OCR a work out of love not to make a profit. If you love an author who is obscure or out of print, this may be the only way of sharing their work. There are fansites devoted to legally OCRing texts such as diybookscanner.org. It can be done with a cheap scanner that costs less than $100, or even a digital camera on a phone, but it is slow and difficult. However, enthusiasts have created bespoke set-ups or kits for a few hundred dollars to simplify the procedure, using specialised masks and stands to control the lighting and camera/scanner position. Although if you work at a library or institution with a professional reader you might be able to "borrow" it. A lot of people involved in the DIY scene will take a lot of care, but obviously if you don't bother about formatting or proofing, it's even easier.

10
  • Are you sure human correction is still needed? Machine learning has come a long way in the last decade. Commented Apr 14, 2022 at 11:58
  • 2
    @leftaroundabout Improvements to OCR have been made thanks to ML, but you still generally want a person proofing things, especially if the original is not of great quality (for example, printing errors often do not get fixed correctly even by the best OCR, because even noticing them requires natural language processing, which is still a very difficult task even for good ML systems). Commented Apr 14, 2022 at 12:48
  • "The remaining steps in obtaining a text are firstly to use OCR (optical character recognition) to convert the image to text, and secondly to correct the OCRed text which (as mentioned) will often include errors. " I doubt that these two steps are always entirely separate. Recognizing a whole word sounds like an easier problem than recognizing the individual characters independently, then correcting the resulting list of characters into a word.
    – Stef
    Commented Apr 14, 2022 at 13:55
  • @Stef that's not at all what that sentence is saying.
    – hobbs
    Commented Apr 14, 2022 at 15:15
  • @hobbs Hum, okay. Should I ask what is not at all what what sentence is saying?
    – Stef
    Commented Apr 14, 2022 at 16:32
8

Prior to the ebook explosion, OCR piracy was quite common. After the collapse of the Soviet Union, the former Soviet states had limited access to printed books from western countries, a strong demand for western literature, and a rather lax attitude towards intellectual property. It was quite common for someone to get a printed book, scan it, OCR it, optionally spell-check it, and give/sell electronic copies to other people.

That said, the quote in the other question is not showing the usual signs of OCR piracy: OCR frequently mis-interprets letter pairs, such as turning "li" into "h" or "in" into "m", and frequently drops punctuation, but rarely changes case. It looks far more like electronic copying where font information was dropped, turning small caps into lower case.

4
  • Do you have references for any of your claims here?
    – bobble
    Commented Apr 13, 2022 at 21:49
  • 4
    @bobble, I hung around on warez sites during the OCR-piracy era, so this comes from things like reading discussions or noting that books tended to be uploaded by users with Cyrillic names. Pretty sure the sites are all long-dead these days, though the OCR-pirated ebooks are still circulating.
    – Mark
    Commented Apr 13, 2022 at 21:54
  • 3
    Here's an example of OCR reading small-caps as lower-case in this passage: Internet Archive's copy of the 2012 HarperCollins edition. Commented Apr 14, 2022 at 6:14
  • @bobble Mark can use me as a reference. (Yes, I'm being silly.) I have two famous SciFi books in RTF that have obvious OCR errors. I downloaded them for free from a site in Russia in 2005, before e-books as we now know them existed. I already owned the books in paperback, so I didn't feel so bad about copyright infringement. No money changed hands and I was already a paying customer of the author and publisher.
    – MTA
    Commented Apr 14, 2022 at 19:16

Not the answer you're looking for? Browse other questions tagged or ask your own question.