Which software converts academic publications from PDF to Word without mangling complex formatting?

Question

I'm asking about this in StackExchange Academia rather than a software-recommendation site because I have conversion problems specific to the layout of academic publications. I'm helping a student who has problems with their eyesight. They find it hard to read academic papers and books on screen, when distributed as PDF. They prefer to have these converted to Word, then read that on screen. This helps them because they can change the font of any part that's hard to read, as well as highlighting section headings and other cues to help them navigate. There are many PDF-to-Word converters, but all the ones I've tried (see below) have defects. Can anyone recommend something better than those I've listed? The converter must not mangle tables, diagrams, footnotes, formulae, subscripts and superscripts, or other complicated content typical of academic publications.

The student does not like to read printed text, and only has an A4 printer, so blowing up the text and printing that isn't feasible. The pandemic introduces its own constraints, since they don't want to venture from their flat to use University computers. So they're restricted to what I think is a 22-inch screen, Windows 10, and their A4 printer. I only mention these because commenters suggested them. The student is comfortable reading Word files, and I want to help them do that, rather than force them to do something outside their preferences.

Getting back to PDF-to-Word converters, so far, we have tried:

Kofax Power PDF, but it's very erratic, though. For example, when I converted an archaeological report which I had to ask Power PDF to OCR, it "thought" it had found characters in the thin lines indicating boundaries in a diagram of the archaeological site. Power PDF also mangles structural chemical formulae, even when told not to OCR. Footnotes and tables cause even more problems, as do in-line chemical formulae, and subscripts and superscripts.
SmallPDF. This claims to be "the platform that makes it super easy to convert and edit all your PDF files". It is not. It converts some so they work better than those done by Power PDF, but crashes on others.
PDF24. This probably does a better job than either Power PDF or SmallPDF, when it works. But on some PDFs that need OCR'ing, it just emits the original page-images. This may well be a bug, and I've reported it, but I've not had any reply.

Does anyone have any recommendations?

To the close voters: please recall that, according to the outcome of this meta discussion, we accept questions about software recommendations related to academic tasks, and this is sufficiently specific to be considered on topic. See also this FAQ. — Massimo Ortolano, Commented Jan 7, 2021 at 20:00
Comments are not for extended discussion; this conversation has been moved to chat. Please read this FAQ before posting another comment. — Massimo Ortolano, Commented Jan 8, 2021 at 8:47
By the nature of how pdf works internally, it is unlikely that you will find something that will work with any pdf. You can try to find something that works well most of the time but if reading pdf's is really not possible then I think some combination of solutions will be needed. — user2316602, Commented Jan 8, 2021 at 10:51
Unfortunately, you're right. I do realise that this is a hard problem: I said that in a comment below. But I don't think it's insoluble, I just haven't had time or money to evaluate all the converters available. — Phil van Kleur, Commented Jan 8, 2021 at 11:13

Mario Niepel · Accepted Answer · 2021-01-07 20:30:46Z

2

Have you tried the obvious of either exporting the pdf from Adobe Acrobat importing the pdf into Word. I tried just now with two papers and the results are really good with regards to figures, tables, footnotes, references, layout, ...

original PDF

exported from Acrobat

imported into Word

Make sure to download the documents instead of viewing them in the Google drive. They will look closer to the original in Word itself than rendered by Google drive. The only major issue I can see is the title page in both cases--but the PDF export seems to do a slightly better job.

answered Jan 7, 2021 at 20:30

Mario Niepel

1,4345 silver badges13 bronze badges

Doesn't work with scanned papers! The OP has high standards!
– user151413
Commented Jan 7, 2021 at 21:00
@Mario Niepel Thanks for taking the time to show me that. I'll try that on some of the recent papers and see how the Word versions look. Unfortunately, user151413 is right about the scans. Some of our older archaeology papers have arrived as scans, and we do need to OCR them. There's not much we can do about that. But for the others, I'll try it. Question: has Word ever frozen on you when you've imported these PDFs?
– Phil van Kleur
Commented Jan 7, 2021 at 21:10
Honestly, this is the first time I tried, so my experience is n=1. And of course any import/export feature is only as good as the OCR you use. If OCR is an issue, I would divorce this from the PDF question.
– Mario Niepel
Commented Jan 7, 2021 at 21:13
@Mario Niepel I've been considering OCR and conversion together because that's how the software companies package them. All three converters I mentioned do OCR, and I think all the others I've glanced at do too.
– Phil van Kleur
Commented Jan 7, 2021 at 21:18
1

has Word ever frozen on you when you've imported these PDFs? This might be due to a lack of memory.
– user2768
Commented Jan 8, 2021 at 9:31

Add a comment |

user2768 · Accepted Answer · 2021-01-08 09:35:02Z

UPDATE: The question has gone through 10+ revisions since I wrote this answer. I haven't checked to see whether it is still relevant. I'll try to respond to comments.

I'll start by disclaiming: My initial solutions do not follow the prescribed pdf to doc conversion (whilst my last solution does, albeit, it's probably a bad solution), and I haven't tested them, which violates the OP's constraints. I disregard these constraints because I don't think they are particularly relevant.

It isn't clear whether on-screen, digital reading or printed, physical reading is preferred, nor how much adjustment in text size is desired, clarification may allow more focus.

The trivial, obvious solutions, which have surely been tried, are: Large screen plus zoom and printing to a larger paper size, e.g., A3 or A2 (the former being more commonly available), rather than A4. Annotations and highlighting can be applied to pdfs with numerous tools.

For various other enhancements, numerous tools exist on Linux, which can be scripted to suit needs. For instance, a page could be split into smaller parts and manipulated in various ways, e.g., split a portrait page into top- and bottom-parts, and blow up each part onto a horizontal page (optionally of larger size). All operations can be automated. This approach cannot fail particularly badly for academic papers. At worst, perhaps the split is applied in the wrong place, but, the two parts can be displayed simultaneously, aligned at the lower/upper edge, or by folding the top edge of a printed page.

Linux tools can also be used to convert to word processor format, for instance, pdf2ps and ps2txt can be combined to convert pdf to plaintext, which any word processor can open. (This is my worst solution. I doubt tables, footnotes, formulae, etc. will be handled particularly well.)

Converting to a word processor format seems like a misdirection. That's why my last solution probably won't work. Extracting another format from a pdf seemingly goes against the design goal of the pdf format, namely, to present documents in a [software independent] manner...encapsulat[ing, in each file,] a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it (source: Wikipedia).

My apologies if I'm way off mark here. I'm unsure of your precise requirements.

Thanks. I've updated my question to better state the constraints. A big printer isn't an option, sadly; nor is Linux; nor (during the pandemic), a bigger screen. I have actually tried an equivalent of what you suggest with PDF-to-PS-to-text, by running a PDF-to-text converter in the R programming language. R's good for scripting, and I could easily apply the converter to a list of files. But you're right, tables etc. were not handled well. — Phil van Kleur, Commented Jan 7, 2021 at 17:42
You're also right about the design goals of PDF. I realise that this is a hard problem, in the same sense that integration is harder than differentiation. On the other hand, it's not the 1960s any more, and AI has advanced beyond researchers believing that machine vision can be achieved in a summer ( news.ycombinator.com/item?id=12080565 ). The companies ought to be able to do better than they are. I bet the Department of Defense could ... — Phil van Kleur, Commented Jan 7, 2021 at 17:45
Many could, I guess there's not enough demand. Ordering a massive screen seems like an easy option, assuming there's a budget. Alternatively, asking someone willing to collect one from the department could work too. Equally, doing so out of hours could be a safe option. — user2768, Commented Jan 7, 2021 at 17:50
@PhilvanKleur Why isn't a bigger printer or screen possible? They can easily be delivered. — Azor Ahai -him-, Commented Jan 7, 2021 at 19:57
@ " Errr... money?" - So a software which costs 300$ would also not be a valid answer? You are narrowing your question down more and more. In essence, you want a free software which can do what probably most journals' software can't do (at least, I have never seen a properly reformatted old paper, journals just scan and OCR them). — user151413, Commented Jan 7, 2021 at 20:59

Jeff · Accepted Answer · 2021-01-08 14:20:33Z

Converting scanned (OCR) PDF documents into Word documents with complete accuracy isn't possible.

When I teach my students about this, I tell them that when someone asks if they can get data out of a PDF document, the first answer should be "No." The second answer, if pressed, should be "No. Is there any other way to get it?" And if still pressed, the final answer is "Fine, but it won't be very reliable."

This is especially true with OCR. The software is using algorithms, often Google's Tesseract engine, to try and give its best guess what it's seeing. There will nearly always be errors, especially if the scan may not be the highest quality. You're going to see spacing messed up a lot, but also close values, like confusing a 1 and a 7, or a 0 and an O. Tables are especially prone to being butchered. If you upload any PDF to Google Docs, it gives you the option to convert it to text in the menu. You can also download Tesseract hooks for Python, for example.

Honestly, even non-OCR PDF documents can be unreliable. Have you ever copy-pasted text out of a PDF, and noticed that some of the spaces disappear? Well, software converting it to text encounters problems as well.

The PDF specification isn't a plain text document the way Word files are. For example, a trick not many people know is that you can unzip a .docx file into its component parts using regular unzip software, and one of those parts is a text file. This is not at all true with a PDF document.

I feel like your student should focus on finding a platform and software that helps them read PDF documents. Something with a magnifying lens, for example, or on a tablet.

I didn't know that .docx files were zip files! But thanks for the explanation. I have often noticed that spacing gets messed up when I copy-paste from PDFs. That is very annoying, and affects me as well as the student, just because sometimes I need to quote from a paper and only have it as a PDF. How do you get around that? And, out of curiosity, how is the PDF representing the text and the spaces in it? — Phil van Kleur, Commented Jan 11, 2021 at 18:12
@PhilvanKleur I'm not totally up-to-date on the PDF specification, but it's my understanding that it defines the position of each piece of text using x-y coordinates that represent its distance from the upper-left hand corner. So there ARE no actual space characters; they're just white space left over in between the other pieces. There's unfortunately no perfect solution to get around this, other than finding sources that aren't in PDF formats. — Jeff, Commented Jan 11, 2021 at 18:54

Stack Exchange Network

Which software converts academic publications from PDF to Word without mangling complex formatting?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
publications
health
reading
.

Hot Network Questions

Which software converts academic publications from PDF to Word without mangling complex formatting?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged publicationshealthreading.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
publications
health
reading
.