2

I'm using pandoc in my website to allow my content creators to use Word, which would simplify developmnent immensely. The only issue I have is that Pandoc seems to output very little information about document styling. It's possible for me to manually re-add these with CSS later, but that somewhat defeats the pupose of using Word.

Hence my question. Can I get pandoc to spit out more information about how my document is structured, such that I can apply more generic styles and have the document looking comparable to how it was written.

Example, I've got a table with user-defined centre-aligned text (the person centred the text in Word). When I export my document to HTML using the following command:

pandoc --from docx --to html --embed-resources --reference-doc ./reference.docx --section-divs. (I'm streaming in the contents of the file via stdin, as it isn't guaranteed to be in a consistent location)`.

When the HTML comes back, the contents of the table elements are simply paragraphs. I'd like them to be wrapped in <center> tags or the equivalent - just something to identify that they need to be centred.

There's a number of examples like these that I'd like to introduce, but momentarily, centring text is a priority.

Thanks for any pointers

1
  • I too am searching for an answer to this question. It doesn't seem like it would be that hard (the alignment info is pretty clear in the document.xml file). I even tried using python-docx to convert the alignment info to a custom style I defined (that part works) but then pandoc just ignored it. Super frustrating
    – Cfreak
    Commented May 5 at 16:42

1 Answer 1

1

I kept with this and I found a solution. It's not great but it works. Changing --from docx to --from docx+styles will wrap paragraphs with style with an additional <div data-custom-style="your-custom-style">

I wish it would create a CSS class and I'm still searching for ways to do it but some post-processing of the HTML could do it.

For alignment where someone just pressed right, center, justify, etc. in Word that isn't a custom style. (There are some weird other options but for now, I'm worrying only about those 3. Left is the default.)

I'm doing something similar with user-contributed Word documents. Using Python and the python-docx and docxcomposer packages I did the following:

  1. Create a reference document (generated from Pandoc) with additional styles defined that handle alignment. I'm only worrying about body text but you could create ones for headings as well.

  2. Make a copy of the reference document with the styles saved but with none of the content. This is to assist with applying the styles to the user's document (Python-docx doesn't have an easy way to remove content from the document)

  3. Open the blank copy of the reference document using python-docx and then using docxcompose to append the content of the user's document to the blank reference. (You can add styles directly but python-docx doesn't support outputting them yet).

  4. Save the combined document and re-open (this might not be necessary but I'm doing it all in memory and I wanted to ensure the XML is all created correctly)

  5. Re-open the combined file and loop through the document paragraphs. paragraph.alignment will contain the alignment info and then you can use paragraph.style = "custom-align-style" for the appropriate alignment.

  6. Save the resulting document and convert it with pandoc using the --from docx+styles option I mentioned above. Now you'll know which paragraphs need treatment in HTML.

I'm post-processing mine on the screen in Javascript. I'm looking into whether a pandoc filter could be made to do this so it could be contained in a single command.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .