3

When I export a file from Word or TextEdit, I get very bloated HTML, full of crazy style tags on every paragraph, so I can't even clean it by hand.

The only information I want preserved is:

  • <h1>, <h2>, <h3>, <p> tags.

  • Alignment (center, left, right)

  • links, external and internal (for the table of contents)

  • <img> tags

6
  • 1
    Word is notorious for building messy markup. Can you use a different program? Try importing the documents into Google Docs and downloading as HTML (Zipped).
    – Synetech
    Commented Feb 14, 2012 at 3:56
  • 1
    Google Docs html does everything with spans and css classes and has no newlines.
    – Nathan
    Commented Feb 14, 2012 at 4:33
  • Cannot reproduce issues with TextEdit. Can you provide a sample document that uses inline styles?
    – Daniel Beck
    Commented Feb 14, 2012 at 12:53
  • I'd also try openoffice/libreoffice. Commented Feb 14, 2012 at 15:31
  • @DanielBeck This is a simple document, written in pages, exported as .rtf, and saved as html; which is what I need to be able to do. snipt.org/uMr6 Commented Feb 14, 2012 at 19:56

1 Answer 1

0

I once heard that the blog feature of Microsoft Word exports much better HTML than even filtered HTML under the Save As menu.

To try go to the Word Ribbion -> Publish -> Blog. You will need to setup a dummy account but if the results are good enough it might be worth it.

Otherwise, since your expected output sounds so simple you may even want to consider creating your own VBA script which walks each element in the document in order and creates an HTML string from each that is then saved to disk.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .