My father, an internist, has been self-publishing a short collection of stories on death and dying called Kind Farewells; each tells a story from his medical practice about a death that touched him, and others, deeply. So far, he has listed the book on in hardcover, softcover, and PDF download formats. Recently, he decided that he wanted to offer it in various e-book formats as well.

This shouldn’t be hard. He wrote the book in OpenOffice; the two most common ebook formats (epub and mobi) are both collections of HTML files; OpenOffice is able to export to HTML; and the Calibre tool is freely available to convert into many formats, including epub and mobi. So this should be a matter of choosing whether to export the HTML as a single file or as one file per chapter, converting it in Calibre, reviewing the converted content, and posting the file to Amazon’s Digital Text Platform to make it available on the Kindle, Barnes and Noble’s PubIt! to make it available for the Nook, and any other e-bookstores that are easy to use and allow individuals to publish books.

He asked for some help.

In practice, our biggest difficulty was HTML output formatting. We encountered two kinds of problems. The first was a simple bug; after the closing </html> tag, OpenOffice appended a few lines of seemly-random text from within the file. The second was that the conversion was clearly intended to preserve as many WYSIWYG elements as possible, which clashed badly with the constraints of ebook displays.

I ended up using the Python lxml module to write a short script which I made available on bitbucket to clean up HTML output for use in an e-book. I’m not intending to do active long-term maintenance of the script, so I encourage anyone who needs to modify it to fork it on bitbucket. If you use the “fork” button on bitbucket, the fork will show up on the main page for other people to see.

When I run the script, in order to make sure that the changes are only to the styling and do not modify the text of the book, I compare the output of lynx -dump before and after processing using diff -u. Since lynx does not honor CSS (as far as I know), my CSS changes have no impact on the output, and therefore bugs in my conversion script stood out as differences in the output from lynx.

The script isn’t particularly quick; I intentionally wrote it to use XPath and multiple passes to make it easy to understand, edit, and modify.

The end result: While Barnes and Noble has taken days and still has not made Kind Farewells available in the NOOKstore, it showed up in less than a day on Amazon