wiki:HtmlToDockbookProject

Improving Boost Docs Project


HTML to Docbook

Problem

In order to have a complete conversion tool, it is necessary to be able to convert existing documentation written in HTML to Quickbook. As this currently stands, good progress is being made on the following parts of the document-conversion pipeline:

docbook --[boostbook + xsltproc]--> HTML --[quickbook css]-->quickbook

However, this project still lacks an important part:

HTML --[html to docbook (missing)]--> docbook --> [above pipeline] --> result

The aim of this subproject, then, is to investigate open-source solutions to this problem and to try and see which one will work best for Boost.

Converting HTML to docbook XML

What exactly should this tool do? As input it should take an HTML document (which may not necessarily be valid XHTML) and map the HTML tags to docbook XML. For example:

<h1>My Section</h1>
<p>Some text</p>

should become something like:

<section id="my_section">
<title>My Section</title>
<para>Some text</para>
</section>

Two main problems present themselves. First, what should the tool do if the original document doesn't validate as XHTML? Second, there will certainly be a many-to-one mapping from HTML to docbook. Is it possible to determine a general solution for this?

Open-source solutions

For myself, I am comfortable with the idea of recommending Tidy for producing validated XHTML. It is open-source and cross-platform. Furthermore, it is the original author's responsibility to ensure that his or her input is valid, and I feel that this task falls out of the scope of this subproject.

Regarding the second point, I have found the following resources for projects that attempt to address this problem:

  1. http://www.eecs.umich.edu/~ppadala/projects/tidy/
  2. http://wiki.docbook.org/topic/Html2DocBook

The first of these initially seemed promising (and was proposed as a possible solution by Matias) but I was unable to make it compile. This makes me wonder whether this project is dead or not. I have sent an email to the developer and am awaiting a response.

The second of these, as an XSL stylesheet, seems the more natural solution. It is still not perfect and does not completely obviate the need for manual rechecking and retagging, but I feel that using this and adapting it for our own needs may be fruitful. I have not yet tried hacking the stylesheet (this is the next thing I will try) but of the things I have found so far, this one seems the most promising.

Conclusion (so far)

With the (still only limited) investigation I have done so far, I think that the most natural solution for converting one XML format to another is to use an XSL stylesheet. Short of developing one specifically for this project, it is best to use the one provided in solution 2 above as this has been developed by someone who already has a lot of experience with docbook (I have not yet been in touch with him). Further adapting it for Boost's requirements may, I feel, be the most fruitful solution.


Active developers


Glyn Matthews
Glyn Matthews
Linked In profile
glyn dot matthews at gmail dot com


Last modified 15 years ago Last modified on Jul 17, 2007, 9:25:59 AM
Note: See TracWiki for help on using the wiki.