wiki:HtmlToDockbookProject

Version 14 (modified by Matias Capeletto, 15 years ago) ( diff )

--

Improving Boost Docs Project


HTML to Docbook

Problem

In order to have a complete conversion tool, it's necessary to be able to convert existing documentation, written in HTML, to quickbook. As this currently stand, good progress is being made on the following part of the document conversion pipeline:

docbook --[boostbook + xsltproc]--> HTML --[quickbook css]-->quickbook

However, this project still lacks an important part:

HTML --[html to docbook (missing)]--> docbook --> [above pipeline] --> result

The aim of this subproject then, is to investigate some open source solutions to this problem, and try and see which one will work best for boost.

Converting HTML to docbook XML

What exactly should this tool do? As input it should take an HTML document (which may not necessarily be valid XHTML) and map the HTML tags to docbook XML. For example:

<h1>My Section</h1>
<p>Some text</p>

should become something like:

<section id="my_section">
<title>My Section</title>
<para>Some text</para>
</section>

Two main problems present themselves. In the first case, what should the tool do if the original document doesn't validate as XHTML? Secondly, there will certainly be a many-to-one mapping from HTML to docbook. Is it possible to determine a general solution for this?

Open Source Solutions

For me, I'm comfortable with the idea of recommending Tidy to produce validating XHTML. Its open source and cross platform. Furthermore its the original author's responsibility to ensure that their input is valid and I feel that this task falls out of the scope of this sub-project.

For the second point, I have found the following resources for projects which have attempted to address this problem:

  1. http://www.eecs.umich.edu/~ppadala/projects/tidy/
  2. http://wiki.docbook.org/topic/Html2DocBook

The first of these seemed initially promising (and was proposed as a possible solution by Matias) but I was unable to make it compile. This makes me wonder whether this project is dead or not. I've sent an e-mail to the developer and I'm awaiting a response.

The second of these, as an XSL stylesheet, seems the more natural solution. Its still not perfect and doesn't completely obviate the need for manually rechecking and retagging, but I feel that using this and adapting it for own needs my be fruitful. I haven't tried hacking the stylesheet yet (this will be the next thing I try) but of the things I've found so far this seems the most promising.

Conclusion (so far)

With the (still only limited) investigation I've done so far, I think that the most natural solution for converting what is one XML format to another, is to use an XSL stylesheet. Short of developing one specifically for this project, it is best to use the one provided in solution 2 as this has been developed by someone who already has a lot of experience with docbook (I haven't yet been in touch with him yet). Further adapting it for boost's requirements, I feel, may be the most fruitful solution.


Active Developers


Glyn Matthews
Glyn Matthews
Linked In Profile
glyn dot matthews at gmail dot com


Note: See TracWiki for help on using the wiki.