Trying out SVG and MathML parsing from Henri Sivonen on 2008-06-19 (public-html@w3.org from June 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 19 Jun 2008 14:35:39 +0300
To: "public-html@w3.org WG" <public-html@w3.org>
Message-Id: <3C2E021E-64C5-4963-A5E7-885F4122194D@iki.fi>

There was some discussion about SVG parsing on IRC today. Since I
happened to have something almost ready, I figured I'd put a build out
there before I head to Midsummer/St.John festivities (national
holiday; big deal over here).

Here's a parser build:
http://about.validator.nu/htmlparser/htmlparser-svg-demo.zip
(GPG sig: http://about.validator.nu/htmlparser/htmlparser-svg-demo.zip.sig
)

It implements the MathML stuff (except for entities) and the SVG stuff
that used to be in the spec draft but was commented out soon after.

It comes with a sample program called HTML2XML and a shell script
(html2xml.sh) for running it easily. The script expects to see a
'java' executable for Java 5 or later in $PATH. The program parses a
file as HTML5 and converts it to XML.

If the program is run without arguments, it reads from stdin and
writes to stdout. If it is invoked with one argument, it reads from a
file specified as the argument and writes to stdout. If there are two
arguments, the first is the input file name and the second the output
file name.

This is not a proper release but an unpolished demo version. Moreover,
the new code isn't properly tested at all at this point. I'm just
trying to get something that runs out there for interested parties to
try and break today.

Known bugs:
* When an element or attribute name is not an XML 1.0 + Namespaces
NCName, wrong things happen. The parser treats them per the HTML5
spec, but the serializer does the wrong kind of fixups when presented
with what isn't an XML 1.0 infoset. (This is a regression from the
previous versions of the parser). So don't put colons in names except
the well-known xlink: and xml:-prefixed names.
* The start tags handling for li, dt and dd is not in sync with the
latest spec text.
* The end tag handling for li, dt, dd and p is not in sync with the
latest spec text.
* The parser does not support MathML entities, yet.
* The serializer outputs non-XML for astral characters. (https://issues.apache.org/jira/browse/XALANJ-2419
)

--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 19 June 2008 11:36:20 UTC