- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 19 Jun 2008 14:35:39 +0300
- To: "public-html@w3.org WG" <public-html@w3.org>
There was some discussion about SVG parsing on IRC today. Since I happened to have something almost ready, I figured I'd put a build out there before I head to Midsummer/St.John festivities (national holiday; big deal over here). Here's a parser build: http://about.validator.nu/htmlparser/htmlparser-svg-demo.zip (GPG sig: http://about.validator.nu/htmlparser/htmlparser-svg-demo.zip.sig ) It implements the MathML stuff (except for entities) and the SVG stuff that used to be in the spec draft but was commented out soon after. It comes with a sample program called HTML2XML and a shell script (html2xml.sh) for running it easily. The script expects to see a 'java' executable for Java 5 or later in $PATH. The program parses a file as HTML5 and converts it to XML. If the program is run without arguments, it reads from stdin and writes to stdout. If it is invoked with one argument, it reads from a file specified as the argument and writes to stdout. If there are two arguments, the first is the input file name and the second the output file name. This is not a proper release but an unpolished demo version. Moreover, the new code isn't properly tested at all at this point. I'm just trying to get something that runs out there for interested parties to try and break today. Known bugs: * When an element or attribute name is not an XML 1.0 + Namespaces NCName, wrong things happen. The parser treats them per the HTML5 spec, but the serializer does the wrong kind of fixups when presented with what isn't an XML 1.0 infoset. (This is a regression from the previous versions of the parser). So don't put colons in names except the well-known xlink: and xml:-prefixed names. * The start tags handling for li, dt and dd is not in sync with the latest spec text. * The end tag handling for li, dt, dd and p is not in sync with the latest spec text. * The parser does not support MathML entities, yet. * The serializer outputs non-XML for astral characters. (https://issues.apache.org/jira/browse/XALANJ-2419 ) -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 19 June 2008 11:36:20 UTC