W3C home > Mailing lists > Public > public-html@w3.org > June 2008

Trying out SVG and MathML parsing

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 19 Jun 2008 14:35:39 +0300
Message-Id: <3C2E021E-64C5-4963-A5E7-885F4122194D@iki.fi>
To: "public-html@w3.org WG" <public-html@w3.org>

There was some discussion about SVG parsing on IRC today. Since I  
happened to have something almost ready, I figured I'd put a build out  
there before I head to Midsummer/St.John festivities (national  
holiday; big deal over here).

Here's a parser build:
http://about.validator.nu/htmlparser/htmlparser-svg-demo.zip
(GPG sig: http://about.validator.nu/htmlparser/htmlparser-svg-demo.zip.sig 
  )

It implements the MathML stuff (except for entities) and the SVG stuff  
that used to be in the spec draft but was commented out soon after.

It comes with a sample program called HTML2XML and a shell script  
(html2xml.sh) for running it easily. The script expects to see a  
'java' executable for Java 5 or later in $PATH. The program parses a  
file as HTML5 and converts it to XML.

If the program is run without arguments, it reads from stdin and  
writes to stdout. If it is invoked with one argument, it reads from a  
file specified as the argument and writes to stdout. If there are two  
arguments, the first is the input file name and the second the output  
file name.

This is not a proper release but an unpolished demo version. Moreover,  
the new code isn't properly tested at all at this point. I'm just  
trying to get something that runs out there for interested parties to  
try and break today.

Known bugs:
  * When an element or attribute name is not an XML 1.0 + Namespaces  
NCName, wrong things happen. The parser treats them per the HTML5  
spec, but the serializer does the wrong kind of fixups when presented  
with what isn't an XML 1.0 infoset. (This is a regression from the  
previous versions of the parser). So don't put colons in names except  
the well-known xlink: and xml:-prefixed names.
  * The start tags handling for li, dt and dd is not in sync with the  
latest spec text.
  * The end tag handling for li, dt, dd and p is not in sync with the  
latest spec text.
  * The parser does not support MathML entities, yet.
  * The serializer outputs non-XML for astral characters. (https://issues.apache.org/jira/browse/XALANJ-2419 
)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Thursday, 19 June 2008 11:36:20 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:38:55 UTC