Re: SVG in HTML proposal from Andrew Sidwell on 2008-07-14 (public-html@w3.org from July 2008)

From: Andrew Sidwell <w3c@andrewsidwell.co.uk>
Date: Mon, 14 Jul 2008 15:39:09 +0100
To: Erik Dahlström <ed@opera.com>
CC: public-html@w3.org, "www-svg@w3.org" <www-svg@w3.org>
Message-ID: <487B650D.6080803@andrewsidwell.co.uk>
Hello,

Erik Dahlström wrote:
> Hello HTML WG,
> 
> The SVG WG is happy to announce the first draft proposal for how to handle
 > SVG in HTML (see attachment).

I've recently been writing an HTML5 parsing library in C (hubbub)[1] and 
have implemented MathML and SVG as written in that spec (with SVG 
handling as it is in commented-out portions of that spec).  Having read 
this proposal in full, I have a number of technical comments:


1. Making the tokeniser case-preserving doesn't help.

You go to quite a bit of effort to allow the tokeniser to preserve case 
and to have the treebuilder lowercase HTML elements then inserted, I 
assume so that authors can't write '<SVG xmlNS="...">' and have it work. 
  However, given that you haven't made the tokeniser not handle "<svg 
xmlns=http://...>", and the like, it seems like a fairly pointless 
change.  If everything from the first angle bracket gets passed to an 
XML processor, then '<SVG xmlNS="">'/"<svg xmlns=http://..." won't work 
anyway, since the XML processor will either misnamespace or choke.

It's not that I believe the tokeniser should not be case-preserving; I 
just think that if your motivation is just to make weirdly-cased tags 
not trigger XML parsing, then that's not a useful route to pursue.


2. Requiring "A start tag whose case-sensitive tag name is "*:svg" that 
has a case-sensitive attribute "xmlns:*" with the value 
"http://www.w3.org/2000/svg", where '*' can be any string as long as 
it's the same in both the tagname and the xmlns attributename:" is bad; 
it adds too much complexity for little gain.

Hubbub and the Java parser behind Validator.nu both do not do string 
comparisons when dealing with lists of elements.  Instead, they hash the 
element name and then just compare hashes from then on.  (This is 
obviously a massive performance gain.)  The requirement above hurts this 
by forcing a string comparison on the name, and then in certain cases 
forcing one to look through all the attributes of an element and perform 
string comparisons on their names and values too.

The spec to date has gone to effort to avoid making implementations 
search through attributes, because it is slow.  As far as I can 
remember, there is one place that attributes are checked in the 
treebuilder, and that is <input type="hidden"> in the "in table" phase.

I understand you want to be compatible with existing SVG content, but 
this is a place where you shouldn't be.  <svg xmlns="..."> is quite enough.


3. There are various problems with the text of the algorithm for parsing 
XML fragments.

The lines:
"Save the tokeniser content model flag to old-state."
"Reset the tokeniser content model flag to the old-state."

are superfluous.  At no point in the course of parsing XML fragments is 
the tokeniser content model changed, so this text serves no purpose.

I was under the impression that an off-the-shelf XML processor should be 
able to be used to parse SVG-in-text/html.  If this is the case, the 
requirement "For each element that is successfully parsed, the XML 
parser must insert a foreign element." should probably be changed to 
"For each element the XML parser parses, insert a foreign element with 
the namespace, name, and attributes of that element", or the like, to 
avoid mandating that the XML parser must have behaviour that is not 
specified in the XML spec.  In general, I think the algorithm should 
specify what to do with things that the XML parser parses and not that 
e.g. the XML parser must do something.

Handling is not specified for what happens if an XML parser parses 
characters or processing instructions, and nothing is said about empty 
tags (basically that they should insert a new element and then pop that 
element off the stack).

The sentence "Feed the XML parser the string corresponding to the start 
tag of the element along with all its attributes." is unclear.  I 
believe the intention is closer to: "Feed the XML parser the string 
starting with the character that triggered entry into the 'tag open' 
state and ending with the character that triggered emittance of the 
start tag token."




My non-technical comments:

I think that to implement what the SVG WG proposes to a decent level of 
performance will require building a new XML parser into the HTML5 
parser.  Feeding XML to an XML processor though an API one byte at a 
time will slow things down a lot.

I much prefer the HTML5 model over having to incorporate an XML parser 
as the SVG WG suggests, since XML fragments in text/html are 
underspecified, and will be until XML parsing is specified to the level 
of HTML5 somewhere.  Even when it is, I don't think there is a place for 
draconian error handling in text/html; it goes against the very grain of 
the language.

That said, I don't think the HTML5 model is perfect.  I tend towards 
believing that the tokeniser should have a case-preserving flag, which 
is flipped when entering "in foreign content", since that saves the 
headache of going over all element and attribute names and 
case-correcting them.  I understand the SVG WG's concerns, but I don't 
think that using an XML parser is the answer.  I think it would be much 
more productive if, when HTML5 parsing starts to be implemented in 
browsers, people make sure that those browsers allow export of 
well-formed XML versions of any foreign content included in them.


Cheers,
a.

[1] http://www.netsurf-browser.org/projects/hubbub/
Received on Monday, 14 July 2008 14:39:55 UTC