- From: Andrew Sidwell <w3c@andrewsidwell.co.uk>
- Date: Mon, 14 Jul 2008 15:39:09 +0100
- To: Erik Dahlström <ed@opera.com>
- CC: public-html@w3.org, "www-svg@w3.org" <www-svg@w3.org>
Hello, Erik Dahlström wrote: > Hello HTML WG, > > The SVG WG is happy to announce the first draft proposal for how to handle > SVG in HTML (see attachment). I've recently been writing an HTML5 parsing library in C (hubbub)[1] and have implemented MathML and SVG as written in that spec (with SVG handling as it is in commented-out portions of that spec). Having read this proposal in full, I have a number of technical comments: 1. Making the tokeniser case-preserving doesn't help. You go to quite a bit of effort to allow the tokeniser to preserve case and to have the treebuilder lowercase HTML elements then inserted, I assume so that authors can't write '<SVG xmlNS="...">' and have it work. However, given that you haven't made the tokeniser not handle "<svg xmlns=http://...>", and the like, it seems like a fairly pointless change. If everything from the first angle bracket gets passed to an XML processor, then '<SVG xmlNS="">'/"<svg xmlns=http://..." won't work anyway, since the XML processor will either misnamespace or choke. It's not that I believe the tokeniser should not be case-preserving; I just think that if your motivation is just to make weirdly-cased tags not trigger XML parsing, then that's not a useful route to pursue. 2. Requiring "A start tag whose case-sensitive tag name is "*:svg" that has a case-sensitive attribute "xmlns:*" with the value "http://www.w3.org/2000/svg", where '*' can be any string as long as it's the same in both the tagname and the xmlns attributename:" is bad; it adds too much complexity for little gain. Hubbub and the Java parser behind Validator.nu both do not do string comparisons when dealing with lists of elements. Instead, they hash the element name and then just compare hashes from then on. (This is obviously a massive performance gain.) The requirement above hurts this by forcing a string comparison on the name, and then in certain cases forcing one to look through all the attributes of an element and perform string comparisons on their names and values too. The spec to date has gone to effort to avoid making implementations search through attributes, because it is slow. As far as I can remember, there is one place that attributes are checked in the treebuilder, and that is <input type="hidden"> in the "in table" phase. I understand you want to be compatible with existing SVG content, but this is a place where you shouldn't be. <svg xmlns="..."> is quite enough. 3. There are various problems with the text of the algorithm for parsing XML fragments. The lines: "Save the tokeniser content model flag to old-state." "Reset the tokeniser content model flag to the old-state." are superfluous. At no point in the course of parsing XML fragments is the tokeniser content model changed, so this text serves no purpose. I was under the impression that an off-the-shelf XML processor should be able to be used to parse SVG-in-text/html. If this is the case, the requirement "For each element that is successfully parsed, the XML parser must insert a foreign element." should probably be changed to "For each element the XML parser parses, insert a foreign element with the namespace, name, and attributes of that element", or the like, to avoid mandating that the XML parser must have behaviour that is not specified in the XML spec. In general, I think the algorithm should specify what to do with things that the XML parser parses and not that e.g. the XML parser must do something. Handling is not specified for what happens if an XML parser parses characters or processing instructions, and nothing is said about empty tags (basically that they should insert a new element and then pop that element off the stack). The sentence "Feed the XML parser the string corresponding to the start tag of the element along with all its attributes." is unclear. I believe the intention is closer to: "Feed the XML parser the string starting with the character that triggered entry into the 'tag open' state and ending with the character that triggered emittance of the start tag token." My non-technical comments: I think that to implement what the SVG WG proposes to a decent level of performance will require building a new XML parser into the HTML5 parser. Feeding XML to an XML processor though an API one byte at a time will slow things down a lot. I much prefer the HTML5 model over having to incorporate an XML parser as the SVG WG suggests, since XML fragments in text/html are underspecified, and will be until XML parsing is specified to the level of HTML5 somewhere. Even when it is, I don't think there is a place for draconian error handling in text/html; it goes against the very grain of the language. That said, I don't think the HTML5 model is perfect. I tend towards believing that the tokeniser should have a case-preserving flag, which is flipped when entering "in foreign content", since that saves the headache of going over all element and attribute names and case-correcting them. I understand the SVG WG's concerns, but I don't think that using an XML parser is the answer. I think it would be much more productive if, when HTML5 parsing starts to be implemented in browsers, people make sure that those browsers allow export of well-formed XML versions of any foreign content included in them. Cheers, a. [1] http://www.netsurf-browser.org/projects/hubbub/
Received on Monday, 14 July 2008 14:39:55 UTC