- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Mon, 21 Jul 2008 16:28:34 +0300
- To: HTML WG <public-html@w3.org>, www-svg <www-svg@w3.org>
I'm glad to see that the SVG WG has looked into the SVG-in-text/html issue. My comments are inline with quotes from CVS revision 1.14 of the proposal. Summary: I think putting an XML parser into the text/html parsing process is a bad idea. I disagree with some of the requirements stated in the proposal. I suggest proceeding from the prior proposal that is commented out in the HTML5 draft. Quoting from: http://dev.w3.org/cvsweb/SVG/proposals/svg-html/svg-html-proposal.html?rev=1.14&content-type=text/html > SVG is an XML language by design, and therefore has certain > abilities and constraints at the syntactic level that are dissimilar > to those of text/html (but consistent with XHTML); therefore, to > work correctly, with the full range of features, SVG must follow the > syntactic rules with which it was designed, in both HTML and XHTML. > I disagree with this characterization. SVG allowing scripting fundamentally makes it a DOM language. An XML serialization is one way to initialize the DOM tree, but the vocabulary-specific user agent code that implements SVG reads from the DOM and never sees the XML source. > This consistency will aid developers, who will not need to learn two > separate sets of rules for the syntax and feature sets. > The proposal is suggesting that developers learn two sets of syntax rules for working *within* text/html. > It will maintain compatibility with existing SVG viewers, > SVG-in-text/html is a new feature. Syntactic "compatibility" with existing SVG viewers is a red herring whenever: * The existing viewer rejects text/html upon HTTP Content-Type. * The existing viewer throws an error trying to parse the HTML wrapper before it even gets to an SVG subtree. * The content depends on both HTML and SVG parts getting rendered and the existing viewer only supports SVG. * The existing viewer is unusably buggy anyway (something to keep in mind when reading about mobile SVG UA statistics). > and continue to permit round-tripping in SVG authoring tools such as > Inkscape, Adobe Illustrator, and CorelDraw, which rely on the XML > format. > How do those tools behave if the SVG content is wrapped in some HTML? You'd need to extract the SVG part and put it in a standalone SVG file, right? The most sensible and robust way of doing that is reserializing an SVG subtree as XML in a browser (like Firefox offers to show a serialization of a MathML subtree today). When round- tripping is achieved this way, it is unnecessary to try to limit what goes over the wire to look like XML. The SVG-in-text/html proposal that is in the comments of the HTML5 spec already makes this possible for SVG features that are defined in SVG 1.1 Full itself. It currently doesn't properly round-trip product- specific extensions to SVG that Inkscape and Illustrator put into their output and that browsers aren't expected to act on. Aside about naming Illustrator specifically: Earlier in this thread <http://lists.w3.org/Archives/Public/public-html/2008Jul/0184.html >, Jeff Schiller implied that Adobe Illustrator's SVG export isn't under active development. I'm not sure if that is the case. However, considering the progress of the Web platform, I think it is a bad idea to allow a frozen authoring tool dictate where the platform can go. If a given authoring product is frozen and the frozenness results in inconvenience either for the users of that product or for everyone, I think we should inconvenience the users of that product by requiring them to use a sanitizer that turns stuff from out there from the Web into a form that is safe for the frozen product to read instead of trying to freeze the Web to fit that product in case someone happens to want to use the frozen product to edit arbitrary content from the Web. To keep things in perspective, consider how thing work in the output direction from Illustrator: If you take SVG output from Illustrator, you can't just paste the source text in the middle of an XHTML document and have it work, because Illustrator does bad things with DTDs and you can't put a doctype in the middle of an XHTML document without making it ill-formed. > This document is a proposal for integrating SVG in both the text and > XML serializations of HTML5. This proposal follows the model that > works today in every major browser that supports XHTML (Firefox, > Opera, and Safari), all of which also support SVG natively. > In those browsers, the model is that the SVG renderer operates on the DOM, and you can initialize the DOM from XML using an XML parser. To cast the same model into text/html, an HTML parser would build a DOM with SVG nodes in it. What this proposal does is more complex as it seeks to involve both an HTML parser and an XML parser in the process of building one DOM tree. > It also works to a lesser extent in Internet Explorer, with the use > of an SVG plugin (and a small bit of extra code). > I'd be interested in an elaborated explanation on this point. Do existing SVG plug-ins for IE get something they can reparse using an XML parser (according to this proposal) from the enclosing Trident instance? What's the parser interface between Trident and the existing SVG plug-ins like? What's the "small bit of extra code" like? > The SVG WG believes that this satisfies the spirit and the letter of > the HTML5 Design Principles, particularly in the aspects of > compatibility. > I think this proposal fails to satisfy the HTML Design Principles in the following ways: * Solve Real Problems: The main motivation behind introducing a counter-proposal to what was briefly in the HTML5 draft and was commented out seems to have been the desire to make text/html source code copyable on the source text level to a standalone SVG file that can be read by SVG tools. Other than the use case of extracting content from the Web into a legacy SVG editor, making source text look like XML isn't a Real Problem. Taking an SVG image from the Web and loading it into a legacy SVG editor can be solved in a simpler and more robust way by providing a browser feature for showing an XML serialization of an SVG DOM subtree instead of trying to make the source text copyable and pasteable. * Priority of Constituencies: This proposal places theoretical purity (XML-lookingness) over implementors (implementation ease, code footprint, performance) and authors (authors aren't helped by the complexity of arbitrary prefixes). * Well-defined Behavior: This proposal doesn't define what happens when document.write() is called from a script element inside an SVG subtree. (document.write() writes UTF-16 strings, but this proposal seems to make the XML parser work from the byte layer.) * Avoid Needless Complexity: Adding the XML parser inside an HTML parser is gratuitous complexity when implementation experience from the proposal that is commented out in the HTML5 draft shows that SVG can be integrated in text/html with fairly simple amendments to the HTML5 parsing algorithm. > The SVG WG proposes to change the HTML5 specification so that SVG > fragments are parsed by an XML parser. > As noted above, I disagree with the introduction of an XML parser into the mix. To get a performant result, one can't just use an off-the-shelf XML parser. In practice, this proposal would add developing an integrated XML parser to the burden of HTML5 parser implementors. It looks like a lot of pain for no gain for parser developers--regardless of the purpose of the parser (browser, conformance checker, non-browser app, JavaScript compatibility library for existing browsers). I don't see why the effort would be a good use of any developer's time. Furthermore, the code footprint wouldn't be nice in the case of a JS library. (See the last paragraph of http://blog.whatwg.org/html5-live-dom-viewer for how close to reality the JS scenario already is.) > A requirement for namespaces in XML for the SVG fragments [xml- > namespaces] is also added. > What's the rationale for this requirement? SVG-in-text/html seems like a great opportunity to shield authors from the complexity of Namespaces on the serialization level even though it's too late to get rid of namespaces on the DOM level. > If an SVG fragment is not XML well-formed, the fragment will be > repaired by closing all elements up to and including the element > where XML parsing began, and then control is handed over to the HTML > parser. The point where the HTML parsing resumes is the character > that triggered an XML error, or the character that follows the > closing tag for the element where XML parsing began if there was no > error.[foreign-elements]. > If the XML error is within a tag (say, lack of space between attributes), having the HTML parser resume from inside a tag seems highly ungraceful. > One problem with mixing HTML and SVG is that some elements and > attributes have the same (case-insensitive) names. This proposal > suggests that such clashes are handled by recommending authors to > use prefixing inside of SVG fragments to avoid any problems with > legacy user agents [name-collisions]. > Prefixes are ugly, and (anecdotally) novice XML authors are confused by the freedom to choose the prefix and by the layer of indirection from the prefix to a namespace URI. I think the approach taken in the proposal that is commented out in the HTML5 draft is much better: Using the tree builder context to decide what namespace the homographs get assigned to. > Going forward, HTML5 and SVG should strive to not introduce any new > name-collisions. > I strongly agree. (Aside: SVG could also strive not to introduce any more names with capital letters in them.) > By using an XML parser the following important requirements are met: > > attribute values in SVG fragments are guaranteed to always be in > quotes I disagree with this requirement. If this requirement were really important, how could HTML have succeeded to allow unquoted attributes since the dawn of the Web? > attribute- and element-names are case-sensitive in SVG fragments I think it is reasonable to require SVG names to use their canonical case in the DOM. However, it doesn't follow that the names had to use the canonical case in the serialization. In fact, implementation experience with the proposal that is commented out in the HTML5 draft shows that case-insensitivity in the parser but fixups before the tree can be implemented with zero perf cost in the case of element names and with the cost of introducing one layer of array access indirection in the attribute case if cost is paid instead in static memory footprint. Furthermore, since text/html is traditionally case-insensitive, making SVG parts case-sensitive in the serialization would make text/html inconsistent with itself. And yet, making it case-sensitive only makes the format less robust as more cases fail (like <SVG>). (In general, this proposal seems to fail more eagerly than the proposal commented out in the HTML5 draft. I don't see the value of eager failure.) > custom namespaced data in SVG fragments is made available in the > DOM in the correct namespace(s) This requirement is indeed interesting if one wants to round-trip product-specific editor state. However, it seems silly to pay a performance and complexity tax for things that a browser is expected to ignore. > moving to and from XHTML+SVG is made easier, since the syntax is > more or less the same The XHTML/HTML part isn't more or less the same for practical purposes, which makes the sameness of the SVG parts a lot less interesting. I think moving between syntaxes is better addressed with reserializers. Trying to shield people from the awareness of differences (i.e. "lies to children") doesn't seem like a good way, since people always tend to hit the bounds of the polite fiction anyway. > Requirements > SVG should remain XML when inline in HTML. I disagree with this requirement. I think requirements should address real problems such as use cases or implementability issues. This is just an arbitrary requirement to use a particular syntax without a rationale. > Should be able to take a conforming SVG document and paste its > contents into an HTML document and have it be the same DOM. (That > is, it should be possible for authors to create an SVG document in > Inkscape, take the contents of the file, and include it directly in > the HTML without having to munge its syntax to get it to work.) This > includes script content. I agree with this requirement when it comes to markup features that 1) are specified in SVG 1.1 (i.e. excluding product-specific cruft) and 2) would be equally pasteable into XHTML (i.e. excluding Illustrator's doctype cruft) and 3) are used by default by the most popular SVG editors (i.e. excluding prefixed SVG elements since the popular tools by default use unprefixed element names) > Should be able to take a conforming HTML document and copy the > SVG fragment from it and paste it into a new file and that would be > a conforming SVG document. (That is, it should be possible for > authors to, when they come across an SVG-in-HTML fragment, copy and > paste that source and open it up in Inkscape to edit.) I agree with the use case on a general level. However, I don't agree that copyability needs to be able to happen on the source level. I think reserializing the DOM fragment is an acceptable intermediate step. > Should be able to provide some sort of fallback mechanism for the > SVG-in-HTML so that UAs that dont know how to handle these SVG > fragments will display the fallback. I agree that this is desirable--but I think it doesn't need to be able to go further than to enable fallback to an HTML <img> element. I wouldn't consider this a hard requirement, since it seems unlikely that authors would want to deal with the burden of producing both an SVG image *and* fallback. > Should allow for unrestricted growth of the SVG language by the > SVG specifications (though those specifications should also take > into account the idea that SVG will, going forward, be used more > commonly in concert with HTML). This means that there would be no > "white list" of allowed SVG elements in HTML. It also means that the > SVG spec should be more careful about element and attribute names > going forward. I disagree with this requirement. I think being "more careful" going forward is an acceptable price to pay for text/html integration. "More careful" *could* include not introducing new names with uppercase letters, for example. As you already acknowledge, it is reasonable for "more careful" to cover not adding new name collisions. However, introducing new camelCase names and having a list of camelCase names in the parser can be reconciled: When a user agent implements support for a camelCase element in the rendering engine the element name is added to the list in the parser of that UA. Adding an entry to a list of well-known tokens is *trivial* compared to implementing a new SVG feature in the rendering engine. > Should allow for SVG Fonts to be included in HTML, and ideally to > be usable in HTML text. I think the proposal that is commented out in the HTML5 draft is too aggressive when it comes to breaking out of foreign content on <font>. However, I think the right way to proceed is to make <font> not break out of foreign content when used the way it is used in SVG. Tossing out what's now commented out in the draft and introducing an XML parser into the mix is not the right way forward, in my opinion. > Should attempt to avoid breaking existing text/html pages. > However, this must be balanced with the need for a clean, > sustainable architecture. I don't consider the mixing of the HTML parser and an XML parser "a clean, sustainable architecture". > Should specify a tolerant error handling model for the SVG content. This proposal fails to be tolerant of errors when it specifies the use of a (Draconian) XML parser. > For fallback behavior (requirement #4) it's proposed that a 'switch' > element inside of the SVG fragment is used to isolate the markup > that will be displayed by legacy UA:s. > That see more complex than simply sticking the fallback into an SVG elements whose content isn't rendered by the SVG renderer. <desc> works if adding elements is considered out of bounds for the parsing algorithm spec: http://hsivonen.iki.fi/test/moz/desc-as-fallback.html > Changes to the HTML5 Specification > The following following changes to the HTML5 specification are > suggested. > > Make tokeniser case-preserving It isn't clear why this is needed. Wouldn't the nested XML parser handle the SVG tags? Anyway, I strongly disagree with changes that defer case folding of HTML and unknown elements until after the tokenizer. For performance reasons, it is desirable to use static immutable token representations for well-known elements (i.e. all elements from HTML5, HTML 4.01, browser-sensitive legacy elements like <marquee> and <keygen>, SVG elements and MathML elements). For performance reasons, the tokenizer should be able to use a character buffer to find such an static immutable token object without intermediate allocation. This isn't just about interning the element name string. For performance reasons, it is desirable for the token objects for well- known elements to carry other data such as flags for whether the element is scoping or special and the tree builder dispatch group of the token. (All elements that always hit the same switch-case in the tree builder should share the dispatch group.) With a setup like this, the token objects can also carry an interned camelCase name the token. This way element node creation in the SVG context can use the camel case name with zero perf hit. In particular, this way the common case (HTML) doesn't need to pay a perf tax for SVG support (other than one conditional jump per start tag token inspecting the "in foreign" state). I have also implemented interning well-known attribute names to static objects. These objects contain an array of alternative names for HTML, SVG and MathML cases. When the token holding the attributes ends up starting an HTML element, the array is accessed by the offset for HTML. When the token ends up starting an SVG element, the array is accessed by an offset for SVG. Thus, the case fixup only adds one array access of indirection for attributes. > If there are attribute tokens with the same name it is a parse > error, discard all attribute tokens that are duplicates and the > value that is associated with each such token (if any), keep the > first occurrence of an attribute token whose name is duplicated. > Then the UA must create an element for the normalized token in the > HTML namespace, and then append this node to the current node, and > push it onto the stack of open elements so that it is the new > current node. > > Remove the paragraph: When the user agent leaves the attribute name > state (and before emitting the tag token, if appropriate), the > complete attribute's name must be compared to the other attributes > on the same token; if there is already an attribute on the token > with the exact same name, then this is a parse error and the new > attribute must be dropped, along with the value that gets associated > with it (if any). > Dropping attributes in the tree builder seems a bit worse for perf than dropping duplicates early and not supporting psychotic use of namespaces <http://www.flightlab.com/~joe/sgml/sanity.txt>. Anyway, why is this change needed in the HTML tokenizer if there's an XML parser involved as well? > When the insertion mode is " in body", tokens must be handled as > follows: > > ...A start tag whose case-sensitive tag name is "math" that has a > case-sensitive attribute "xmlns" with the value "http://www.w3.org/Math/1998/MathML > ":A start tag whose case-sensitive tag name is "svg" that has a case- > sensitive attribute "xmlns" with the value "http://www.w3.org/2000/ > svg":A start tag whose case-sensitive tag name is "*:math" that has > a corresponding case-sensitive attribute "xmlns:*" with the value "http://www.w3.org/Math/1998/MathML > ", where '*' can be any string as long as it's the same in both the > tagname and the xmlns attributename:A start tag whose case-sensitive > tag name is "*:svg" that has a case-sensitive attribute "xmlns:*" > with the value "http://www.w3.org/2000/svg", where '*' can be any > string as long as it's the same in both the tagname and the xmlns > attributename: I disagree with allowing prefixed element names. Not allowing prefixes is good for parser performance and gives authors less rope to shoot themselves in the foot with. > Create a new XML parser. Set the encoding to the character encoding > used by the HTML parser. > This seems to imply that the XML parser would consume bytes instead of characters. Such a scheme would totally wreck any performant character decoding buffering scheme, since the HTML parser and the XML parser would require the decoder to run in different error handling modes. How would recovery happen if the XML parser threw a fatal error due to a bad byte sequence? > Feed the XML parser the string starting with the character that > triggered entry into the 'tag open' state and ending with the > character that triggered emittance of the start tag token. > This scheme seems very inefficient unless the XML parser is not really an off-the-shelf XML parser but instead a duplicate set of tokenizer states, which would be a serious addition of complexity. > Let the XML parser attempt to parse and insert the foreign element. > The namespace of the foreign element shall be decided by following > namespaces in XML [XMLNS]. If the element was inserted successfully > let it be the entry-point element. > > If the previous step was successful, then bypass the tokeniser, and > continue to feed the unmodified input stream character by character > directly to the XML parser until it: > > returns with an error [XML10] > closes the entry-point element, with no errors This seems very inefficient. Also, it requires an XML parser that you can push data into instead of the XML parser pulling data. Has this proposal been implemented experimentally? > For each element the XML parser parses, insert a foreign element > with the namespace, name, and attributes of that element. The > namespace of the foreign element shall be decided by following > namespaces in XML [XMLNS]. > > If the XML parser returns an error: > > Close all open elements on the stack up to and including the > entry-point element. > If the element that had a parsing error in it was the entry- > point element, then insert an HTML element corresponding to that > token. Otherwise let the position in the input stream be the first > character that followed the last successfully inserted foreign > element. If the XML parser bails out inside a tag, will the rest of the tag leak into HTML text content? > Use of SVG Resources in HTML and CSS This section seems irrelevant to parsing. > Fallback Mechanisms See earlier about <img> and <desc>. If using <desc> like this seems distasteful, coining a new element called e.g. <fallback> could work. (I don't like the <ext> proposal. HTML and SVG should look like a coherent platform to authors without highlighting where working group boundaries are in the DOM tree.) -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Monday, 21 July 2008 13:29:40 UTC