Re: SVG in HTML proposal from Henri Sivonen on 2008-07-21 (public-html@w3.org from July 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 21 Jul 2008 16:28:34 +0300
To: HTML WG <public-html@w3.org>, www-svg <www-svg@w3.org>
Message-Id: <908FCE26-BCAF-4B27-9CC1-8C1F5A17CEEB@iki.fi>
I'm glad to see that the SVG WG has looked into the SVG-in-text/html  
issue. My comments are inline with quotes from CVS revision 1.14 of  
the proposal.

Summary: I think putting an XML parser into the text/html parsing  
process is a bad idea. I disagree with some of the requirements stated  
in the proposal. I suggest proceeding from the prior proposal that is  
commented out in the HTML5 draft.

Quoting from:
http://dev.w3.org/cvsweb/SVG/proposals/svg-html/svg-html-proposal.html?rev=1.14&content-type=text/html
> SVG is an XML language by design, and therefore has certain  
> abilities and constraints at the syntactic level that are dissimilar  
> to those of text/html (but consistent with XHTML); therefore, to  
> work correctly, with the full range of features, SVG must follow the  
> syntactic rules with which it was designed, in both HTML and XHTML.
>
I disagree with this characterization. SVG allowing scripting  
fundamentally makes it a DOM language. An XML serialization is one way  
to initialize the DOM tree, but the vocabulary-specific user agent  
code that implements SVG reads from the DOM and never sees the XML  
source.
> This consistency will aid developers, who will not need to learn two  
> separate sets of rules for the syntax and feature sets.
>
The proposal is suggesting that developers learn two sets of syntax  
rules for working *within* text/html.
> It will maintain compatibility with existing SVG viewers,
>
SVG-in-text/html is a new feature. Syntactic "compatibility" with  
existing SVG viewers is a red herring whenever:
  * The existing viewer rejects text/html upon HTTP Content-Type.
  * The existing viewer throws an error trying to parse the HTML  
wrapper before it even gets to an SVG subtree.
  * The content depends on both HTML and SVG parts getting rendered  
and the existing viewer only supports SVG.
  * The existing viewer is unusably buggy anyway (something to keep in  
mind when reading about mobile SVG UA statistics).
> and continue to permit round-tripping in SVG authoring tools such as  
> Inkscape, Adobe Illustrator, and CorelDraw, which rely on the XML  
> format.
>
How do those tools behave if the SVG content is wrapped in some HTML?  
You'd need to extract the SVG part and put it in a standalone SVG  
file, right? The most sensible and robust way of doing that is  
reserializing an SVG subtree as XML in a browser (like Firefox offers  
to show a serialization of a MathML subtree today). When round- 
tripping is achieved this way, it is unnecessary to try to limit what  
goes over the wire to look like XML.

The SVG-in-text/html proposal that is in the comments of the HTML5  
spec already makes this possible for SVG features that are defined in  
SVG 1.1 Full itself. It currently doesn't properly round-trip product- 
specific extensions to SVG that Inkscape and Illustrator put into  
their output and that browsers aren't expected to act on.

Aside about naming Illustrator specifically: Earlier in this thread <http://lists.w3.org/Archives/Public/public-html/2008Jul/0184.html 
 >, Jeff Schiller implied that Adobe Illustrator's SVG export isn't  
under active development. I'm not sure if that is the case. However,  
considering the progress of the Web platform, I think it is a bad idea  
to allow a frozen authoring tool dictate where the platform can go. If  
a given authoring product is frozen and the frozenness results in  
inconvenience either for the users of that product or for everyone, I  
think we should inconvenience the users of that product by requiring  
them to use a sanitizer that turns stuff from out there from the Web  
into a form that is safe for the frozen product to read instead of  
trying to freeze the Web to fit that product in case someone happens  
to want to use the frozen product to edit arbitrary content from the  
Web.

To keep things in perspective, consider how thing work in the output  
direction from Illustrator: If you take SVG output from Illustrator,  
you can't just paste the source text in the middle of an XHTML  
document and have it work, because Illustrator does bad things with  
DTDs and you can't put a doctype in the middle of an XHTML document  
without making it ill-formed.
> This document is a proposal for integrating SVG in both the text and  
> XML serializations of HTML5. This proposal follows the model that  
> works today in every major browser that supports XHTML (Firefox,  
> Opera, and Safari), all of which also support SVG natively.
>
In those browsers, the model is that the SVG renderer operates on the  
DOM, and you can initialize the DOM from XML using an XML parser. To  
cast the same model into text/html, an HTML parser would build a DOM  
with SVG nodes in it. What this proposal does is more complex as it  
seeks to involve both an HTML parser and an XML parser in the process  
of building one DOM tree.
> It also works to a lesser extent in Internet Explorer, with the use  
> of an SVG plugin (and a small bit of extra code).
>
I'd be interested in an elaborated explanation on this point. Do  
existing SVG plug-ins for IE get something they can reparse using an  
XML parser (according to this proposal) from the enclosing Trident  
instance? What's the parser interface between Trident and the existing  
SVG plug-ins like? What's the "small bit of extra code" like?
> The SVG WG believes that this satisfies the spirit and the letter of  
> the HTML5 Design Principles, particularly in the aspects of  
> compatibility.
>
I think this proposal fails to satisfy the HTML Design Principles in  
the following ways:

  * Solve Real Problems: The main motivation behind introducing a  
counter-proposal to what was briefly in the HTML5 draft and was  
commented out seems to have been the desire to make text/html source  
code copyable on the source text level to a standalone SVG file that  
can be read by SVG tools. Other than the use case of extracting  
content from the Web into a legacy SVG editor, making source text look  
like XML isn't a Real Problem. Taking an SVG image from the Web and  
loading it into a legacy SVG editor can be solved in a simpler and  
more robust way by providing a browser feature for showing an XML  
serialization of an SVG DOM subtree instead of trying to make the  
source text copyable and pasteable.

  * Priority of Constituencies: This proposal places theoretical  
purity (XML-lookingness) over implementors (implementation ease, code  
footprint, performance) and authors (authors aren't helped by the  
complexity of arbitrary prefixes).

  * Well-defined Behavior: This proposal doesn't define what happens  
when document.write() is called from a script element inside an SVG  
subtree. (document.write() writes UTF-16 strings, but this proposal  
seems to make the XML parser work from the byte layer.)

  * Avoid Needless Complexity: Adding the XML parser inside an HTML  
parser is gratuitous complexity when implementation experience from  
the proposal that is commented out in the HTML5 draft shows that SVG  
can be integrated in text/html with fairly simple amendments to the  
HTML5 parsing algorithm.
> The SVG WG proposes to change the HTML5 specification so that SVG  
> fragments are parsed by an XML parser.
>
As noted above, I disagree with the introduction of an XML parser into  
the mix.

To get a performant result, one can't just use an off-the-shelf XML  
parser. In practice, this proposal would add developing an integrated  
XML parser to the burden of HTML5 parser implementors. It looks like a  
lot of pain for no gain for parser developers--regardless of the  
purpose of the parser (browser, conformance checker, non-browser app,  
JavaScript compatibility library for existing browsers). I don't see  
why the effort would be a good use of any developer's time.  
Furthermore, the code footprint wouldn't be nice in the case of a JS  
library. (See the last paragraph of http://blog.whatwg.org/html5-live-dom-viewer 
  for how close to reality the JS scenario already is.)
> A requirement for namespaces in XML for the SVG fragments [xml- 
> namespaces] is also added.
>
What's the rationale for this requirement? SVG-in-text/html seems like  
a great opportunity to shield authors from the complexity of  
Namespaces on the serialization level even though it's too late to get  
rid of namespaces on the DOM level.
> If an SVG fragment is not XML well-formed, the fragment will be  
> repaired by closing all elements up to and including the element  
> where XML parsing began, and then control is handed over to the HTML  
> parser. The point where the HTML parsing resumes is the character  
> that triggered an XML error, or the character that follows the  
> closing tag for the element where XML parsing began if there was no  
> error.[foreign-elements].
>
If the XML error is within a tag (say, lack of space between  
attributes), having the HTML parser resume from inside a tag seems  
highly ungraceful.
> One problem with mixing HTML and SVG is that some elements and  
> attributes have the same (case-insensitive) names. This proposal  
> suggests that such clashes are handled by recommending authors to  
> use prefixing inside of SVG fragments to avoid any problems with  
> legacy user agents [name-collisions].
>
Prefixes are ugly, and (anecdotally) novice XML authors are confused  
by the freedom to choose the prefix and by the layer of indirection  
from the prefix to a namespace URI. I think the approach taken in the  
proposal that is commented out in the HTML5 draft is much better:  
Using the tree builder context to decide what namespace the homographs  
get assigned to.
> Going forward, HTML5 and SVG should strive to not introduce any new  
> name-collisions.
>
I strongly agree. (Aside: SVG could also strive not to introduce any  
more names with capital letters in them.)
> By using an XML parser the following important requirements are met:
>
> 	• attribute values in SVG fragments are guaranteed to always be in  
> quotes

I disagree with this requirement. If this requirement were really  
important, how could HTML have succeeded to allow unquoted attributes  
since the dawn of the Web?

> 	• attribute- and element-names are case-sensitive in SVG fragments

I think it is reasonable to require SVG names to use their canonical  
case in the DOM. However, it doesn't follow that the names had to use  
the canonical case in the serialization. In fact, implementation  
experience with the proposal that is commented out in the HTML5 draft  
shows that case-insensitivity in the parser but fixups before the tree  
can be implemented with zero perf cost in the case of element names  
and with the cost of introducing one layer of array access indirection  
in the attribute case if cost is paid instead in static memory  
footprint.

Furthermore, since text/html is traditionally case-insensitive, making  
SVG parts case-sensitive in the serialization would make text/html  
inconsistent with itself. And yet, making it case-sensitive only makes  
the format less robust as more cases fail (like <SVG>).

(In general, this proposal seems to fail more eagerly than the  
proposal commented out in the HTML5 draft. I don't see the value of  
eager failure.)

> 	• custom namespaced data in SVG fragments is made available in the  
> DOM in the correct namespace(s)

This requirement is indeed interesting if one wants to round-trip  
product-specific editor state. However, it seems silly to pay a  
performance and complexity tax for things that a browser is expected  
to ignore.

> 	• moving to and from XHTML+SVG is made easier, since the syntax is  
> more or less the same

The XHTML/HTML part isn't more or less the same for practical  
purposes, which makes the sameness of the SVG parts a lot less  
interesting. I think moving between syntaxes is better addressed with  
reserializers. Trying to shield people from the awareness of  
differences (i.e. "lies to children") doesn't seem like a good way,  
since people always tend to hit the bounds of the polite fiction anyway.

> Requirements
> 	• SVG should remain XML when inline in HTML.

I disagree with this requirement. I think requirements should address  
real problems such as use cases or implementability issues. This is  
just an arbitrary requirement to use a particular syntax without a  
rationale.

> 	• Should be able to take a conforming SVG document and paste its  
> contents into an HTML document and have it be the same DOM. (That  
> is, it should be possible for authors to create an SVG document in  
> Inkscape, take the contents of the file, and include it directly in  
> the HTML without having to munge its syntax to get it to work.) This  
> includes script content.

I agree with this requirement when it comes to markup features that
  1) are specified in SVG 1.1 (i.e. excluding product-specific cruft)
and
  2) would be equally pasteable into XHTML (i.e. excluding  
Illustrator's doctype cruft)
and
  3) are used by default by the most popular SVG editors (i.e.  
excluding prefixed SVG elements since the popular tools by default use  
unprefixed element names)

> 	• Should be able to take a conforming HTML document and copy the  
> SVG fragment from it and paste it into a new file and that would be  
> a conforming SVG document. (That is, it should be possible for  
> authors to, when they come across an SVG-in-HTML fragment, copy and  
> paste that source and open it up in Inkscape to edit.)

I agree with the use case on a general level. However, I don't agree  
that copyability needs to be able to happen on the source level. I  
think reserializing the DOM fragment is an acceptable intermediate step.

> 	• Should be able to provide some sort of fallback mechanism for the  
> SVG-in-HTML so that UAs that don’t know how to handle these SVG  
> fragments will display the fallback.

I agree that this is desirable--but I think it doesn't need to be able  
to go further than to enable fallback to an HTML <img> element. I  
wouldn't consider this a hard requirement, since it seems unlikely  
that authors would want to deal with the burden of producing both an  
SVG image *and* fallback.

> 	• Should allow for unrestricted growth of the SVG language by the  
> SVG specifications (though those specifications should also take  
> into account the idea that SVG will, going forward, be used more  
> commonly in concert with HTML). This means that there would be no  
> "white list" of allowed SVG elements in HTML. It also means that the  
> SVG spec should be more careful about element and attribute names  
> going forward.

I disagree with this requirement. I think being "more careful" going  
forward is an acceptable price to pay for text/html integration. "More  
careful" *could* include not introducing new names with uppercase  
letters, for example. As you already acknowledge, it is reasonable for  
"more careful" to cover not adding new name collisions.

However, introducing new camelCase names and having a list of  
camelCase names in the parser can be reconciled: When a user agent  
implements support for a camelCase element in the rendering engine the  
element name is added to the list in the parser of that UA. Adding an  
entry to a list of well-known tokens is *trivial* compared to  
implementing a new SVG feature in the rendering engine.

> 	• Should allow for SVG Fonts to be included in HTML, and ideally to  
> be usable in HTML text.

I think the proposal that is commented out in the HTML5 draft is too  
aggressive when it comes to breaking out of foreign content on <font>.  
However, I think the right way to proceed is to make <font> not break  
out of foreign content when used the way it is used in SVG. Tossing  
out what's now commented out in the draft and introducing an XML  
parser into the mix is not the right way forward, in my opinion.

> 	• Should attempt to avoid breaking existing text/html pages.  
> However, this must be balanced with the need for a clean,  
> sustainable architecture.

I don't consider the mixing of the HTML parser and an XML parser "a  
clean, sustainable architecture".

> 	• Should specify a tolerant error handling model for the SVG content.

This proposal fails to be tolerant of errors when it specifies the use  
of a (Draconian) XML parser.
> For fallback behavior (requirement #4) it's proposed that a 'switch'  
> element inside of the SVG fragment is used to isolate the markup  
> that will be displayed by legacy UA:s.
>
That see more complex than simply sticking the fallback into an SVG  
elements whose content isn't rendered by the SVG renderer. <desc>  
works if adding elements is considered out of bounds for the parsing  
algorithm spec:
http://hsivonen.iki.fi/test/moz/desc-as-fallback.html

> Changes to the HTML5 Specification
> The following following changes to the HTML5 specification are  
> suggested.
>
> Make tokeniser case-preserving

It isn't clear why this is needed. Wouldn't the nested XML parser  
handle the SVG tags?

Anyway, I strongly disagree with changes that defer case folding of  
HTML and unknown elements until after the tokenizer. For performance  
reasons, it is desirable to use static immutable token representations  
for well-known elements (i.e. all elements from HTML5, HTML 4.01,  
browser-sensitive legacy elements like <marquee> and <keygen>, SVG  
elements and MathML elements). For performance reasons, the tokenizer  
should be able to use a character buffer to find such an static  
immutable token object without intermediate allocation.

This isn't just about interning the element name string. For  
performance reasons, it is desirable for the token objects for well- 
known elements to carry other data such as flags for whether the  
element is scoping or special and the tree builder dispatch group of  
the token. (All elements that always hit the same switch-case in the  
tree builder should share the dispatch group.)

With a setup like this, the token objects can also carry an interned  
camelCase name the token. This way element node creation in the SVG  
context can use the camel case name with zero perf hit. In particular,  
this way the common case (HTML) doesn't need to pay a perf tax for SVG  
support (other than one conditional jump per start tag token  
inspecting the "in foreign" state).

I have also implemented interning well-known attribute names to static  
objects. These objects contain an array of alternative names for HTML,  
SVG and MathML cases. When the token holding the attributes ends up  
starting an HTML element, the array is accessed by the offset for  
HTML. When the token ends up starting an SVG element, the array is  
accessed by an offset for SVG. Thus, the case fixup only adds one  
array access of indirection for attributes.
> If there are attribute tokens with the same name it is a parse  
> error, discard all attribute tokens that are duplicates and the  
> value that is associated with each such token (if any), keep the  
> first occurrence of an attribute token whose name is duplicated.  
> Then the UA must create an element for the normalized token in the  
> HTML namespace, and then append this node to the current node, and  
> push it onto the stack of open elements so that it is the new  
> current node.
>
> Remove the paragraph: When the user agent leaves the attribute name  
> state (and before emitting the tag token, if appropriate), the  
> complete attribute's name must be compared to the other attributes  
> on the same token; if there is already an attribute on the token  
> with the exact same name, then this is a parse error and the new  
> attribute must be dropped, along with the value that gets associated  
> with it (if any).
>
Dropping attributes in the tree builder seems a bit worse for perf  
than dropping duplicates early and not supporting psychotic use of  
namespaces <http://www.flightlab.com/~joe/sgml/sanity.txt>.

Anyway, why is this change needed in the HTML tokenizer if there's an  
XML parser involved as well?

> When the insertion mode is " in body", tokens must be handled as  
> follows:
>
> ...A start tag whose case-sensitive tag name is "math" that has a  
> case-sensitive attribute "xmlns" with the value "http://www.w3.org/Math/1998/MathML 
> ":A start tag whose case-sensitive tag name is "svg" that has a case- 
> sensitive attribute "xmlns" with the value "http://www.w3.org/2000/ 
> svg":A start tag whose case-sensitive tag name is "*:math" that has  
> a corresponding case-sensitive attribute "xmlns:*" with the value "http://www.w3.org/Math/1998/MathML 
> ", where '*' can be any string as long as it's the same in both the  
> tagname and the xmlns attributename:A start tag whose case-sensitive  
> tag name is "*:svg" that has a case-sensitive attribute "xmlns:*"  
> with the value "http://www.w3.org/2000/svg", where '*' can be any  
> string as long as it's the same in both the tagname and the xmlns  
> attributename:

I disagree with allowing prefixed element names. Not allowing prefixes  
is good for parser performance and gives authors less rope to shoot  
themselves in the foot with.
> Create a new XML parser. Set the encoding to the character encoding  
> used by the HTML parser.
>
This seems to imply that the XML parser would consume bytes instead of  
characters. Such a scheme would totally wreck any performant character  
decoding buffering scheme, since the HTML parser and the XML parser  
would require the decoder to run in different error handling modes.  
How would recovery happen if the XML parser threw a fatal error due to  
a bad byte sequence?
> Feed the XML parser the string starting with the character that  
> triggered entry into the 'tag open' state and ending with the  
> character that triggered emittance of the start tag token.
>
This scheme seems very inefficient unless the XML parser is not really  
an off-the-shelf XML parser but instead a duplicate set of tokenizer  
states, which would be a serious addition of complexity.
> Let the XML parser attempt to parse and insert the foreign element.  
> The namespace of the foreign element shall be decided by following  
> namespaces in XML [XMLNS]. If the element was inserted successfully  
> let it be the entry-point element.
>
> If the previous step was successful, then bypass the tokeniser, and  
> continue to feed the unmodified input stream character by character  
> directly to the XML parser until it:
>
> 		• returns with an error [XML10]
> 		• closes the entry-point element, with no errors

This seems very inefficient. Also, it requires an XML parser that you  
can push data into instead of the XML parser pulling data.

Has this proposal been implemented experimentally?
> For each element the XML parser parses, insert a foreign element  
> with the namespace, name, and attributes of that element. The  
> namespace of the foreign element shall be decided by following  
> namespaces in XML [XMLNS].
>
> If the XML parser returns an error:
>
> 		• Close all open elements on the stack up to and including the  
> entry-point element.
> 		• If the element that had a parsing error in it was the entry- 
> point element, then insert an HTML element corresponding to that  
> token. Otherwise let the position in the input stream be the first  
> character that followed the last successfully inserted foreign  
> element.

If the XML parser bails out inside a tag, will the rest of the tag  
leak into HTML text content?

> Use of SVG Resources in HTML and CSS

This section seems irrelevant to parsing.

> Fallback Mechanisms

See earlier about <img> and <desc>. If using <desc> like this seems  
distasteful, coining a new element called e.g. <fallback> could work.  
(I don't like the <ext> proposal. HTML and SVG should look like a  
coherent platform to authors without highlighting where working group  
boundaries are in the DOM tree.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Monday, 21 July 2008 13:29:40 UTC