Feedback on SVGWG's SVG-in-text/html proposal

This e-mail contains feedback on:
   http://dev.w3.org/SVG/proposals/svg-html/svg-html-proposal.html

First, so that we all understand where we are each coming from, here
are some of the goals and requirements I have received from various
participants primarily from the HTML5 side, including especially
browser vendors:

 * It should be possible to drop an SVG file from a graphic editor into an 
   HTML5 document sent as text/html and usually have it validate and work.

 * The DOM aspect of this should be very similar to using SVG in XHTML, so 
   that there is no work required beyond parser changes for text/html.

 * Changes to the parser should be relatively small and localised. For 
   example, it should double the number of states in the tokeniser, or add 
   half a dozen tree construction insertion modes.

 * The parsing model should be very light-weight. It shouldn't require, 
   for example, extra buffering, or parsing text twice.

 * The markup should be as easy to edit by hand as regular HTML, modulo 
   complications due to the vocabulary itself.

 * The syntax shouldn't introduce two different syntaxes for HTML 
   elements in text/html. For example, it should be possible to take a big 
   blob of existing HTML, and wrap it in a <foreignObject> and have it 
   just work, without having to fix up missing end tags or namespace 
   declarations or whatever.

 * If possible, the same mechanism should work for both MathML and SVG, 
   and it should make it relatively easy to introduce other vocabularies 
   in future, at least for vocabularies designed with this mechanism in 
   mind.

 * Markup seen on real pages today, and errors of a similar vain, 
   shouldn't result in dramatically different renderings in browsers that 
   support this feature. Examples of pages today:
    - http://www.cocopahrv.com/map.html
    - http://www.laroseweb.com/calcs/fans.php
    - http://puysl.com/view.htm
    - http://albren.blogspot.com/2007/02/l05-traductor.html

 * We must not require users to declare namespace prefixes correctly.

 * If possible, we shouldn't expose users to namespace syntax at all, 
   though the DOM still needs to expose the namespaces.

Here are some non-goals, again primarily from HTML5 participants and
browser vendors:

 * It is not a goal to make this so generic that any random XML or content
   from any random namespace can be "smuggled" in text/html.

 * It is not a goal that any valid SVG file must be embeddable in 
   text/html. (Only the syntax that is actually widely used need be 
   supported.)

 * It is not a goal that the processing rules support the "draconian" 
   parsing model. (For example, missing quotes around attributes should, 
   as far as possible, be gracefully handled.)

 * It is not a goal that anything that is functional in text/html be 
   functional when copied and pasted into image/svg+xml.

 * It is not a goal that anything that is valid text/html be valid 
   image/svg+xml. In particular, whether to use case-sensitive or case- 
   insensitive tag and attribute names at the syntax level should be 
   driven from implementation performance choices, not conformance.

 * It is not a goal to support round-tripping of proprietary metadata or 
   metdata from namespaces other than those explicitly needed for SVG, 
   MathML, and HTML to work together.


The documents above also included a number of requirements.

Here are the goals where we appear to be aligned, or where the SVG 
proposal introduces new goals that, while not redundant with the above, 
are not contradictory:

 * Should be able to take a conforming SVG document and paste its
   contents into an HTML document and have it be the same DOM. (That
   is, it should be possible for authors to create an SVG document in
   Inkscape, take the contents of the file, and include it directly in
   the HTML without having to munge its syntax to get it to work.)
   This includes script content.

 * Should allow for unrestricted growth of the SVG language by the SVG
   specifications (though those specifications should also take into
   account the idea that SVG will, going forward, be used more
   commonly in concert with HTML). This means that there would be no
   "white list" of allowed SVG elements in HTML. It also means that
   the SVG spec should be more careful about element and attribute
   names going forward.

 * Should allow for SVG Fonts to be included in HTML, and ideally to
   be usable in HTML text.

 * Should specify a tolerant error handling model for the SVG content.


Next, here are some goals that are mostly compatible but may be somewhat 
problematic, with commentary regarding the conflicts:

 * Should attempt to avoid breaking existing text/html pages. However,
   this must be balanced with the need for a clean, sustainable
   architecture.

   -- it's likely that we disagree on the balance. In general, not 
   breaking existing HTML pages is pretty much the most important concern 
   in HTML5's development, and clean, sustainable architecture (i.e. the 
   concerns of the specification author) are the lowest priority.


 * SVG should remain XML when inline in HTML.

   -- it's not quite clear what this means.


 * Should be able to provide some sort of fallback mechanism for the
   SVG-in-HTML so that UAs that don’t know how to handle
   these SVG fragments will display the fallback.

   -- this would be interesting, certainly, though so long as we show it 
   is possible we probably don't need special syntax for it.


Finally, there is one goal in the SVG proposal that directly conflict with 
the aforementioned goals:

 * Should be able to take a conforming HTML document and copy the SVG
   fragment from it and paste it into a new file and that would be a
   conforming SVG document. (That is, it should be possible for
   authors to, when they come across an SVG-in-HTML fragment, copy and
   paste that source and open it up in Inkscape to edit.)

   -- the problem with this is that the parenthetical isn't made possible 
   by the goal. Most content will be invalid, so just making the 
   conforming content work will not make this always possible. In general 
   it's not really clear why this is such a great requirement anyway; it 
   seems like the better solution would be to just have SVG tools support 
   text/html. (This isn't a "tools will save us" argument because the 
   entire point here is that we're trying to use tools. If we didn't have 
   to use tools at all, then we wouldn't need it to be XML.)



While I'm listing goals and such, let me also go through the research we 
have. First, there's the data I found regarding how MathML and SVG is 
already present in some ways on pages; the four links I give above are 
pretty representative of such pages. They're not ubiquitious, but there's 
enough of them for us to pay attention. Some features that they indicate 
we have to support gracefully:

 * Unclosed <svg> elements in the middle of HTML, both with and without 
   xmlns="" attributes, for no apparent reason. For example, this kind of 
   markup occurs on the Web:

      <svg xmlns="http://www.w3.org/2000/svg">
      <circle .../>
      <p>...</p>
      <p>...</p>

 * Completely unexpected <math> elements, with MathML namespace and 
   everything, even where no MathML is involved. For example, this kind 
   of markup occurs on hte Web:

      <html xmlns="http://www.w3.org/1999/xhtml" >
      <math xmlns="http://www.w3.org/1998/Math/MathML">
      <head>
      ...normal html page...

 * Actually correct MathML fragments (sometimes without a namespace) 
   aren't uncommon on the Web.

Next, there is Henri's research:

   http://lists.w3.org/Archives/Public/public-html/2008Aug/thread.html#msg93

Some factors of note:

 * SVG elements without namespace declarations occurs in >1%.

 * SVG elements using prefixes occurs in <0.5%.

 * XLink attributes that aren't prefixed by 'xlink' occur in ~0.005%.

 * Namespace-related entities declared in internal subsets occur in >5%.

There's also been other research, for example Andrew and Henri's
implementation experience.



Now, let us review the proposal, by evaluating it against the goals listed 
above:

 * It should be possible to drop an SVG file from a graphic editor into an 
   HTML5 document sent as text/html and usually have it validate and work.

In a significant number of cases (at least up to 5% if Henri's numbers are 
to be believed), this goal is not met by the SVGWG proposal. In 
particular, it fails for anything that uses DTDs to declare namespaces 
today, as well as anything that omits namespaces altogether. (It also 
makes it hard to copy and paste SVG from within non-SVG XML files into 
text/html files if the declarations for XLink are at the top of the file, 
which is common.)


 * The DOM aspect of this should be very similar to using SVG in XHTML, so 
   that there is no work required beyond parser changes for text/html.

This goal is met.


 * Changes to the parser should be relatively small and localised. For 
   example, it should double the number of states in the tokeniser, or add 
   half a dozen tree construction insertion modes.

The proposal as defined embeds an entire XML parser into the HTML parsing 
rules. It's not entirely clear to me how this interface is expected to 
work (there is no spec that strictly defines how an XML parser works; the 
XML spec technically only defines the syntax and how to determine if a 
document is conforming or not). For example, XML parsers typically work on 
byte streams, but the proposal seems to pass characters to the XML parser. 

Now, I see two ways to actually implement this: either actually stick an 
XML parser into the implementation pretty much exactly how the proposal 
says to, or implement the relevant parts of an XML parser in the HTML 
parser. Realistically the former would have horrifying performance. Given 
how much browsers work to optimise their parsers, I would imagine they 
would all end up going the latter route. In fact, to make this 
well-defined enough, we'd probably have to do that in the spec too.

Embedding an entire XML parser in HTML's parser, even if we can drop 
things like DOCTYPE parsing, is an ungodly amount of complexity relative 
to just adding a couple of new states or insertion modes.

Furthermore, the proposal requires changing the way that token case 
handling happens throughout the entire tokenising and tree construction 
stages -- pretty much every string comparison is affected. That is hardly 
a localised change.

So in conclusion, I would have to say this goal isn't met.

Incidentally, it's not clear to me _where_ the proposal considers a 
well-formedness error to be. It would be important to define this to the 
byte, since the proposal relies on this definition for interoperable 
behaviour. XML defines a boolean (error present or not) but doesn't 
specifically define the byte or character where the error is. (There are 
ways to define this, but there are no ways to define this that I have seen 
that don't have pretty serious problems.)


 * The parsing model should be very light-weight. It shouldn't require, 
   for example, extra buffering, or parsing text twice.

As specified, this goal isn't met (having an entire XML parser in there 
is hardly "light-weight"), but I think this could be worked around by just 
inlining the whole XML parser. There's also the issue of reparsing in the 
face of well-formedness errors, but exactly how heavy-weight that would be 
depends on the exact definition, which is as yet not present.

Also, making the tokeniser case-preserving removes a major class of 
performance optimisations that are currently possible, as has been 
documented by Henri and Andrew. Given the emphasis on performance that 
browser vendors have, we would be unlikely to get all browser vendors to 
actually implement this as written. (I am surprised, in fact, that Opera 
is willing to put its name behind this proposal.)


 * The markup should be as easy to edit by hand as regular HTML, modulo 
   complications due to the vocabulary itself.

The proposal introduces namespace prefixes, so this goal isn't met.


 * The syntax shouldn't introduce two different syntaxes for HTML 
   elements in text/html. For example, it should be possible to take a big 
   blob of existing HTML, and wrap it in a <foreignObject> and have it 
   just work, without having to fix up missing end tags or namespace 
   declarations or whatever.

The proposal requires HTML inside SVG to use the XML syntax, so this goal 
isn't met.


 * If possible, the same mechanism should work for both MathML and SVG, 
   and it should make it relatively easy to introduce other vocabularies 
   in future, at least for vocabularies designed with this mechanism in 
   mind.

This goal is met.


 * Markup seen on real pages today, and errors of a similar vain, 
   shouldn't result in dramatically different renderings in browsers that 
   support this feature. Examples of pages today:
    - http://www.cocopahrv.com/map.html
    - http://www.laroseweb.com/calcs/fans.php
    - http://puysl.com/view.htm
    - http://albren.blogspot.com/2007/02/l05-traductor.html

This goal isn't met. In particular, consider some markup like this:

   <!DOCTYPE HTML>
   <title> Test </title>
   <svg xmlns="http://www.w3.org/2000/svg">
   <p>Hello world.</p>
   <p>How do you do.</p>

The proposal would make the text disappear relative to today's rendering. 
This is a pretty big problem, and one that would only get worse during the 
transition period (where only some browsers support this, so people are 
seeing the syntax as people experiment, but are then copying it into 
documents that are only tested in legacy browsers). There are examples in 
the list of URLs above that are quite similar to this kind of markup, 
though right now they all end up being ok. However, consider this one:

   http://puysl.com/view.htm

...and note that this document was updated _yesterday_, according to the 
text. It's quite possible that the <p>&nbsp;</p> line would end up moved, 
and suddenly the "Welcome to" text would disappear.


 * We must not require users to declare namespace prefixes correctly.

This goal is somewhat met, in that they aren't actually _required_, but 
they are _encouraged_, which is almost as bad.


 * If possible, we shouldn't expose users to namespace syntax at all, 
   though the DOM still needs to expose the namespaces.

This goal is not met.


 * Should be able to take a conforming SVG document and paste its
   contents into an HTML document and have it be the same DOM. (That
   is, it should be possible for authors to create an SVG document in
   Inkscape, take the contents of the file, and include it directly in
   the HTML without having to munge its syntax to get it to work.)
   This includes script content.

This goal isn't met for all SVG images, mostly because of the requirement 
that the namespace prefixes be correct and that entities be declared even 
though there is no declaration mechanism.


 * Should allow for unrestricted growth of the SVG language by the SVG
   specifications (though those specifications should also take into
   account the idea that SVG will, going forward, be used more
   commonly in concert with HTML). This means that there would be no
   "white list" of allowed SVG elements in HTML. It also means that
   the SVG spec should be more careful about element and attribute
   names going forward.

This goal is met.


 * Should allow for SVG Fonts to be included in HTML, and ideally to
   be usable in HTML text.

This goal is met.


 * Should specify a tolerant error handling model for the SVG content.

This goal is not really met, in that syntax-level errors cause the parser 
to switch out of the "XML" mode altogether.


 * Should attempt to avoid breaking existing text/html pages. However,
   this must be balanced with the need for a clean, sustainable
   architecture.

This goal isn't met, at least insofar as there are clear ways in which 
text/html pages can end up rendering differently (missing text) with this 
proposal implemented. It also isn't met on the second front -- as at least 
one implementor has pointed out, mixing XML and HTML parsers together 
isn't "clean" or "sustainable".


 * Should be able to provide some sort of fallback mechanism for the
   SVG-in-HTML so that UAs that don’t know how to handle
   these SVG fragments will display the fallback.

This goal is met, though I haven't carefully examined the mechanisms for 
it. I have no doubt that it is possible to address this requirment 
irrespective of how we design the parser model, so I'm going to mostly 
ignore this for the purposes of this e-mail, and focus on the parser.


 * Should be able to take a conforming HTML document and copy the SVG
   fragment from it and paste it into a new file and that would be a
   conforming SVG document. (That is, it should be possible for
   authors to, when they come across an SVG-in-HTML fragment, copy and
   paste that source and open it up in Inkscape to edit.)

This goal is met, but it is still possible, as defined, for documents 
with images in text/html that function correctly to not be valid XML 
documents when the user copies-and-pastes the content, so it seems like 
somewhat of a phyrric victory. For example:

   <!DOCTYPE HTML><title></title>
   <?xml error?>
   <svg xmlns="http://www.w3.org/2000/svg"></svg>

That document will work fine, but copying the <?xml?> declaration along 
with the SVG will result in an error.

Another example, assuming that encoding errors are handled before the XML 
parser:

   <!DOCTYPE HTML><title></title>
   <svg xmlns="http://www.w3.org/2000/svg">
    <desc> x </desc>
   </svg>

...where "x" is some byte that isn't valid UTF-8.

Another example (missing the </svg> end tag):

   <!DOCTYPE HTML><title></title>
   <svg xmlns="http://www.w3.org/2000/svg">
    <circle cx="0" cy="0" r="100"/>
   <p>...



In general, my conclusions are are somewhat negative:

 - There are a lot of goals that aren't met.

 - It seems to me that this proposal goes to great lengths to support
   some syntax (e.g. namespaces) despite evidence that doing so is not
   necessary, and it makes sacrifices regarding potential
   optimisations (like making the tokeniser case-insensitive, avoiding
   substring searches, avoiding attribute searches) despite evidence
   that browsers consider performance critical.

 - It leaves some aspects quite poorly defined, such as how encoding
   errors are handled, exactly where parse errors are to be
   established as occuring, and how the XML parser is expected to 
   interact with document.write().

 - It rather poorly handles typical authoring mistakes such as copying
   and pasting half of an SVG or MathML fragment into an HTML page, or
   omitting namespace declarations altogether.

I think all these problems would have to be resolved, as well as many
more of the goals above being met without caveats, before the proposal
could be considered for inclusion in HTML5.


On Tue, 15 Jul 2008, Robin Berjon wrote:
> On Jul 14, 2008, at 16:39 , Andrew Sidwell wrote:
> > It's not that I believe the tokeniser should not be case-preserving; I 
> > just think that if your motivation is just to make weirdly-cased tags 
> > not trigger XML parsing, then that's not a useful route to pursue.
> [...]
> > I understand you want to be compatible with existing SVG content, but 
> > this is a place where you shouldn't be.  <svg xmlns="..."> is quite 
> > enough.
> 
> Something is either compatible, or it's not, and in this case that's 
> not. There may be a case to be made that certain aspects of 
> compatibility could be dropped, but if so we should be clear that that's 
> what we're discussing.

I agree that we should be clear on what e want. In general, I think that 
we should aim to be compatible with the needs demonstrated by existing 
content (which, based on Henri's data, support's Andrew's suggestion).


On Mon, 21 Jul 2008, Henri Sivonen wrote:
>
> How do those tools behave if the SVG content is wrapped in some HTML? 
> You'd need to extract the SVG part and put it in a standalone SVG file, 
> right? The most sensible and robust way of doing that is reserializing 
> an SVG subtree as XML in a browser (like Firefox offers to show a 
> serialization of a MathML subtree today). When round-tripping is 
> achieved this way, it is unnecessary to try to limit what goes over the 
> wire to look like XML.

I do agree that the above, or making SVG editors simply support text/html 
natively, is in general a better solution than constraining the syntax to 
be limited to what today's tools support.


Regarding case-sensitivity as a mechanism for ensuring that future user 
agents don't have to do anything (beyond actually implementing the 
feature) to support an attribute with capital letters, Henri continued:

> However, introducing new camelCase names and having a list of camelCase 
> names in the parser can be reconciled: When a user agent implements 
> support for a camelCase element in the rendering engine the element name 
> is added to the list in the parser of that UA. Adding an entry to a list 
> of well-known tokens is *trivial* compared to implementing a new SVG 
> feature in the rendering engine.

I agree with this. The syntax side of things is trivial compared to the 
actual implementation and testing cost of a browser feature.


Henri also asks:

> Has this proposal been implemented experimentally?

This is a good question. Has anyone implemented this? It would be very 
useful to be able to compare notes. (Experimental implementations are a 
key part of HTML5's development process.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 28 August 2008 02:08:12 UTC