- From: Aryeh Gregor <ayg@aryeh.name>
- Date: Thu, 14 Jun 2012 13:23:14 +0300
- To: Elliott Sprehn <esprehn@gmail.com>
- Cc: Ryosuke Niwa <rniwa@webkit.org>, Boris Zbarsky <bzbarsky@mit.edu>, Ojan Vafai <ojan@chromium.org>, www-dom <www-dom@w3.org>
On Tue, Jun 12, 2012 at 11:55 PM, Elliott Sprehn <esprehn@gmail.com> wrote: > I'd much rather the spec require serialization to respect the document mode > and output the doctype required to ensure that reading in the serialized > document produces the same document state. > > That is we should require Unserialize(Serialize(document)) == document That's impossible in general, because some markup can be created in the DOM but has no serialized representation. For instance, data:text/html,<script>document.appendChild(document.createComment("-->"))</script> There you have a DOM that will never be created by an HTML *or* XML parser, for any input. (Gecko refuses to create the comment in this case, but that's a bug. Anyway, you can come up with other examples without much trouble, like consecutive or empty text nodes.) You're talking about beyond the DOM level, though -- you want a serialized standards-mode document to parse as standards-mode. This is an interesting thought, but you could also talk about all sorts of other metadata that the parser infers and that doesn't later change even if the DOM does. For instance, would you also like to ensure that the charset that the document is serialized in matches the charset that will be parsed? What about the HTML vs. XML flag -- what if I have an HTML document and serialize it as XML? Do you want it to only become an HTML document, not an XML document? What happens if the root element has a manifest attribute, and script removes or changes it later on? Should the serializer reinsert it? What if there was a <base> that was in the markup, and then an <img>, so the <img>'s src was resolved relative to the <base>, but the <base> was later removed? Should the <img>'s src be changed, or a new <base> inserted, or what? What if the <base> was removed by a <script> that also removed itself? Note that in all of these cases -- including your proposal to inject doctypes -- the modifications would not only add complexity to the serializer, but also make the reparsed DOM not match the existing DOM. This could break all kinds of things, like if there's a script that relies on document.firstChild being <html> but now it's the doctype that you inserted. HTML pages are dynamic, and serialized markup is never going to fully or correctly encode the current state of the page if scripts have messed with it. All else being equal, it's better if we serialize a more accurate reflection of the current page's state, but IMO this is a weak reason by itself to add complexity to serializers. This would only be a good idea if authors are actually hitting problems in practice.
Received on Thursday, 14 June 2012 10:24:08 UTC