- From: Larry Masinter <masinter@adobe.com>
- Date: Sun, 29 Apr 2012 19:54:40 -0700
- To: Robin Berjon <robin@berjon.com>, David Carlisle <davidc@nag.co.uk>, "Bjoern Hoehrmann (derhoermi@gmx.net)" <derhoermi@gmx.net>
- CC: "www-tag@w3.org" <www-tag@w3.org>
Since we're talking about XML-ER. I can't tell from looking at the doc at all how XML-ER deals with unclosed tags. in http://lists.w3.org/Archives/Public/public-html-xml/2012Jan/0009.html Bjoern Hoehrmann wrote: > HTML is first and foremost a bad language, simply because if you have a > document with `<x><y>...` you cannot know whether the <y> is child or > sibling to <x>, and whether `<y>` is character data or a start-tag, un- > less you recognize both elements. Which you might not since elements are > added to the language all the time. Since documents do not convey this > on their own this means that HTML parsers require maintenance, and it'd > seem clear that something that requires maintenance is vastly more com- > plex than something that does not. In http://tools.ietf.org/html/bcp70#section-2 ("Guidelines for the Use of XML within IETF Protocols") we noted that one of the advantages of XML is that "data framing" is built in. I'm not sure this is the correct word: http://en.wikipedia.org/wiki/Frame_(networking), says: "making it possible for the receiver to detect the beginning and end of the packet in the stream of symbols or bits." But of course we're not looking at packets... And then ... "If a receiver is connected to the system in the middle of a frame transmission, it ignores the data until it detects a new frame synchronization sequence." Which isn't really true either for XML in general -- you can't really build an XML stream processor which can start in the middle of an XML stream and "catch up", can you? (not sure how XMPP handles this). So I'll call what is desirable about XML is "self-delimiting" rather than "framing", but it's the same idea: if you're looking for <x> elements, can you just do a simple string scan for <x> before kicking in a more complicated parser. (OK, maybe also you have to scan for <x> OR entity declarations.) This isn't so much a performance feature (performance dominated by memory latency of the data, but it is, as Bjoern points out, a maintenance issue. I think that being self-delimiting is a key contributor of the X ("eXtensible") of XML, and I can't quite tell from the discussion of XML ER whether it's a requirement or even a feature of the current proposal. Self-delimiting is clearly something HTML **doesn't have**, since you can't tell whether in <x><y> whether <y> is a sibling or child of <x> without knowing something about <x> and <y> and their relationship. Self-delimiting is much more important for middleware in protocol pipelines, where the middle component is scanning or transforming content. Self-delimiting is not nearly as important if you're building an endpoint which is expected to interpret all the content anyway... either a browser or a very complete, broad-scale search engine or crawler. Part of the XML/HTML tension has been disagreement about the importance of self-delimiting and the impact of not having any way of designing middleware systems that can scan or transform without full parsing. If XML-ER _is_ self-delimiting and HTML is not, then how does XML-ER help with the XML/HTML divergence? If at all? And if not that, what does it help with? Larry -- http://larry.masinter.net
Received on Monday, 30 April 2012 02:55:24 UTC