XML-ER and self-delimiting from Larry Masinter on 2012-04-30 (www-tag@w3.org from April 2012)

From: Larry Masinter <masinter@adobe.com>
Date: Sun, 29 Apr 2012 19:54:40 -0700
To: Robin Berjon <robin@berjon.com>, David Carlisle <davidc@nag.co.uk>, "Bjoern Hoehrmann (derhoermi@gmx.net)" <derhoermi@gmx.net>
CC: "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D194AC36D6C@nambxv01a.corp.adobe.com>

Since we're talking about XML-ER. I can't tell from looking at the doc
at all how XML-ER deals with unclosed tags.

in
http://lists.w3.org/Archives/Public/public-html-xml/2012Jan/0009.html
Bjoern Hoehrmann wrote:

> HTML is first and foremost a bad language, simply because if you have a
> document with `<x><y>...` you cannot know whether the <y> is child or
> sibling to <x>, and whether `<y>` is character data or a start-tag, un-
> less you recognize both elements.  Which you might not since elements are
> added to the language all the time. Since documents do not convey this
> on their own this means that HTML parsers require maintenance, and it'd
> seem clear that something that requires maintenance is vastly more com-
> plex than something that does not.

In http://tools.ietf.org/html/bcp70#section-2 
("Guidelines for the Use of XML within IETF Protocols") we noted that
one of the advantages of XML is that "data framing" is built in. 
I'm not sure this is the correct word:

http://en.wikipedia.org/wiki/Frame_(networking), says:
  "making it possible for the receiver to detect the beginning and end 
   of the packet in the stream of symbols or bits." 

But of course we're not looking at packets...

And then ...

   "If a receiver is connected to the system in the middle of
   a frame transmission, it ignores the data until it detects a new frame 
   synchronization sequence."

Which isn't really true either for XML in general -- you can't really build
an XML stream processor which can start in the middle of an XML
stream and "catch up", can you? (not sure how XMPP handles this).

So I'll call what is desirable about XML is "self-delimiting" rather than
"framing", but it's the same idea: if you're looking for <x> elements,
can you just do a simple string scan for <x> before  kicking in a more
 complicated parser. (OK, maybe also you have to scan for <x> OR 
 entity declarations.) This isn't so much a performance feature
(performance dominated by memory latency of the data, but it
is, as Bjoern points out, a maintenance issue.

I think that being self-delimiting is a key contributor of the X
 ("eXtensible") of XML, and I can't quite tell from the discussion
of XML ER whether it's a requirement or even a feature of
the current proposal.  

Self-delimiting is clearly something HTML **doesn't have**, since
you can't tell  whether in  <x><y> whether <y> is a sibling or
child of <x> without knowing something about <x> and <y> and
their relationship. 

Self-delimiting is much more important for middleware in protocol pipelines,
where the middle component is scanning or transforming content.

Self-delimiting is not nearly as important if you're building an endpoint
which is expected to interpret all the content anyway... either a 
browser or a very complete, broad-scale search engine or crawler.

Part of the XML/HTML tension has been disagreement about the 
importance of self-delimiting and the impact of not having any way of
designing middleware systems that can scan or transform 
without full parsing.

If XML-ER _is_ self-delimiting and HTML is not, then how does XML-ER
help with the XML/HTML divergence? If at all? And if not that,
what does it help with?

Larry
--
http://larry.masinter.net

Received on Monday, 30 April 2012 02:55:24 UTC