Re: XML-ER and self-delimiting from David Carlisle on 2012-04-30 (www-tag@w3.org from April 2012)

From: David Carlisle <davidc@nag.co.uk>
Date: Mon, 30 Apr 2012 09:43:08 +0100
To: Larry Masinter <masinter@adobe.com>
Cc: Robin Berjon <robin@berjon.com>, "Bjoern Hoehrmann (derhoermi@gmx.net)" <derhoermi@gmx.net>, "www-tag@w3.org" <www-tag@w3.org>
Message-ID: <4F9E509C.80708@nag.co.uk>

On 30/04/2012 03:54, Larry Masinter wrote:
[not sure why this was cc'ed to me rather than to xml-er list, but 
anyway....]

> Since we're talking about XML-ER. I can't tell from looking at the doc
> at all how XML-ER deals with unclosed tags.

The only tricky/contentious part of the current xml-er draft is deciding 
what a tag is. Once you have that then the handling of unclosed tags is 
fairly trivial, (and the same as html apart from the html parsers built 
in special handling of certain element names.
When you reach a close tag you just close all elements on the stack 
until you reach an element of the right name (or you ignore the close 
tag if there is no such element, more or less: the devil is on the 
details, which are in the draft spec.

> So I'll call what is desirable about XML is "self-delimiting" rather than
> "framing", but it's the same idea: if you're looking for<x>  elements,
> can you just do a simple string scan for<x>  before  kicking in a more
>   complicated parser. (OK, maybe also you have to scan for<x>  OR
>   entity declarations.)

There are so many caveats that that is at best only just true.
You also have to look for <x > and you have to skip over CDATA sections 
and comments and processing instructions. Not to mention the black hole 
of needing to know what character encoding the document is using.

> Self-delimiting is clearly something HTML **doesn't have**, since
> you can't tell  whether in<x><y>  whether<y>  is a sibling or
> child of<x>  without knowing something about<x>  and<y>  and
> their relationship.

xml-er parsing in the current draft has no knowledge of any particular 
schema so no predefined list of empty/void elements. so <x><y> (if that 
is the complete document) parses as <x><y/></x> as they are parsed as 
open tags and the stack of open elements is closed off at eof.

David

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________

Received on Monday, 30 April 2012 08:43:34 UTC