Re: Options for dealing with IDs from Rick Jelliffe on 2003-01-16 (www-tag@w3.org from January 2003)

From: Rick Jelliffe <ricko@topologi.com>
Date: Fri, 17 Jan 2003 04:00:46 +1100
To: <www-tag@w3.org>
Message-ID: <002b01c2bd80$d0a0ae10$4bc8a8c0@AlletteSystems.com>

From: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
 
> Why would you want to restrict the syntax of the documents 
> you can process? (Yes, I know SOAP does this. I think SOAP is wrong, 
> and this brain damage should not be encouraged to propagate into 
> other domains.)  I don't want to allow subsets of XML syntax to be 
> defined and required. It's an interoperability disaster. 

I think it is the current definition of well-formed that is the interoperability
"disaster".   

As Simeon and Wadler point out in 
http://www.research.avayalabs.com/user/wadler/papers/xml-essence/xml-essence.pdf
one of the important properties of an external data-representation format is round-tripping.

The current situation where you don't know what infoset a parser will produce
when you give it a document means that at the heart of XML is a flaw which
should be removed sooner rather than later.  People wrongly attribute the interoperability
problem to "entities" in general (often just suggesting some kind of other link
whose influence on the information set is even less well defined.)   

Now by "what infoset a parser will produce" I don't mean minor things like
the status of CDATA sections, but very major things: whether an attribute
is present, and (most significantly for downstream processing) whether that
attribute provides a namespace.  

Which is why I think we need to move to four kinds of XML documents
and processors

    - headless (e.g. for SOAP, similar to Norm's suggestion)
    - well-formed (deprecated)
    - infoset-complete but unvalidated (e.g. for XHTML)
    - valid

To recap, the infoset-complete-but-unvalidated documents/processors
would have exactly the same infoset as a valid document. However,
the parser would not need to understand content models, nor test
that attribute values which were enumerations matched their declarations.
A processor would have to maintain about an element: whether it allowed
PCDATA (to report whitespace correctly), what the default values for
attributes are, what attributes are IDs or IDREFs, and what tokenizing
or space-normalizing was needed for an attribute value.  But no DFAs.

Well-formed should be a category of minority interest to editor-application
developers, not something for public usage.

Cheers
Rick Jelliffe

Received on Thursday, 16 January 2003 11:59:14 UTC