Alternate wording in terms of applications and infoset (was Re: Request for Erratum to XML 1.0 and 1.1 Specs) from Rick Jelliffe on 2003-10-22 (xml-editor@w3.org from October to December 2003)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Wed, 22 Oct 2003 19:51:09 +1000
To: Chris Lilley <chris@w3.org>
Cc: xml-editor@w3.org, W3c I18n Group <w3c-i18n-ig@w3.org>, w3c-xml-plenary@w3.org
Message-ID: <3F96530D.2020300@allette.com.au>
Chris Lilley wrote:

 > In my view, adding another XML conformance level below well formed is
 > not an erratum. Its a major change to the language.

Actually, the class "not well-formed but processed as well-formed
anyway" exists in the XML Spec, because not all WF errors need be
reported by a non-validating processor. Only validating parsers
report all WF errors.**

<GIST>
I have thought through the comments that Chris (Noah and others) have
raised, and gone through the spec again. I think that there is an 
alternative way to achieve the same effect I (and others such as Martin)
think is good (that this should be handled by XML APIs transparent to
application programmers), still limited to standard character names, but
which also does not create an apparant new conformance level, nor change
any definitions of WF and Valid, nor change the status of any document.

How to do this impossible thing?

Instead of the previous proposal, just append something like the
following paragraph after the first paragraph of XML 4.4.3
Included If Validating
   http://www.w3.org/TR/REC-xml#include-if-valid

"Applications *may* replace any Unexpanded Entity Reference Information 
Items[1] which have no replacement text or system identifier defined 
with the value of the ISO/HTML 4 or the ISO/MathML standard character 
entities of the same name."

[1] http://www.w3.org/TR/xml-infoset/#infoitem.rse

and then the XML WG should make it clear (when releasing the rationale
for this) that the appropriate place for this to occur is at the SAX
processor rather than on an application-by-application process.

Specifications for applications such as XSLT and XML Schema may also
specify that Unexpanded Entity Reference Items may/must be replaced with
default values before schema-processing, to give added impetus.

XML needs an erratum to make it clear that this is currently
allowed because people including myself have not been clear on
it and W3C applications might like to also put in similar
errata. The aim is to encourage this into SAX, where it belongs.
There is already text in 4.4.3 to clarify what applications
may do, but this important case has left out and is currently
causing confusion.

</GIST>

Currently people think they are banned from doing anything with
an Unexpanded Entity Reference Information Item by Draconian
considerations, whereas actually the XML Spec is silent.

My mistake has been to conflate the "XML Processor" with "the thing
that produces a SAX stream". We can keep the definition of XML
the same, but get SAX parser to expand default entities, with the
justification "this is application behaviour, but implemented tightly
with the parser".

Now expanding Unexpanded Entity Reference Information Item is definitely
something that applications are allowed to do:

"Browsers, for example, when encountering an external parsed entity 
reference, might choose to provide a visual indication of the entity's 
presence and retrieve it for display only on demand."
  http://www.w3.org/TR/REC-xml#include-if-valid

But this does not cover the case of what an application can do when
there is no replacement text or SYSTEM or PUBLIC identifier for
an Unexpanded Entity Reference Information Item.

So this is something that currently slips between the cracks of
XML Infoset and the XML spec: the XML spec is not concerned with
information but parsing, the infoset is not concerned with
parsing. There is nothing in either specs that I can see that
prevents an application from attempting to dereference
Unexpanded Entity Reference Information Items using the
standard entity sets.

So, actually, this is something that, I guess, XML Schema and
XSLT and XQuery etc could all specify independently. Or it
could be made part of some notional layer between XML processing
and the infoset. But I believe the simplest thing, and the
thing that would make it available to the broadest range of
XML users, would be to clarify that it is allowed so that
SAX (in particular) can add the feature and we can all go
back to our business.

That XML APIs do not make this available currently shows that
they have a mistaken view of what is required by an XML processor;
mistakes in what is allowed by XML processors is grounds
for an erratum.

I guess this approach also moves a bit towards Michael's suggestion,
in that is says the answer lies in something *after* XML proper
but somehow before Schema processing. In practise, I think it
is better to encourage this into generic SAX processors, though
if the schema spec also makes it a requirement of schema processing,
that does no harm that I can see (because the substitutions
can occur as a layer before other schema processing, and specs
such as XSLT 1 or Schematron that may want the effect without
heavyweight schema-processing can get it.)

As for the concern that it would be bad if some documents that
were non-WF become WF, I think the rewording deals with that.
There may also be some value in adopting a stripped down
version of Richard's proposal, and reserve a special attribute
such as @xmlEntityDefaulting="true" which allows a non-validating
parser to perform this error recovery but makes other parsers
barf (due to the attribute name starting with "xml") but I
don't think it is needed.

Background quotes from the XML Spec's Conformance section
  http://www.w3.org/TR/REC-xml#sec-conformance

"The behavior of a validating XML processor is highly predictable; it 
must read every piece of a document and report all well-formedness and 
validity violations. Less is required of a non-validating processor; it 
need not read any part of the document other than the document entity. "

and

"For maximum reliability in interoperating between different XML 
processors, applications which use non-validating processors should not 
rely on any behaviors not required of such processors. Applications 
which require facilities such as the use of default attributes or 
internal entities which are declared in external entities should use 
validating XML processors."



Cheers
Rick Jelliffe



** See http://www.w3.org/TR/REC-xml#sec-conformance

"Certain well-formedness errors, specifically those that require reading 
external entities, may not be detected by a non-validating processor. 
Examples include the constraints entitled Entity Declared, Parsed 
Entity, and No Recursion, as well as some of the cases described as 
forbidden in 4.4 XML Processor Treatment of Entities and References."
Received on Wednesday, 22 October 2003 05:51:14 UTC