Request for Erratum to XML 1.0 and 1.1 Specs from Rick Jelliffe on 2003-10-21 (xml-editor@w3.org from October to December 2003)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Tue, 21 Oct 2003 18:48:24 +1000
To: xml-editor@w3.org
Cc: W3c I18n Group <w3c-i18n-ig@w3.org>, w3c-xml-plenary@w3.org
Message-ID: <3F94F2D8.9030000@allette.com.au>
Request for Erratum to XML 1.0 and 1.1 Specs
----------------------------------------------
Rick Jelliffe, ricko@topologi.com, 2003-10-21


I request the XML Working Group please consider the following erratum
to XML 1.0 which should also apply to XML 1.1.

The following two paragraphs, or something to the same effect, should be 
appended to section 5.1 "Validating and Non-Validating Processors"



"A non-validating processor may, at user option, imply definitions for
all the character entities defined by HTML 4[1]. A document or entity 
for which definitions are implied is not well-formed. The processor must 
report a non-fatal error. NOTE: The document is 'not well-formed but 
processed'. Reliance on this feature by specifications is deprecated; 
this option may be withdrawn at some
future time should it prove dangerous."

"A non-validating processor which provides the HTML 4
definitions may, at user option, also imply definitions for other
Math ML and ISO standard sets[2]. A processor must report a non-fatal
error. The document is 'not well-formed but processed'. NOTE: Reliance 
on this feature by specifications is deprecated; this option may be 
withdrawn at some future time should it prove dangerous."

[1] http://www.w3.org/TR/html401/sgml/entities.html
[2] http://www.w3.org/TR/MathML2/chapter6.html#chars_entity-tables



This suggested erratum has the following characteristics:

1) It does not require any change to any XML processor
2) It does not change the basic XML characteristic that the
only way to guarantee information is received at the other
end is to use a UTF-* encoding, no entities and no attribute
defaulting.
3) It maintains the current layering, ao no re-architecting
or change in design is needed
4) It keeps the XML specification as the location on how to
go from characters to data+markup.

5) It does not make any existing valid XML document invalid
6) It does not make any existing invalid XML document valid
7) It does not make any existing WF document or entity non-WF
8) It does not make any existing non-WF document formally WF

9) It does allow the continued non-validating processing of
documents which are non-WF only because they contain standard
references
10) It limits this to user option
11) It does not allow other specifications to use this as
its default
12) It can be withdrawn

13) I believe it is practical and would be simple to implement.



I believe the beneficiaries of such an erratum include:

  * Users typing in editors with no adequate input methods
  for non-ASCII characters. I note that although Unicode
  editors can display many characters, not all operating
  systems have input methods to allow convenient data entry
  even of Latin1 characters. (I believe this is better provided
  by using decent XML markup editors, without prejudice.)

  * XHTML users who are used to named references without declarations
  in HTML.

  * Potential XInclude users, who may wish
  to treat a WF parsed entity from a document that uses
  standard character references as a microdocument

  * Potential XML Schemas, Schematron and RELAX NG users who
  may wish to upgrade from DTDs.

  * Potential XQuery users who are being hindered by the lack
  of XML Schemas.

  * XML pipeline systems which can pass XML without requiring
   tricky prologs

  * SOAP, RSS and RDF systems which must cope with data fragments
  from externally-generated document being embedded

  * Programmers serializing data to XML, especially for internal
   systems, who may prefer to generate "&mdash;" or "&nbsp;"
   rather than the numeric or literal equivalents.

  * Vendors who make products for the above

  * Low-sight or motion-impaired users whose speech synthesizers
   or input methods only support ASCII characters. Aged, enraged
   or diminished capacity users who may be frustrated at having
   to lookup the number for something they know the name for.
   (Though I do not want to suggest that "entity rage" is a hidden
   problem.)


I suggest its benefits over other suggested approaches include:

  * It does not require change to subsequent processes, as PSVI
   processing would, nor any changes or additions to schema
   specifications

  * It does not require pre-processing, as a macro processor would

  * It does not require the introdution and deployment of new
   transcoders, as would Tim Bray and John Cowan's recent thought
   experiment "UTF-8+Names"

  * It does not require interaction with other standards groups, notably
   XML Schemas EG or IANA or IETF.

  * By providing it at user option, it can succeed or fail; if it is
  popular and successful, that is good; if it is unpopular or unsafe.

  * By limiting itself to the HTML and the MathML/ISO entities, it
   avoids issues of user-defined entities, and the need to enumerate
   the entities.

  * It does not define mappings for those characters, but defers to
   HTML and MathML/ISO, who may provide standard mappings.

This gives a very wide constituency:

I note that Xerces' SAX 2 provide features by which a parser can
continue processing after an error. This proposal could be seen as
a very limit nod of recognition of that kind of practise.


Cheers
Rick Jelliffe
Received on Tuesday, 21 October 2003 04:48:29 UTC