- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Tue, 21 Oct 2003 18:48:24 +1000
- To: xml-editor@w3.org
- Cc: W3c I18n Group <w3c-i18n-ig@w3.org>, w3c-xml-plenary@w3.org
Request for Erratum to XML 1.0 and 1.1 Specs ---------------------------------------------- Rick Jelliffe, ricko@topologi.com, 2003-10-21 I request the XML Working Group please consider the following erratum to XML 1.0 which should also apply to XML 1.1. The following two paragraphs, or something to the same effect, should be appended to section 5.1 "Validating and Non-Validating Processors" "A non-validating processor may, at user option, imply definitions for all the character entities defined by HTML 4[1]. A document or entity for which definitions are implied is not well-formed. The processor must report a non-fatal error. NOTE: The document is 'not well-formed but processed'. Reliance on this feature by specifications is deprecated; this option may be withdrawn at some future time should it prove dangerous." "A non-validating processor which provides the HTML 4 definitions may, at user option, also imply definitions for other Math ML and ISO standard sets[2]. A processor must report a non-fatal error. The document is 'not well-formed but processed'. NOTE: Reliance on this feature by specifications is deprecated; this option may be withdrawn at some future time should it prove dangerous." [1] http://www.w3.org/TR/html401/sgml/entities.html [2] http://www.w3.org/TR/MathML2/chapter6.html#chars_entity-tables This suggested erratum has the following characteristics: 1) It does not require any change to any XML processor 2) It does not change the basic XML characteristic that the only way to guarantee information is received at the other end is to use a UTF-* encoding, no entities and no attribute defaulting. 3) It maintains the current layering, ao no re-architecting or change in design is needed 4) It keeps the XML specification as the location on how to go from characters to data+markup. 5) It does not make any existing valid XML document invalid 6) It does not make any existing invalid XML document valid 7) It does not make any existing WF document or entity non-WF 8) It does not make any existing non-WF document formally WF 9) It does allow the continued non-validating processing of documents which are non-WF only because they contain standard references 10) It limits this to user option 11) It does not allow other specifications to use this as its default 12) It can be withdrawn 13) I believe it is practical and would be simple to implement. I believe the beneficiaries of such an erratum include: * Users typing in editors with no adequate input methods for non-ASCII characters. I note that although Unicode editors can display many characters, not all operating systems have input methods to allow convenient data entry even of Latin1 characters. (I believe this is better provided by using decent XML markup editors, without prejudice.) * XHTML users who are used to named references without declarations in HTML. * Potential XInclude users, who may wish to treat a WF parsed entity from a document that uses standard character references as a microdocument * Potential XML Schemas, Schematron and RELAX NG users who may wish to upgrade from DTDs. * Potential XQuery users who are being hindered by the lack of XML Schemas. * XML pipeline systems which can pass XML without requiring tricky prologs * SOAP, RSS and RDF systems which must cope with data fragments from externally-generated document being embedded * Programmers serializing data to XML, especially for internal systems, who may prefer to generate "—" or " " rather than the numeric or literal equivalents. * Vendors who make products for the above * Low-sight or motion-impaired users whose speech synthesizers or input methods only support ASCII characters. Aged, enraged or diminished capacity users who may be frustrated at having to lookup the number for something they know the name for. (Though I do not want to suggest that "entity rage" is a hidden problem.) I suggest its benefits over other suggested approaches include: * It does not require change to subsequent processes, as PSVI processing would, nor any changes or additions to schema specifications * It does not require pre-processing, as a macro processor would * It does not require the introdution and deployment of new transcoders, as would Tim Bray and John Cowan's recent thought experiment "UTF-8+Names" * It does not require interaction with other standards groups, notably XML Schemas EG or IANA or IETF. * By providing it at user option, it can succeed or fail; if it is popular and successful, that is good; if it is unpopular or unsafe. * By limiting itself to the HTML and the MathML/ISO entities, it avoids issues of user-defined entities, and the need to enumerate the entities. * It does not define mappings for those characters, but defers to HTML and MathML/ISO, who may provide standard mappings. This gives a very wide constituency: I note that Xerces' SAX 2 provide features by which a parser can continue processing after an error. This proposal could be seen as a very limit nod of recognition of that kind of practise. Cheers Rick Jelliffe
Received on Tuesday, 21 October 2003 04:48:29 UTC