comments on NEL changes (PR-xml11-20031105) from David Carlisle on 2003-11-12 (xml-editor@w3.org from October to December 2003)

From: David Carlisle <davidc@nag.co.uk>
Date: Wed, 12 Nov 2003 14:17:35 GMT
To: xml-editor@w3.org
Message-Id: <200311121417.OAA06072@penguin.nag.co.uk>
1.3 Rationale for XML 1.1 states:
    
    In addition, XML 1.0 attempts to adapt to the line-end conventions of
    various modern operating systems, but discriminates against the
    conventions used on IBM and IBM-compatible mainframes. As a result, XML
    documents on mainframes are not plain text files according to the local
    conventions. XML 1.0 documents generated on mainframes must either
    violate the local line-end conventions, or employ otherwise unnecessary
    translation phases before parsing and after generation. Allowing
    straightforward interoperability is particularly important when data
    stores are shared between mainframe and non-mainframe systems (as
    opposed to being copied from one to the other). Therefore XML 1.1 adds
    NEL (#x85) to the list of line-end characters. For completeness, the
    Unicode line separator character, #x2028, is also supported
    

This rationale fails to mention the simpler and less disruptive
alternative that does not require "unnecessary translation phases before
parsing and after generation" namely to use a text encoding specified in
the xml or text declaration that maps NEL in the file to a Unicode
newline. This would have avoided the disruptive changes to the XML white
space rules. Even if (as I suspect will happen) the WG decides to keep
the addition of NEL to the line end normalisation characters, I think
that the option of using an encoding (and a rationale for why it wasn't
used) should be mentioned, or failing that, this rationale ought to be
removed from spec (its not clear that such discussion belongs in the
spec anyway).


The current rules in 2.11 End-of-Line Handling appear to be self
contradicting.

First they say

    the XML processor MUST behave as if it normalized all line breaks in
    external parsed entities (including the document entity) on input,
    before parsing, by translating all of the following to a single #xA
    character:

This would imply that a reasonable strategy would be to run an
off-the-shelf line end normaliser over the file before parsing
however if you do that you can not (so easily) comply with the final
rule of that section

    The characters #x85 and #x2028 cannot be reliably recognized and
    translated until an entity's encoding declaration (if present) has
    been read. Therefore, it is a fatal error to use them within the XML
    declaration or text declaration.

If these characters MUST (appear to) have been normalised away before
parsing, ie before the text declaration is recognised, how can you tell
they appear in a text declaration? Clearly some form of words could be
constructed that say what you mean here, but the fact that the
description needs to become more convoluted is perhaps an indication
that this change isn't as "straightforward" as the current rationale
implies.

As a side remark on "fatal error" as it appears in the line quoted
above. Is there any chance that this (and other similar terms) could be
hyperlinked to the definition of this term in the glossary. Currently
there is no typographic or hypertextual indication that this is a
defined term. which means basically changing  "fatal error" to
<termref def="dt-fatal">fatal error</termref> everywhere it appears
if it's not already so marked.


David

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________
Received on Wednesday, 12 November 2003 09:21:20 UTC