Re: [xml-dev] version numbers and infosets from John Cowan on 2002-07-26 (www-xml-blueberry-comments@w3.org from July 2002)

From: John Cowan <jcowan@reutershealth.com>
Date: Fri, 26 Jul 2002 09:06:10 -0400 (EDT)
To: elharo@metalab.unc.edu (Elliotte Rusty Harold)
Cc: xml-dev@lists.xml.org, www-xml-blueberry-comments@w3.org
Message-Id: <200207261318.JAA23890@mail2.reutershealth.com>

Elliotte Rusty Harold scripsit:

> This really goes to the heart of the problem: atoi and atof are ASCII 
> functions [...]

Not at all.  C is in no way bound to the ASCII character set; the native
character set can be anything that contains at least the 95 printing
characters of the ASCII repertoire plus space and newline.  Mapping the
C character "\n" to an EBCDIC NEL on an EBCDIC platform, or for that
matter to an extended-ASCII NEL on an ASCII mainframe platform, is an
eminently sensible and standards-conformant thing to do.  When you write
'printf("foo\nbar\n")', the intent is to generate two lines of plain text,
and that is just what happens.

For that matter, the Java situation is not open and shut either.
Although in Java it is guaranteed that '\n' == '\013', which is not
guaranteed in C, the specific encoding employed by PrintStream to print
characters is explicitly platform-specific, and it is not unreasonable
for a Java implementation to output a NEL when it is asked to print '\n'.

But to meet your larger point, there is nothing inappropriate in the use
of 8-bit functions in XML processing.  XML parsers that return UTF-8 are
not unknown, and every XML file I generate for publication (~200 a day)
is generated with 8-bit operations, and is either in UTF-8 or in 8859-1
(properly labeled).

> It's been a while since I've written C, but my recollection is that 
> the char type is always one-byte wide. 

Technically the width of a byte could be 16 bits if you wanted, though;
C leaves the number of bits per byte open.  In practice it is always 8.

> All of the other functions we're talking about are similar. Even with 
> NEL, you still shouldn't be using these to process XML. OS/390 needs 
> to get some modern libraries. XML does not need to change.  

The issue remains: XML files on the mainframe are not plaintext files
according to local conventions.

XML processing is specified to be done in terms of LF only, with all
other line-terminator conventions translated to LF.  Suppose this
had not been done, and all XML storage representations had been
defined to require LF only.  "What about Windows?"  "Oh well, they
can run an external program to convert CR/LF to LF before parsing,
and LF to CR/LF after generation."  If that had been the story, there
damned well would be no significant amount of XML on Windows.
You can rearrange this story using any line terminator and OS you like.

Mainframes and EBCDIC are far from dead.  XML 1.0 Appendix F makes a
point of talking about how to autodetect EBCDIC encodings, for example;
there is no reason why XML files can't start 4C 6F A7 94.
There is no reason not to convert the occasional 0x15 (or 0x85 in
the ASCII-compatible encoding) to an XML end of line, either.

Speaking for myself and not necessarily the Core WG, I agree that there
is no need to redefine the S production, merely to do line-terminator
mapping on input.  IMHO, there is no reason for #xD to be part of S
either, as all real CRs are already mapped away, and having #xD be
part of S serves only to allow very strange abuse of character
references in entities containing attribute values and the like.
However, I am certainly not suggesting that #xD be removed from S.

-- 
John Cowan          http://www.ccil.org/~cowan        jcowan@reutershealth.com
To say that Bilbo's breath was taken away is no description at all.  There are
no words left to express his staggerment, since Men changed the language that
they learned of elves in the days when all the world was wonderful. --The Hobbit

Received on Friday, 26 July 2002 09:08:46 UTC