Re: [xml-dev] version numbers and infosets

At 9:06 AM -0400 7/26/02, John Cowan wrote:


>For that matter, the Java situation is not open and shut either.
>Although in Java it is guaranteed that '\n' == '\013', which is not
>guaranteed in C, the specific encoding employed by PrintStream to print
>characters is explicitly platform-specific, and it is not unreasonable
>for a Java implementation to output a NEL when it is asked to print '\n'.

Anybody using a PrintStream to do serious work deserves the bugs they 
get. They've got problems before they even start thinking about XML. 
In fact, I wrote one 600 page book inspired mostly by exactly the 
problems with PrintStream. Good code uses the other stream and writer 
classes, in which this behavior is unambiguously specified.

>But to meet your larger point, there is nothing inappropriate in the use
>of 8-bit functions in XML processing.  XML parsers that return UTF-8 are
>not unknown, and every XML file I generate for publication (~200 a day)
>is generated with 8-bit operations, and is either in UTF-8 or in 8859-1
>(properly labeled).
>

Do you really mean to suggest that using UTF-8 code points as C chars 
is adequate? I suppose you could do that, but it most certainly is 
not convenient and completely fails your stated goal of making XML 
files plain text files. You're basically suggesting we treat them as 
binary data rather than text.

>>  All of the other functions we're talking about are similar. Even with
>>  NEL, you still shouldn't be using these to process XML. OS/390 needs
>>  to get some modern libraries. XML does not need to change.
>
>The issue remains: XML files on the mainframe are not plaintext files
>according to local conventions.

Yes, that's true and the issue is *much* broader than merely adding 
NEL to the white space production. Even if we do this, XML files on 
mainframes will still not be plain text files. Adding NEL won't fix 
the problem.

This whole notion of the "plain text" file may be a red herring. The 
community has realized over the last several years, that calling XML 
files plain text, really isn't accurate on any platform. Hence the 
move from text/xml to application/xml.

>XML processing is specified to be done in terms of LF only, with all
>other line-terminator conventions translated to LF.  Suppose this
>had not been done, and all XML storage representations had been
>defined to require LF only.  "What about Windows?"  "Oh well, they
>can run an external program to convert CR/LF to LF before parsing,
>and LF to CR/LF after generation."  If that had been the story, there
>damned well would be no significant amount of XML on Windows.
>You can rearrange this story using any line terminator and OS you like.

You're confusing issues by merging together two different time 
frames: before and after XML 1.0 was released. Had IBM raised this 
issue during the development of XML, it could have been considered on 
different grounds. They failed to do so, and I see no justification 
for reopening the case now. It is far more important for XML to 
remain stable, than to allow a miniscule number of users (possibly as 
few as zero) not to upgrade their software to something that supports 
XML 1.0 conventions.

I find it completely reasonable to ask editors and other tools to 
support the line ending conventions of the files they're editing. I 
do this routinely on Mac, Windows, and Unix. I find it hard to 
believe that it is so much more difficult for mainframe programmers 
to do this.

>Mainframes and EBCDIC are far from dead.  XML 1.0 Appendix F makes a
>point of talking about how to autodetect EBCDIC encodings, for example;
>there is no reason why XML files can't start 4C 6F A7 94.
>There is no reason not to convert the occasional 0x15 (or 0x85 in
>the ASCII-compatible encoding) to an XML end of line, either.

Airline reservation clerks and bank tellers don't count. They never 
see the XML. How many actual users are their writing raw XML who have 
problems? So far I haven't seen any. A programmer generating XML from 
code can easily specify the line ending that XML requires. A 
programmer reading XML through a parser will just see line feeds 
anyway. You're trying to fix a non-existent problem.

>Speaking for myself and not necessarily the Core WG, I agree that there
>is no need to redefine the S production, merely to do line-terminator
>mapping on input.  IMHO, there is no reason for #xD to be part of S
>either, as all real CRs are already mapped away, and having #xD be
>part of S serves only to allow very strange abuse of character
>references in entities containing attribute values and the like.
>However, I am certainly not suggesting that #xD be removed from S.
>

Again, it's a time frame issue. We are not discussing what XML would 
be in an ideal world, had we known everything in 1996 that we know 
now. We are discussing what is best to do now. Failing to add NEL, in 
no way justifies removing CR.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
|              http://www.cafeconleche.org/books/xian2/              |
|  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+

Received on Friday, 26 July 2002 12:36:48 UTC