- From: John Cowan <jcowan@reutershealth.com>
- Date: Fri, 26 Jul 2002 09:06:10 -0400 (EDT)
- To: elharo@metalab.unc.edu (Elliotte Rusty Harold)
- Cc: xml-dev@lists.xml.org, www-xml-blueberry-comments@w3.org
Elliotte Rusty Harold scripsit: > This really goes to the heart of the problem: atoi and atof are ASCII > functions [...] Not at all. C is in no way bound to the ASCII character set; the native character set can be anything that contains at least the 95 printing characters of the ASCII repertoire plus space and newline. Mapping the C character "\n" to an EBCDIC NEL on an EBCDIC platform, or for that matter to an extended-ASCII NEL on an ASCII mainframe platform, is an eminently sensible and standards-conformant thing to do. When you write 'printf("foo\nbar\n")', the intent is to generate two lines of plain text, and that is just what happens. For that matter, the Java situation is not open and shut either. Although in Java it is guaranteed that '\n' == '\013', which is not guaranteed in C, the specific encoding employed by PrintStream to print characters is explicitly platform-specific, and it is not unreasonable for a Java implementation to output a NEL when it is asked to print '\n'. But to meet your larger point, there is nothing inappropriate in the use of 8-bit functions in XML processing. XML parsers that return UTF-8 are not unknown, and every XML file I generate for publication (~200 a day) is generated with 8-bit operations, and is either in UTF-8 or in 8859-1 (properly labeled). > It's been a while since I've written C, but my recollection is that > the char type is always one-byte wide. Technically the width of a byte could be 16 bits if you wanted, though; C leaves the number of bits per byte open. In practice it is always 8. > All of the other functions we're talking about are similar. Even with > NEL, you still shouldn't be using these to process XML. OS/390 needs > to get some modern libraries. XML does not need to change. The issue remains: XML files on the mainframe are not plaintext files according to local conventions. XML processing is specified to be done in terms of LF only, with all other line-terminator conventions translated to LF. Suppose this had not been done, and all XML storage representations had been defined to require LF only. "What about Windows?" "Oh well, they can run an external program to convert CR/LF to LF before parsing, and LF to CR/LF after generation." If that had been the story, there damned well would be no significant amount of XML on Windows. You can rearrange this story using any line terminator and OS you like. Mainframes and EBCDIC are far from dead. XML 1.0 Appendix F makes a point of talking about how to autodetect EBCDIC encodings, for example; there is no reason why XML files can't start 4C 6F A7 94. There is no reason not to convert the occasional 0x15 (or 0x85 in the ASCII-compatible encoding) to an XML end of line, either. Speaking for myself and not necessarily the Core WG, I agree that there is no need to redefine the S production, merely to do line-terminator mapping on input. IMHO, there is no reason for #xD to be part of S either, as all real CRs are already mapped away, and having #xD be part of S serves only to allow very strange abuse of character references in entities containing attribute values and the like. However, I am certainly not suggesting that #xD be removed from S. -- John Cowan http://www.ccil.org/~cowan jcowan@reutershealth.com To say that Bilbo's breath was taken away is no description at all. There are no words left to express his staggerment, since Men changed the language that they learned of elves in the days when all the world was wonderful. --The Hobbit
Received on Friday, 26 July 2002 09:08:46 UTC