Re: [xml-dev] XML and mainframes, yet again (was RE: [xml-dev] Some comments on the 1.1 draft) from Elliotte Rusty Harold on 2001-12-15 (www-xml-blueberry-comments@w3.org from December 2001)

From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Date: Sat, 15 Dec 2001 09:08:27 -0400
To: "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>, xml-dev@lists.xml.org
Message-Id: <p04330100b840f557d186@[192.168.254.4]>
At 7:57 PM -0700 12/14/01, Champion, Mike wrote:


>I'm out of my depth here,  but this argument doesn't smell right to me.  I
>thought we concluded in the massive Blueberry thread a few months back that
>#x85 probably should have been included in the S production in the first
>place, and wasn't mainly because of a lack of mainframe expertise among the
>members of the original WG.

No, we didn't conclude that. A lot of us thought then and still think 
that XML 1.0 got this right, that #x85 should not have been part of 
the S production and still shouldn't be.

>pragmatism and leave them out. BUT there is an IMMENSE amount of data in
>mainframe databases that will probably be exposed via XML one day.  It's not
>IBM that will pay the cost of debugging all the programs that neglect to
>translate #x85 into a politically correct separator when exposing these
>legacy systems as web services.  And it is potentially OUR bank accounts and
>insurance policies in these legacy systems that are vulnerable to someone
>getting this wrong.
>

And exactly *none* of this data is in XML. If you want to take it out 
of the database and put it in XML, then it must be translated with or 
without XML 1.1. The same is true of Oracle, FileMaker, SQL: Server, 
and all other legacy database products on the market. It is trivial 
to translate #x85 to #xA or #xD or both in the process. However, even 
that isn't necessary!

#x85 is allowed in character data; i.e. in element content and 
attribute nodes, today, with XML 1.0. All fields from IBM's databases 
that contain #x85 characters can be included in XML 1.0 documents 
without translations. The only place you can't put #x85 is in tags 
between element names and attributes and attributes and other 
attributes.

The issue is not IBM databases and never has been. The issue is that 
IBM has some brain damaged text editors that insert a #x85 every time 
you hit the return key instead of inserting a #xA or #xD or both. 
Files created with these editors are not well-formed XML without an 
additional conversion pass. Similarly, IBM has some programming 
languages and tools that generate a #x85 when they do a println() or 
that language's equivalent.  That's all.

This has nothing to do with letting data move from IBM databases into 
XML. It has everything to do with IBM not wanting to update their 
software to the standards the rest of the world has been using for 
more than 20 years. Worst of all, IBM wants to start shipping around 
XML documents they generate with these strange line ending characters 
that will not behave appropriately in the installed base of software 
the rest of the world is using. I'm not just talking about XML here, 
but much more broadly installed things like text editors and 
programming languages. For instance, suppose an IBM tool generates a 
start-tag like this using #x85:

<name
   att1="value"
   att2="value"
   att3="value"
>

Looks like well-formed ASCII right? But it's not. Here's what you'll 
see if you open up the document containing that tag on a typical 
Windows text editor:

<name...  att1="value"...  att2="value"...  att3="value"...>

(Actual ellipsis characters will be used instead of three periods, 
but you get the idea.) Open it on a Mac and all the ellipses will 
change into O with two dots above instead.

This isn't just a question of recognizing the right encoding. It's a 
question of attaching the right semantics to the characters. #x85 
isn't just another character. It's a character with special meaning 
for many text-processing systems. Unfortunately IBM has chosen to 
assign different semantics to this character than pretty much 
everyone else in the world. Even if the document is labeled as 
ISO-8859-1 and the editor recognizes that and can tell that #x85 is 
not a graphics character, it still won't break the lines when it sees 
#x85!
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+
Received on Saturday, 15 December 2001 09:14:50 UTC