RE: property value clarification from David G. Durand on 1998-11-01 (w3c-dist-auth@w3.org from October to December 1998)

From: David G. Durand <dgd@cs.bu.edu>
Date: Sat, 31 Oct 1998 21:31:07 -0400
To: w3c-dist-auth@w3.org
Message-Id: <v04011703b261671105f2@[24.0.249.126]>
At 8:36 PM -0400 11/1/98, Babich, Alan wrote:
>David Durand wrote:
>"There are issues here: most character encodings are a subset of Unicode,
>so that there are legitimate properties that are unlikely to be translatable
>into likely server encodings. This is not to say that we should prohibit
>internal translation, but we should require
>that _if_ the encoding changes, _then_ the characters will be
>preserved in the result returned by the server.
>
>Transcoding is a reality, but it's also a minefield, and servers
>that do it should be held accountable for not corrupting data.
>Furthermore, even transcodable character-set encodings are not
>always invertible: the precomposed characters in unicode are a
>canonical example."

>RDBMS's typically store things in one, maybe two, character sets.
>(Sometimes you can use both ASCII and Unicode in the same
>database.) If you want to put data in or get it out in another
>character set, a character set translation is invoked.
>Unfortunately, the world isn't perfect: Unicode is very
>popular, but as you have pointed out, losses in character
>set translation can and do occur. Facing reality, that fact
>must be accepted. So, the draft must allow that "best
>efforts" behavior.

No, it should prohibit that behavior. You can always just store bytes
(pardon me, octets), and avoid transcoding. Anyone who is not an expert in
internationalization should in fact avoid transcoding in almost every
situation.

I'm not competent to make this argument, but I've watched the same thing
play out in several stanbdards so far. If we explicitly loosen the XML
rules in regards to character sets, expect a bloody battle. The I18N
experts usually win, too, by the way. That's one reason that usin XML is
smart -- it's dealt with those issues, and we don't have to.

David Megginson's parser is 37K of Java and does all this. It's hard to get
right, but pretty easy to take working code and not have to worry about
getting it right.

>It is reasonable to translate character sets mechanically
>and on the fly, but not languages. The best that you can do
>is to use an appropriate collation for the characters in
>queries. If losses occur in character set translation, then
>you can expect improper collation for those strings.
>This is reality, and the draft must accept this "best
>efforts" behavior.

I think that you will find that the IETF is full of people checking every
standard to see that exactly this kind of shortcut from being taken. Since
these kinds of "solutions" have caused most of the problems in this area, I
don't blame them.

>"XML explicitly does _not_ ignore whitespace in any situation. When
>validating against a DTD (and only when validating against a DTD) it
>flags data (whitespace) that would be ignored by an SGML processor,
>and that many applications will also want to ignore.

>The flagging often confuses people, but it is an application issue as to
>whether the data can be ignored. Since WebDAV _has_ no DTD for all
>properties (since additional tags are not an error), there is _no_
>ignorable whitespace in any DAV property."

>I did not say XML ignored whitespace. I said whitespace is
>ignored. I meant that applications often ignore it.
>This is true whether or not you have a DTD. (Applications
>do not always follow the rules perfectly.) If I give
>a DAV store an integer property value:

Applications can do what they want. If a serer is serving a dead property,
it can't do what you say.

>    <AgeInYears> 023 </AgeInYears>
>
>I would bet that the whitespace would be ignored by
>the server, the value would be stored in binary, and
>the value returned later by PROPFIND or query would be
>
>    <AgeInYears>23</AgeInYears>

What about:
<macFilename>   firstload INIT</macFilename>

This kind of filename has _significant_ leading spaces (I have several such
on my machine). A server that fixed this for me would simply be a
bug-ridden waste of bytes. Any server might be asked to save this property
value for me, and need not know anything about Macs (because only my client
cares, for instance).

If the server _doesn't_ understand the semantics of the property (based on
its namespace) it can't do _any_ normalization, and still be worth the
media it's encoded on, UNLESS we specify in advance what normalization is
performed (and how to defeat it if necessary).

>There is nothing in the draft that prevents the server
>from doing that. And that, IMHO, is a very good thing -- it
>is the solution, not a problem. In order to avoid
>making this a longer e-mail, I'll refrain
>from listing all the reasons why, except to point out that
>as far as query is concerned, considering the following
>representations of the property value on the wire to be
>equivalent is a very good thing:
>
>    <AgeInYears>23</AgeInYears>
>
>    <AgeInYears> 023 </AgeInYears>
>
>    <AgeInYears> 23
>    </AgeInYears>
>
>"Many of the issues you raise are only appropriate for live properties."
>
>I disagree. In the above example, AgeInYears is a dead property.

With known semantics to the server, that allow it to _change_ the property.
I don't want to argue whether or not it's live because the server rewrote
it, though that is the way I would expect things to work. But if the server
does not know the meaning of the tag, all you've done is suggest that we
explicitly advocate the creation of server-dependent hard-to-detect bugs.

  -- David
_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://www.dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
Received on Sunday, 1 November 1998 21:32:50 UTC