Re: ERB decisions on A.17, B.9, and other questions from David G. Durand on 1996-10-19 (w3c-sgml-wg@w3.org from October 1996)

From: David G. Durand <dgd@cs.bu.edu>
Date: Sat, 19 Oct 1996 17:55:12 -0400
To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <v02130501ae8ef3c84b7b@[128.148.157.46]>
At 2:56 PM 10/19/96, Michael Sperberg-McQueen wrote:
>On Sat, 19 Oct 1996 15:15:27 -0400 David G. Durand said:
>>>Instead of a declaration like
>>>
>>>  <!ENTITY auml SDATA "[auml    ]">
>>
>>How do I declare a character like "TENGWAR VOWEL LETTER A" or "ME
>>GLYPH GARDINER 201" (Middle Egyptian). It's just &#23434 or
>>something.
>>
>>My gut reaction to that is #$*#&$#?!
>
>I'm not sure what you're arguing here.  Would
>
>    <!ENTITY a.teng SDATA "[a.teng  ]">
>    <!ENTITY a.teng SDATA "[TENGWAR VOWEL LETTER A]">
>    <!ENTITY a.teng SDATA "[U+5B8A]">
>
>be (a) equally likely to be understood by all XML processors and (b)
>materially less likely to elicit the reaction &expletive;?

Yes, because I can be told, by the application, that "An unknown character
was found in the data stream, here's its name."

Otherwise, I get the information from the application: "An unknown
character was found, here's its number (in an officially undefined section
of unicode)".

In one case, the name may help me to understand the reference, or at least
give me a starting point for research, in the other, I have no further
recourse.

>If the ERB decides question C.5 the way I hope we will, then we can
>at least say
>
>    <!ENTITY auml   "&u00E4;"><!-- auml = a umlaut (dec 228) -->
>    <!ENTITY a.teng "&u5B8A;"><!-- a.teng = Tengwar vowel A
>                                   (decimal 23434) -->
Something like this is essential, but it addresses a different problem. The
problem with the above is that hiding critical information in comments is
not a good prescription for portability.

>which I, for one, prefer, since I don't have to be logged on to a
>system with a shell-level hex-to-decimal calculator, and it's
>easier to proofread the declarations against Unicode documentation.
>
>(But to be honest I suspect that a.teng belongs at 57344 (U+E000) or
>thereabouts in the Private Use Area.)
(Sure, I just made the number up by hitting my numeric keypad)

That sure doesn't help. U+E000 is almost, but not quite, utterly uninformative.
Tengwar vowel mark A, may help a human, ig not an application.

>
>Come on, now.  The question of including some entities automatically
>is on the table, and has been addressed by several postings, most in
>favor, as far as I can remember.  The proposals I've heard so far are
>(a) just lt, gt, and amp (possibly augmented by declarations for ' and
>"), (b) the above plus ISOLat1, and (c) everything in HTML (which
>version?).  If you are hoping for automatic inclusion of Tengwar and
>other similar scripts, I predict disappointment.  Or even Middle
>Egyptian, for that matter.  If you think we should decide to include
>them automatically without worrying about the form of their declaration,
>then I respectfully but vigorously disagree.

No, my point was that the set of possible character references is an open
set, and that numbers are never preferable to names, absent a well-defined
place to look up the number. (Private use does not solve that problem,
since all I can do is look it up and be told that it's officially
undefined.) Hey, I can point you to a unicode definition of where TENGWAR
LETTER A might appear in the private use set. But even if I knew nothing of
unicode that description is a lot more useful than any arbitrary code
value.

   If we applied the proposed SDATA removal policy to programming languages
we could replace annoying variable names with convenient numbers, and
improve the bandwidth of programmers. If we have unique labels for things,
why bother to make them meaningful?

>On the whole, I don't see how retaining or excising SDATA has any
>effect on the ease or difficulty of declaring things like TENGWAR
>VOWEL A and ME GLYPH GARDINER 201 in useful ways.  And so I don't
>see how these examples bear upon the decision to do without SDATA
>in version 1.0 of XML.

A comment that the parser strips that tells me the name of a character is
measurably less useful that an application available string containing the
same information. And a lot more

For instance, I might put URLs of .gif renditions of characters in my SDATA
entities, instead of character names. I don't think that we should
standardize the meaning of SDATA strings, but that we need the hook to
attach meaningful data to undefined refrences. For instance, many of the
less-common Asian characters might have meaningful string that could
document the ideogram, even for someone who only has a JIS font.

> extended metaphor about the promised land deleted...

This is all very well but we do have to get to the promised land. I don't
want to end up like the Red 'Lectroids in Buckaroo Banzai, whose triumphal
return home was terminated with the words:

   "We are not in the eighth dimension. We are over New Jersey."

If we could at least make it to Seattle, I'd be happier.

   -- David

_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
http://www.dynamicdiagrams.com/services_map_main.html
Received on Saturday, 19 October 1996 17:50:41 UTC