Re: ERB decisions on A.17, B.9, and other questions from lee@sq.com on 1996-10-23 (w3c-sgml-wg@w3.org from October 1996)

From: <lee@sq.com>
Date: Wed, 23 Oct 96 15:31:18 EDT
To: John_Lavagnino@brown.edu, w3c-sgml-wg@w3.org, U35395@UICVM.UIC.EDU
Message-Id: <9610231931.AA00836@sqrex.sq.com>
> (I'm assuming that 'you' is the XML processor; if John is assuming
> it's the client application, the two views are compatible.  On this
> interpretation, John's account of application behavior does assume
> that SDATA entity boundaries are visible to applications, which is a
> common usage but not one required by 8879, as far as I can tell.)

None of ESIS is mandated by 8879 as far as I know.

I have generally assumed that the XML specification will go much further
than specifying just the syntax of the language, and hence (in any normal
non-SGML terminology) will affect more than just the parser.

> 1.  An ISO 10646 character known to the application:
>     <!ENTITY auml '&#228;'>
>     vs.
>     <!ENTITY auml SDATA "[auml    ]" >
>     or
>     <!ENTITY auml SDATA "[LATIN SMALL LETTER A WITH DIAERESIS" >

The following two make more sense to me:
    <!ENTITY auml "&#228;">
    <!ENTITY auml SDATA "228: LATIN SMALL LETTER A WITH DIAERESIS">

and I'd hope XML could allow both of them.

>  Applications which display text must maintain a
> Unicode-to-local mapping table, unless they have Unicode display
> drivers (which effectively embody such tables).
Or they can use the numbers, if they are more useful.

But they will still need to cope with &#0x4486;&#0x030C;, which is
OLD HANGUL SYLLABLE MIEUM-ALAE A-GIYEOG LIEUL with a CARON accent
over it, if they're going for strict conformance.

So the application at least needs to know which code points are to be
treated as non-spacing accents.

> 2.  An ISO 10646 character unknown to the application:
> 
> The text entity provides only a Unicode code point; the SDATA entity
> provides either a quasi-mnemonic identifier or a full SC2-style
> character name.
> 
> For fallback processing, the text-entity user must rely on
> display-specific sets of entity declarations -- unfortunately,
> without public identifiers these cannot be reliably labeled.
If there were public identifiers for the unknown glyphs, they probably
wouldn't need to be unknown!

Ask someone like Glenn Adams or bc Krishna of FutureTense about this...

>  The
> SDATA user can use the entity text (or name, if provided) to
> generate fallback display text.
Yes.  This is the difference -- whether you get meaningless in the
failure case or whether you get something you can use.

A [less than with tilde under it] B
can be read immediately and understood by milions of people the world over.
A &0x60000020; B
cannot be read by anyone.
(this is character 0x20 (32 decimal) in row 0 of plane 0 in group 0x60,
the first Private Use group; this is the lowest code point that is
generally available, unless you use the 8192 positions in plane 0 of
group 0 that are in the R-zone.)


> 3.  A non-ISO-10646 character known to the application:
>     <!ENTITY a.teng '&#27344;'>
>     vs.
>     <!ENTITY a.teng SDATA "[a.teng  ]" >
>     or
>     <!ENTITY auml SDATA "[TENGWAR LETTER VOWEL A]" >
> 
> In both cases, private arrangements of a form not covered by the XML
> spec are required.
Not if the XML spec handles this case.

> The SDATA method seems to involve a slightly lower chance of
> accidental collisions arising when private agreements inadvertently
> use the same name or character position, and thus may have less need
> of mechanisms to signal the applicability of this or that private
> agreement.

Yes.


> 4.  A non-ISO-10646 character unknown to the application:
> 
> Same as case 2, except that the user presented with only a code
> point cannot find out, by consulting ISO 10646 or Unicode
> documentation, what character is involved.

Yes.

> On the whole, it seems to me that SDATA provides some advantages
> over text entities in cases 2 and 4; this is what I understand David
> and Lee and John to be arguing as well.

Yes.  The whole system becomes completely unusable with just numbers,
if you are working with anything outside the spec.
Latin scribal abbreviations might be too obscure to bother with,
as might mathematical symbols, since mathematicians will continue to
use TeX, I expect.

> In neither case do the advantages seem to me to be extremely large;

Well, when a system can't do what you need it to do at all, you don't
use it.  If we restrict ourselves to a subset of those documents that
can be handled by HTML, the advantages are vanishingly small.
But why not just use HTML for that?

Lee
Received on Wednesday, 23 October 1996 15:31:22 UTC