Re: ERB decisions on A.17, B.9, and other questions from Michael Sperberg-McQueen on 1996-10-23 (w3c-sgml-wg@w3.org from October 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
Date: Wed, 23 Oct 96 10:14:33 CDT
To: John Lavagnino <John_Lavagnino@brown.edu>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199610231829.OAA11454@www10.w3.org>
[Summary:  John Lavagnino's note clarifies some SDATA issues but
also illustrates variations in descriptions of the right SDATA
behavior.  Discussion of four cases:  character in / not in ISO
10646, character known/not known to application.  SDATA seems to
offer some (small) advantages for unknown characters, text entities
some (small) advantages for known characters.  Reference to 8879.]

On Tue, 22 Oct 1996 22:05:04 -0400 John Lavagnino said:
>If you encounter an SDATA entity, you:
>
>--- take the entity text
>--- look it up in a table of SDATA-to-local-rendition conversions
>--- output the string that the table supplies, if there is one
>--- if not, complain (or not; this part is indeed undefined, but it is
>    possible to mandate some particular behavior)

This is a very clear account, and I thank John for providing it.

It should be noted, however, that this is not quite the same as what
has been described to me in private mail by other proponents of
SDATA, who say when you encounter an SDATA entity you:

 --- inform the application that an SDATA entity is beginning
 --- pass the application the entity text (and perhaps its name)
 --- inform the application that the SDATA entity is ending

I.e. you perform *no* lookup, you merely provide the application
with the power to perform lookup.

(I'm assuming that 'you' is the XML processor; if John is assuming
it's the client application, the two views are compatible.  On this
interpretation, John's account of application behavior does assume
that SDATA entity boundaries are visible to applications, which is a
common usage but not one required by 8879, as far as I can tell.)

If we assume that 'including SDATA in XML' means specifying the
behaviors above for the XML processor (normatively) and XML
applications (informatively), plus defining some convention for
specifying the replacement text so as to ensure it's not just
another character number, then it seems to me the cost/advantage
tradeoff has four cases, depending on whether the character is or is
not in ISO 10646, and whether the application does or does not
understand it.

1.  An ISO 10646 character known to the application:
    <!ENTITY auml '&#228;'>
    vs.
    <!ENTITY auml SDATA "[auml    ]" >
    or
    <!ENTITY auml SDATA "[LATIN SMALL LETTER A WITH DIAERESIS" >

The text entity provides all the information a conforming
application needs.  Applications which display text must maintain a
Unicode-to-local mapping table, unless they have Unicode display
drivers (which effectively embody such tables).

The SDATA method requires each application to maintain a table of
SDATA-entity-text to character value or display value, or both.
Since the application must accept Unicode directly, it must also
maintain the same tables as for the text entity.

2.  An ISO 10646 character unknown to the application:

The text entity provides only a Unicode code point; the SDATA entity
provides either a quasi-mnemonic identifier or a full SC2-style
character name.

For fallback processing, the text-entity user must rely on
display-specific sets of entity declarations -- unfortunately,
without public identifiers these cannot be reliably labeled.  The
SDATA user can use the entity text (or name, if provided) to
generate fallback display text.

Text-entity applications must either provide less informative error
(or I-can't-display-this) messages, or else maintain a table mapping
Unicode code points to names.  SDATA applications can use the entity
text to provide an error message referring to "auml" or "latin small
letter a with diaeresis", which is at least as informative as
"character U+00E4".  mapping table.

3.  A non-ISO-10646 character known to the application:
    <!ENTITY a.teng '&#27344;'>
    vs.
    <!ENTITY a.teng SDATA "[a.teng  ]" >
    or
    <!ENTITY auml SDATA "[TENGWAR LETTER VOWEL A]" >

In both cases, private arrangements of a form not covered by the XML
spec are required.  In the SDATA case, these might take the form of
the user modifying the application's local lookup table to add the
desired characters; in the text entity case, they might take a
similar form, though the lookup table might look different.

The SDATA method seems to involve a slightly lower chance of
accidental collisions arising when private agreements inadvertently
use the same name or character position, and thus may have less need
of mechanisms to signal the applicability of this or that private
agreement.

4.  A non-ISO-10646 character unknown to the application:

Same as case 2, except that the user presented with only a code
point cannot find out, by consulting ISO 10646 or Unicode
documentation, what character is involved.

-----

On the whole, it seems to me that SDATA provides some advantages
over text entities in cases 2 and 4; this is what I understand David
and Lee and John to be arguing as well.  It seems to me that text
entities have the advantage in cases 1 and 3; I don't know if the
other participants in the discussion will take this view or not.

In neither case do the advantages seem to me to be extremely large;
they could easily be outweighed by other considerations (as indeed
they were, in the ERB discussion).

The key points appear to be

 - is the advantage in various cases large or small?
 - which cases are more important for the design of XML?

For what it's worth, here is the only passage I've found in 8879
which seems to bear on this issue (clause 8):

    A processing instruction that returns data must be defined as
    an "SDATA" entity and entered with an entity reference.  One
    that does not return data should be bedfined as a "PI" entity.

From this (and from no other passage) I infer that SDATA entities
can 'return data' -- this seems an apt description of the
SDATA behavior being described by John Lavagnino and others,
except that in his commentary, Charles Goldfarb seems to
be assuming that the SGML parser is to do the lookup, while
in current practice SGML parsers do no such lookup, but provide
enough information to allow a downstream application to do the
lookup.  The text of 8879 seems (to this reader) not to
specify who does the lookup.

-C. M. Sperberg-McQueen
Received on Wednesday, 23 October 1996 14:29:50 UTC