Re: ERB decisions on A.17, B.9, and other questions
At 10:14 AM 10/23/96, Michael Sperberg-McQueen wrote:
>[Summary: John Lavagnino's note clarifies some SDATA issues but
>also illustrates variations in descriptions of the right SDATA
>behavior. Discussion of four cases: character in / not in ISO
>10646, character known/not known to application. SDATA seems to
>offer some (small) advantages for unknown characters, text entities
>some (small) advantages for known characters. Reference to 8879.]
[summary: Apparent variation in behavior probably not intended by
Lavagnino, confirming alternate intepretation mentioned by Michael.
Re-analysis of the four cases in terms of when SDATA should sensibly be
used. Advantages now considered only for non ISO-10646 characters. In reply
to Michael's two mentions of 8879 on SDATA, I cite the ESIS proposal, the
most-widely implemented model of what an SGML implementation returns, and
its definition of SDATA processing.]
>This is a very clear account, and I thank John for providing it.
>It should be noted, however, that this is not quite the same as what
>has been described to me in private mail by other proponents of
>SDATA, who say when you encounter an SDATA entity you:
> --- inform the application that an SDATA entity is beginning
> --- pass the application the entity text (and perhaps its name)
> --- inform the application that the SDATA entity is ending
>I.e. you perform *no* lookup, you merely provide the application
>with the power to perform lookup.
>(I'm assuming that 'you' is the XML processor; if John is assuming
>it's the client application, the two views are compatible. On this
>interpretation, John's account of application behavior does assume
>that SDATA entity boundaries are visible to applications, which is a
>common usage but not one required by 8879, as far as I can tell.)
I talked with John the other day, and he was describing, I think, what a
reasonable application does, lumping parser and application behavior
together as you note inyour parenthesis. XML should prescribe a lookup
table as described by John. SGML does not.
As for the point that SDATA entities should be visible to the application,
I will cite the ESIS spec, in Goldfarb, Appendix B, on page 592, point "j)"
I know that ESIS is not normative text, but it is very widely viewed as the
_minimal_ information available to an SGML-driven application.
" j) References to internal entities
The information passed to the application depends on the entity type:
SDATA Replacement text, identified as an individual SDATA entity.
PI Replacement text, identified as a processing instruction, but
not as an entity.
For other references, nothing is passed to the application."
I think this supports my contention that SDATA entities are application
visible, and among the few entities that are application visible at all,
excepting external SUBDOC and DATA entities.
> ....the cost/advantage
>tradeoff has four cases, depending on whether the character is or is
>not in ISO 10646, and whether the application does or does not
>Cases 1 + 2 (The ones concerning known ISO 10646 characters deleted)
SDATA usage could not be prevented for such characters, if SDATA entities
are available, but it should be deprecated and never implemented. _I_ am
certainly not proposing that SDATA should ever be used for any character
that has an ISO 10646 code outside of the private use area. It is the
application's job to handle 10646, and we should keep that in their
>3. A non-ISO-10646 character known to the application:
> <!ENTITY a.teng '櫐'>
> <!ENTITY a.teng SDATA "[a.teng ]" >
> <!ENTITY auml SDATA "[TENGWAR LETTER VOWEL A]" >
>In both cases, private arrangements of a form not covered by the XML
>spec are required. In the SDATA case, these might take the form of
>the user modifying the application's local lookup table to add the
>desired characters; in the text entity case, they might take a
>similar form, though the lookup table might look different.
>The SDATA method seems to involve a slightly lower chance of
>accidental collisions arising when private agreements inadvertently
>use the same name or character position, and thus may have less need
>of mechanisms to signal the applicability of this or that private
Michael lists this as a case where character codes have an advantage. I'm
not sure why. To me using mnemonic names to identify things is always
better than using numbers to identify things. I think the chance of
collisions between private use character codes are relatively high, since
the simplest approach is to start with the first private use character and
go up from there. I think private use always requires an explicit
notification of convention, unless we assume that groups of individuals
only process documents by others who use the same private character sets.
>4. A non-ISO-10646 character unknown to the application:
>Same as case 2, except that the user presented with only a code
>point cannot find out, by consulting ISO 10646 or Unicode
>documentation, what character is involved.
>In neither case do the advantages seem to me to be extremely large;
>they could easily be outweighed by other considerations (as indeed
>they were, in the ERB discussion).
What were these other considerations?
>The key points appear to be
> - is the advantage in various cases large or small?
I think seeing an unknown character number message will be as annoying as
seeing an image with no "alt" tag is on the web.
> - which cases are more important for the design of XML?
Mathematics, Asian Literary texts, and Asian commercial texts all seem to
have this problem with 10646. Are these imporatant cases?
RE delenda est.
I am not a number. I am an undefined character.
David Durand email@example.com \ david@dynamicDiagrams.com
Boston University Computer Science \ Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams
MAPA: mapping for the WWW \__________________________