- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Wed, 23 Oct 96 10:14:33 CDT
- To: John Lavagnino <John_Lavagnino@brown.edu>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
[Summary: John Lavagnino's note clarifies some SDATA issues but also illustrates variations in descriptions of the right SDATA behavior. Discussion of four cases: character in / not in ISO 10646, character known/not known to application. SDATA seems to offer some (small) advantages for unknown characters, text entities some (small) advantages for known characters. Reference to 8879.] On Tue, 22 Oct 1996 22:05:04 -0400 John Lavagnino said: >If you encounter an SDATA entity, you: > >--- take the entity text >--- look it up in a table of SDATA-to-local-rendition conversions >--- output the string that the table supplies, if there is one >--- if not, complain (or not; this part is indeed undefined, but it is > possible to mandate some particular behavior) This is a very clear account, and I thank John for providing it. It should be noted, however, that this is not quite the same as what has been described to me in private mail by other proponents of SDATA, who say when you encounter an SDATA entity you: --- inform the application that an SDATA entity is beginning --- pass the application the entity text (and perhaps its name) --- inform the application that the SDATA entity is ending I.e. you perform *no* lookup, you merely provide the application with the power to perform lookup. (I'm assuming that 'you' is the XML processor; if John is assuming it's the client application, the two views are compatible. On this interpretation, John's account of application behavior does assume that SDATA entity boundaries are visible to applications, which is a common usage but not one required by 8879, as far as I can tell.) If we assume that 'including SDATA in XML' means specifying the behaviors above for the XML processor (normatively) and XML applications (informatively), plus defining some convention for specifying the replacement text so as to ensure it's not just another character number, then it seems to me the cost/advantage tradeoff has four cases, depending on whether the character is or is not in ISO 10646, and whether the application does or does not understand it. 1. An ISO 10646 character known to the application: <!ENTITY auml 'ä'> vs. <!ENTITY auml SDATA "[auml ]" > or <!ENTITY auml SDATA "[LATIN SMALL LETTER A WITH DIAERESIS" > The text entity provides all the information a conforming application needs. Applications which display text must maintain a Unicode-to-local mapping table, unless they have Unicode display drivers (which effectively embody such tables). The SDATA method requires each application to maintain a table of SDATA-entity-text to character value or display value, or both. Since the application must accept Unicode directly, it must also maintain the same tables as for the text entity. 2. An ISO 10646 character unknown to the application: The text entity provides only a Unicode code point; the SDATA entity provides either a quasi-mnemonic identifier or a full SC2-style character name. For fallback processing, the text-entity user must rely on display-specific sets of entity declarations -- unfortunately, without public identifiers these cannot be reliably labeled. The SDATA user can use the entity text (or name, if provided) to generate fallback display text. Text-entity applications must either provide less informative error (or I-can't-display-this) messages, or else maintain a table mapping Unicode code points to names. SDATA applications can use the entity text to provide an error message referring to "auml" or "latin small letter a with diaeresis", which is at least as informative as "character U+00E4". mapping table. 3. A non-ISO-10646 character known to the application: <!ENTITY a.teng '櫐'> vs. <!ENTITY a.teng SDATA "[a.teng ]" > or <!ENTITY auml SDATA "[TENGWAR LETTER VOWEL A]" > In both cases, private arrangements of a form not covered by the XML spec are required. In the SDATA case, these might take the form of the user modifying the application's local lookup table to add the desired characters; in the text entity case, they might take a similar form, though the lookup table might look different. The SDATA method seems to involve a slightly lower chance of accidental collisions arising when private agreements inadvertently use the same name or character position, and thus may have less need of mechanisms to signal the applicability of this or that private agreement. 4. A non-ISO-10646 character unknown to the application: Same as case 2, except that the user presented with only a code point cannot find out, by consulting ISO 10646 or Unicode documentation, what character is involved. ----- On the whole, it seems to me that SDATA provides some advantages over text entities in cases 2 and 4; this is what I understand David and Lee and John to be arguing as well. It seems to me that text entities have the advantage in cases 1 and 3; I don't know if the other participants in the discussion will take this view or not. In neither case do the advantages seem to me to be extremely large; they could easily be outweighed by other considerations (as indeed they were, in the ERB discussion). The key points appear to be - is the advantage in various cases large or small? - which cases are more important for the design of XML? For what it's worth, here is the only passage I've found in 8879 which seems to bear on this issue (clause 8): A processing instruction that returns data must be defined as an "SDATA" entity and entered with an entity reference. One that does not return data should be bedfined as a "PI" entity. From this (and from no other passage) I infer that SDATA entities can 'return data' -- this seems an apt description of the SDATA behavior being described by John Lavagnino and others, except that in his commentary, Charles Goldfarb seems to be assuming that the SGML parser is to do the lookup, while in current practice SGML parsers do no such lookup, but provide enough information to allow a downstream application to do the lookup. The text of 8879 seems (to this reader) not to specify who does the lookup. -C. M. Sperberg-McQueen
Received on Wednesday, 23 October 1996 14:29:50 UTC