SDATA entities from Anders Berglund on 1996-10-21 (w3c-sgml-wg@w3.org from October 1996)

From: Anders Berglund <alb@ebt.com>
Date: Mon, 21 Oct 96 12:01:24 EDT
To: w3c-sgml-wg@w3.org
Cc: alb@ebt-inc.ebt.com
Message-Id: <9610211601.AA08106@worf>
>-C. M. Sperberg-McQueen writes:
> 
>If I am reading the postings from David Durand, Lee Quin, and John
>Lavagnino correctly, it seems that my earlier note embodied a large
>misconception.  I had asked for references to 8879 to explain the
>behavior they seem to be associating with SDATA entities; they
>seem, if I understand correctly, to be unanimous in saying that
>8879 does not in fact define the desired behavior at all, but that
>it's widely implemented and well understood (and, I assume, should
>be incorporated into XML 1.0).
> 
>So what seems to be at issue is *not* whether XML 1.0 should retain
>the SGML 'SDATA' keyword, but whether it should retain that keyword
>and prescribe some particular behavior for processors encountering
>it.
> 
>Here's a confession that if this behavior is widely implemented, I
>haven't seen it in the software I use most frequently.  Perhaps I
>just don't read the manuals carefully enough, but the only program
>that comes to mind as describing any behavior like this is Panorama,
>and its functionality (rendering correctly all the characters in the
>ISO entity sets for which the local system has font resources) can
>be done with character references to ISO 10646, since all the
>characters in the ISO entity sets are in 10646 (some glyphs
>representing ligatures are missing).

This is unfortunately a totally incorrect statement. The ISO entity
sets contain a LARGE number of entities for which there are NO CHARACTERS
in 10646. The ISO entity sets for maths and science (currect versions
in 9573-13 published in '92) have several hundred entities that are
not in 10646. The current working draft of 9573-15 (entities for non-latin
alphabets - which includes a number of sets for scholarly usage)
has an entity repertoire where maybe half are in 10646.

Revision of 10646 is progressing very slowly - ISO/IEC JTC1/SC18 requested
the addition of the characters for maths and science from 9573-13
on both DIS 1 and 2 (4 years ago) and so far there is no sign of
any addition. Some standards bodies have suggested creating a
"competitor" standard in ECMA and then "fast-tracking" that as additions
to 10646. Thus I belive that it is higlhly important that XML
includes SDATA entities and in a way that is decoupled from 10646.
Use of "private use" areas in 10646 I believe will lead to the same
mess as eg IBM ended up with EBCDIC.

At the recent SGML Asia Pacific conference it was very interesting to
hear the complaints from "CJK" representatives on 10646.

>As to the other proposition, that it is widely understood, I can
>only say that if it is widely understood, then surely someone can be
>found to describe the behavior on this list explicitly enough that
>the rest of us can also understand it.  I for one do not understand
>the behavior, and would really like a description, preferably with
>reference to some documentation.

In DSSSL the following approach was taken, based on the observation
that most applications/implementations use common NAMES for the
SDATA entities (mainly those in the ISO entity sets) and either
totally ignore the SDATA replacement text (eg FrameMaker SGML) or have
some system specific syntax in the replacement text (eg DynaText) in
the implementation specific files that use the common NAMES:

- a DSSSL character is named and you have an infinitely large name
  space. A character is decoupled from the encoding BUT some specification
  shortcuts are possible for those characters that are in 10646  
  (especially useful for the character properties)
- a coded input character (eg ISO 8859-x, ISO/IEC 10646, EBCDIC) is mapped
  to a DSSSL character
- an SDATA entity is mapped to a DSSSL character, primarily based on
  the entity name (or on the replacement text for systems where the
  entity name is not available)
- a DSSSL character is mapped to a glyph identifier

This approach has the following advantages:

- decoupling from the input character code - you may mix and match
  input encodings
- decoupling from existing coding standards that are ALL inadequate
- infinite character repertoire - very important for publishing
  and scholarly applications

One should note that the DSSSL approach is "equivalent" to mapping 
an input character in a particular encoding to a particular glyph in
a font.

Anders Berglund

Editor ISO/IEC 9573 Techniques for Using SGML
where the (revised) ISO entity sets are included
Member of the ISO/IEC JTC1/SC18/WG8 RG on DSSSL
Received on Monday, 21 October 1996 12:03:57 UTC