Re: Kind of surprising commercial XML application (long) from Harvey Bingham on 1997-03-06 (w3c-sgml-wg@w3.org from March 1997)

From: Harvey Bingham <hbingham@ACM.org>
Date: Thu, 06 Mar 1997 00:14:02 -0500
To: w3c-sgml-wg@w3.org
Message-Id: <2.2.32.19970306051402.006e8390@tiac.net>
At 07:48 AM 3/5/97 -0800, Tim Bray wrote:
>Check out 
>
> http://www.ile.com/opentag/
>
>I don't understand the application domain enough to really be fully 
>in tune with all the implications, but it's an interesting read. - Tim

Thanks, Tim for the pointer.

Executive Summary: The primary connection to this wg is their marketing 
hype, claiming they do XML. The novel idea for natural language
processing is that text zones can be extracted from a word-processing
proprietary format and tagged to the opentag DTD, manipulated (natural
language translation of those text zones), and then restored to the
places where the original text zones were, yielding the same formatting
for a new document in the different natural language. Custom
translators to and from the proprietary format are needed for the
25 elements of the OpenTag DTD.

----
1. Analysis:

OpenTag focuses on localization (natural langage processing, NLP) of text
in documents in any proprietary formatting text processor. It uses its
standard DTD that will support open data encoding methods during 
the localization process. It provides a 25-element DTD that assertedly
preserves the significant clues useful for producing natural language
translation of each text object. Presumably translation is easier using 
the opentag markup. When the translation is complete, the reverse process
occurs: it substitutes the translated text objects back into the 
corresponding places in the original formatted document, so all the
original formatting markup is reused. The SGML version is only the 
intermediary. It need not persist beyond the translation process.

They assert "XML compliant (and therefore SGML compliant as well)". They
aren't there yet. [See Eve's message as well.]

The OpenTag DTD preserves a few styles that might provide aids to
translation [but not font, face, pointsize information, that might
convey hierarchy or captioning]. Presumably a human translator
would have the original formatted document to work with that could
convey this other information.

A tagged instance has sequential numeric ids assigned to 
paragraphs in a document and again ordered sequential numeric ids to 
each inline textual object in each paragraph. As the content model 
includes tags for inlines in a paragraph, the mixed content #PCDATA 
parts would remain in the same order separated by injections of inlines.
A generic x in-line code tag is available, that might be used to
surround #PCDATA.

It handles footnotes, index terms, and bibliographic references. 
It preserves column breaks and table cell breaks. 

They allow (character) codeset change within a document from
that specified in an initial PI.
 
Comparing different localizations for the same formatting language X
of the same original would all share the ordered ids. 

Transformation to a different target formatting language Y, such as RTF,
HTML, Wordperfect, etc. would not generally be possible, unless there
were a richer way to assign corresponding style recognizers and generators.

Some formatting languages such as MicroSoft Powerpoint scramble object
order. There is no implicit left-to-right, top-down text progression
as is true for western languages in most word-processing format languages.
(Even in those, table cell line-wrapping can break that model.)

I believe the opentag document instances will lose much information 
in the translation process of mapping everything onto their 25 generic
elements, nine of which are EMPTY. The consequence is that they will 
not be able to generally go from X to different formatting language Y.

2. The DTD -- opentag.dtd

The OpenTag DTD claims conformance with
"XML, Extensible Markup Language, working draft 11-14-96".

The assertion of compatiblity with XML is commendable, but overstated.

Omissible tag indicators should not be present. Or, if they are,
the omissible endtag on a content model is improper:

<!ELEMENT P - O ...> is inconsistent with XML.

All hierarchic structures are flattened into a sequence of paragraphs P.

Self-exclusion exceptions occur, on b,i,u,d bold, italics, underline,
doubleunderline) indicate a style attribute on something -- but why stop
with those few? smallcaps, overbar, strikethrough, subscript, superscript, etc.]

<!ELEMENT so - - RCDATA>

RCDATA is disallowed in XML.

There is no element declaration for level, used in the ixd content
model. Instead there is a lvl element, with pernicious mixed
content, ((%paratxt;)*,so) that is illegal in XML.

The PB tag semantics refer to an otherwise undefined definition group.


3. The OpenTag Format Specifications -- otspecs.htm

This includes a markup reference to XML, character entities (either
SGML &name;, or Unicode &#xxxx;), a file header, and a tag description
table, providing some semantics. 
An initial PI file header describes version, type, encoding, locale, 
and original. This suggests document==file. There is no mention of
external entities.

No attributes are on the base document element, opentag.
Almost universal is an id NUMBER attribute. Few other attributes exist.
The element type distinguishes use: target or reference for the id value.

The example for paragraph (which has a content model) "<p id=1/>" is 
inconsistent with SGML. It uses the XML empty tag variant "/>" in 
place of tagc, not now present in SGML.

I am confused by the use of id as a numeric (not ID) attribute. It is
appropriate for the textual definitions of footnotes,
index terms, and (biographic) references -- FND, IXD and
RFD, but also in their EMPTY points of reference in FN, IX, and RF. 
As the latter are actually references, in SGML I believe it would be 
better with a distinguishing attribute name, possibly "idr", rather than
overload id. I see no reason why the "idr" need to be unique in the
file. Several references to the same object should be allowed.
[The end of the RFD semantics seems in error, referring to footnote
definition, rather than reference definition.]

The use of sequential integers starting from 1 at each <p> for each 
in-line makes any use of the id values during language processing
awkward. Since each paragraph has its own sequence, presumably the 
paragraph ids, unique to the document, also a set of sequential numbers
starting at 1 as qualifiers to disambiguate the ids of the in-line children.
But since the inline id values are reused, they are not 
sufficientby themselves to locate any object, other than indirectly 
through its ancestors.

The generic inline group can be arbitrarily moved in a P. That seems to
means that all id values in that P need to be adjusted after such 
movement. [Any editing to add, reorder, or delete objects in a P has
the same problem.] Perhaps this is ruled out in the localization NLP.

Revisions across a set of localized versions of a document could use
opentag versions to linearize the structure and provide some degree
of common navigation to common places. There are no attributes that
would indicate any revision control.

From this cursory look, I believe OpenTag still needs significant work.

Regards/Harvey Bingham
mailto:hbingham@acm.org
Received on Thursday, 6 March 1997 00:15:10 UTC