- From: Harvey Bingham <hbingham@ACM.org>
- Date: Thu, 06 Mar 1997 00:14:02 -0500
- To: w3c-sgml-wg@w3.org
At 07:48 AM 3/5/97 -0800, Tim Bray wrote: >Check out > > http://www.ile.com/opentag/ > >I don't understand the application domain enough to really be fully >in tune with all the implications, but it's an interesting read. - Tim Thanks, Tim for the pointer. Executive Summary: The primary connection to this wg is their marketing hype, claiming they do XML. The novel idea for natural language processing is that text zones can be extracted from a word-processing proprietary format and tagged to the opentag DTD, manipulated (natural language translation of those text zones), and then restored to the places where the original text zones were, yielding the same formatting for a new document in the different natural language. Custom translators to and from the proprietary format are needed for the 25 elements of the OpenTag DTD. ---- 1. Analysis: OpenTag focuses on localization (natural langage processing, NLP) of text in documents in any proprietary formatting text processor. It uses its standard DTD that will support open data encoding methods during the localization process. It provides a 25-element DTD that assertedly preserves the significant clues useful for producing natural language translation of each text object. Presumably translation is easier using the opentag markup. When the translation is complete, the reverse process occurs: it substitutes the translated text objects back into the corresponding places in the original formatted document, so all the original formatting markup is reused. The SGML version is only the intermediary. It need not persist beyond the translation process. They assert "XML compliant (and therefore SGML compliant as well)". They aren't there yet. [See Eve's message as well.] The OpenTag DTD preserves a few styles that might provide aids to translation [but not font, face, pointsize information, that might convey hierarchy or captioning]. Presumably a human translator would have the original formatted document to work with that could convey this other information. A tagged instance has sequential numeric ids assigned to paragraphs in a document and again ordered sequential numeric ids to each inline textual object in each paragraph. As the content model includes tags for inlines in a paragraph, the mixed content #PCDATA parts would remain in the same order separated by injections of inlines. A generic x in-line code tag is available, that might be used to surround #PCDATA. It handles footnotes, index terms, and bibliographic references. It preserves column breaks and table cell breaks. They allow (character) codeset change within a document from that specified in an initial PI. Comparing different localizations for the same formatting language X of the same original would all share the ordered ids. Transformation to a different target formatting language Y, such as RTF, HTML, Wordperfect, etc. would not generally be possible, unless there were a richer way to assign corresponding style recognizers and generators. Some formatting languages such as MicroSoft Powerpoint scramble object order. There is no implicit left-to-right, top-down text progression as is true for western languages in most word-processing format languages. (Even in those, table cell line-wrapping can break that model.) I believe the opentag document instances will lose much information in the translation process of mapping everything onto their 25 generic elements, nine of which are EMPTY. The consequence is that they will not be able to generally go from X to different formatting language Y. 2. The DTD -- opentag.dtd The OpenTag DTD claims conformance with "XML, Extensible Markup Language, working draft 11-14-96". The assertion of compatiblity with XML is commendable, but overstated. Omissible tag indicators should not be present. Or, if they are, the omissible endtag on a content model is improper: <!ELEMENT P - O ...> is inconsistent with XML. All hierarchic structures are flattened into a sequence of paragraphs P. Self-exclusion exceptions occur, on b,i,u,d bold, italics, underline, doubleunderline) indicate a style attribute on something -- but why stop with those few? smallcaps, overbar, strikethrough, subscript, superscript, etc.] <!ELEMENT so - - RCDATA> RCDATA is disallowed in XML. There is no element declaration for level, used in the ixd content model. Instead there is a lvl element, with pernicious mixed content, ((%paratxt;)*,so) that is illegal in XML. The PB tag semantics refer to an otherwise undefined definition group. 3. The OpenTag Format Specifications -- otspecs.htm This includes a markup reference to XML, character entities (either SGML &name;, or Unicode &#xxxx;), a file header, and a tag description table, providing some semantics. An initial PI file header describes version, type, encoding, locale, and original. This suggests document==file. There is no mention of external entities. No attributes are on the base document element, opentag. Almost universal is an id NUMBER attribute. Few other attributes exist. The element type distinguishes use: target or reference for the id value. The example for paragraph (which has a content model) "<p id=1/>" is inconsistent with SGML. It uses the XML empty tag variant "/>" in place of tagc, not now present in SGML. I am confused by the use of id as a numeric (not ID) attribute. It is appropriate for the textual definitions of footnotes, index terms, and (biographic) references -- FND, IXD and RFD, but also in their EMPTY points of reference in FN, IX, and RF. As the latter are actually references, in SGML I believe it would be better with a distinguishing attribute name, possibly "idr", rather than overload id. I see no reason why the "idr" need to be unique in the file. Several references to the same object should be allowed. [The end of the RFD semantics seems in error, referring to footnote definition, rather than reference definition.] The use of sequential integers starting from 1 at each <p> for each in-line makes any use of the id values during language processing awkward. Since each paragraph has its own sequence, presumably the paragraph ids, unique to the document, also a set of sequential numbers starting at 1 as qualifiers to disambiguate the ids of the in-line children. But since the inline id values are reused, they are not sufficientby themselves to locate any object, other than indirectly through its ancestors. The generic inline group can be arbitrarily moved in a P. That seems to means that all id values in that P need to be adjusted after such movement. [Any editing to add, reorder, or delete objects in a P has the same problem.] Perhaps this is ruled out in the localization NLP. Revisions across a set of localized versions of a document could use opentag versions to linearize the structure and provide some degree of common navigation to common places. There are no attributes that would indicate any revision control. From this cursory look, I believe OpenTag still needs significant work. Regards/Harvey Bingham mailto:hbingham@acm.org
Received on Thursday, 6 March 1997 00:15:10 UTC