Re: Inclusions from C. M. Sperberg-McQueen on 2011-02-02 (xmlschema-dev@w3.org from February 2011)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Wed, 2 Feb 2011 13:18:03 -0700
To: Andrew Leslie <info@structuredinformation.co.uk>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, <xmlschema-dev@w3.org>
Message-Id: <8EE9EEDF-AA94-4D8D-A9ED-2E87DB5DC8B9@blackmesatech.com>

On Feb 2, 2011, at 4:42 AM, Andrew Leslie wrote:

> A customer of mine requires I translate S1000D 1.8 SGML to S1000D 3.0.1 XML.
> One of the issues with this is inclusions; they are allowed in SGML but not in XML.
>  
> Eg., within 1.8 SGML we have :
>  
> <!ELEMENT descript - o (para*,(%spcpara;),para0*) +(figure | foldout | table | caption) >
>  
>  
> Where figure, foldout, table and caption are allowed anywhere within descript and its subelements.
>  
> But within 3.0.1 XML we have
>  
> <!ELEMENT descript (((para*, ((warning*, caution*), note*), para0*) | ((figure | multimedia | foldout | table) | caption))*)>
>  
> Where figure, foldout, table and caption are allowed anywhere but only as direct descendants to descript.
>  
>  
> Other than simply extending the 3.0.1 schema to allow for inclusions (which I really do not want to do if at all possible), are there any other methods which may be more appropriate ?
>  
> The customer is not keen on normalizing their data in any way.

Are you constrained to use the translation in to XML that you quote?
Or is that just the current best effort?

I understand your reluctance to modify the 3.0.1 DTD to allow
the relevant elements in the appropriate places (I've done it for
some vocabularies, and it can be tedious work), but in principle
if you want the SGML and the XML to have the same element
structure, the right thing to do really is to formulate an XML DTD
that enforces something like the same rules as the SGML DTD.

If I had to do another SGML-to-XML DTD conversion, I think
I would try to write a tool to automate the handling of inclusions
(under user control), to reduce the tedium and reduce the chance
of error.  

By far the simplest way to get the correct result (although it
does not always produce attractive content models -- sometimes
other formulations accept the same sequences of children and
are clearer) is to define a parameter entity I for the inclusions
and change every element reference X in the content model
to (X, (%I;)*) -- and add (%I;)* at the beginning of the model
as well.  So the content model for descript becomes

    ((%I;)*, 
    (para, (%I;)*)*, 
    (((warning, (%I;)*)*, 
     (caution, (%I;)*)*), 
     (note, (%I;)*)*), 
    (para0, (%I;)*)*) 

If this looks too ugly, and you know that all the documents are
in fact valid against the SGML DTD, then you might consider
using one of the various tools around that read a body of documents
and produce a DTD (or sometimes nowadays a schema in another
schema language).  That exercise will tell you where the inclusion
exceptions in the SGML DTD are actually used and have to be
accounted for in the XML DTD, as opposed to where they might
theoretically have been used.  That may help you produce simpler
content models.

If I understand you correctly, the desiderata for the translation are

  (1) no re-arrangement ('normalization') of the data
  (2) output valid against an XML DTD to be specified (or:  
      against the XML DTD  you quote from)
  (3) no heavy lifting in modifying the XML DTD

I may be unduly pessimistic, but I don't think it's possible to get all
three of those in the normal case, especially given that the XML
DTD you quote from does not recognize anything like the same set
of documents as the SGML DTD.

I hope this helps.

-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Wednesday, 2 February 2011 20:18:37 UTC