Re: "Notes on logic grammars and XML Schema" meets Blindfold

[I'm cc'ing www-archive on a reply which includes text written by
myself and cmsmcq, who has given permission to be quoted like this.]

On 2002-09-19 07:17:16, C. M. Sperberg-McQueen wrote (privately):
> At 2002-09-18 06:33, Sandro Hawke wrote (privitely):
>  > C. M. Sperberg-McQueen Writes (privately):
>  >> Following up on a suggestion made by Allen Brown of MS at the XML
>  >> Schema meeting 1-2 August, have been reading about definite-clause
>  >> translation grammars (DCTGs) and thinking about how to apply them
>  >> to the formalization of the XML Schema specification.  I've gotten
>  >> some enjoyment from working out examples.
> 
>  >>    http://www.w3.org/People/cmsmcq/2002/dctgnotes
> 
>  >> So far, it seems quite plausible to believe that a rather compact
>  >> DCTG can serve as a schema processor for a hard-coded schema; I
>  >> have high hopes that a second-level DCTG can be used to read schema
>  >> documents and generate the instance-level DCTGs.  SWI-Prolog's XML
>  >> parser is a big advantage here.
> 
>  > I should add a few thoughts here.  I've looked over your notes,
>  > although not followed each example in detail.  I don't quite grok
>  > the PSVI in this context; my work here has been in parsing XML files
>  > and coming up with RDF triples saying what was meant;
> 
> This sounds like work I would be interested in reading about; it
> sounds as if it connects with work I'm doing with Claus Huitfeldt in
> Bergen and Allen Renear and David Dubin in Urbana [1], as well as with
> the schema annotation work EricM and Ralph and I are doing [2].  Are
> you referring to blindfold here?  Or one of the other things you
> mention on your Team page? For translations like this what mechanism
> do you now use to tell the software what RDF triples to generate from
> the input? And are you satisfied with that mechanism or is it
> temporary until something better emerges?  [Oops; questions look
> premature, you provide further info below.]

Yeah, it's blindfold, which is alas not very well documented.  I've
tended to work on it in brief periods and then get interupted, leaving
things kind of a mess.  

>   [1] Published:
>    http://www.w3.org/People/cmsmcq/2000/mim.html
>    http://www.w3.org/People/cmsmcq/2002/EML2002Sper0518.final
>   Unpublished: borderline-coherent work log:
>    http://www.w3.org/People/cmsmcq/2002/desperately-seeking-skeletons
> 
>   [2] Background:
>    http://www.w3.org/People/cmsmcq/2002/schema-annotation.html
>   Current working outline:
>    http://www.w3.org/2002/06/emmsm-schema-annotation.xml (an
>    XSLT-enabled browser makes this more legible)
> 
>  > the PSVI is RDF triples saying lots of stuff about the document as
>  > read under that schema.  Hrm...
> 
> Yes, I think that if you squint enough the problem I'm looking at
> (read XML, write out the PSVI) looks a lot like the problem of reading
> XML and writing out RDF graphs or logical expressions.  There are
> enough devils in the details (including the fact that an important
> part of the production of the PSVI is testing certain fairly specific
> and fairly elaborate constraints expressed in the schema) that it
> doesn't seem like an obvious application for a generic annotation
> language, unless the annotation language is Turing-complete.

Perhaps.  I wonder how much of the Turing-completeness is really
necessary, though.  I'd think first-order logic would actually be a
decent match, although of course it would end up looking like LISP or
Prolog on the more complex cases.

>  > My first approach was using DCGs, but after a while I switched to
>  > lex/yacc.  I think my main reason was a concern that DCG's could not
>  > be acceptable to the wider XML community due to prejudice against
>  > prolog, along with actual performance problems.  Some of my tests
>  > had the non-linear performance characteristic of bad parsers, which
>  > I imagined would be completely unacceptable to the database
>  > community.  Someone who better understands prolog under the hood (or
>  > got luckier, or was working on a slightly different task) might not
>  > have this problem.  ( I'm also just a lot more comfortable with C,
>  > despite a my love for prolog. )
> 
> Yes; performance is a bit of a risk, at least for any plans to make
> the DCTG stuff I'm doing into a system people would actually use day
> to day.  I would not mind that at all, but my initial goal is more
> modest: to show that one can go easily from the spec to the DCTG, with
> confidence that the DCTG correctly represents the rules of the spec,
> and then make the DCTGs run.  The goal here is compactness in the DCTG
> form, and clarity.  I'm a bit worried about the latter: I find most of
> the code fairly clear and easy to read, but there is a danger that the
> WG will not find it any clearer or more useful than the
> Gentzen-calculus rules that are in the draft Formal Description of XML
> Schema.
> 
> That is, if I can get a conforming schema processor by writing a
> relatively small DCTG, it won't matter if it's slow.  Possibly it
> won't matter even if it's very slow.  It just needs to be visibly
> correct.

Okay, I see what you mean.  I would like that, too, esp for the RDF
grammar, but I'm less excited by that (perhaps since I'm not on the
WG).

> I do have some questions about whether yacc and lex are optimal tools
> here; more below.
> 
>  > My approach to handling XML was to turn an XML stream into a
>  > character strean with extra tokens, so
> 
>  >  <a1 b="c">d</a>
> 
>  > turned into
> 
>  >    START_ELEMENT a 1 START_ATTRIBUTE_TAG b START_ATTRIBUTE_VALUE c
>  >   START_CONTENT d END_ELEMENT
> 
>  > I did this in some code called yylexpat (yacc expects the tokens to
>  > come from a function called yylex; my yylex was an expat wrapper).
>  > To simplify parsing, yylexpat also expanded namespaces and sorted
>  > attribute tags.  (If you don't sort attribute tags, the grammar
>  > needs to look for them in any order, which makes it exponentially
>  > convoluted when some of them are manditory.)
> 
> Yes; my current plan is to check for mandatory attributes via a
> special condition-check, rather than expressing it grammatically.  It
> makes it seem kind of ad hoc, but it can be combined conveniently with
> the task of supplying default values for attributes, which is
> something a schema processor needs to do.  And it's actually somewhat
> easier to understand as an ad hoc check-for-required-atts() call than
> it would be if compiled into the grammar.
> 
> Wait -- there's another way, which I'll want to look into.  Both
> required and defaulted attributes can be handled using attributes in
> an attribute grammar -- and one of the nicer features of DCGs and
> DCTGs is that they work very nicely with attribute grammars.
> 
> Suppose we have an element with declarations for attributes a, b, c,
> d, e, and f, with a and e (vowels) required. I think we can say
> something like this, in DCTG notation.
> 
>    /* the non-terminal all_e_atts can be rewritten as e_atts, as long
>     * as the 'names' attribute of the e_atts instance has a value
>     * which includes 'a' and 'e' as members.
>     */
>    all_e_atts ::= e_atts^^A, { member(a,A^^names), member(e,A^^names) }.
> 
>    /* e_atts can be empty.  If so, its names attribute has the value []
>     * (empty list)
>     */
>    e_atts ::= []
>      <:>
>      names([]).
> 
>    /* e_atts can consist of an att-for-e, followed by an e_atts.
>     * If so, the names attribute is calculated by consing the value
>     * of the name attribute of the att-for-e onto the (value of the)
>     * names attribute of the e_atts.
>     */
>    e_atts ::= att-for-e^^Att, e_atts^^List
>      <:>
>      names(A|L) ::- Att^^name(A), List^^names(L).
> 
>    /* att-for-e is an attribute named a, b, c, etc.  In each case
>     * the 'name' attribute of the non-terminal is the name of the
>     * attribute (a, b, c, etc).  This grammar omits typechecking and
>     * so on, to allow focus on using the name/names attributes to
>     * provide for simple check on required atts.
>     */
>    att-for-e  ::= [a=Avalue] <:> name(a).
>    att-for-e  ::= [b=Bvalue] <:> name(b).
>    att-for-e  ::= [c=Cvalue] <:> name(c).
>    att-for-e  ::= [d=Dvalue] <:> name(d).
>    att-for-e  ::= [e=Evalue] <:> name(e).
>    att-for-e  ::= [f=Fvalue] <:> name(f).

I'm having trouble getting my head around that, perhaps because of
the hour.   What's wrong with 
    attrs ::= a [b] [c] [d] e [f]         # ebnf style
or
    attrs ::= a b? c? d? e f?             # regexp (Kleene) style

>  > I had in mind the same two level plan as you.  My approach looked
>  > more like:
> 
>  >    1a.  read BNF-with-annotation; turn it into an abstract grammar
>  >    1b.  read XML-Schema-with-annotations; turn it into an abstract
>  >         grammar
>  >    1c.  ... etc
>  >
>  >    2.  Compile the abstract annoted grammar, via lex/yacc, into C, &
>  >        run it.
> 
>  > I wrote 1a by hand, with the plan that 1b, 1c, ... could be written
>  > in the language of 1a.
> 
> I guess this is the place to wonder about yacc.  Yacc's quite
> convenient for calculating synthesized (bottom-up) attributes, but I
> never found a good way to handle top-down (inherited) attributes.  (I
> did run across a pre-processor for attribute grammars done in the
> 1980s or so by a Dutch CS researcher named Katwijk, and he sent me
> source code on request, but I got distracted and don't actually know
> how well it worked.) 

I think: you use parser-global variables, like
  
     a ::= b c { mode=special } d e f { mode=original } g h i

if you want d,e,f to be read with some special attribute set.  I
sometimes implement a proper stack for this, so I pop the previous
mode.

I hestitate slightly to say this is the same thing, because I haven't
used attributes enough.  This can also get complicated with lookahead
and (with btyacc) backtracking, but in practice I haven't had it be a
problem.   I don't think there's any need for this in blindfold.

> I suppose that for production work along the
> lines you describe it would be convenient to be able to have both
> synthesized and inherited attributes; has that not been a problem for
> you?
> 
>  > My working examples involve the annotations-are-operations style,
>  > instead of annotations-are-declarations style but still give some
>  > flavor.  My canonical example is:
>  >
>  > http://dev.w3.org/cvsweb/2001/blindfold/sample/unix-etc-group6.bn
>  > f?rev=1.1&content-type=text/x-cvsweb-markup
>  >
>  > ... where I particularly want to point out the last annotation where
>  > I refer to (1) the text of the user name, and (2) the "entry", which
>  > is something *up* the parse tree.
> 
> Thanks; this example helps a lot.  I wonder whether using an
> attribute-grammar toolkit would be useful in this application area; it
> has been a while since I went Web-surfing in search of such systems,
> but I seem to remember seeing several that looked worth experimenting
> with.
> 
>  > My best XML example is half-done:
>  > 
> http://dev.w3.org/cvsweb/2001/blindfold/sample/wsdl/wsdl-manual.bnf?rev=1.4&c
> ontent-type=text/x-cvsweb-markup
>  > which shows how I extended my EBNF to to make XML grammars look much
>  > more natural -- very much like the informal XML grammar for WSDL
>  > which I happened to be working from.
> 
> One typo I notice: for
> 
>    # should be:  qname ::= external(xmlrec:qname)
>    qname ::= [a-zA-Z0-9:]+	
> 
> you probably want
> 
>    # should be:  qname ::= external(nsrec:qname)
>    qname ::= [a-zA-Z0-9:]+	

Did you forget to actually make the change you had in mind?  I don't
see a difference.  :-)

> and it would make me personally happier if it were
> 
>    qname ::= [a-zA-Z_][a-zZ-Z0-9_-]*:[a-zA-Z0-9_-]*	
> 
> which comes slightly closer to the definition (max one
> colon in a QName).

I really was planning to implement the "external" directive there, so
I didn't bother thinking too much about it.

>  > My biggest & most interesting research question is how to handle
>  > unexpected combinations of multiple namespaces.
> 
> Can you elaborate?

I think of the annotated grammar as defining a mapping from an
expression in one language (L1 ) to an expression in another language
such as FOL or N-Triples (L2).  What is the language of an XML
document?  At first glance, it's maybe identified by a tuple of "XML"
and the root element namespace.  For an RDF/XML document, the language
would be
    (XML, http://www.w3.org/1999/02/22-rdf-syntax-ns#)
and that works fine if only one namespace is used.  

If an element from another namespace is used, that's probably okay,
too.  In blindfold, for each syntactic child element, the grammar
annotations say what to do with (1) the source text of the child, (2)
the symbol denoting whatever the child denotes, and (3) the
information gleaned (about that child and about other things) from the
child element text.  [I probably need a fourth thing, a variant of
(1), for quoting XML text with namespace declarations pushed inside.]  

But I don't think that covers it for attributes from other
namespaces.   I can imagine an attribute like rdf:parseType, 
used on elements from other spaces, which wants to run the
show.   I'm not sure how to do that.    

I'll probably have a better idea when/if I actually try to write a
blindfold grammar for RDF/XML.  RDF/XML is one of my ludicrous test
cases, XHTML, XSLT, and XML-Schema are probably the others.  How fast
to do you think a resolution theorem prover would "exectute" an XSLT
program based on it's semantics as defined in XSLT's blindfold
grammar?  :-)   Oh yeah, so the blindfold grammar for XML-Schema would
pretty much be the validator you're working on.....   Hrrrrrm.

I wonder also about XML like

   <ns1:el>
       ...
          <ns2:el>
              ...
                  <ns1:el>

and whether the outer ns1:el should have special say over what ns2:el
does with the inner ns1:el.  I imagine people doing serious work with
XML have all sorts of interesting corner cases I haven't even imagined
yet.   What I'd really like is a simple unifying theory of how you
combine blindfold grammars (each concerned with its own namespace)
into onto joint grammar.    

>  > I remain convinced this kind of work is essential for the success of
>  > the semantic web.  Without it, the vast majority of
>  > nearly-semweb-data remains inaccessible.
> 
> Thank you for your note.  I am glad to have looked at your
> stuff.

You're welcome.  Thanks for taking the time!

    -- sandro

Received on Monday, 23 September 2002 07:05:35 UTC