Re: Parse XML with ixml

M Joel Dubinko writes:

> Idle thought,
>
> Is it possible to represent the XML grammar (of XML itself) in ixml?
> Not that this seems immediately useful, other than perhaps as a torture test...

Yes, I think so, though the closest anyone has come was a very simple
toy grammar to illustrate the principle and exhibit the complications
that arise in the process.

The toy grammar, which I have now retrieved and sanity checked with a
functioning ixml processor, looks like this:

    { A grammar for a small subset of XML, as an illustration. }

    document: ws?, element, ws?.
    element:  start-tag, content, end-tag; sole-tag.
    
    -start-tag:  -"<", @gi, (ws, attribute)*, ws?, -">".
    -end-tag:  -"</", @gi2, (ws, attribute)*, ws?, -">".
    -sole-tag:  -"<", @gi, (ws, attribute)*, ws?, -"/>".
    
    attribute:  @name, ws?, -"=", ws?, @value.
    @value: dqstring; sqstring.
    -dqstring: dq, ~['"']*, dq.
    -sqstring: sq, ~["'"]*, sq.
    -dq: -['"'].
    -sq: -["'"].
    
    -content:  (PCDATA-char; processing-instruction; comment; element)*.
    
    -PCDATA-char:  (~["<>&"]; "&amp;"; "&lt;"; "&gt;").
    processing-instruction:  -"<?", @name, ws, @pi-data, -"?>".
    comment:  -"<--", commentdata, -"-->".

    name: [L; "_"], [L; Nd; "_-."]*.  { a slight simplification }
    -pi-data: ~["?"]*.  { another slight simplification }
    -commentdata: ~["-"]*.  { and another }
    
    gi: name.
    gi2: name.
    
    -ws:  -[#20; #A; #C; #9]+.

As you can see, it's somewhat simpler than the grammar in the XML spec,
but I don't know of any principled reason that the spec grammar could
not be translated in its entirety into ixml.  (The XML spec does use
some constructs like subtraction which would complicate the effort, but
I don't think it's impossible to translate those into ixml, just
error-prone and tedious.)

Among the input sequences which should be accepted by this grammar is
the following XML representation of a haiku.

<haiku author="Basho" date="1686">
    <line>When the old pond</line>
    <line>gets a new frog</line>
    <line>it's a new pond.</line>
</haiku>

When we ask a processor to parse that input against that grammar, the
results remind us of the fact that XML has a number of additional
constraints that are attached to the grammar but not part of it.  (For
those who care about such things, it may be noted that the
well-formedness and validity constraints, and in particular the idea of
attaching them to the relevant productions in the grammars, came from
the notation of attribute grammars.)

Also, ixml creates elements with the names of the nonterminals in the
grammar. So the output does not look like the input XML (although it is
easy to see that it has a similar structure and it would be an easy XSLT
exercise to translate it into conventional XML).  It looks like an XML
document describing the structure of another an XML structure -- which,
I suppose, is what it is.

<document>
   <element gi="haiku" gi2="haiku">
      <attribute name="author" value="Basho"/>
      <attribute name="date" value="1686"/>
    
      <element gi="line" gi2="line">When the old pond</element>
    
      <element gi="line" gi2="line">gets a new frog</element>
    
      <element gi="line" gi2="line">it's a new pond.</element>
</element>
</document>

There is extra whitespace in the output because the grammar makes no
effort to suppress inter-element whitespace.  Again, easy to deal with
in an XSLT post-processing step.

I hope this helps, and thank you for the question.

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Saturday, 25 June 2022 02:27:41 UTC