XML 1.0 - overcomplicating the parse engine from Kent M Pitman on 1998-04-17 (xml-editor@w3.org from April to June 1998)

From: Kent M Pitman <kmp@harlequin.com>
Date: Fri, 17 Apr 98 05:13:03 EDT
To: xml-editor@w3.org
Cc: kmp@harlequin.com
Message-Id: <9804170913.AA01357@excel.harlequin.com>

XML, following in SGML's footsteps, seems to me to overcomplicate the
parse phase, by trying to force errors to be detected in the parse phase
when really they ought to parse more simply and be detected as errors
if necessary later.

An example is:

 [24] VersionInfo ::= S 'version' Eq ( "'" VersionNum "'" |
                                       '"' VersionNum '"' )

 [26] VersionNum ::= ( [a-zA-Z0-9_.:] | '-' )+


Is there a really good reason this has to have a special case rule
for parsing each and every string?  Really, the only thing the parser
needs should be:

 [24] VersionInfo ::= S 'version' Eq datastring

      datastring ::= ( "'" [^'] "'" ) | ( '"' [^"] '"' )

and then everything else that needs data can use the same thing.  e.g.,

 [32] SDDecl ::= S 'standalone' Eq datastring

It should be a validity constraint that the value is either 'yes' or 'no'
in [32]--it should not affect the parsing.

The problem might be that you need a way of talking about what's in the
quotes, not the thing including the quotes, but that can be solved by
introduction of new terminology and/or syntax.  e.g., you could invent 
a notation such that you would write:

 [24] VersionInfo ::= S 'version' Eq @VersionNum

where @VersionNum meant that the parser would parse a quoted datum and
the spec refer to the quoted object as VersionNum.  Or you could define
a way of indicating the parsing as giving a name to the datastring with
the quotes, and then some way of saying that the data content has a name.

 [24] VersionInfo ::= S 'version' Eq VersionNumStr

      VersionNumStr ::= datastring

      VersionNum = the data content of VersionNumStr

Right now, if you're not using YACC and you're instead hand-parsing
this stuff, you end up with separate parsers for each of these things that
I think oughtn't be separate.  (Actually, YACC probably has separate parsers
too, but doesn't tell you.)  Anyway, I just think there's no good excuse
for not doing how poeple think of it, which is "here's a thing that takes a
string datum as an argument" and "oh, by the way, once we figure out what
the string is, we can tell you if it's the right string".  I don't think
users expect a "that's not well-formed syntax" error for x="maybe"; they expect
a "that's not a good value for that attribute" error--and you can't say that
unless you can parse it in the first place.  In other worse, I claim improper
values ARE well-formed; just not valid.  But the present syntax doesn't permit
that view; the present syntax actively forces a more complex view.

I think the present situation quite unfortunate because it disallows
some very intuitive (and more modular) parser implementations,
requiring them to be gratuitously larger and more convoluted,
involving more special cases that might break between versions.

-----------
DISCLAIMER:
 The above are my personal feelings and not necessarily 
 Harlequin's official position.

Received on Friday, 17 April 1998 05:09:45 UTC