RE: Errata in section 2.4 of Extensible Markup Language (XML) 1.0 (Fifth Edition) from Grosso, Paul on 2011-11-03 (xml-editor@w3.org from October to December 2011)

From: Grosso, Paul <pgrosso@ptc.com>
Date: Thu, 3 Nov 2011 16:49:20 -0400
To: "Daniel van Vugt" <vanvugt@gmail.com>, <xml-editor@w3.org>
Message-ID: <9B2DE9094C827E44988F5ADAA6A2C5DA03F74DC5@HQ-MAIL9.ptcnet.ptc.com>

Daniel,

Thank you for your interest in the XML spec and your
comments [1,2,3] on the XML 1.0 5th edition.

The XML Core Working Group discussed them and came to the 
following conclusion:

Regarding the several ambiguous grammar reports
-----------------------------------------------
You are correct that the productions as written do not themselves
specify a non-ambiguous grammar, and the alterations you are 
suggesting are exactly the kind that a parser writer should
be making if a non-ambiguous grammar is needed or desired.

However, the technical ambiguities in the productions in the XML
specification have been there since the first edition in 1998,
and it was never the intention to imply that the productions
in the document can be used without change as a non-ambiguous
grammar.  The original authors of the specification felt that
logical clarity was better served by the productions as written,
and parser writers are free to translate them into an equivalent 
non-ambiguous grammar.

Perhaps that sentiment should have been spelled out explicitly
in the document, but it does not seem necessary or prudent to 
do that or to alter the productions at this late date. 

Regarding the CharData construct
--------------------------------
CharData does not include character references.

The discussion in section 2.4 starts with "_Text_ consists of 
intermingled character data and markup."  The discussion in
the next few paragraphs about character references is talking
about character references in _Text_.  The CharData term that,
as you note, does not allow the < or & character, is only
referenced from production [43] for "content" which is the
production for _text_, and that production defines "content"
as being CharData interspersed with various markup constructs
including Reference (which includes entity and character
references).


Paul Grosso, co-chair of the XML Core WG

[1] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0000
[2] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0001
[3] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0002

> -----Original Message-----
> From: xml-editor-request@w3.org [mailto:xml-editor-request@w3.org] On
> Behalf Of Daniel van Vugt
> Sent: Thursday, 2011 October 20 0:20
> To: xml-editor@w3.org
> Subject: Errata in section 2.4 of Extensible Markup Language (XML) 1.0
> (Fifth Edition)
> 
> ERROR #1: Ambiguous grammar
> 
> These rules make the grammar ambiguous:
> 
> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
> [43] content ::= CharData? ((element | Reference | CDSect | PI |
> Comment) CharData?)*
> 
> CharData is allowed to match an empty string due to its use of "*".
> However CharData is referenced as CharData? meaning this potentially
> empty string is optional. Therefore, if content is blank, it is
> ambiguous as to whether CharData is matched as the empty string or if
> CharData is omitted completely.
> 
> Functionally this is low severity. However grammar parsers such as my
> own will find both interpretations and treat it as an error because
the
> grammar is ambiguous.
> 
> The fix is simple. Change:
> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
> to:
> [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*)
> 
> 
> ERROR #2: CharData supports, and doesn't support, character references
> 
> Section 2.4 seems to suggest that Character Data may contain character
> references such as &amp;. However at the same time, the grammar rule
> [14] for CharData does not appear to be able to match ampersand
> character references at all:
> 
> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
> 
> 
> Regards,
> 
> Daniel van Vugt
>

Received on Thursday, 3 November 2011 22:22:40 UTC