Re: Errata in section 2.4 of Extensible Markup Language (XML) 1.0 (Fifth Edition) from Daniel van Vugt on 2011-11-04 (xml-editor@w3.org from October to December 2011)

From: Daniel van Vugt <vanvugt@gmail.com>
Date: Fri, 04 Nov 2011 11:19:35 +0800
To: "Grosso, Paul" <pgrosso@ptc.com>, xml-editor@w3.org
Message-ID: <4EB359C7.1010103@gmail.com>
I am very surprised you are not accepting corrections to the standard, 
for mistakes that you acknowledge do exist. Especially a correction such 
as this which only requires changing a single character.

However, this is not the first time I have encountered an official 
language specification with BNF grammar where the authors have stated 
they don't guarantee the grammar to be technically accurate...

For the benefit of the wider community, I think it would be helpful to 
still publish the errata, even indefinitely, and even if you have no 
intention of ever resolving the problems in the main document.

- Daniel


On 04/11/11 04:49, Grosso, Paul wrote:
> Daniel,
>
> Thank you for your interest in the XML spec and your
> comments [1,2,3] on the XML 1.0 5th edition.
>
> The XML Core Working Group discussed them and came to the
> following conclusion:
>
> Regarding the several ambiguous grammar reports
> -----------------------------------------------
> You are correct that the productions as written do not themselves
> specify a non-ambiguous grammar, and the alterations you are
> suggesting are exactly the kind that a parser writer should
> be making if a non-ambiguous grammar is needed or desired.
>
> However, the technical ambiguities in the productions in the XML
> specification have been there since the first edition in 1998,
> and it was never the intention to imply that the productions
> in the document can be used without change as a non-ambiguous
> grammar.  The original authors of the specification felt that
> logical clarity was better served by the productions as written,
> and parser writers are free to translate them into an equivalent
> non-ambiguous grammar.
>
> Perhaps that sentiment should have been spelled out explicitly
> in the document, but it does not seem necessary or prudent to
> do that or to alter the productions at this late date.
>
> Regarding the CharData construct
> --------------------------------
> CharData does not include character references.
>
> The discussion in section 2.4 starts with "_Text_ consists of
> intermingled character data and markup."  The discussion in
> the next few paragraphs about character references is talking
> about character references in _Text_.  The CharData term that,
> as you note, does not allow the<  or&  character, is only
> referenced from production [43] for "content" which is the
> production for _text_, and that production defines "content"
> as being CharData interspersed with various markup constructs
> including Reference (which includes entity and character
> references).
>
>
> Paul Grosso, co-chair of the XML Core WG
>
> [1] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0000
> [2] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0001
> [3] http://lists.w3.org/Archives/Public/xml-editor/2011OctDec/0002
>
>> -----Original Message-----
>> From: xml-editor-request@w3.org [mailto:xml-editor-request@w3.org] On
>> Behalf Of Daniel van Vugt
>> Sent: Thursday, 2011 October 20 0:20
>> To: xml-editor@w3.org
>> Subject: Errata in section 2.4 of Extensible Markup Language (XML) 1.0
>> (Fifth Edition)
>>
>> ERROR #1: Ambiguous grammar
>>
>> These rules make the grammar ambiguous:
>>
>> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
>> [43] content ::= CharData? ((element | Reference | CDSect | PI |
>> Comment) CharData?)*
>>
>> CharData is allowed to match an empty string due to its use of "*".
>> However CharData is referenced as CharData? meaning this potentially
>> empty string is optional. Therefore, if content is blank, it is
>> ambiguous as to whether CharData is matched as the empty string or if
>> CharData is omitted completely.
>>
>> Functionally this is low severity. However grammar parsers such as my
>> own will find both interpretations and treat it as an error because
> the
>> grammar is ambiguous.
>>
>> The fix is simple. Change:
>> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
>> to:
>> [14] CharData ::= [^<&]+ - ([^<&]* ']]>' [^<&]*)
>>
>>
>> ERROR #2: CharData supports, and doesn't support, character references
>>
>> Section 2.4 seems to suggest that Character Data may contain character
>> references such as&amp;. However at the same time, the grammar rule
>> [14] for CharData does not appear to be able to match ampersand
>> character references at all:
>>
>> [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
>>
>>
>> Regards,
>>
>> Daniel van Vugt
>>
>
>
Received on Friday, 4 November 2011 03:21:55 UTC