Re: XML parsing query

Thanks for the quick response.

I have continued to look at the spec, and section 4.4.7 Bypassed
states that "When a general entity reference appears in the
EntityValue in an entity declaration, it must be bypassed and left as
is." which agrees with the part you quoted from 4.5.  The use of
"bypass" does suggest that the declaration does not need to be
checked.

This then confuses me slightly, since there is, I believe, nowhere
else that this declaration could occur, and so any use of the
reference will lead to trouble.  Indeed, my parser does leave the
general entity as is, but it does try and check that it is declared.

However, I would welcome any clarification on this.

Cheers,

Colin

On Thu, Feb 12, 2009 at 2:37 AM, James Clark <jjc@public.jclark.com> wrote:
> Hmm, it's been a while since I looked at this stuff.  Section 4.5 says you
> are supposed to leave the entity reference unexpanded ("however,
> general-entity references MUST be left as-is, unexpanded").  The unstated
> implication is that you don't have to check that the general entity
> reference is declared.  At first glance it looks to me like the spec could
> be clearer here.  I'm CC'ing xml-editor@w3.org to see if the XML Core WG
> agrees.
>
> James
>
> On Thu, Feb 12, 2009 at 6:14 AM, Colin Ross <colin@vcolin.com> wrote:
>>
>> Good evening,
>>
>> I am in the process of writing myself an XML parser using the
>> specification at
>>
>> http://www.w3.org/TR/2008/REC-xml-20081126/
>>
>> and am using the "xmltest" suite published as part of
>>
>> http://www.w3.org/XML/Test/xmlts20080827.tar.gz
>>
>> which has your name in the profile.
>>
>> I am having some trouble with the test case in
>>
>> valid/sa/114.xml
>>
>> which is described in the xmltest.xml file as
>>
>> <TEST TYPE="valid" ENTITIES="none" ID="valid-sa-114"
>>  URI="valid/sa/114.xml" SECTIONS="2.7 [20]"
>>  OUTPUT="valid/sa/out/114.xml">
>>  Test demonstrates that all text within a valid CDATA section is
>> considered text and not recognized as markup. </TEST>
>>
>> The xml itself looks like the following
>>
>> <!DOCTYPE doc [
>> <!ELEMENT doc (#PCDATA)>
>> <!ENTITY e "<![CDATA[&foo;]]>">
>> ]>
>> <doc>&e;</doc>
>>
>> My parser, as it currently is implemented, parses the <!ENTITY...  line as
>>
>> InternalSubsetMarkup (MarkupEntityDecl (GEDecl (Name "e") (EntityDef1
>> (EntityValueDQ [
>> RawEV "'<'",
>> RawEV "'!'",
>> RawEV "'['",
>> RawEV "'C'",
>> RawEV "'D'",
>> RawEV "'A'",
>> RawEV "'T'",
>> RawEV "'A'",
>> RawEV "'['",
>> ReferenceEV (EntityRef (Name "foo")),
>> RawEV "']'",
>> RawEV "']'",
>> RawEV "'>'"]))))
>>
>> Note here, the parsed EntityRef of "foo".
>>
>> This all  looks according to the spec, since
>>
>> EntityValue ::= '"' ([^%&"] | PEReference | Reference)* '"'
>>  | "'" ([^%&'] | PEReference | Reference)* "'"
>>
>> which states that the ampersand is not permitted to be part of the
>> [^%&"] group and so is parsed as part of a Reference instead.
>>
>> My parser rejects this as not well-formed because the EntityRef of
>> "foo" has not been declared, but perhaps this is wrong? I admit that
>> the specification leaves me slightly lost here:
>>
>> === SPEC (http://www.w3.org/TR/2008/REC-xml-20081126/#wf-entdeclared) ===
>>
>> Well-formedness constraint: Entity Declared
>>
>> In a document without any DTD, a document with only an internal DTD
>> subset which contains no parameter entity references, or a document
>> with " standalone='yes' ", for an entity reference that does not occur
>> within the external subset or a parameter entity, the Name given in
>> the entity reference MUST match that in an entity declaration that
>> does not occur within the external subset or a parameter entity,
>> except that well-formed documents need not declare any of the
>> following entities: amp, lt, gt, apos, quot. The declaration of a
>> general entity MUST precede any reference to it which appears in a
>> default value in an attribute-list declaration.
>>
>> Note that non-validating processors are not obligated to read and
>> process entity declarations occurring in parameter entities or in the
>> external subset; for such documents, the rule that an entity must be
>> declared is a well-formedness constraint only if standalone='yes'.
>> ===
>>
>> As I understand it, this document has only an internal DTD subset
>> which contains no parameter entity references.  Therefore, "the Name
>> given in the entity reference MUST match that in an entity
>> declaration".  Since in this case "foo" is not declared, my parser
>> fails.
>>
>> I note that in the test, it references section 2.7 and rule 20 which
>> is the "CData" definition. However, from what I can see, "CData" can
>> only be a part of the "CDSect" rule which in turn must be a part of
>> the "content" rule. "Content" in turn only appears between an STag and
>> an ETag, and this is not the case for this document.
>>
>> If you have the time, I would appreciate if you could explain where I
>> am going wrong with my parsing of this document, or else a pointer to
>> where best to ask my question would be gratefully appreciated.
>>
>> Many thanks in advance,
>>
>> Colin
>
>

Received on Thursday, 12 February 2009 13:02:48 UTC