Attribute value normalization of character references. from Paul Duffin on 2002-07-12 (xml-editor@w3.org from July to September 2002)

From: Paul Duffin <pduffin@volantis.com>
Date: Fri, 12 Jul 2002 09:21:28 -0400 (EDT)
To: xml-editor@w3.org
Message-ID: <3D2ED83B.3010006@volantis.com>
This question relates to the 2nd Edition of the XML Version 1.0
specification.
     http://www.w3.org/TR/REC-xml

I have looked in the archives and errata but cannot find any answer to
my question although there are some unanswered questions which touch on
this.

The second example seems incorrect.

As I understand the rules listed in section 3.3.3 the attribute
specification
     a="&d;&d;A&a;&a;B&da;"

should be normalized to
     #xD #xD A #xA #xA B #xD #xA

for both CDATA and non CDATA declared attributes.

Here is how I think the algorithm as described works.
1) Normalization of line breaks has no effect as there are no line
    breaks in the example.
2) Normalized string is "".
3) This applies to each character, entity reference or character
    reference in the UNNORMALIZED attribute value. This has four
    different rules which I assume have been labelled 3a, 3b, 3c
    and 3d in document order.

Processing the UNNORMALIZED attribute value goes as follows.

   &d;
     is an entity reference so rule 3b applies so apply the rules to
     the entity's replacement text.
       &#xD;
         is a character reference so rule 3a applies which means that we
         have to add #xD to the NORMALIZED value.

   &d;
     ditto.

   A
     is another character so rule 3d applies so we add A to the
     NORMALIZED value.

   &a;
     is an entity reference so rule 3b applies so apply the rules to
     the entity's replacement text.
       &#xA;
         is a character reference so rule 3a applies which means that we
         have to add #xA to the NORMALIZED value.

   &a;
     ditto.

   B
     is another character so rule 3d applies so we add B to the
     NORMALIZED value.

   &da;
     is an entity reference so rule 3b applies so apply the rules to
     the entity's replacement text.
       &#xD;
         is a character reference so rule 3a applies which means that we
         have to add #xD to the NORMALIZED value.

       &#xA;
         is a character reference so rule 3a applies which means that we
         have to add #xA to the NORMALIZED value.

I have just done some more reading of the specification and realise that
example 2 is correct and the reason is that the literal entity value has
already had any character references resolved before it is processed by
the attribute value normalization rules.

It would be much clearer if rule 3b contained a reference to section 4.5
which describes the "Construction of Internal Entity Replacement Text".

Also an example which illustrates this would be good.
e.g.

Given
     <!ENTITY literal-a "&#38;#xD;">

then
     a="&literal-a;A;&literal-a;B;&literal-a;"

would be normalized to

     #xA A #xA B #xA

for both CDATA and non CDATA attributes.

A more detailed working through of the examples would be also be useful
specifying the replacement text of the entities before normalization.
This could may be be added to appendix D.

-- 
This message may contain confidential information and will be protected
by copyright. If this email isn't for you then we'd be grateful if you
could notify Volantis by return and delete it. You should not copy,
disclose or distribute any of its contents.

Any reply may be read by the recipient to whom you send it and others
within Volantis Systems Ltd.

Although we aim to use efficient virus checking procedures we accept no
liability for viruses and recipients should use their own virus checking
procedures.
Received on Friday, 12 July 2002 09:59:35 UTC