3.3.3 of XML 1.0 from MURATA Makoto on 1998-11-06 (xml-editor@w3.org from October to December 1998)

From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
Date: Fri, 06 Nov 1998 10:45:34 +0900
To: xml-editor@w3.org
Message-Id: <199811060145.AA02630@murata.apsdc.ksp.fujixerox.co.jp>
"3.3.3 Attribute-value normalization" is one of the hardest part of the 
XML specification.

Here are some suggestinos for improvement.

1)

The itemized list in the first para is actually a pseudo-program.  
Readers have to first understand that this is a conditional 
statement repeatedly executed for each character between quotes.
This should be explicitly stated, at least.

2)

Moreover, readers have to understand that this normalization happens 
after end-of-line normalization is already performed.

3)

It might be a good idea to provide an example.  See attachment from XML-dev.

4)

"declared value" in the second para should read "attribute type".

Hope this helps.

Cheers,

Makoto
------------------------------------------------------------------------------

MURATA Makoto wrote:
> From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
> Date: Wed, 27 May 1998 16:17:27 +0900
> To: xml-dev@ic.ac.uk
> 
> While translating the XML specification, I find that I do not understand 
> the attribute normalization mechanism of XML.
> 
> I made an example XML document (shown below).  I used the latest version 
> of expat, Lark, Aelfred, xp, and MSXML.  I used DemoHandler of SAX to 
> invoke Lark, Aelfred, and xp.
> 
> xp says that the type of the attribute "a" is CDATA.  MSXML reports a 
> fatal error.  Aelfred says that the attribute value is always "test test".  
> Lark and expat normalize some but not all.  Which one is correct? 
> 
> <?xml version="1.0"?>
> <!DOCTYPE test 
> [
> <!ELEMENT test (#PCDATA|test)*>
> <!ATTLIST test 
> 	a NMTOKENS #IMPLIED>
> <!ENTITY D "&#xD;"> 
> <!ENTITY A "&#xA;">
> <!ENTITY DA "&#xD;&#xA;">  ]>
> <test>
> <test a="
> 
> test
> 
> test
> 
> "/>
> <test a="&D;&A;&D;&A;test&D;&A;&D;&A;test&D;&A;&D;&A;"/>
> <test a="&DA;&DA;test&DA;&DA;test&DA;&DA;"/>
> <test a="&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;"/>
> <test a="&#xD;&#xD;test&#xD;&#xD;test&#xD;&#xD;"/>
> <test a="&#xA;&#xA;test&#xA;&#xA;test&#xA;&#xA;"/>
> </test>
> 
> Makoto
>  
> Fuji Xerox Information Systems
>  
> Tel: +81-44-812-7230   Fax: +81-44-812-7231
> E-mail: murata@apsdc.ksp.fujixerox.co.jp


Richard Tobin wrote:
> Date: Wed, 27 May 1998 12:35:14 +0100 (BST)
> From: Richard Tobin <richard@cogsci.ed.ac.uk>
> To: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, xml-dev@ic.ac.uk
> 
> > While translating the XML specification, I find that I do not understand 
> > the attribute normalization mechanism of XML.
> 
> The result produced by RXP and LT-XML is given at the end (except that
> carriage return characters have been replaced by the sequence ^M for
> ease of reading).  Here is my explanation for each case.  The relevant
> section of the standard is of course 3.3.3.
> 
> > <test a="
> > 
> > test
> > 
> > test
> > 
> > "/>
> 
> In this case, the linefeeds (or whatever record boundaries are in your
> system) are replaced by spaces. Then, the trailing spaces are removed and
> the other spaces compressed.  So the result is
> 
>   <test a="test test"/>
> 
> This is of course the intended way for NMTOKENS to work.
> 
> > <test a="&D;&A;&D;&A;test&D;&A;&D;&A;test&D;&A;&D;&A;"/>
> > <test a="&DA;&DA;test&DA;&DA;test&DA;&DA;"/>
> 
> In this cases the character entities were expanded (into carriage
> returns and linefeeds) when then general entities were defined.  So
> when the replacement text of the entities is "recursively processed",
> they get turned into spaces.  They then get stripped or replaced,
> producing the same result as the first case.
> 
> [However, if the attribute were of type CDATA, the result would be
> different from the first case: these would have 4 spaces instead of 2,
> because the cr/lf pairs in the first case were reduced to linefeeds
> (probably on input, see section 2.11), whereas in the second case they
> are not part of the *literal* entity value of the internal entity.]
> 
> > <test a="&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;"/>
> > <test a="&#xD;&#xD;test&#xD;&#xD;test&#xD;&#xD;"/>
> > <test a="&#xA;&#xA;test&#xA;&#xA;test&#xA;&#xA;"/>
> 
> In these cases, the character references are appended, but unlike the
> case general entity references the result is not recursively
> processed.  So there are no space characters to normalise, and the
> result is the same as if the attribute had had type CDATA - that is,
> the carriage returns and linefeeds appear in the normalised value.
> 
> Here is the RXP/LT-XML output:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE test [
> <!ELEMENT test (#PCDATA|test)*>
> <!ATTLIST test 
>         a NMTOKENS #IMPLIED>
> <!ENTITY D "&#xD;"> 
> <!ENTITY A "&#xA;">
> <!ENTITY DA "&#xD;&#xA;">  ]>
> <test>
> <test a="test test"/>
> <test a="test test"/>
> <test a="test test"/>
> <test a="^M
> ^M
> test^M
> ^M
> test^M
> ^M
> "/>
> <test a="^M^Mtest^M^Mtest^M^M"/>
> <test a="
> 
> test
> 
> test
> 
> "/>
> </test>
> 
> -- Richard
> 

Lars Marius Garshol wrote:
> * Richard Tobin
> |
> | Here is the RXP/LT-XML output:
> | 
> | [actual output snipped]
>  
> FWIW, this agrees with the canonical XML that xmlproc outputs (after I
> modified it a couple of hours ago):
> 
> <test>&#10;<test a="test test"></test>&#10;<test a="test test"></test>&#10;<test a="test 
test"></test>&#10;<test 
a="&#13;&#10;&#13;&#10;test&#13;&#10;&#13;&#10;test&#13;&#10;&#13;&#10;"></test>&#10;<test 
a="&#13;&#13;test&#13;&#13;test&#13;&#13;"></test>&#10;<test 
a="&#10;&#10;test&#10;&#10;test&#10;&#10;"></test>&#10;</test>

MURATA Makoto wrote:
> > FWIW, this agrees with the canonical XML that xmlproc outputs (after I
> > modified it a couple of hours ago):
> 
> Good to hear that.  What happens if the keyword NMTOKENS is replaced 
> with CDATA?
> 
> Here is the revised document.
> 
> <?xml version="1.0"?>
> <!DOCTYPE test 
> [
> <!ELEMENT test (#PCDATA|test)*>
> <!ATTLIST test 
> 	a CDATA #IMPLIED>
> <!ENTITY D "&#xD;"> 
> <!ENTITY A "&#xA;">
> <!ENTITY DA "&#xD;&#xA;">  ]>
> <test>
> <test a="
> 
> test
> 
> test
> 
> "/>
> <test a="&D;&A;&D;&A;test&D;&A;&D;&A;test&D;&A;&D;&A;"/>
> <test a="&DA;&DA;test&DA;&DA;test&DA;&DA;"/>
> <test a="&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;"/>
> <test a="&#xD;&#xD;test&#xD;&#xD;test&#xD;&#xD;"/>
> <test a="&#xA;&#xA;test&#xA;&#xA;test&#xA;&#xA;"/>
> </test>
> 
> Makoto
>  
> Fuji Xerox Information Systems

Richard Tobin wrote:
> 
> > Good to hear that.  What happens if the keyword NMTOKENS is replaced 
> > with CDATA?
> 
> RXP and LT-XML produce (in canonical XML - I should have thought of
> that last time):
> 
> <test>&#10;<test a="  test  test  "></test>&#10;<test a="    test    test    
"></test>&#10;<test a="    test    test    "></test>&#10;<test 
a="&#13;&#10;&#13;&#10;test&#13;&#10;&#13;&#10;test&#13;&#10;&#13;&#10;"></test>&#10;<test 
a="&#13;&#13;test&#13;&#13;test&#13;&#13;"></test>&#10;<test 
a="&#10;&#10;test&#10;&#10;test&#10;&#10;"></test>&#10;</test>
> 
> Expat agrees.
> 
> -- Richard

Richard Tobin wrote:
> >     "only a single #x20 is appended for a "#xD#xA" sequence that is
> >     part of [...] the literal entity value of an internal parsed
> >     entity"
> 
> The implementations are correct.
> 
> The key here is the word "literal".  None of the internal entities
> contains that sequence (ie carriage-return followed by linefeed)
> *literally* - ie in the very text between the quotes in the entity
> definition.  (There should be a link from the text in section 3.3.3 to
> the definition of literal entity value in 4.5.)  The DA entity contains
> that sequence in its replacement text, but not in its literal value.
> 
> A "natural" (to me, anyway) implementation will not have to do
> anything at all to comply with the phrase quoted above, because it
> will already have reduced literal #xD#xA sequences to #xA before
> parsing.
> 
> -- Richard

Makoto
 
Fuji Xerox Information Systems
 
Tel: +81-44-812-7230   Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp
Received on Thursday, 5 November 1998 20:40:44 UTC