- From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
- Date: Fri, 06 Nov 1998 10:45:34 +0900
- To: xml-editor@w3.org
"3.3.3 Attribute-value normalization" is one of the hardest part of the XML specification. Here are some suggestinos for improvement. 1) The itemized list in the first para is actually a pseudo-program. Readers have to first understand that this is a conditional statement repeatedly executed for each character between quotes. This should be explicitly stated, at least. 2) Moreover, readers have to understand that this normalization happens after end-of-line normalization is already performed. 3) It might be a good idea to provide an example. See attachment from XML-dev. 4) "declared value" in the second para should read "attribute type". Hope this helps. Cheers, Makoto ------------------------------------------------------------------------------ MURATA Makoto wrote: > From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp> > Date: Wed, 27 May 1998 16:17:27 +0900 > To: xml-dev@ic.ac.uk > > While translating the XML specification, I find that I do not understand > the attribute normalization mechanism of XML. > > I made an example XML document (shown below). I used the latest version > of expat, Lark, Aelfred, xp, and MSXML. I used DemoHandler of SAX to > invoke Lark, Aelfred, and xp. > > xp says that the type of the attribute "a" is CDATA. MSXML reports a > fatal error. Aelfred says that the attribute value is always "test test". > Lark and expat normalize some but not all. Which one is correct? > > <?xml version="1.0"?> > <!DOCTYPE test > [ > <!ELEMENT test (#PCDATA|test)*> > <!ATTLIST test > a NMTOKENS #IMPLIED> > <!ENTITY D "
"> > <!ENTITY A "
"> > <!ENTITY DA "
"> ]> > <test> > <test a=" > > test > > test > > "/> > <test a="&D;&A;&D;&A;test&D;&A;&D;&A;test&D;&A;&D;&A;"/> > <test a="&DA;&DA;test&DA;&DA;test&DA;&DA;"/> > <test a="

test

test

"/> > <test a="

test

test

"/> > <test a="

test

test

"/> > </test> > > Makoto > > Fuji Xerox Information Systems > > Tel: +81-44-812-7230 Fax: +81-44-812-7231 > E-mail: murata@apsdc.ksp.fujixerox.co.jp Richard Tobin wrote: > Date: Wed, 27 May 1998 12:35:14 +0100 (BST) > From: Richard Tobin <richard@cogsci.ed.ac.uk> > To: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>, xml-dev@ic.ac.uk > > > While translating the XML specification, I find that I do not understand > > the attribute normalization mechanism of XML. > > The result produced by RXP and LT-XML is given at the end (except that > carriage return characters have been replaced by the sequence ^M for > ease of reading). Here is my explanation for each case. The relevant > section of the standard is of course 3.3.3. > > > <test a=" > > > > test > > > > test > > > > "/> > > In this case, the linefeeds (or whatever record boundaries are in your > system) are replaced by spaces. Then, the trailing spaces are removed and > the other spaces compressed. So the result is > > <test a="test test"/> > > This is of course the intended way for NMTOKENS to work. > > > <test a="&D;&A;&D;&A;test&D;&A;&D;&A;test&D;&A;&D;&A;"/> > > <test a="&DA;&DA;test&DA;&DA;test&DA;&DA;"/> > > In this cases the character entities were expanded (into carriage > returns and linefeeds) when then general entities were defined. So > when the replacement text of the entities is "recursively processed", > they get turned into spaces. They then get stripped or replaced, > producing the same result as the first case. > > [However, if the attribute were of type CDATA, the result would be > different from the first case: these would have 4 spaces instead of 2, > because the cr/lf pairs in the first case were reduced to linefeeds > (probably on input, see section 2.11), whereas in the second case they > are not part of the *literal* entity value of the internal entity.] > > > <test a="

test

test

"/> > > <test a="

test

test

"/> > > <test a="

test

test

"/> > > In these cases, the character references are appended, but unlike the > case general entity references the result is not recursively > processed. So there are no space characters to normalise, and the > result is the same as if the attribute had had type CDATA - that is, > the carriage returns and linefeeds appear in the normalised value. > > Here is the RXP/LT-XML output: > > <?xml version="1.0" encoding="UTF-8"?> > <!DOCTYPE test [ > <!ELEMENT test (#PCDATA|test)*> > <!ATTLIST test > a NMTOKENS #IMPLIED> > <!ENTITY D "
"> > <!ENTITY A "
"> > <!ENTITY DA "
"> ]> > <test> > <test a="test test"/> > <test a="test test"/> > <test a="test test"/> > <test a="^M > ^M > test^M > ^M > test^M > ^M > "/> > <test a="^M^Mtest^M^Mtest^M^M"/> > <test a=" > > test > > test > > "/> > </test> > > -- Richard > Lars Marius Garshol wrote: > * Richard Tobin > | > | Here is the RXP/LT-XML output: > | > | [actual output snipped] > > FWIW, this agrees with the canonical XML that xmlproc outputs (after I > modified it a couple of hours ago): > > <test> <test a="test test"></test> <test a="test test"></test> <test a="test test"></test> <test a=" test test "></test> <test a=" test test "></test> <test a=" test test "></test> </test> MURATA Makoto wrote: > > FWIW, this agrees with the canonical XML that xmlproc outputs (after I > > modified it a couple of hours ago): > > Good to hear that. What happens if the keyword NMTOKENS is replaced > with CDATA? > > Here is the revised document. > > <?xml version="1.0"?> > <!DOCTYPE test > [ > <!ELEMENT test (#PCDATA|test)*> > <!ATTLIST test > a CDATA #IMPLIED> > <!ENTITY D "
"> > <!ENTITY A "
"> > <!ENTITY DA "
"> ]> > <test> > <test a=" > > test > > test > > "/> > <test a="&D;&A;&D;&A;test&D;&A;&D;&A;test&D;&A;&D;&A;"/> > <test a="&DA;&DA;test&DA;&DA;test&DA;&DA;"/> > <test a="

test

test

"/> > <test a="

test

test

"/> > <test a="

test

test

"/> > </test> > > Makoto > > Fuji Xerox Information Systems Richard Tobin wrote: > > > Good to hear that. What happens if the keyword NMTOKENS is replaced > > with CDATA? > > RXP and LT-XML produce (in canonical XML - I should have thought of > that last time): > > <test> <test a=" test test "></test> <test a=" test test "></test> <test a=" test test "></test> <test a=" test test "></test> <test a=" test test "></test> <test a=" test test "></test> </test> > > Expat agrees. > > -- Richard Richard Tobin wrote: > > "only a single #x20 is appended for a "#xD#xA" sequence that is > > part of [...] the literal entity value of an internal parsed > > entity" > > The implementations are correct. > > The key here is the word "literal". None of the internal entities > contains that sequence (ie carriage-return followed by linefeed) > *literally* - ie in the very text between the quotes in the entity > definition. (There should be a link from the text in section 3.3.3 to > the definition of literal entity value in 4.5.) The DA entity contains > that sequence in its replacement text, but not in its literal value. > > A "natural" (to me, anyway) implementation will not have to do > anything at all to comply with the phrase quoted above, because it > will already have reduced literal #xD#xA sequences to #xA before > parsing. > > -- Richard Makoto Fuji Xerox Information Systems Tel: +81-44-812-7230 Fax: +81-44-812-7231 E-mail: murata@apsdc.ksp.fujixerox.co.jp
Received on Thursday, 5 November 1998 20:40:44 UTC