[widgets] white space handling from Cyril Concolato on 2009-12-17 (public-webapps@w3.org from October to December 2009)

From: Cyril Concolato <cyril.concolato@enst.fr>
Date: Thu, 17 Dec 2009 13:26:24 +0100
To: public-webapps <public-webapps@w3.org>
Message-ID: <4B2A2370.60009@enst.fr>

Hi Widget addicts,

While reading again through the spec, I'm wondering why there are differences between the P&C spec and the XML spec in terms of white space handling.

P&C defines:
* "space characters" as: U+0020, U+0009, U+000A, U+000B, U+000C, U+000D
* "Unicode white space characters" as: U+0009-U+000D, U+0020, U+0085, U+00A0, U+1680, U+180E, U+2000-U+200A, U+2028, +2029, U+202F, U+205F, U+3000
* "control characters" as: U+0000-U+001F, U+007F
* "forbidden characters" as: control characters and U+003C, U+003E, U+003A, U+0022, U+002F, U+005C, U+007C, U+003F, U+002A, U+005E, U+0060, U+007B, U+007D, U+0021.
"space characters" are used in "Rule for Getting a Single Attribute Value", "Rule for Getting a List of Keywords From an Attribute", "Rule for Parsing a Non-negative Integer", "algorithm to derive the user agent locales" and ZIP handling.
"Unicode white space characters" are used only in "Rule for Getting Text Content with Normalized White Space"
"control characters" are only used only in "forbidden characters" and "forbidden characters" are used only in ZIP processing.

XML defines "white space" as: U+0020, U+0009, U+000A, U+000D

Given that, I have the following questions/remarks:

- Why do you define control characters, can't you put their code points in "forbidden characters"? This would simplify the spec and make it more easy to understand.

- Could you rename "forbidden characters" to "ZIP forbidden characters"? This would clearly indicate in which area they are forbidden and why they are defined.

- Why do the definition of P&C "space characters" and "Unicode white space charactes" differ from the XML "white space" definition?

For "Unicode white space characters", I could understand this difference since it's only used in the "Rule for Getting Text Content with Normalized White Space" which first applies XML parsing, DOM3 textContent behavior and then applies additional P&C-defined behavior. But still, I'm wondering: is this difference really needed? If yes, can you add a note explaining the rationale and difference with the basic XML processing.

For "space characters", why did you add U+000B and U+000C?

- Ignoring U+000B and U+000C, the "Rule for Getting a Single Attribute Value" seems to me to be already defined in XML as "Attribute-Value Normalization"(http://www.w3.org/TR/xml/#AVNormalize). I could understand that you want a self-contained spec but you should at least indicate that the behavior is the same as the basic XML processing.

Best regards,

Cyril
-- 
Cyril Concolato
Maître de Conférences/Associate Professor
Groupe Mutimedia/Multimedia Group
Telecom ParisTech
46 rue Barrault
75 013 Paris, France
http://concolato.blog.telecom-paristech.fr/

Received on Thursday, 17 December 2009 12:54:40 UTC