[XHTML 1.0] white space handling removal from 1st edition in 2nd edition (PR#10216) from mahmoudbahaa.eg@gmail.com on 2009-05-28 (public-xhtml2@w3.org from May 2009)

From: <mahmoudbahaa.eg@gmail.com>
Date: Thu, 28 May 2009 06:07:07 -0500
To: public-xhtml2@w3.org
CC: voyager-issues@mn.aptest.com
Message-Id: <200905281107.n4SB77We022405@htmlwg.mn.aptest.com>

In the 2nd edition of XHTML
http://www.w3.org/TR/2002/REC-xhtml1-20020801/#uaconf the following
paragraph from the 1st version here
http://www.w3.org/TR/2000/REC-xhtml1-20000126/#uaconf


> . The XHTML user agent in addition, must treat the following characters as whitespace:
>
> Form feed (&#x000C;)
> Zero-width space (&#x200B;)
>
> In elements where the 'xml:space' attribute is set to 'preserve', the user agent must leave all whitespace characters intact (with the exception of leading and trailing whitespace characters, which should be removed). Otherwise, whitespace is handled according to the following rules:
>
> All whitespace surrounding block elements should be removed.
> Comments are removed entirely and do not affect whitespace handling. One whitespace character on either side of a comment is treated as two white space characters.
> Leading and trailing whitespace inside a block element must be removed.
> Line feed characters within a block element must be converted into a space (except when the 'xml:space' attribute is set to 'preserve').
> A sequence of white space characters must be reduced to a single space character (except when the 'xml:space' attribute is set to 'preserve').
> With regard to rendition, the User Agent should render the content in a manner appropriate to the language in which the content is written. In languages whose primary script is Latinate, the ASCII space character is typically used to encode both grammatical word boundaries and typographic whitespace; in languages whose script is related to Nagari (e.g., Sanskrit, Thai, etc.), grammatical boundaries may be encoded using the ZW 'space' character, but will not typically be represented by typographic whitespace in rendered output; languages using Arabiform scripts may encode typographic whitespace using a space character, but may also use the ZW space character to delimit 'internal' grammatical boundaries (what look like words in Arabic to an English eye frequently encode several words, e.g. 'kitAbuhum' = 'kitAbu-hum' = 'book them' == their book); and languages in the Chinese script tradition typically neither encode such delimiters nor use typographic whitespace in this way.

�was removed from section 3.2 which I seriously don't know why first
off the first part of the removed part of considering Form feed
(&#x000C;) % Zero-width space (&#x200B;) as white spaces as well seems
consistent with the HTML 4.01 spec particularly section 9.1 on white
space http://www.w3.org/TR/html401/struct/text.html#h-9.1where it says
:

> In HTML, only the following characters are defined as white space characters:
>
> ASCII space (&#x0020;)
> ASCII tab (&#x0009;)
> ASCII form feed (&#x000C;)
> Zero-width space (&#x200B;)

as for the rest it actually explain how conforming user agents should
handle white spaces which what this part was all about as it says in
the the first line of the paragraph "White space is handled according
to the following rules" and with these rules removed the paragraph
seems missing an important part now . the behavior described in 1st
version seems consistent with that done in HTML 4.01 user agents , so
does that mean XHTML 1.0 2nd edition define no specific behavior for
user agents to handle white spaces or the entire removal of this
paragraph was not intentional ?

Received on Thursday, 28 May 2009 11:11:35 UTC