[Bug 10800] New: Reconsider form feed (U+000C) conformance

http://www.w3.org/Bugs/Public/show_bug.cgi?id=10800

           Summary: Reconsider form feed (U+000C) conformance
           Product: HTML WG
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: bugzilla@polizisten-duzer.de
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


As currently drafted, HTML5 allows the form feed (U+000C) character

* as syntactic whitespace
* in content (text and attribute values)

This is really an innovation of HTML5. HTML 2.0, 3.2, 4.0 and 4.01 all had SGML
declarations that excluded the form feed (actually, all control characters
except horizontal tab, line feed and carriage return) from the document
character set <http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html>, which means
that in HTML 4.01, form feeds can only occur as character references, which
means they aren't syntactic whitespace.

HTML 4.01 also mentions the form feed character in a section that is about
"printable" whitespace
<http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1>, but it's obscure, has
not been implemented consistently by any browser, and defining the rendering is
nowadays considered the job of CSS rather than HTML.

Now HTML5 allows the form feed as syntactic whitespace. This is rather
harmless, but not particularly useful either. What is more harmful is that
HTML5 also allows form feeds in content. So:

* While HTML 4.01 allowed all control characters in content (if written as
character references), HTML5 rules them out completely (even as character
references) except for the form feed character (which is now allowed even in
raw form). => Not consistent with anything known.
* XML 1.0 does not allow form feeds in any way. => Results in a class of
conforming HTML5 documents that can't be expressed in XML 1.0 and could be
avoided rather easily (more easily than the other such cases).
* No browser currently implements the rendering of the form feed character in a
useful way. Internet Explorer and Opera render it as a collapsing space with
'white-space: normal', but as a box with 'white-space: pre'. Gecko and Webkit
always render it as a non-collapsing zero-width glyph; the CSS 'white-space'
property makes no difference (and they don't regard it as "printable"
whitespace at all; this can be seen when searching for 'word1 word2' in a
document that contains 'word1&#xC;word2').
* CSS 2.1 does not consider the form feed character to be "printable"
whitespace. It says "Control characters other than U+0009 (tab), U+000A (line
feed), U+0020 (space), and U+202x (bidi formatting characters) are treated as
characters to render in the same way as any normal character"
<http://www.w3.org/TR/CSS21/text.html#ctrlchars>. (The grammar of CSS 2.1 does
consider the form feed character to be syntactic whitespace, but this is not
helpful for the rendering part.)

In order to prevent another "single quirk" story where implementors waste more
time than they already did (in the past
<https://bugzilla.mozilla.org/show_bug.cgi?id=373268> and
<https://bugzilla.mozilla.org/show_bug.cgi?id=437915> and in the future maybe
<https://bugs.webkit.org/show_bug.cgi?id=13159>) on a character that has no
agreed semantics in any markup language, and in order to prevent authors from
expecting anything useful from it, I'm kindly asking for one of the following:

* Do what XML 1.0 does, i.e., disallow the form feed character entirely. (If
the treatment as syntactic whitespace is required for compatibility with legacy
content, it can become part of the error handling.)
* Revert to what HTML 4.01 did, i.e., allow the form feed character as
character references only so nobody thinks it were whitespace. This is what XML
1.1 does, too. (I would not recommend this because it can't be extended to all
control characters - certainly not the C1 controls since they need to be
treated as Windows-1252 codepoints for compatibility - but still better than
the raw character. And again: If necessary for compatibility, it can be treated
as syntactic whitespace as part of the error handling.)

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Wednesday, 29 September 2010 09:37:22 UTC