W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > October 2010

[Bug 10800] Reconsider form feed (U+000C) conformance

From: <bugzilla@jessica.w3.org>
Date: Fri, 01 Oct 2010 03:30:25 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1P1WKH-00053h-1q@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=10800

--- Comment #4 from bugzilla@polizisten-duzer.de 2010-10-01 03:30:23 UTC ---
I think that the form feed character made its way into HTML by mistake rather
than by intention.

HTML 2.0 explicitly said "In SGML applications, the use of control characters
is limited in order to maximize the chance of successful interchange over
heterogeneous networks and operating systems. In the HTML document character
set only three control characters are allowed: Horizontal Tab, Carriage Return,
and Line Feed (code positions 9, 13, and 10)."

In July 1997, a draft of HTML 4
<http://www.w3.org/TR/WD-html40-970708/struct/text.html> (the earliest that
mentioned form feeds in any way) said:

"In addition, for all elements except PRE, a sequence of contiguous white space
characters such as spaces, horizontal tabs, form feeds and line breaks, should
be replaced by a single word space. Since the notion of what word space is
varies from script (written language) to script, user agents should collapse
white space in script-sensitive ways. For example, in Latin scripts, a single
word space is just a space (ASCII decimal 32), while in Thai it is a zero-width
word separator."

Note how bogus this is. It mentions form feeds in a "such as" phrase (not quite
appropriate wording for a normative section) without adjusting the SGML
declaration accordingly. It also mentions the zero-width word separator, which
has a totally different context. It sounds more like a brainstorming about
whitespace than like a specification. But *if* taken normatively, the IE/Opera
rendering (where form feeds collapse with 'white-space: normal') is closer.

The next draft from November 1997
<http://www.w3.org/TR/PR-html40-971107/struct/text.html#h-9.1> says:

"HTML considers only the following characters to be white space characters:

* ASCII space (&#x0020;)
* ASCII tab (&#x0009;)
* ASCII form feed (&#x000C;)
* Zero-width space (&#x0009;)"

Note how it has managed to mix the form feed and the zero-width space, which
were previously mentioned in totally different contexts, into one category and
even get the code point of the zero-width space wrong. The coint point has been
corrected shortly after, but the whole section has remained basically unchanged
and obscure. The issue has been brought up more than once

* http://lists.w3.org/Archives/Public/www-html-editor/1998JulSep/0131.html
* http://lists.w3.org/Archives/Public/www-html/2004May/0022.html
*
http://bytes.com/topic/html-css/answers/169504-theory-question-u-000c-html-4-01-a

but was never resolved in 13 years. On the contrary, it was propagated into
other specifications. For some time, even XHTML 1 treated the form feed as
whitespace <http://www.w3.org/TR/1999/PR-xhtml1-19991210/#uaconf> (fixed three
years later).

Therefore, I'd like to be 100% sure that the form feed isn't allowed in HTML5
just because of a 13 years old mistake. Besides, HTML5's treatment doesn't look
consistent in itself. HTML5 rules out &#13;, presumably because that would give
an actual carriage return in the DOM and CSS isn't prepared to handle that (CSS
regards carriage returns as random control characters, not whitespace), and
that is reasonable. But then, CSS isn't prepared to handle form feeds either.
Is the ability to paste RFC text into HTML and still be conforming really a use
case that justifies this?

CSS has added the form feed around the same time, btw. (the last version
without form feeds was <http://www.w3.org/TR/WD-CSS2-971104/grammar.html>, the
first version with form feeds is
<http://www.w3.org/TR/1998/WD-css2-19980128/grammar.html>), but that's rather
harmless because a form feed in CSS doesn't get into the DOM. Class and
[attr~=val] selectors constitute an intersection, however. (For these, it would
IMHO make more sense if CSS followed the whitespace definition of the document
language instead of its own, but it's not too important as long as the only
character where it would make a difference were non-conforming.)

One more bizzare thing: As said obove, IE collapses form feeds with
'white-space: normal' (matching the original HTML 4 draft), but renders them as
boxes with 'white-space: pre' - unless they are preceded or followed by a
vertical tab. '&#11;&#12;' gets rendered as '&#9794;&#9792;' and '&#12;&#11;'
gets rendered as '&#9792;&#9794;'. '&#9794;' and '&#9792;' have code positions
11 and 12 in some DOS code pages. IE must be really desperate about making
something printable of them.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Friday, 1 October 2010 03:30:27 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 16:30:57 UTC