The form feed characters and other control codes

I just noticed an inconsistency in the HTML 4.01 specification.
It's of little practical value, but I think it should still be fixed at
least by adding a note into the "Errata".

At http://www.w3.org/TR/html4/struct/text.html#whitespace
ASCII form feed () is defined as a white space character.

But at http://www.w3.org/TR/html4/sgml/sgmldecl.html the SGML declaration
says:

         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED

which means that 12 (decimal), which means form feed, is UNUSED, i.e.
prohibited.

At least the W3C validator reports a form field as "non SGML character
12", which looks OK to me - preference should be given to the formalized
specification in favor of prose text.

Of course, the form feed would be useless even if it were allowed, since
it would be equivalent to a space. But the contradiction should be removed
by removing the form feed from the set of white space characters.

This implies that the description of differences between HTML and XHTML
could be simplified, by completely removing clause C.15 from Appendix C,
http://www.w3.org/TR/html/#C_15
As far as I can see, all characters permitted in HTML are permitted in XML
and XHTML as well. The converse is not true, and this raises another
question:

XML permits C1 Controls, and HTML 4 forbids them. Since they are hardly
useful, and typically result from conversion errors or incorrectly
specified encoding (e.g., serving windows-1252 encoded data as
iso-8859-1), shouldn't XHTML have a separate rule that forbids them?
This would not make it possible to detect the problem as an error
in validation, but it would let other checkers report it objectively
as an error. Or is there some imaginable use for C1 Controls in XHTML?
(Note that C0 Controls except tab, CR and LF are forbidden.)

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Thursday, 13 May 2004 07:54:42 UTC