Re: [css-text] Control characters

After reading those mozilla bugs, and thinking some more, I suggest the
following:

1. Render control characters U+0080-U+009F normally (ie show boxes if there
is no available glyph).
2. Treat U+000C (form feed), in addition to U+0009, U+000A and U+000D, as
whitespace.
3. Ignore other control characters for the purposes of rendering (as in the
current spec)

Reasoning:

1. The most likely reason for a document containing C1 control characters
is that they are left over from conversion from one of the Windows 8-bit
legacy encodings.  Note that HTML treats numeric character references to
chars in this range specially [1]. This is a deviation from Unicode, which
requires an U+0085 to be rendered as blank space, if there is no available
glyph; however, U+0085 as a whitespace character (NEL) typically only
results from a conversion from EBCDIC, which is almost certainly much less
common than Windows legacy case.

2. HTML [2] and Unicode both treat form feed as a whitespace character. It
is also still occasionally used as a whitespace character in real-life" for
example, GNU Emacs has a set of commands that work on "pages", which by
default are separated by form feeds (eg C-x [ and C-x ] will move backwards
and forwards by pages); formatted ASCII output uses form feed to separate
pages. Unicode also treats U+000B (vertical tab) as white space, as does
JavaScript; HTML doesn't (although it does treat it slightly differently
from other control characters [3]). However, I have never seen U+000B
intentionally used as whitespace.

3. Other control characters with code points less U+0020 are more likely to
be random crap, which the user won't be helped by showing (though it would
be useful to show them in some contexts such as view-source).

[1]
http://www.w3.org/html/wg/drafts/html/master/single-page.html#tokenizing-character-references
[2]
http://www.w3.org/html/wg/drafts/html/master/single-page.html#space-character
[3]
http://www.w3.org/html/wg/drafts/html/master/single-page.html#preprocessing-the-input-stream

James


On Thu, Mar 20, 2014 at 9:10 PM, Jonathan Kew <jfkthame@gmail.com> wrote:

> On 20/3/14 04:57, Robert O'Callahan wrote:
>
>> On Thu, Mar 20, 2014 at 11:00 AM, James Clark <jjc@jclark.com
>> <mailto:jjc@jclark.com>> wrote:
>>
>>     CSS Text says:
>>
>>         Control characters (Unicode class Cc) other than tab (U+0009),
>>         line feed (U+000A), and carriage return (U+000D) are ignored for
>>         the purpose of rendering.
>>
>>
>>     (This is a change from CSS 2.1, which says they are rendered as
>>     usual.) I was wondering what the thinking is here.  This requirement
>>     conflicts with Unicode (see
>>     http://www.unicode.org/faq/unsup_char.html) in a couple of ways:
>>
>>     1. In addition to 0x9, 0xA and 0xD, Unicode gives characters 0xB
>>     (VT), 0xC (FF) and 0x85 (NEL) the White_Space property.  Characters
>>     with the White_Space property are supposed to be rendered as a
>>     visible but blank space. (Of these, HTML includes only 0xC as a
>>     space character.)
>>
>>     2. Other control characters are supposed to be rendered normally (ie
>>     displayed with a missing glyph if not available in the font).
>>
>>
>> We had a discussion about this a while back within Mozilla; some people
>> like the idea of displaying control characters so that such 'soft
>> errors' in pages can be more easily detected and fixed.
>>
>> We ended up defining an internal CSS property
>> '-moz-control-character-visibility:visible|hidden', with initial value
>> hidden, but we set it to visible for devtools, plain text files, the
>> contents of text inputs, view-source, etc. We could easily standardize
>> that if other people are interested.
>>
>
> For some further discussion, see comments (arguing both for and against
> such a change) in relevant mozilla bugs, such as:
>
>   https://bugzilla.mozilla.org/show_bug.cgi?id=757521
>   https://bugzilla.mozilla.org/show_bug.cgi?id=909344
>   https://bugzilla.mozilla.org/show_bug.cgi?id=947588
>   https://bugzilla.mozilla.org/show_bug.cgi?id=963252
>
> JK
>
>

Received on Friday, 21 March 2014 01:55:48 UTC