W3C home > Mailing lists > Public > www-international@w3.org > October to December 2005

Re: New article for REVIEW: Upgrading from language-specific legacy encoding to Unicode encoding

From: Frank Yung-Fong Tang <franktang@gmail.com>
Date: Wed, 12 Oct 2005 17:57:31 -0700
Message-ID: <2e4dfd690510121757n763f5b9er@mail.gmail.com>
To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
Cc: franktang@gmail.com, Richard Ishida <ishida@w3.org>, www-international@w3.org, member-i18n-geo@w3.org
The issue in CSS

CSS1
http://www.w3.org/TR/CSS1
"

The following is the tokenizer, written in flex
[16]<http://www.w3.org/TR/CSS1#ref16>notation. Note that this assumes
an 8-bit implementation of flex. The
tokenizer is case-insensitive (flex command line option -i).

unicode		\\[0-9a-f]{1,4}
"



CSS2
http://www.w3.org/TR/CSS21/syndata.html#q6
"

Third, backslash escapes allow authors to refer to characters they can't
easily put in a document. In this case, the backslash is followed by at most
six hexadecimal digits (0..9A..F), which stand for the ISO 10646
([ISO10646]<http://www.w3.org/TR/CSS21/refs.html#ref-ISO10646>)
character with that number, which must not be zero. (It is undefined in CSS
2.1 what happens if a style sheet *does* contain a zero.) If a character in
the range [0-9a-fA-F] follows the hexadecimal number, the end of the number
needs to be made clear. There are two ways to do that:

   1. with a space (or other whitespace character): "\26 B" ("&B"). In
   this case, user agents should treat a "CR/LF" pair (U+000D/U+000A) as a
   single whitespace character.
   2. by providing exactly 6 hexadecimal digits: "\000026B" ("&B")

"

and also
unicode \\[0-9a-f]{1,6}(\r\n|[ \n\r\t\f])?

The CSS \ escaping is tricky because in CSS1 it does not require a ' '
termination but in CSS2 it does (if it is less than 6 digit.

So it become very tricky how to write U+4e00 + 'a' + U+0043 + 'b'
1. \4e00 a\43 b
and
2. \004e00a\000043b
both represent
4 characters U+4e00 + 'a' + U+0043 + 'B' in CSS2
but 1 represent
U+4e00 + ' ' + 'a' + U+0043 + ' ' + B in CSS1
and 2 represent
U+4e00 + '0' + '0' + 'A' + U+0000 + '4' + '3' + 'b' in CSS1

And in CSS2 what does
\04e00a\00043b
represent ?
also what does
\004E00a\43B
represent
(notice I change the e to E and b to B

It is tricky, right?

2005/10/12, Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>:
>
>
> Hi Frank
>
> Could you clarify: we're not sure what problem you refer to. Possibly:
>
> - if you change encoding of your HTML, you should ensure no knock ons with
> other files
> or
> - class defined in another language
> or
> - something else?
>
> Many thanks
>
> Deborah
>
>
> -----Original Message-----
> From: www-international-request@w3.org on behalf of Frank Yung-Fong Tang
> Sent: Tue 8/23/2005 20:02
> To: Richard Ishida
> Cc: www-international@w3.org
> Subject: Re: New article for REVIEW: Upgrading from language-specific
> legacy encoding to Unicode encoding
>
>
> I think you should mention not only charset with HTML, but also issue
> with CSS and seperate JavaScript file. The issue with \ unicode in CSS
> is quite tricky.
>
> Richard Ishida wrote on 8/23/2005, 1:45 PM:
>
> >
> >
> >
> > Title: Upgrading from language-specific legacy encoding to Unicode
> > encoding
> > http://www.w3.org/International/questions/qa-utf8-upgrade.html
> >
> > Comments are being sought on this article prior to final release.
> > Please send any comments to www-international@w3.org. We expect to
> > publish a final version in one to three weeks.
> >
> > This article provides an answer to the question: What should I
> > consider when upgrading my web pages from legacy encoding to Unicode
> > encoding?
> >
> >
> >
> > ============
> > Richard Ishida
> > W3C
> >
> > contact info:
> > http://www.w3.org/People/Ishida/
> >
> > W3C Internationalization:
> > http://www.w3.org/International/
> >
> > Publication blog:
> > http://people.w3.org/rishida/blog/
> >
> >
> >
>
>
>
>
>
>
> http://www.bbc.co.uk/
>
> This e-mail (and any attachments) is confidential and may contain
> personal views which are not the views of the BBC unless specifically
> stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in
> reliance on it and notify the sender immediately. Please note that the
> BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>
>


--
Frank Yung-Fong Tang 譚永鋒
Îñţérñåţîöñåļîžåţîöñ

FrankTang@gmail.com
Skype: FrankYungFongTang
Yahoo IM: FrankYungFongTan
MSN IM: FrankYungFongTang@hotmail.com
Received on Thursday, 13 October 2005 00:57:56 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:06 GMT