Re: Two new encoding related articles for review from Richard Ishida on 2014-03-24 (www-international@w3.org from January to March 2014)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 24 Mar 2014 17:40:54 +0000
To: www-international@w3.org
Message-ID: <53306E26.1080901@w3.org>
I'm basically just repeating the text in the HTML5 spec, so please raise 
a bug against that, and then I'll adapt my text to suit.

For convenience, the HTML5 spec currently reads:

"Encodings in which a series of bytes in the range 0x20 to 0x7E can 
encode characters other than the corresponding characters in the range 
U+0020 to U+007E represent a potential security vulnerability: a user 
agent that does not support the encoding (or does not support the label 
used to declare the encoding, or does not use the same mechanism to 
detect the encoding of unlabeled content as another user agent) might 
end up interpreting technically benign plain text content as HTML tags 
and JavaScript. Authors should therefore not use these encodings. For 
example, this applies to encodings in which the bytes corresponding to 
"<script>" in ASCII can encode a different string. Authors should not 
use such encodings, which are known to include JIS_C6226-1983, 
JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings 
based on ISO-2022, and encodings based on EBCDIC. Furthermore, authors 
must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also 
fall into this category; these encodings were never intended for use for 
Web content. [RFC1345] [RFC1842] [RFC1468] [RFC2237] [RFC1554] [CP50220] 
[RFC1922] [RFC1557] [CESU8] [UTF7] [BOCU1] [SCSU]"

Cheers,
RI



On 24/03/2014 06:47, Jungshik SHIN (신정식) wrote:
>
>
>
> On Sun, Mar 23, 2014 at 10:16 PM, "Martin J. Dürst"
> <duerst@it.aoyama.ac.jp <mailto:duerst@it.aoyama.ac.jp>> wrote:
>
>     What Jungshik says. In addition, even encodings such as iso-8859-*
>     can be understood in terms of the ISO-2022 framework/toolbox, so
>     'encodings based on ISO-2022' is really best avoided.
>
>
> Yup.  I meant to mention that, too but forgot while actually writing it.
>
> Jungshik
>
>
>     Regards,    Martin.
>
>
>     On 2014/03/21 05:18, Jungshik SHIN (신정식) wrote:
>
>         Documents must not use JIS_C6226-1983, JIS_X0212-1990,
>         HZ-GB-2312, JOHAB
>         (Windows code page 1361), encodings based on ISO-2022, or
>         encodings based
>         on EBCDIC. This is because they allow ASCII code points to represent
>         non-ASCII characters, which poses a security threat.
>
>         Well, JOHAB is ASCII-compatible (NOT that I would encourage
>         anybody to use
>         it. Nobody has actually used it on the web except for testing.
>         So, it may
>         not be worth mentioning it here. Whether it's mentioned it or
>         not, nobody
>         will use it). I don't know what encoding JIS X 0212-1990 is like
>         (it's a
>         coded character set that can be used in one of encodings like
>         EUC-JP -
>         ISO-2022-based definition. Well, a new definition of EUC-JP in
>         the encoding
>         standard does not allow it.).
>
>         Moreover, 'encodings based on ISO-2022' include EUC-JP, EUC-KR
>         (well, in
>         the new encoding standard,  it's now synonymous with Windows-949
>         and it's
>         not ISO-2022-based any more) as well as ISO-2022-{KR,JP,CN} etc.
>         Obviously,
>         the encodings in the former group are ASCII-compatible (with a
>         possible
>         exception of \x5C).
>
>         Therefore, to be precise (pedantic ) , 'encodings based on
>         ISO-2022' has to
>         be replaced with 'ISO-2022-JP*, ISO-2022-KR, ISO-2022-CN*'.
>
>         Jungshik
>
>
>
>
>
>
>
>
>
>         On Thu, Mar 20, 2014 at 12:08 PM, Gunnar Bittersmann
>         <gunnar@bittersmann.de <mailto:gunnar@bittersmann.de>>wrote:
>
>             Richard Ishida scripsit (2014-03-17 17:29+01:00):
>
>             http://www.w3.org/__International/questions/qa-__choosing-encodings-new
>             <http://www.w3.org/International/questions/qa-choosing-encodings-new>
>
>
>
>                     However, I don’t think that the keywords should be
>                     marked-up as <strong
>                     class="kw">
>
>                     Stick with code elements, or use span or b. Or for
>                     the character
>                     encodings, no markup at all, as before.
>
>                     (Don’t replace all occurences of ‘strong’ with
>                     ‘code’, there’s a
>                     ‘strongly’ in the text.)
>
>
>                 The idea was to make them stand out visually. I replaced
>                 strong with b.
>
>
>             You were using ‘ASCII’, “UTF-8’, ‘UTF-16’ and ‘UTF-32’ with
>             no special
>             visual emphasis throughout the upper three quarters of the
>             article. Why
>             here?
>
>             To my taste, it does not improve the readability of the
>             text, quite the
>             contrary.
>
>             If you really want to make them stand out visually: There’s
>             still ‘UTF-8’
>             and ‘ISO-8859-8-i’ without that markup in one of these
>             paragraphs. And in
>             other articles, such keywords are marked-up as <code
>             class="kw"> and set in
>             normal font weight. Here it’s <b class="kw">, bold font,
>             inconsistently.
>
>             My proposal is: Display encoding names as normal text, no
>             markup.
>
>             ‘replacement’ and ‘x-user-defined’ are good candidates for
>             that keyword
>             markup, though. But not in bold, but in normal monospaced
>             font, i.e. use
>             the code element.
>
>
>
>                And shouldn’t this link to
>
>                     http://www.w3.org/__International/questions/qa-__visual-vs-logical#term_
>                     <http://www.w3.org/International/questions/qa-visual-vs-logical#term_>
>                     visualordering
>
>                     given that ‘logically ordered’ links to
>                     http://www.w3.org/__International/questions/qa-__visual-vs-logical#term_
>                     <http://www.w3.org/International/questions/qa-visual-vs-logical#term_>
>                     logicalordering
>
>                     ?
>
>                 No. That's what i wanted.
>
>
>             To me it’s strange that ‘logically ordered’ (marked-up as
>             "termref")
>             points to the description of that term while ‘visual
>             encoding’ (also
>             marked-up as "termref") does not accordingly, but points to
>             the whole
>             article instead.
>
>             I think the best phrase to use as link title for the whole
>             article would
>             be ‘should also be avoided’.
>
>             The anchor links might be out of the scope of this article;
>             most of the
>             target audience of qa-choosing-encodings don’t have to deal
>             with RTL
>             scripts, and Hebrew in particular. And those who do will
>             read the entire
>             article qa-visual-vs-logical anyway.
>
>             My proposal is: Link to that article just once, without fragment
>             identifier:
>
>             … (Hebrew visual encoding) <a
>             href="/International/__questions/qa-visual-vs-__logical">should
>             also be avoided</a>, in favour of an encoding that works
>             with logically
>             ordered text …
>
>
>
>             »»
>             that maps every octet to the Unicode code point
>             ««
>
>             This is the only time when the term ‘octet’ is used in this
>             article. Would
>             the term be clear to the reader? Or would it be better to
>             use ‘byte’ in
>             this context (even though that might be less accurate)?
>
>             Cheers,
>             Gunnar
>
>
>
>
Received on Monday, 24 March 2014 17:41:24 UTC