Re: [html5] r5258 - [e] (0) Some more references to UTF-8. from Leif Halvard Silli on 2010-08-10 (public-html@w3.org from August 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 10 Aug 2010 20:34:16 +0200
To: "Tab Atkins Jr." <jackalmage@gmail.com>
Cc: whatwg@whatwg.org, HTMLwg <public-html@w3.org>, Ian Hickson <ian@hixie.ch>, commit-watchers@whatwg.org
Message-ID: <20100810203416530294.2a0cf4bb@xn--mlform-iua.no>

Tab Atkins Jr., Tue, 10 Aug 2010 10:11:25 -0700:
> On Tue, Aug 10, 2010 at 4:53 AM, Leif Halvard Silli
> <xn--mlform-iua@xn--mlform-iua.no> wrote:
>> whatwg@whatwg.org, Mon,  9 Aug 2010 18:16:12 -0700 (PDT):
>>> Author: ianh
>>> Date: 2010-08-09 18:16:10 -0700 (Mon, 09 Aug 2010)
>>> New Revision: 5258
>> 
>>>    <p>Authors are encouraged to use UTF-8. Conformance checkers may
>>> -  advise authors against using legacy encodings.</p>
>>> +  advise authors against using legacy encodings. <a
>>> href=#refsRFC3629>[RFC3629]</a></p>
>> 
>> Could we replace 'legacy encodings' with a clearer wording - or
>> eventually define what 'legacy encodings' mean? The current wording
>> could give the impression that any encoding other than UTF-8 is a
>> legacy encoding. But it is unclear whether that is actually what is
>> meant.
>> 
>> Specifically, it is not clear from the above whether conformance
>> checkers may advice authors against using UTF-16, since UTF-16
>> generally isn't associated with 'legacy encoding'.
> 
> That's precisely what's meant.  UTF-8 is the encoding of the web.  Any
> and all other encodings are legacy encodings.

But that is not clear from this paragraph in the specification.

Because, "legacy" is not synonymous with "deprecated", "unwanted" or 
"not optimal". Rather, it is synonymous with "old" and "outdated" (and 
therefore, subsequently, deprecated/unwanted/unoptimal). Not everyone 
(probably quite few) reading the spec will think of UTF-16 as "old" and 
"out of date". And unlike what I feel you are suggesting, "legacy" used 
about encodings, usually mean the same thing, irrespective of the 
context - be it "the Web" or anything else. It is generally accepted 
that non-UNICODE encodings are legacy encodings, irrespective of the 
context. But it is not generally accepted that UTF-16, being a UNICODE 
encoding in "good standing", is a legacy encoding.

Thus, if 'legacy encoding' is meant to cover 'UTF-16' as well, then it 
is a quite unclear wording open to more than one interpretation. I 
suggest using a wording that is more certain to get the message across: 
EITHER define what 'legacy encoding' refers to [*]. OR avoid the term 
entirely [#].  It is not important to me which strategy is used - it 
only matters that the wording becomes clear.

[*] For example: "Encodings other that UTF-8 are considered legacy 
encodings by this specification and conformance checkers may advise 
against their use."
-- 
leif halvard silli

Received on Tuesday, 10 August 2010 18:34:52 UTC