RE: Encoding from Shawn Steele on 2015-02-25 (www-international@w3.org from January to March 2015)

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Wed, 25 Feb 2015 16:53:20 +0000
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Glenn Adams <glenn@skynav.com>
CC: "Phillips, Addison" <addison@lab126.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <CY1PR0301MB073144641AD6F887F3626E9D82170@CY1PR0301MB0731.namprd03.prod.outlook.>
That's what I was afraid of.

Browsers do different things.  So if I have a browser that works fine for my enterprise app, those encodings cannot change.  Otherwise those apps will be broken.  So even a "perfect" set of encodings is found, then there will still be incompatibility problems.

The fix for data ambiguity is to use UTF-8.  That removes (most of) the ambiguity.  I'd prefer to see much stronger language that people SHOULD use UTF-8.

I would rather that the portion about the legacy code pages be informative rather than normative.  It's fine as a reference, but the worst thing that could happen is that some data author with a minor bug on an otherwise working system sees a mismatch between their legacy code page mappings and this document.  Then they change the data to match the legacy data tables, perhaps fixing that bug, but breaking everything else that was working.  Like tools, databases, reports, etc.

More complicated are thinks like Zawgyi and moved HKSCS mappings, which this document doesn't address.

-Shawn

-----Original Message-----
From: "Martin J. Dürst" [mailto:duerst@it.aoyama.ac.jp] 
Sent: Wednesday, 25 February, 2015 1:12 AM
To: Glenn Adams; Shawn Steele
Cc: Phillips, Addison; www-international@w3.org
Subject: Re: Encoding

I think the author would describe the goal to have a spec for browsers so that hopefully/eventually all browsers would treat encodings the same, and so that it would be easier to create a new browser, because there was less stuff to reverse-engineer.

On top of that, the author may have (had) the hope that the power of the
(Web) platform would make (t)his spec eliminate all the other specs and variants. But there is a high chance that it was just something like https://xkcd.com/927/.


As for the W3C, the encoding spec was referenced from the HTML5 spec, and so there was a need to move the encoding spec forward process-wise in order for the HTML5 spec to be able to move forward.

Regards,   Martin.

On 2015/02/25 07:06, Glenn Adams wrote:
> On Tue, Feb 24, 2015 at 2:01 PM, Shawn Steele 
> <Shawn.Steele@microsoft.com>
> wrote:
>
>> I'm still struggling with the goals of the encoding work.
>> https://encoding.spec.whatwg.org/

>
>
> IMO, the reason for this specification is that the author had little 
> knowledge of character encoding, and used the exercise of writing a 
> new document as a way to acquire that knowledge, and, of course, to 
> rewrite the world of encodings in his PoV.
>
> I suppose the reason the author would give, however, is that it was 
> intended to document existing practice or best practice or something 
> in between. Again, one questions the authority to do something of that 
> sort from one new to the subject.
>
> That's just my two cents. Do not interpret my comments as an attack on 
> the author. I have a lot of respect for him. Just not on this subject.
>
>
>>
>>
>> Everything except UTF-8 is legacy, which is good, and I get a desire 
>> to quantify the landscape, however I'm not sure what point is served 
>> by standardizing the tables.
>>
>> Either A) Existing content is already correct per an existing 
>> standard (in which case a link would suffice), or B) Existing content 
>> was encoded using slightly different tables.
>>
>> In the case of existing content, it probably "works" for whomever's 
>> using it, though there may be interoperability issues.  To correct 
>> that data, they need to move to UTF-8.  Adding yet another "perfect" 
>> mapping table only causes further fragmentation as people may attempt to convert to that.
>>
>> For example, HKSCS is rolled up to big-5, however historically there 
>> have been multiple font-hack PUA and real Unicode code point 
>> assignments for that space.  Which makes it hard to say that one 
>> mapping or another is "right" for that space.  It likely depends on 
>> actual data, how the application uses it, and what it's dependencies 
>> are.  Worse, I can't even reliably detect the quirks of the system 
>> where data originated as it may be currently hosted on some other platform.
>>
>> Currently different vendors/platforms/systems have slightly different 
>> mappings.  Clearly that isn't desirable, however a "standard" would 
>> obviously break existing data for at least some of those 
>> vendors/platforms/systems.
>>
>> So, what does the WG expect to happen from this process?
>>
>> A) Do they expect users to correct data to the WG standard mappings?
>> B) Do they expect applications (or users) to abandon previous 
>> behavior to the WG standard mappings?
>> C) For either of these, what timeframe does the WG expect it to happen in?
>> D) Does the WG expect that this problem will be "solved" as a result 
>> of this work.  (Solved == everything's codified so there is no more 
>> confusion?)
>>
>> Thanks,
>>
>> -Shawn
>>
>
Received on Wednesday, 25 February 2015 16:53:50 UTC