Re: [Encoding] false statement [I18N-ACTION-328][I18N-ISSUE-374] from Martin J. Dürst on 2014-08-29 (www-international@w3.org from July to September 2014)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 29 Aug 2014 15:32:43 +0900
To: John Cowan <cowan@mercury.ccil.org>, John C Klensin <john+w3c@jck.com>
CC: Larry Masinter <masinter@adobe.com>, Richard Ishida <ishida@w3.org>, "Phillips, Addison" <addison@lab126.com>, www-international@w3.org
Message-ID: <54001E8B.30400@it.aoyama.ac.jp>
On 2014/08/29 12:04, John Cowan wrote:
> John C Klensin scripsit:
>
>> it is also noteworthy that the number of web
>> browsers, or even web servers, in use is fairly small.  By
>> contrast, the number of SMTP clients and servers, including
>> independently-developed submission clients built into embedded
>> devices, is huge and the number of mail user agents even larger.
>
> Very true.  But the number of web pages reduces all the distinct
> Internet software programs in the world to nothing at all.

It's not so much overall number that counts (otherwise we would have to 
count the number of emails, too) but the relationship between the number 
of 'senders' and 'recepients'. The Web is characterized by an extreme 
number of senders (essentially, each Web page is its own sender), and 
just very few recipients (browsers). For mail, each sender is also a 
recipient. This creates very different ecosystems for the standards in 
those different fields.


>> So an instruction from the IETF (or W3C or some other entity) to
>> those email systems to abandon the IANA Registry's definitions
>> in favor of some other norm would, pragmatically, be likely to
>> make things worse rather than better,
>
> +1

It's not only email systems. There are many libraries that provide 
encoding conversion. These libraries are used in all kinds of contexts 
(programming languages, databases, applications,...). These contexts 
expect the same amount of consistency over time with respect to encoding 
definitions as Web pages expect from Web browsers.

As the creator and sometime maintainer of such a library (the one in the 
Ruby programming language since v1.9), I can assure everybody that such 
maintainers will not change what e.g. "US-ASCII" or "iso-8859-1" means 
very soon, because they would shoot themselves and most of their users 
in the feet.

And not making changes doesn't mean it won't be compatible with the Web. 
It very much depends on where and how these libraries are used. Being 
used on a server to generate content will not be a problem, because for 
any *sane* data, the result will be okay by the "Encoding" spec, too. 
There's no problem converting non-ASCII data to numeric character 
references if a page is served as "US-ASCII", and there is no problem 
serving "iso-8859-1" content without bytes in the C1 range.

Even when used to consume Web content, there's no big problem. The 
average Web spider project doesn't suffer significantly from a few 
encoding hickups because statistically, the amount of pages that 
conforms to both IANA and "Encoding" is a large majority. Even long 
before you get to Google scale, mislabelings probably are a bigger 
problem than encoding details in otherwise correctly labeled data.

On the other hand, changing how a transcoding library works risks to put 
of lots of users, and create lots of difficult to trace bugs, and so is 
best avoided.

That libraries have to stay on the conservative side is made clear by 
the fact that Microsoft has difficulties implementing the "Encoding" 
spec. IE uses a common Windows library for transcoding, and both 
changing this library and splitting it into separate Web and non-Web 
versions is highly unattractive to Microsoft. My guess is that Microsoft 
will just sit things out until UTF-8 is the only thing that counts.

So claims such as "Which applications don't want to be compatible
with the web?" (implying a single overreaching unification is desirable 
and possible) ignore the (messy) reality on the ground.


>> Sure.  But that and scale measured in numbers of deployed
>> independent implementations and the difficulties associated with
>> changing them, would seem to argue strongly for at least mostly
>> changing the web browsers to conform to what is in the IANA
>> registry

In theory, this is correct. If all browsers agreed to do it, they could 
do so easily. But it's a prisoner's dilemma type of situation, with more 
than two players who can ruin things, and there are enough equivalent 
situations in the HTML5,... area that strongly suggest that going back 
to the IANA registry won't happen.


>> (possibly there are Registry entries that might need
>> tuning too --the IETF Charset procedures don't allow that at
>> present but, at you point out, they could, at least in
>> principle, be changed)

You are right that RFC 2978 (http://tools.ietf.org/html/rfc2978) doesn't 
mention any procedure for updating registrations. But that hasn't made 
updates impossible. Updates can be and have been made in analogy to new 
registrations (think reregistration). But they need more backing than 
"Web browsers do it this way, so that's what everybody else has to do, 
too". So the problem is not one of process, but one of "rough consensus 
and running code". And there's lots of running code on the IANA side, too.

Regards,   Martin.

>> rather than trying to retune the Internet
>> to match what a handful of browser vendors are doing.
>
> Both are hopeless efforts, and each group must maintain its own standards.
>
Received on Friday, 29 August 2014 06:33:24 UTC