[W3C] Best practices / charsets from eduardo.casais@areppim.com on 2008-10-23 (public-bpwg-ct@w3.org from October 2008)

From: <eduardo.casais@areppim.com>
Date: Thu, 23 Oct 2008 12:40:34 +0200
To: <public-bpwg-ct@w3.org>
Cc: <Tom.Hume@futureplatforms.com>, <fd@w3.org>
Message-ID: <000001c934fb$c88d5130$9aa2fea9@AREPPIM002>
Hello,

I was absent till the beginning of October, then got to
catch up with work, and it seems that my previous message
for some reason did not go through.

So here are my answers to the questions raised regarding
the remarks on character sets.

Sorry for the late information and the inconvenience.

Eduardo Casais

-------------

To the W3C Mobile Web BPWG.

These are answers to the points raised in messages about 
charsets (LC-2023)

I. UTF-8, etc.

Question:
What's in scope is to speak about character encoding 
mappings that might be carried out that would do harm, 
probably, though then again, does this imply that we 
should have a clause that says "don't map character 
encodings to something the device does not support". 
Seems exceptionally banal and how many other statements 
of that form would we need to include for the sake of 
consistency?

Answer:

It is important to consider both the flow of information
from server to client and vice-versa (e.g. forms).

Let us consider s, the charset of content returned 
by the server, and M, the set of charsets accepted
by the mobile client.

1. If s is included in M, there is no role for
transcoding. The charset is common to both elements
in the chain -- whether by design (the server did
examine the Accept-charset of the client) or by
happenstance is irrelevant.

2. If s is not included in M, then we may have to map
from s to one of the m which constitute M. Let us
consider s and m.

2.a There is a bijective mapping between s and m. 
Transcoding will work. Although there might be some
legacy encodings that exhibit this property, in 
practice such bijective mappings take place between
the various Unicode UTF-* and UCS-4 charsets.

2.b All code points in s can be mapped to m, but not
vice-versa. In this case, the content can be presented
to the client, but input from the client (e.g. forms)
cannot be reliably transferred to the server.

2.c All code points in m can be mapped to s, but not
vice-versa. In this case, the content cannot be presented
reliably to the client, but input from the client
(e.g. forms) can be reliably transferred to the server.

2.d There is no complete mapping between s and m (though
there might be overlaps, since most charsets include ASCII
as a subset). Content cannot be transmitted reliably in 
either direction.

At this stage, only UTF-* and UCS-4 for both client 
and server allow a reliable transcoding. Case 2.d is 
basically hopeless.

2.c.I. Let us consider now a transcoder that analyses 
content, and, if it determines that the effective charset 
used in s is actually mappable to m, performs the transformation
and delivers the transcoded content to the client. This 
solves case 2.c, since the response of the client will 
anyway be compatible (via backwards mapping) with what 
the server handles. In these cases, at least a complete
transaction (user requests content, fills in form, sends
back content) can be successful.

2.b.I. This cannot be the case in 2.b: since m is greater 
than s (in terms of code points), users might find themselves
in the curious situation where the first half of a
transaction works out (i.e. request content), but the
second part (send back data) does not. This will not
happen with 2.c.I: either the complete transaction
succeeds, or it fails (e.g. error message) right from 
its first phase.

Once again, s > m in practice whenever s is UTF-* or UCS-4
(although there might be cases where the relation holds for
other charsets). 

That explains why I propose to restrict safe transcodings
to UTF-* and UCS-4 (in this sense, LC-2023 was unclear, it
only talked about UTF-8). 

As for ASCII, it can be used as a quasi-universal encoding.
since most charsets contain ASCII, and encoding everything
outside ASCII as Unicode numeric character references. In
this case, a transcoder can fall back on a situation similar
to 2.c.I (provided it thoroughly examines numeric character
entities and takes into account the issues of private code
spaces used for whatever proprietary purposes -- e.g. 
pictograms). If the server actually uses ASCII in this
way, then it may well be ready to accept input containing
numeric references as well (introduced by the transcoder
or even by browsers).

II. Client capabilities.

Question:

Between the CT-proxy and the end-user, the text we 
already have in 4.3.6.1 summarizes the point "make 
sure that if you transform, you create content that 
can be rendered by the end-user's device".

Answer:

This only considers Web browsing as a unidirectional
flow of information -- but there are forms. Hence, one
must consider both transcoding directions, and this is
no longer an issue of presentation -- it is an issue
of input to back-end systems (e.g. databases). This is
all the more difficult that there are few mechanisms
for servers to advertise what charsets they really are
able to handle, and those (such as "accept-charset" in
forms) and their interactions with other parts of the
transcoding process are not elaborated upon in the 
guidelines.

This point is actually acknowledged later as follows:

This would also address the "bijective" part of the 
comment: do not transform the content encoding of a 
page that contains a form, unless you can ensure that 
you can convert the user's input back to the content 
encoding expected by the server.



E. Casais
Received on Thursday, 23 October 2008 11:28:18 UTC