- From: <eduardo.casais@areppim.com>
- Date: Thu, 23 Oct 2008 12:40:34 +0200
- To: <public-bpwg-ct@w3.org>
- Cc: <Tom.Hume@futureplatforms.com>, <fd@w3.org>
Hello, I was absent till the beginning of October, then got to catch up with work, and it seems that my previous message for some reason did not go through. So here are my answers to the questions raised regarding the remarks on character sets. Sorry for the late information and the inconvenience. Eduardo Casais ------------- To the W3C Mobile Web BPWG. These are answers to the points raised in messages about charsets (LC-2023) I. UTF-8, etc. Question: What's in scope is to speak about character encoding mappings that might be carried out that would do harm, probably, though then again, does this imply that we should have a clause that says "don't map character encodings to something the device does not support". Seems exceptionally banal and how many other statements of that form would we need to include for the sake of consistency? Answer: It is important to consider both the flow of information from server to client and vice-versa (e.g. forms). Let us consider s, the charset of content returned by the server, and M, the set of charsets accepted by the mobile client. 1. If s is included in M, there is no role for transcoding. The charset is common to both elements in the chain -- whether by design (the server did examine the Accept-charset of the client) or by happenstance is irrelevant. 2. If s is not included in M, then we may have to map from s to one of the m which constitute M. Let us consider s and m. 2.a There is a bijective mapping between s and m. Transcoding will work. Although there might be some legacy encodings that exhibit this property, in practice such bijective mappings take place between the various Unicode UTF-* and UCS-4 charsets. 2.b All code points in s can be mapped to m, but not vice-versa. In this case, the content can be presented to the client, but input from the client (e.g. forms) cannot be reliably transferred to the server. 2.c All code points in m can be mapped to s, but not vice-versa. In this case, the content cannot be presented reliably to the client, but input from the client (e.g. forms) can be reliably transferred to the server. 2.d There is no complete mapping between s and m (though there might be overlaps, since most charsets include ASCII as a subset). Content cannot be transmitted reliably in either direction. At this stage, only UTF-* and UCS-4 for both client and server allow a reliable transcoding. Case 2.d is basically hopeless. 2.c.I. Let us consider now a transcoder that analyses content, and, if it determines that the effective charset used in s is actually mappable to m, performs the transformation and delivers the transcoded content to the client. This solves case 2.c, since the response of the client will anyway be compatible (via backwards mapping) with what the server handles. In these cases, at least a complete transaction (user requests content, fills in form, sends back content) can be successful. 2.b.I. This cannot be the case in 2.b: since m is greater than s (in terms of code points), users might find themselves in the curious situation where the first half of a transaction works out (i.e. request content), but the second part (send back data) does not. This will not happen with 2.c.I: either the complete transaction succeeds, or it fails (e.g. error message) right from its first phase. Once again, s > m in practice whenever s is UTF-* or UCS-4 (although there might be cases where the relation holds for other charsets). That explains why I propose to restrict safe transcodings to UTF-* and UCS-4 (in this sense, LC-2023 was unclear, it only talked about UTF-8). As for ASCII, it can be used as a quasi-universal encoding. since most charsets contain ASCII, and encoding everything outside ASCII as Unicode numeric character references. In this case, a transcoder can fall back on a situation similar to 2.c.I (provided it thoroughly examines numeric character entities and takes into account the issues of private code spaces used for whatever proprietary purposes -- e.g. pictograms). If the server actually uses ASCII in this way, then it may well be ready to accept input containing numeric references as well (introduced by the transcoder or even by browsers). II. Client capabilities. Question: Between the CT-proxy and the end-user, the text we already have in 4.3.6.1 summarizes the point "make sure that if you transform, you create content that can be rendered by the end-user's device". Answer: This only considers Web browsing as a unidirectional flow of information -- but there are forms. Hence, one must consider both transcoding directions, and this is no longer an issue of presentation -- it is an issue of input to back-end systems (e.g. databases). This is all the more difficult that there are few mechanisms for servers to advertise what charsets they really are able to handle, and those (such as "accept-charset" in forms) and their interactions with other parts of the transcoding process are not elaborated upon in the guidelines. This point is actually acknowledged later as follows: This would also address the "bijective" part of the comment: do not transform the content encoding of a page that contains a form, unless you can ensure that you can convert the user's input back to the content encoding expected by the server. E. Casais
Received on Thursday, 23 October 2008 11:28:18 UTC