[W3C] Best practices / charsets / fuzzy 2 from eduardo.casais@areppim.com on 2008-10-29 (public-bpwg-ct@w3.org from October 2008)

From: <eduardo.casais@areppim.com>
Date: Wed, 29 Oct 2008 23:21:32 +0100
To: "'Francois Daoust'" <fd@w3.org>
Cc: <public-bpwg-ct@w3.org>, <Tom.Hume@futureplatforms.com>
Message-ID: <000001c93a14$b2d43b60$9aa2fea9@AREPPIM002>
Let us cut a lot of cited text...

> It is vague. User experience is vague in essence.
> Can you think of a better proposal?

One difficulty I see is that two aspects are mixed in
this notion of "user experience":

1. Ergonomics -- i.e. the fact that the choice of fonts
and colours, the placement of links, the size of images,
the length of pages, etc, makes navigation, reading and
typing hassle-free. This is essential, but I am not a
usability specialist, though I know one can quantify
various aspects of usability, and there are even best
practices as to these various aspects.

2. Feature support -- i.e. the fact that the character
sets, the colour space, the dimensions of images, 
the weight of pages, etc, are suitable for the terminal.
This is much more tractable as to formalization.

> I agree that the use of a normative statement here is clumsy, 
> at best. 
> The goal is to have some kind of recognition that one does not expect 
> e.g. a Content-Transformation proxy to split pages in 1KB chunks when 
> the user agent is a high end smartphone-like mobile device, 
> which would 
> affect user experience.

This is primarily about technology, so these issues
are largely formalizable. Here is a sketch:

Let us consider the terminal capabilities (termcap), as 
defined by:
a) the information sent in HTTP accept-*, user-agent and
related fields (mainly for IE devices);
b) the information contained in an attached user agent profile,
when published by the device via x-wap-profile;
c) additional information from the transcoder operator, 
representable in a schema compatible with (b), when
available.

In order to maximize user experience when transforming 
content, transcoders make sure that the characteristics 
of the output sent to the device, described in the same
attribute space as termcap, respect the following properties:

1. If the characteristic maps to a set attribute in the 
termcap, then the value of the characteristic corresponds
to one element of the set (e.g. document type). If it 
maps to a mono-valued attribute, then its value must be
equal to the value of the termcap attribute (e.g. 
colour capability).

2. If a q-value is attached to a termcap attribute value,
then the characteristic value corresponds to one of the
termcap value with the highest q-value (e.g. charset).

3. If the termcap attribute is a set of ordinal values,
then the characteristic corresponds to one of these 
values, preferably to the highest one (e.g. versions
of a content type -- WML 1.1, 1.2, 1.3).

4. If the termcap attribute is a numeric value, then the
output that minimizes the difference (termcap.value -
characteristic.value), under the constraint >= 0, is
selected (e.g. decksize, pixel depth). This applies
by analogy to composite values (e.g. screen size).

During the process, all values under consideration are
converted to their canonical form to ensure consistent
comparisons.

Certainly not a final statement, and it does not encompass
the entire user experience (it cannot) -- but at least
this would go some way towards formalizing what "striving
for the best possible user experience that the user agent
supports." It takes care of some basic consistency
requirements as well. It is also reasonably testable by 
independent parties. And it nicely ties in with other
well-established standards.

As a result, a transcoder will select XHTML mobile
profile over WML, XHTML 1.2 over 1.1 or 1.0, with 
tables instead of lines with hard line breaks, a
page size as close to 60000 instead of too small 10000 
or too large 512000, 16bit colour pictures instead of 
b&w (which will neither be 96x48 nor 240x320 if the 
screen is 176x220), and it will be encoded in KOI8-R 
instead of UTF-8 if that is what the terminal prefers.

> To hopefully clarify this: we do not envision a case where the 
> transformation proxy transforms the character encoding of 
> data submitted 
> by an end user other than to rollback a previous change of character 
> encoding in the page that contains the form. If there's any ambiguity 
> here, then we should make it clear. I guess this discussion already 
> proves that the text is not clear enough ;-)

A shame that the older version was eliminated, since the 
text was actually quite good and addressed my concern 
directly!

> For instance, regarding q-values, my browser sends in HTTP requests:
>   Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
> 
> That's supposed to be a "user preference" according to the HTTP RFC:
>   http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.2
> 
> As a user, I just don't see the difference between a page served in 
> ISO-8859-1 or in utf-8. There is no relationship whatsoever 
> with my user 
> experience. I don't think it's supposed to represent a measure of the 
> lack of support for certain characters either, although it may be the 
> case in practice.

There is indeed a reason why iso-8859-1 is considered
"better" than utf-8 in your handset (and in many others).

iso-8859-1 is a single-byte encoding; it is therefore extremely
efficient to decode -- in fact no decoding: each byte is an
index into the symbol table since the 256 code points of iso-8859-1
map to the first 256 code points of Unicode. Whereas utf-8 
requires the multi-byte decoding machinery to launch, convert 
sequences of bytes into an index before accessing the symbol 
in the Unicode table. There is therefore a performance issue 
which impacts user experience, although for the difference to 
become perceptible, one would probably need very long texts 
with lots of accented letters and lots of special iso-8859-1 
symbols.

> I'm trying to find examples of cases where changing the character 
> encoding supported by a client to another one also supported may be 
> needed or at least useful. Let me try one:
>   Let's suppose I'm in Russia with a Russian device. It supports 
> ISO-8859-1, ISO-8859-3, and UTF-8.
>   A transcoding proxy detects that a page whose size is 1MB 
> needs to be 
> paginated. The page uses the ISO-8859-1 encoding. The proxy paginates 
> the page, and adds a "next page" link. "Next page" cannot be 
> written in 
> Cyrillic in ISO-8859-1. ISO-8859-3 could be used if the page only 
> contains characters common to both ISO-8859-1 and ISO-8859-3, UTF-8 
> otherwise.

A straightforward way is to insert "next page" in russian 
as numeric character entities and keep everything as 
iso-8859-1. No analysis of the actual character set 
and conversions needed. Much faster. Slightly bulkier
(about 49 bytes instead of 9).

> Should the guidelines forbid the change of encoding on the 
> grounds that 
> the original page encoding was already supported by the 
> device? Again, I agree it can't work in cases where a form is 
> involved, because 
> there is obviously no mapping from UTF-8 back to ISO-8859-1 for all 
> characters.

I think this is the point: the changes must take into
account the end-to-end context. Forms are the most 
direct example where this is unavoidable.

> I don't think that's the position of the Task Force.
> The position is more that *any* re-structuring operation must 
> be carried 
> out with care because it may break the user experience rather than 
> improving it, or lead to broken scenarios such as the ones 
> you mention. 
> We acknowledge that, but do not think that the scope of the 
> guidelines 
> is to define what type of re-structuring may be done and what type of 
> re-structuring should not be done, but rather to define a few control 
> mechanisms for the content providers and the end users (and we're 
> limited to existing technology in order to do that).

I am a bit puzzled by the fact that on the one side one 
acknowledges that transformations may break applications 
and calls upon transcoder deployers to perform them 
carefully, and on the other side one refuses to state 
what are worst or best practices in this respect, or
even list the potential troublesome consequences of
doing it.

> Does "lcd" stand for "Least Common Denominator"? I'm not sure 
> I get this 
> part.

Yes, sorry for the abbreviation. The reason is that there
are sites that will produce standard pages in one format,
one encoding, one language, no matter what user agent
from whatever IP address is thrown at them. Hence, no
need for a vary HTTP field.

> In the meantime, "Cache-Control: no-transform" is indeed the only 
> reliable method we could find. It may not work for WML 
> (although do WAP 
> gateways actually respect this directive in practice?), 

Well,

a) Shouldn't the CTG group have investigated the issue 
and found the final answer to your question already?
All the more so since representants of an operator that
has probably tested every WAP gateway for its own needs
is present in the redaction committee. If you know this 
is an issue, why leave it hanging?

b) I believe some do, because at the introduction of 
WAP 2.0 support for WBXML was to be dropped in some
terminals. These would therefore handle WML with a
normal XML parser (just like XHTML), and render it 
directly instead of launching a special interpreter for 
WBXML. This means sending textual WML content to the 
device, and hence WAP gateways had to refrain from 
performing the wml-to-wbxml encoding and recognize 
no-transform directives (gateways have always been
ready to accept either wml and wmlc/wbxml from servers).

> but it is far 
> more reliable than any other heuristics we could think of, because it 
> carries the expectations of the content provider: do not transform.

I wish transcoders deployed in the field respected
the directive...

> Yes. Note I wouldn't mind putting a special emphasis on the 
> dangers of 
> playing with character encodings. I just think that the list 
> of "beware" 
> notes could grow quite easily (tidying markup, messing with scripts, 
> restructuring of CSS style sheets, finding breaks for pagination, ...)

I do not see any problem with that; after all, shouldn't the
guidelines make it clear what is at stake? At least RFC always
include a chapter "security considerations", and often others 
such as "compatibility" (see RFC2616), or even chapters 
describing the rationale for doing or not doing certain things 
(see RFC3023). The CTG would benefit from a similar approach.

Cheers

E.Casais
Received on Thursday, 30 October 2008 07:12:48 UTC