Re: [W3C] Best practices / charsets / fuzzy from Francois Daoust on 2008-10-29 (public-bpwg-ct@w3.org from October 2008)

From: Francois Daoust <fd@w3.org>
Date: Wed, 29 Oct 2008 19:12:20 +0100
To: eduardo.casais@areppim.com
CC: public-bpwg-ct@w3.org, Tom.Hume@futureplatforms.com
Message-ID: <4908A784.8020100@w3.org>
More comments below.

eduardo.casais@areppim.com wrote:
> Hello,
> 
> I find your (i.e. W3C group's) proposal to handle this 
> issue a bit problematic.
> 
>> We resolved to add "character encoding" in the list of examples of 
>> alterations a Content Transformation proxy should do with 
>> much care in section 4.3.6.1: 
> 
> 	"A proxy should strive for the best possible 
> 	user experience that the user agent supports."
> 
> This is vague ("should strive" -- why not "must strive"?
> Are there any valid reasons not to strive?), "user experience" 
> is undefined, no way of measuring it is provided (even a 
> binary one distinguishing between acceptable and unacceptable
> user experiences). When questioning the responsible for a 
> CT deployment "Does your proxy strive for the best user 
> experience?", do you expect any answer other than "Of course"?

It is vague. User experience is vague in essence.
Can you think of a better proposal?
I agree that the use of a normative statement here is clumsy, at best. 
The goal is to have some kind of recognition that one does not expect 
e.g. a Content-Transformation proxy to split pages in 1KB chunks when 
the user agent is a high end smartphone-like mobile device, which would 
affect user experience.


> 
> 	"It should only alter the format, layout, 
> 	dimensions etc. to match the specific 
> 	capabilities of the user agent."
> 
> Two problems here:
> 
> 1. What is, and what is not hidden behind the "etc"? What
> about these various elements, none of which is explicitly
> mentioned:
> 	a) the behaviour induced by scripts
> 	b) the page size
> 	c) the character encoding
> 	d) the behaviour induced by keypad events and keypad
> 	   assignments (e.g. access keys)
> Not mentioning these elements means that they are not
> formally object of the guidelines, and hence negligible.

Yes, I think we could probably rephrase that sentence along the lines of:
  "It should only restructure content to match the specific capabilities 
of the user agent"
... and list examples afterwards to make it clear that the normative 
statement covers all restructuring operations.



> 
> 2. Matching the capabilities of the user agent is necessary,
> but insufficient. As I highlighted with the example of forms,
> the capabilities of the application server must be taken into
> account as well -- we are talking about end-to-end systems
> after all. Because of the way (X)HTML and WML handle the
> encoding of form data sent back to the server, changing
> the encoding of the document has an influence on what the
> server experiences -- since the document encoding is second
> in line as the applicable encoding to form data (after 
> explicit statements in the "accept-charset" attribute of
> the form element). The guidelines only consider the client,
> to the exclusion of the server.

We had such a text in a previous draft version of the document:
 
http://www.w3.org/2005/MWI/BPWG/Group/TaskForces/CT/editors-drafts/Guidelines/080606#sec-decision-to-transform

"Proxies should not alter HTTP requests unless: [...]
   2. an unaltered request body is not consistent with the origin 
server's requirements in respect of Internet content type or character 
encoding (as may happen, for example, if the proxy has transformed an 
HTML form that results in this request);"

The text was adjusted, and is now to be found in section 4.1.5:
  http://www.w3.org/TR/ct-guidelines/#sec-altering-header-values
"the request is part of a sequence of requests to the same Web site and 
either it is technically infeasible not to adjust the request because of 
earlier interaction, or because doing so preserves consistency of user 
experience."

Part of the reason for rewording this was that it was felt it was 
obvious that the request body received by a Content Provider must be the 
one it expects.
To hopefully clarify this: we do not envision a case where the 
transformation proxy transforms the character encoding of data submitted 
by an end user other than to rollback a previous change of character 
encoding in the page that contains the form. If there's any ambiguity 
here, then we should make it clear. I guess this discussion already 
proves that the text is not clear enough ;-)

When I wrote "servers must not rely on the data they receive" in my 
previous email, I did not mean to say "servers should not expect data to 
match the character encoding they expect" but rather "servers must check 
that the data they receive is valid and consistent".


>  
>> This addresses the case when a page is served using an 
>> encoding that is 
>> compatible with the user agent. The Content Transformation 
>> proxy should 
>> not switch the encoding to something else in that case, especially 
>> since, as you point out, it is unlikely that the mapping 
>> works for all 
>> characters. That's what the guideline says.
> 
> Not quite. The CTG does not prevent the transcoder
> to change the encoding to another one that is also
> supported by the client. Notice that different
> charsets may have different q-values, and hence
> a change of encoding, even if it supported, may 
> entail a degradation of the "user experience" -- 
> but this cannot be established, since increases 
> or decreases of user experience remain undefined 
> and hence unmeasurable.

Again, is there any way to define and measure "user experience"?

For instance, regarding q-values, my browser sends in HTTP requests:
  Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

That's supposed to be a "user preference" according to the HTTP RFC:
  http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.2

As a user, I just don't see the difference between a page served in 
ISO-8859-1 or in utf-8. There is no relationship whatsoever with my user 
experience. I don't think it's supposed to represent a measure of the 
lack of support for certain characters either, although it may be the 
case in practice.

I'm trying to find examples of cases where changing the character 
encoding supported by a client to another one also supported may be 
needed or at least useful. Let me try one:
  Let's suppose I'm in Russia with a Russian device. It supports 
ISO-8859-1, ISO-8859-3, and UTF-8.
  A transcoding proxy detects that a page whose size is 1MB needs to be 
paginated. The page uses the ISO-8859-1 encoding. The proxy paginates 
the page, and adds a "next page" link. "Next page" cannot be written in 
Cyrillic in ISO-8859-1. ISO-8859-3 could be used if the page only 
contains characters common to both ISO-8859-1 and ISO-8859-3, UTF-8 
otherwise.

Should the guidelines forbid the change of encoding on the grounds that 
the original page encoding was already supported by the device?
Again, I agree it can't work in cases where a form is involved, because 
there is obviously no mapping from UTF-8 back to ISO-8859-1 for all 
characters.



> 
>> That the mapping may not work in all cases and/or in 
>> both directions is not a real problem IMO. 
> 
> This is where we disagree -- effectively we disagree
> on the two-way scenario. I argue that form-filling
> scenarios encompass enough applications where faithful 
> transmission of user input is indispensable either
> because of security (on-line orders, banking, e-mail) 
> or usability reasons (directory queries, timetables),
> that one must provide precise guidelines to at least 
> ensure a minimum of consistency. 
> 
> The W3C position can be reduced to the statement 
> that actually no practice need be enforced, since
> users know what is taking place. That is a position 
> -- but it must be made explicit as such and 
> substantiated in the CTG.

I don't think that's the position of the Task Force.
The position is more that *any* re-structuring operation must be carried 
out with care because it may break the user experience rather than 
improving it, or lead to broken scenarios such as the ones you mention. 
We acknowledge that, but do not think that the scope of the guidelines 
is to define what type of re-structuring may be done and what type of 
re-structuring should not be done, but rather to define a few control 
mechanisms for the content providers and the end users (and we're 
limited to existing technology in order to do that).


> 
>> The Content Transformation proxy may add a warning at the top of the 
>> page along the lines of "Some characters can't be displayed on your 
>> phone" or "The form below is unlikely to work". It would not work 
>> reliably. But working a bit may be considered better than not 
>> working at 
>> all. 
> 
> "may" -- meaning there is no obligation to inform the 
> end user about the consequences of transcoding. The CTG
> provides for informing about a choice of representations,
> not about the consequences thereof.
> 
>> In any case, this is out of scope of these guidelines: 
>> we are not 
>> trying to define the nature of the restructuring operations that may 
>> occur, but rather to define a few mechanisms by which content 
>> providers, 
>> content transformation proxies and end users may communicate 
>> with each 
>> other.
> 
> This is probably impossible without a specific, new protocol
> (on top of HTTP) -- and in fact you have a half-baked one already
> with the x-device-* HTTP fields. Attempts to match existing HTTP
> protocol entities to CTG needs never quite work out (no-transform
> does not work for WML; vary does not work for sites that produce
> lcd content; via does not work since it is optional and even its
> content need not be directly recognizable by servers). To make
> it really work, you must either stipulate strong, unambiguous 
> heuristics to complement the protocol elements (e.g. DOCTYPE, 
> MIME types, domain names) or eventually do what was done with, 
> notably, uaprof -- i.e. define an own, self-contained protocol 
> with the necessary features.
> 

Does "lcd" stand for "Least Common Denominator"? I'm not sure I get this 
part.

We wouldn't mind adding a few stuff to HTTP, but we're not chartered to 
create new technology, I'm afraid.
We already list a few ideas in the "Scope for future Work" appendix:
  http://www.w3.org/TR/ct-guidelines/#sec-scope-for-future-work

POWDER and the reintroduction of the Link HTTP header are good 
candidates to add semantics for content providers and content 
transformation proxies to communicate.

In the meantime, "Cache-Control: no-transform" is indeed the only 
reliable method we could find. It may not work for WML (although do WAP 
gateways actually respect this directive in practice?), but it is far 
more reliable than any other heuristics we could think of, because it 
carries the expectations of the content provider: do not transform.


>> See: http://www.w3.org/2008/09/30-bpwg-minutes.html#item05 where this 
>> was discussed.
> 
> These minutes were a bit confusing -- they discussed another 
> topic with charsets at the same time. Anyway, thanks for the 
> link!
> 
>> I wrote that but I kind of disagree with myself here ;)
>> For the same reason as above, I don't think we should be that 
>> restrictive.
> 
> That may be so, but I hope to have made the rationale
> behind these restrictions clear. On the other hand, the
> CTG specify no restrictions whatsoever regarding encodings
> at this point -- "do for the best and be careful" cannot
> be construed as a guideline, which by definition must 
> restrict the set of possible behaviours to a more or 
> large, but identifiable class of what is acceptable -- 
> and that is where "best practices" would come into play.

Yes. Note I wouldn't mind putting a special emphasis on the dangers of 
playing with character encodings. I just think that the list of "beware" 
notes could grow quite easily (tidying markup, messing with scripts, 
restructuring of CSS style sheets, finding breaks for pagination, ...)


> 
> 
> Note: Is it really necessary to address the e-mail
> personally to you, or does everybody receive the
> messages through public-bpwg-ct@w3.org?

It is by no means compulsory. Everybody receives the message through 
public-bpwg-ct@w3.org. It makes it easier to see who was involved in a 
discussion when browsing the archives, and personally, messages that are 
addressed to me directly show up in my Inbox, whereas emails sent to 
mailing-lists go to specific folders, making it easier to notice that 
someone replied to a conversation I'm involved in.

Francois.
Received on Wednesday, 29 October 2008 18:12:54 UTC