Re: URIs in HTML5 and issues arising from Robert J Burns on 2008-06-29 (public-html@w3.org from June 2008)

From: Robert J Burns <rob@robburns.com>
Date: Sun, 29 Jun 2008 21:56:17 +0300
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Philip Taylor <pjt47@cam.ac.uk>, Ian Hickson <ian@hixie.ch>, HTML WG <public-html@w3.org>
Message-Id: <299282EF-DA91-4DA4-83E2-7833BD47567B@robburns.com>

Hi Julian,

On Jun 29, 2008, at 3:11 PM, Julian Reschke wrote:

>
> Philip Taylor wrote:
>> Ian Hickson wrote:
>>>> According to <http://lists.w3.org/Archives/Public/public-html/2008Jun/0358.html 
>>>> >, Safari 3 uses question marks.
>>>
>>> According to:
>>>
>>>   http://hixie.ch/tests/adhoc/uri/encoding/017.html
>>>
>>> Safari trunk uses &-escaping.
>> That says "Query component: raw question mark" in Safari 3.1.2.
>> It says "Query component: %-escaped ASCII &#9786;" in nightly  
>> WebKit r34603.
>> Looks like it changed in https://bugs.webkit.org/show_bug.cgi? 
>> id=15119
>
> Interesting.
>
> I'm not sure why the Webkit guys think this is any better then what  
> FF does... Which, quoting <https://bugs.webkit.org/show_bug.cgi?id=15119#c1 
> >: "amusingly gives the "correct" answer for Google".
>
> BR, Julian

Given all of the different approaches on this, I would say we  
shouldn't feel constrained to endorse any one approach, since when we  
do we potentially break content targeting a different browser. That's  
good news since it means we can focus more on what the UAs should be  
doing rather than codifying broken behavior. As I've said before, I  
think FireFox comes the closets to the correct behavior on this,  
though even more unicode support would be preferred.

For legacy support, perhaps we could add a accept-charset or similar  
attribute to the root element inherited to all descendent elements and  
having an implied default value of UTF-8 on the root element. This way  
legacy content could be repaired to work with HTMl5 UAs simply by  
adding the attribute where ever necessary. This attribute could then  
be used on any element an attribute with a URI to override the accept- 
charset for the URI. For non-HTML5 UAs, the will continue to use  
whatever disparate legacy approach they currently use, ignoring the  
accept-charset attribute on the root and other elements.

Perhaps accept-charset is not fine-grained enough since we may also  
have to send different encodings to:
  • Host (involving DNS for the international domain name)
  • Path (for the server)
  • Query
  • Fragment Identifier

However, most of these seem to already be handled in a decent way  
except for the query component. Fragment identifiers probably need to  
be treated in the encoding of the destination document, but then  
passing from the source document encoding to UTF-8 to the destination  
document encoding makes the most sense. For local fragment identifiers  
obviously source and destination documents will be the same so it  
collapses to a special case where the actual conversion need not take  
place.

So by adding this attribute HTML5 can guide implementations toward  
universal UTF-8 support for the query URI component while still  
supporting legacy content and legacy application servers.

Take care,
Rob

Received on Sunday, 29 June 2008 18:56:59 UTC