Re: Testing IRI fragment identifiers from Leif Halvard Silli on 2008-12-17 (www-international@w3.org from October to December 2008)

From: Leif Halvard Silli <lhs@malform.no>
Date: Wed, 17 Dec 2008 14:52:23 +0100
To: Bjoern Hoehrmann <derhoermi@gmx.net>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <49490417.80909@malform.no>
Bjoern Hoehrmann 2008-12-17 05.49:
>  Leif Halvard Silli wrote:
>> In my testing, all current mayor browsers, except Opera, 
>> support IRI fragment identifiers in links.
> 
> I am not really sure what that means, usually fragments are identified
> by a sequence of characters, while fragment identifiers represent some
> sequence of octets. 

What I had in mind, then, was what kind of characters that are 
permitted in HTML (in particular - per the spec) and the User 
Agent's support for interpreting these characters (IRIs} as octets 
(URLs).

> To go from octets to characters you need character
> encodings, and to match two character sequences you need a definition
> of equality.

In that sense, in HTML 4, name="a" and name="A" are considered not 
to be unique fragments. (What about XHTML?) Hence they cannot 
remain in the same document, even if UAs are fully able  - and 
required - to discern between them. (But it doesn't seem as if any 
valdiator is checking this uniqueness requirement of HTML 4 ...)

On the other side,  validators don't accept the combining umlaut 
character that you use in your example below. I guess that this is 
a restriction that have been put in place due to document interal 
uniqueness requirements.

> Now the interpretation of fragment identifiers is subject
> to the media type, and in case of text/html you'll find that RFC 2854
> defines neither of the two necessary parts. Also note it's disallowed
> to put non-ASCII characters into text/html href/src/etc. attributes.

W3 Validator gives no errors ... Are you saying that href/src has 
stricter limitation than HTML 4's name="" and XHTML's ID? I would 
have thought it was the opposite - that name and id have stricter 
requirements, due to the uniqueness issues.

> Nevertheless, you may find my test cases and findings in
> 
>   http://lists.w3.org/Archives/Public/www-html-editor/2002OctDec/0001.html
> 
> of interest; they should illustrate why these definitions are needed.

Here you take up the problem of Unicode normalization, with the 
example of the letter ö. Simplified you have

 name="&ouml;"

which you try to link to via

    (1) href="#&ouml;"     (4) href="#%c3%b6"
    (2) href="#o&#x308;"   (5) href="#o%cc%88"
    (3) href="#%f6rn"

and where you expect (1) and (4), as well as (2) and (5), to be 
equal.  You said in your message from year 2002, that (1) and (2) 
are invalid because of "non-ASCII character in URI Reference". As 
of today, we have IRIs, and so (1) and (2) should be valid as 
IRI-s - at least in XHTML, except that XHTML puts an restriction 
on using combining characters in fragments, and thus 2 does not 
become possible to use anyhow. (Honestly, I am not certain if it 
is XHTML or the IRI spec that has make this ruling - but I assume 
it is XHTML.)

This, to mee, seems like a very similar case to the name="a" and 
name="A" case, where the UA and the document has different 
requirements:

Firefox 3, at least in an UTF-8 document, only accepts (1) and 
(4). This is similar to how href="#a" only hits name="a" in a 
conforming User Agent, even if 'a' and 'A' for a human being is 
just another form of the same letter. This is same as in 2002.

Safari, OTOH, accepts link (2) as weøø, for some reason - at 
*least* on Mac OS X (could the reason be linked to the fact that 
OS X uses the same normalisation in its file system? E.g. if I 
copy a "å" from my file system, then it will be pased as a "a" + a 
combining ring above.) Here the problem is that Safari is 
stretching what it considers equal. Safaro was not part of your test.
 
Opera version 9.26 only accepts (1) - as it also did in 2002. 
(Latest Opera 9.5 and later currently doesn't find any non-ASCII 
fragment URIs ...)

In the Ineternet Explorer field, version 8 beta 2 is equal to what 
Opera was in 2002: It only accepts (1). Whereas version 7 is equal 
to the situion for version 6 - in 2002.  I guess this is, after 
all, progress.

It seems there might be some progress since 2002 with regard to 
spesification. But in UA implementations, it is difficult to 
understand the logic, even if we only consider UTF-8 documents, 
and especially if we expand the fragments to cover non-valid 
identifieres, such as name="o&#x308;" (which I have not spoken 
about here). It is a little better if limit us to the valid fragments.

My 2 øre ... There should be some stuff to create test cases from, 
here ...

[1] http://dev.w3.org/html5/html-author/
[2] http://ru.wikipedia.org/w/index.php?title=%C3%98&redirect=no)
-- 
leif halvard silli
Received on Wednesday, 17 December 2008 13:53:04 UTC