W3C home > Mailing lists > Public > public-iri@w3.org > July 2012

Re: why use IRIs?

From: David Clarke <w3@dragonthoughts.co.uk>
Date: Wed, 04 Jul 2012 10:45:16 +0100
Message-ID: <4FF410AC.4060503@dragonthoughts.co.uk>
To: public-iri@w3.org
I've been reading this thread with interest. I'm wondering how the 
originator would feel if URIs had been defined to use digits and 
punctuation only with no alphabetic characters?

 From the point of view of someone who doesn't natively use the Latin 
alphabet, that is equivalent to what he is proposing. Most literate 
people in the world are able to use the Latin alphabet, but will be 
better at recognising errors in their native script (when programming 
etc), and more likely to be able to remember host names, without error, 
that are in their native script.

As far as spoofing goes, in most typefaces, there are already confusions 
between 1 (DIGIT ONE), l (LOWER CASE LATIN LETTER L), I (UPPER CASE 
LATIN LETTER I) and between 0 (UPPER CASE LATIN LETTER O) and 0 (DIGIT 
ZERO). Would it be reasonable propose removal of those characters from 
URLs to reduce spoofing?

On 04/07/2012 09:49, "Martin J. Dürst" wrote:
> Hello Mark,
>
> On 2012/07/04 15:13, Mark Nottingham wrote:
>> I tend to agree with Peter.
>>
>> The experience of using IRIs as identifiers in Atom was, IME, a
>> disaster.
>
> Can you be specific? Can you provide pointers?
>
>
>> Identifiers need to be resistant to spoofing and mistakes.
>
> It's easy to create spoofing identifiers using ASCII/English only.
>
> It's also not too difficult to create spoofing/mistake-resistant
> identifiers in other scripts or languages, for people who are better
> versed in these scripts/languages. This may be difficult to understand
> for "English-centric" people, but it's indeed the case.
>
>
>> Björn said:
>>
>>> How would you like it if URIs could use only 20 of the 26 letters in
>>> the
>>> english alphabet and you would have to encode, decode and convert them
>>> all the time, or use awkward transliterations to avoid having to do so?
>>
>> URIs already have a constrained syntax; you can't use certain
>> characters in certain places.
>
> Yes. But not being able to use certain punctuation is different from
> not being able to use characters in the basic alphabet/character
> repertoire of the language. It's easy to replace spaces with hyphens
> or whatever. It's a different thing to replace one letter with
> another, or just drop it.
>
>> As long as people can put IRIs into HTML and browser address bars, I
>> don't think they'll care.
>>
>> Martin said:
>>
>>> I think the real motivation would be people looking at HTTP traces and
>>> preferring to see Unicode rather than lots of %HH strings. Of course
>>> the
>>> number of people looking at HTTP traces is low, and they are not end
>>> users.
>>
>> Is this use case really worth the pain,
>
> For that specific case, I'm not sure. That's why I used "would". But I
> also don't think the pain would be that high.
>
>
>> inefficiency,
>
> Conversion would indeed cost some cycles. But using raw bytes instead
> of %-encoding would save bytes (which, these days, as far as I have
> followed the SPDY debates so far, seems to be the more important side
> of the tradeoff).
>
>> and very likely security vulnerabilities caused by transcoding from
>> IRIs to URIs and back when hopping from HTTP 2.0 to 1.1 and back? I
>> don't think so.
>
> There are quite a lot of places where security blunders can happen.
> That conversion step wouldn't be the first one and wouldn't be the
> last one. And using %-encoding for basic ASCII characters is already
> allowed today, so the basic security vulnerability (firewalls can't
> just check on character strings) already exists today.
>
>> My English-centric .02; ŸṀṂṼ.
>
> 您里可变 (this is not real Chinese, but just four roughly
> corresponding characters put together).
>
> Regards,   Martin.
>
>
Received on Wednesday, 4 July 2012 09:45:51 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 4 July 2012 09:45:52 GMT