RE: Mixing scripts (Re: Unicode versions (Re: Criteria forexceptional characters)) from Martin Duerst on 2006-12-25 (public-iri@w3.org from December 2006)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Mon, 25 Dec 2006 14:21:40 +0900
To: Michael Everson <everson@evertype.com>, <idna-update@alvestrand.no>
Cc: public-iri@w3.org
Message-Id: <6.0.0.20.2.20061225135712.18e0dec0@localhost>

At 23:00 06/12/24, Michael Everson wrote:
>At 22:00 +0900 2006-12-24, Martin Duerst wrote:

>It is Kurdish, and the two letters are for other functional reasons being proposed for addition to the standard. So for the sake of argument, assume that this particular reason does not apply.
>
>Why then would mixing Latin and Greek and Cyrillic at (at least) the same level not be disallowed in IDNs and IRIs to avoid security problems?

For IDNs, we are discussing this here, and even if it looks like
currently the tendency is to not do this at the protocol level,
I'm rather sure that registries and browsers will do something
about it.

For IRIs, the situation is completely different. IRIs (same as URIs)
are 'meta-syntax', a system that allows to encompass all kinds of
different syntactic conventions. There are extremely few things
you can actually check in an IRI as such. If you know the scheme
(such as http:, ftp:, mailto:,...), there are scheme-specific
rules that can be used for checking, but you can never assume
that a scheme is known everywhere, and implementing all these
checks would be expensive, and is better delegated to resolution,
where knowledge of the scheme is required anyway.

Also, in a 'typical' (e.g. http: or ftp:) IRI, the place where
attacks can take place is the domain name. Anything else is
just between the server and the client. As an example, assume
that you create a font that makes distinction between Latin
and Cyrillic very easy, and you create a Web page for it at
http://www.evertype.com/fonts/latinCYRILLIC.html (where the
'CYRILLIC' part is actually in Cyrillic). Because it's your
Web server, nobody will be able to spoof you, and nobody
should be able to tell you whether this particular page name
is a good idea or not (well, your customers may tell you it's
difficult to type, anyway).

Going one step further, one important part of (URIs and) IRIs
are query parts. You wouldn't want to prohibit users to submit
queries containing keywords in different scripts, or would you?
Take a look at the following query (URI):
http://www.google.com/search?q=russian+%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B9%D0%B8
(or the following, the same as above but as an IRI:
http://www.google.com/search?q=russian+русскйи).

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Monday, 25 December 2006 05:42:34 UTC