W3C home > Mailing lists > Public > www-international@w3.org > April to June 2008

Re: BiDi IRI deployment?

From: Erik van der Poel <erikv@google.com>
Date: Thu, 24 Apr 2008 19:13:51 -0700
Message-ID: <c07a32650804241913t3e592c4eyb337bc3d3b902f51@mail.gmail.com>
To: "Jeremy Carroll" <jjc@hpl.hp.com>
Cc: "WWW International" <www-international@w3.org>

Here are some URIs with Arabic and Hebrew in the host, path and query
parts of the URI. The indicated parents had a link to the URI above at
one point (may no longer contain that link). Note that the query part
is not always escaped UTF-8. (I don't read these languages, so sorry
if they are offensive.)

Arabic host: http://xn--wgbe9chb01aytce.com/
parent: http://www.oneurdu.com/index.php/topic,38697.msg249002.html

Hebrew host: http://www.xn--4dbbmod3aio.net/
parent: http://www.jeeptrip1.com/index8.htm

Arabic path: http://ar.wikipedia.org/wiki/%D8%A3%D9%88%D8%AF%D9%85%D9%88%D8%B1%D8%AA%D9%8A%D8%A7
parent: http://en.wikipedia.org/wiki/Udmurtia

Hebrew path: http://he.wikipedia.org/wiki/%D7%94%D7%9E%D7%90%D7%94_%D7%94-18
parent: http://en.wikipedia.org/wiki/18th_century

Arabic query: http://massai.ahram.org.eg/archive/maklat.asp?cirestriction=%E3%D1%D3%ED&curpos=0&pagetype=norm
parent: http://massai.ahram.org.eg/Index.asp?CurFN=mail2.htm&DID=9556

Hebrew query: http://www.toshav.co.il/contents/page.asp?contentPageID=147&%D7%9E%D7%A9%D7%A8%D7%93%20%D7%94%D7%A4%D7%A0%D7%99%D7%9D=
parent: http://www.toshav.co.il/contents/page.asp?contentPageID=151

Counting scripts of non-ASCII characters only, we get:

top 10 scripts in host name:

Han 11581
Latin 7288
Katakana 5823
Common 1914
Hangul 1452
Hiragana 1434
Arabic 654
Cyrillic 389
Hebrew 145
Thai 120

top 20 scripts in path:

Latin 341133
Han 297869
Katakana 188422
Common 138509
Hangul 104614
Cyrillic 103203
Hiragana 59508
Arabic 22234
Greek 21220
Hebrew 20702
Thai 15043
Telugu 8770
Devanagari 6866
Bengali 4896
Tamil 3994
Georgian 2285
Inherited 1875
Malayalam 1313
Kannada 1038
Armenian 820

top 20 scripts in query:

Han 453989
Latin 252454
Common 128371
Katakana 123491
Cyrillic 87741
Hangul 83077
Hiragana 67603
Devanagari 16661
Arabic 15476
Thai 13027
Greek 11717
Hebrew 11430
Bopomofo 3151
Georgian 2730
Tamil 2439
Telugu 2312
Bengali 1752
Inherited 1680
Kannada 1068
Armenian 850

0.036% of all links contain non-ASCII hosts, 1.9% of all links contain
non-ASCII paths, and 1.5% of all links contain non-ASCII queries.
(This is from a small sample of around 800,000 documents in Google's
index, so the percentages may not be very accurate.)

Erik

On Thu, Apr 24, 2008 at 5:43 AM, Jeremy Carroll <jjc@hpl.hp.com> wrote:
>
>
>
>  Is there any deployment of bidi IRIs? Can I have an example of a real-life
> arabic or hebrew link?
>
>  Jeremy
>
>
Received on Friday, 25 April 2008 02:14:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:17 GMT