- From: Erik van der Poel <erikv@google.com>
- Date: Thu, 24 Apr 2008 19:13:51 -0700
- To: "Jeremy Carroll" <jjc@hpl.hp.com>
- Cc: "WWW International" <www-international@w3.org>
Here are some URIs with Arabic and Hebrew in the host, path and query parts of the URI. The indicated parents had a link to the URI above at one point (may no longer contain that link). Note that the query part is not always escaped UTF-8. (I don't read these languages, so sorry if they are offensive.) Arabic host: http://xn--wgbe9chb01aytce.com/ parent: http://www.oneurdu.com/index.php/topic,38697.msg249002.html Hebrew host: http://www.xn--4dbbmod3aio.net/ parent: http://www.jeeptrip1.com/index8.htm Arabic path: http://ar.wikipedia.org/wiki/%D8%A3%D9%88%D8%AF%D9%85%D9%88%D8%B1%D8%AA%D9%8A%D8%A7 parent: http://en.wikipedia.org/wiki/Udmurtia Hebrew path: http://he.wikipedia.org/wiki/%D7%94%D7%9E%D7%90%D7%94_%D7%94-18 parent: http://en.wikipedia.org/wiki/18th_century Arabic query: http://massai.ahram.org.eg/archive/maklat.asp?cirestriction=%E3%D1%D3%ED&curpos=0&pagetype=norm parent: http://massai.ahram.org.eg/Index.asp?CurFN=mail2.htm&DID=9556 Hebrew query: http://www.toshav.co.il/contents/page.asp?contentPageID=147&%D7%9E%D7%A9%D7%A8%D7%93%20%D7%94%D7%A4%D7%A0%D7%99%D7%9D= parent: http://www.toshav.co.il/contents/page.asp?contentPageID=151 Counting scripts of non-ASCII characters only, we get: top 10 scripts in host name: Han 11581 Latin 7288 Katakana 5823 Common 1914 Hangul 1452 Hiragana 1434 Arabic 654 Cyrillic 389 Hebrew 145 Thai 120 top 20 scripts in path: Latin 341133 Han 297869 Katakana 188422 Common 138509 Hangul 104614 Cyrillic 103203 Hiragana 59508 Arabic 22234 Greek 21220 Hebrew 20702 Thai 15043 Telugu 8770 Devanagari 6866 Bengali 4896 Tamil 3994 Georgian 2285 Inherited 1875 Malayalam 1313 Kannada 1038 Armenian 820 top 20 scripts in query: Han 453989 Latin 252454 Common 128371 Katakana 123491 Cyrillic 87741 Hangul 83077 Hiragana 67603 Devanagari 16661 Arabic 15476 Thai 13027 Greek 11717 Hebrew 11430 Bopomofo 3151 Georgian 2730 Tamil 2439 Telugu 2312 Bengali 1752 Inherited 1680 Kannada 1068 Armenian 850 0.036% of all links contain non-ASCII hosts, 1.9% of all links contain non-ASCII paths, and 1.5% of all links contain non-ASCII queries. (This is from a small sample of around 800,000 documents in Google's index, so the percentages may not be very accurate.) Erik On Thu, Apr 24, 2008 at 5:43 AM, Jeremy Carroll <jjc@hpl.hp.com> wrote: > > > > Is there any deployment of bidi IRIs? Can I have an example of a real-life > arabic or hebrew link? > > Jeremy > >
Received on Friday, 25 April 2008 02:14:30 UTC