RE: BIDI IRI Display (was spoofing and IRIs) from Larry Masinter on 2010-03-04 (public-iri@w3.org from March 2010)

From: Larry Masinter <LMM@acm.org>
Date: Thu, 4 Mar 2010 11:33:32 -0800
To: "'Shawn Steele'" <Shawn.Steele@microsoft.com>, "'Slim Amamou'" <slim@alixsys.com>
Cc: <public-iri@w3.org>, "'Peter Constable'" <petercon@microsoft.com>, <unicode@unicode.org>
Message-ID: <004201cabbd1$96a2a180$c3e7e480$@org>

> I'd suggest an addendum to the bidi algorithm (in Unicode) to cover the IRI case. 

The problem is -- with a sequence of text that contains an IRI, 
how do you know when you are in the IRI case?

I think the only way of handling this practically is to leave the
bidi algorithm in unicode alone (since changing it in all text
display systems is infeasible), and instead focus on
"best practices for generating IRIs which, when rendered using
the existing Unicode algorithm, will produce understandable
results."


-----Original Message-----
From: Shawn Steele [mailto:Shawn.Steele@microsoft.com] 
Sent: Thursday, March 04, 2010 11:28 AM
To: Larry Masinter; 'Slim Amamou'
Cc: public-iri@w3.org; Peter Constable; unicode@unicode.org
Subject: RE: BIDI IRI Display (was spoofing and IRIs)

> Are you suggesting that IRIs should never appear in plain text,

And that's the crux of the problem :).  Unicode is "plain text" insofar as it has a sequence of code points that describe behaviors.  However if I just spit out some "glyph" for each Unicode code point I encounter, say in a fixed-width DOS-type box, I'll have a mess, even for some Latin sequences.

In order for Unicode to display properly the rendering engine must make some decisions.  U+0308 had to be combined with the A before it to make Ä.  Even if you force NFC, some scripts still require combining characters for correct display.  And that doesn't even begin to touch complex script behavior... or BIDI.

So, even "plain text" has rules for display.  The Unicode Bidi Algorithm are some of those rules, without which "plain text" BIDI would be a mess.  Unfortunately the Bidi algorithm can't perfectly handle all cases, and IRIs are a case where the bidi algorithm behavior isn't perfect.  I'd suggest an addendum to the bidi algorithm to cover the IRI case.  Thus tweaking the presentation engines so that when they see a "plain text" IRI, it gets displayed appropriately.

Consistent "Plain Text" display of an IRI is an unattainable holy grail, especially with more complex scripts.  Proper display of Unicode requires a rendering engine for proper display.

That goes for source code too.  If an editor doesn't render complex scripts in readable (to a speaker at least) ways, then it's pretty pointless to use the Unicode, it'd be better to %encode everything.

So my question is "what is your definition of plain text?"  I'd say anything that isn't using extra presentation markup, but allow the rendering engine to make reasonable sense of the display.

-Shawn

Received on Thursday, 4 March 2010 19:34:14 UTC