RE: BIDI IRI Display (was spoofing and IRIs) from Shawn Steele on 2010-03-04 (public-iri@w3.org from March 2010)

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Thu, 4 Mar 2010 19:45:25 +0000
To: Larry Masinter <LMM@acm.org>, 'Slim Amamou' <slim@alixsys.com>
CC: "public-iri@w3.org" <public-iri@w3.org>, Peter Constable <petercon@microsoft.com>, "unicode@unicode.org" <unicode@unicode.org>
Message-ID: <E14011F8737B524BB564B05FF748464A0565AB8C@TK5EX14MBXC139.redmond.corp.microsoft.>

I disagree somewhat :)  Mostly in that the BIDI speakers seem to have concerns about what's "understandable" :)

-Shawn

-----Original Message-----
From: Larry Masinter [mailto:masinter@gmail.com] On Behalf Of Larry Masinter
Sent: Poʻahā, Malaki 04, 2010 11:34 AM
To: Shawn Steele; 'Slim Amamou'
Cc: public-iri@w3.org; Peter Constable; unicode@unicode.org
Subject: RE: BIDI IRI Display (was spoofing and IRIs)

> I'd suggest an addendum to the bidi algorithm (in Unicode) to cover the IRI case. 

The problem is -- with a sequence of text that contains an IRI, how do you know when you are in the IRI case?

I think the only way of handling this practically is to leave the bidi algorithm in unicode alone (since changing it in all text display systems is infeasible), and instead focus on "best practices for generating IRIs which, when rendered using the existing Unicode algorithm, will produce understandable results."

-----Original Message-----
From: Shawn Steele [mailto:Shawn.Steele@microsoft.com]
Sent: Thursday, March 04, 2010 11:28 AM
To: Larry Masinter; 'Slim Amamou'
Cc: public-iri@w3.org; Peter Constable; unicode@unicode.org
Subject: RE: BIDI IRI Display (was spoofing and IRIs)

> Are you suggesting that IRIs should never appear in plain text,

And that's the crux of the problem :).  Unicode is "plain text" insofar as it has a sequence of code points that describe behaviors.  However if I just spit out some "glyph" for each Unicode code point I encounter, say in a fixed-width DOS-type box, I'll have a mess, even for some Latin sequences.

In order for Unicode to display properly the rendering engine must make some decisions.  U+0308 had to be combined with the A before it to make Ä.  Even if you force NFC, some scripts still require combining characters for correct display.  And that doesn't even begin to touch complex script behavior... or BIDI.

So, even "plain text" has rules for display.  The Unicode Bidi Algorithm are some of those rules, without which "plain text" BIDI would be a mess.  Unfortunately the Bidi algorithm can't perfectly handle all cases, and IRIs are a case where the bidi algorithm behavior isn't perfect.  I'd suggest an addendum to the bidi algorithm to cover the IRI case.  Thus tweaking the presentation engines so that when they see a "plain text" IRI, it gets displayed appropriately.

Consistent "Plain Text" display of an IRI is an unattainable holy grail, especially with more complex scripts.  Proper display of Unicode requires a rendering engine for proper display.

That goes for source code too.  If an editor doesn't render complex scripts in readable (to a speaker at least) ways, then it's pretty pointless to use the Unicode, it'd be better to %encode everything.

So my question is "what is your definition of plain text?"  I'd say anything that isn't using extra presentation markup, but allow the rendering engine to make reasonable sense of the display.

-Shawn

Received on Thursday, 4 March 2010 19:46:04 UTC