W3C home > Mailing lists > Public > public-iri@w3.org > July 2012

RE: [iri] #131: Using document charset causes interoperability problems

From: Dave Thaler <dthaler@microsoft.com>
Date: Tue, 24 Jul 2012 01:29:36 +0000
To: Larry Masinter <masinter@adobe.com>, "draft-ietf-iri-3987bis@tools.ietf.org" <draft-ietf-iri-3987bis@tools.ietf.org>, "stpeter@stpeter.im" <stpeter@stpeter.im>
CC: "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <9B57C850BB53634CACEC56EF4853FF653B719893@TK5EX14MBXW604.wingroup.windeploy.ntdev.microsoft.com>
> -----Original Message-----
> From: Larry Masinter [mailto:masinter@adobe.com]
> Sent: Saturday, July 21, 2012 9:07 AM
> To: draft-ietf-iri-3987bis@tools.ietf.org; stpeter@stpeter.im
> Cc: public-iri@w3.org
> Subject: RE: [iri] #131: Using document charset causes interoperability
> problems
> 
> I hate this feature, and would love to get rid of it, but let's acknowledge at least
> somewhere that it happens. That is, the interoperability problems are real, but
> not documenting it here doesn't solve the problem.
> 
> I think what the text in the document intended was that whether there _was_ a
> "document charset" at all depended on the format of the document... yes, for
> HTML, maybe for Word (up to word), no for PDF, maybe (not yet defined) for
> text/plain.
> 
> I can see two choices that might work:
> 
> * Any document format that wishes this kind of processing has to say that what
> they are using aren't really IRIs, they're funny strings that get preprocessed to
> turn them into IRIs or URIs.
> * The IRI spec (continues to) explicitly defines this document-charset-dependent
> behavior, but is more explicit about the rules for where "document charset"
> comes from.
> 
> I could go with either one of those. How do those seem to the group?

I'd argue for the first (and against the second).

-Dave

> 
> 
> -----Original Message-----
> From: iri issue tracker [mailto:trac+iri@grenache.tools.ietf.org]
> Sent: Thursday, July 19, 2012 3:04 PM
> To: draft-ietf-iri-3987bis@tools.ietf.org; stpeter@stpeter.im
> Cc: public-iri@w3.org
> Subject: [iri] #131: Using document charset causes interoperability problems
> 
> #131: Using document charset causes interoperability problems
> 
>  As reported by Dave Thaler...
> 
>  URIs and/or IRIs can appear in many contexts.
> 
>  In normal text in an email message, or in a PDF file or Word doc or  whatever
> else.
> 
>  Allowing it to vary complicates frameworks considerably since now the doc
> charset has to be passed from whatever extracts the URI from the document
> (HTML or otherwise) and whatever else needs to know the interpretation
>  (normalizer code, comparison code, whatever).   Various API frameworks
>  already have various sorts of "Uri" classes that take in a URI-like string  and let
> you do things like get the URI form or the IRI form, or various
>  components or whatever.   This means the constructor needs to change since
>  you cannot correctly interpret an IRI(bis) without knowing the document
> charset.
> 
>  I'm not yet convinced that's a change worth making.   Currently everything
>  assumes UTF-8.   With this change, we'll get random behavior until
>  everything is updated, which is a state worse than today in my view.
> 
>  Example:
>  http://www.sw.it.aoyama.ac.jp/non-existent?é

> 
>  If the charset were iso-8859-1 then under RFC 3987 as I understand it,  this
> would become:
> 
>  http://www.sw.it.aoyama.ac.jp/non-existent?%C3%83%C2%A9

> 
>  In other words, you have to convert iso-8859-1 to UTF-8 and then pct-  encode
> the UTF-8.
> 
>  But as I understand 3987bis it would become:
> 
>  http://www.sw.it.aoyama.ac.jp/non-existent?%C3%A9

> 
>  which would then be passed around via various APIs and protocols that  would
> not pass the charset along with it. As such it would be interpreted  by the
> receiving code as pct-encoded UTF-8:
> 
>  http://www.sw.it.aoyama.ac.jp/non-existent?é

> 
>  which of course it isn't.
> 
>  As such, we should make the RFC 3987 behavior (UTF-8, NOT the doc charset)
> required for everything that doesn't explicitly pass the charset along  with the
> URI.
> 
> --
> -----------------------+--------------------------------------
>  Reporter:  stpeter@…  |      Owner:  draft-ietf-iri-3987bis@…
>      Type:  defect     |     Status:  new
>  Priority:  major      |  Milestone:
> Component:  3987bis    |    Version:
>  Severity:  -          |   Keywords:
> -----------------------+--------------------------------------
> 
> Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/131>
> iri <http://tools.ietf.org/wg/iri/>
> 

Received on Tuesday, 24 July 2012 01:30:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 24 July 2012 01:30:11 GMT