- From: Dave Thaler <dthaler@microsoft.com>
- Date: Tue, 24 Jul 2012 01:29:36 +0000
- To: Larry Masinter <masinter@adobe.com>, "draft-ietf-iri-3987bis@tools.ietf.org" <draft-ietf-iri-3987bis@tools.ietf.org>, "stpeter@stpeter.im" <stpeter@stpeter.im>
- CC: "public-iri@w3.org" <public-iri@w3.org>
> -----Original Message----- > From: Larry Masinter [mailto:masinter@adobe.com] > Sent: Saturday, July 21, 2012 9:07 AM > To: draft-ietf-iri-3987bis@tools.ietf.org; stpeter@stpeter.im > Cc: public-iri@w3.org > Subject: RE: [iri] #131: Using document charset causes interoperability > problems > > I hate this feature, and would love to get rid of it, but let's acknowledge at least > somewhere that it happens. That is, the interoperability problems are real, but > not documenting it here doesn't solve the problem. > > I think what the text in the document intended was that whether there _was_ a > "document charset" at all depended on the format of the document... yes, for > HTML, maybe for Word (up to word), no for PDF, maybe (not yet defined) for > text/plain. > > I can see two choices that might work: > > * Any document format that wishes this kind of processing has to say that what > they are using aren't really IRIs, they're funny strings that get preprocessed to > turn them into IRIs or URIs. > * The IRI spec (continues to) explicitly defines this document-charset-dependent > behavior, but is more explicit about the rules for where "document charset" > comes from. > > I could go with either one of those. How do those seem to the group? I'd argue for the first (and against the second). -Dave > > > -----Original Message----- > From: iri issue tracker [mailto:trac+iri@grenache.tools.ietf.org] > Sent: Thursday, July 19, 2012 3:04 PM > To: draft-ietf-iri-3987bis@tools.ietf.org; stpeter@stpeter.im > Cc: public-iri@w3.org > Subject: [iri] #131: Using document charset causes interoperability problems > > #131: Using document charset causes interoperability problems > > As reported by Dave Thaler... > > URIs and/or IRIs can appear in many contexts. > > In normal text in an email message, or in a PDF file or Word doc or whatever > else. > > Allowing it to vary complicates frameworks considerably since now the doc > charset has to be passed from whatever extracts the URI from the document > (HTML or otherwise) and whatever else needs to know the interpretation > (normalizer code, comparison code, whatever). Various API frameworks > already have various sorts of "Uri" classes that take in a URI-like string and let > you do things like get the URI form or the IRI form, or various > components or whatever. This means the constructor needs to change since > you cannot correctly interpret an IRI(bis) without knowing the document > charset. > > I'm not yet convinced that's a change worth making. Currently everything > assumes UTF-8. With this change, we'll get random behavior until > everything is updated, which is a state worse than today in my view. > > Example: > http://www.sw.it.aoyama.ac.jp/non-existent?é > > If the charset were iso-8859-1 then under RFC 3987 as I understand it, this > would become: > > http://www.sw.it.aoyama.ac.jp/non-existent?%C3%83%C2%A9 > > In other words, you have to convert iso-8859-1 to UTF-8 and then pct- encode > the UTF-8. > > But as I understand 3987bis it would become: > > http://www.sw.it.aoyama.ac.jp/non-existent?%C3%A9 > > which would then be passed around via various APIs and protocols that would > not pass the charset along with it. As such it would be interpreted by the > receiving code as pct-encoded UTF-8: > > http://www.sw.it.aoyama.ac.jp/non-existent?é > > which of course it isn't. > > As such, we should make the RFC 3987 behavior (UTF-8, NOT the doc charset) > required for everything that doesn't explicitly pass the charset along with the > URI. > > -- > -----------------------+-------------------------------------- > Reporter: stpeter@… | Owner: draft-ietf-iri-3987bis@… > Type: defect | Status: new > Priority: major | Milestone: > Component: 3987bis | Version: > Severity: - | Keywords: > -----------------------+-------------------------------------- > > Ticket URL: <http://trac.tools.ietf.org/wg/iri/trac/ticket/131> > iri <http://tools.ietf.org/wg/iri/> >
Received on Tuesday, 24 July 2012 01:30:10 UTC