draft minutes from IRI BoF at IETF 76 from Peter Saint-Andre on 2009-11-25 (public-iri@w3.org from November 2009)

From: Peter Saint-Andre <stpeter@stpeter.im>
Date: Tue, 24 Nov 2009 19:44:15 -0700
To: public-iri@w3.org
Message-ID: <4B0C99FF.3030504@stpeter.im>
These are draft minutes from the "Birds of a Feather" session on IRIs
held at IETF 76 in Hiroshima, Japan on November 10, 2009. If you have
changes, please send them to the BoF chairs so that they can upload
final minutes.

/psa

========================================================================

IRI BoF Minutes
IETF 76
Tuesday, November 10, 2009, 1520-1700
(Afternoon Session II -- Cattelya East)

Chairs: Ted Hardie and Pete Resnick

Jabber Scribes / Note Takers: Pete Resnick and Peter Saint-Andre

Minutes Editor: Peter Saint-Andre

Agenda:

<http://www.ietf.org/proceedings/09nov/agenda/iri.html>

Slides:

<https://datatracker.ietf.org/meeting/76/materials.html#wg-iri>

Audio:

<ftp://videolab.uoregon.edu/pub/videolab/media/ietf76/ietf76-ch4-tue-afnoon2.mp3>

Chat Log:

<http://www.ietf.org/jabber/logs/iri/2009-11-10.txt>

Dramatis Personae

AD = Adam Roach
AM = Alexey Melnikov
BL = Barry Leiba
JH = Joe Hildebrand
JK = John Klensin
LD = Lisa Dusseault
LM = Larry Masinter
MD = Martin Duerst
MS = Michael Smith
PR = Pete Resnick
SC = Stuart Cheshire
TH = Ted Hardie

========================================================================

IETF NOTE WELL statement reiterated.

Agenda bashing.

TH: We are trying to restore interoperability to a part of the Internet
infrastructure where it has been lost. URI mechanism is one of the most
important pieces of the application space. URIs were originally designed
to be (1) under the hood and (2) ASCII. Assumption that IRIs were a way
of presenting data that was really represented in a URI. That assumption
has changed to make IRIs more of a first-class citizen. XML being a prime
example. Now there are no less than 9 communities working on IRIs (W3C,
etc.). Not trying to add a 10th.

JK: Counter-theory is that IRIs have been a disaster for
internationalization.

LM: There is a horrible mess, but there is the possibility of
making things a little bit better.

[slide] History of prior IRI specifications

[slide] Review of current documents

- draft-duerst-iri-bis-07
- draft-duerst-mailto-bis-07
- RFC 4395

Other documents explicitly left out of scope.

Issues:

1. IRI as protocol element vs. mapping of IRI to URI. Until now, IRI
has been defined as a sequence of Unicode characters that's converted
into a URI by translating to UTF-8 and then percent-encoding. The
meaning of the IRI was to be exactly the URI to which it was mapped.
Later, it became clear that most implementations parsed the UTF-8 and
translated to hex-encoding only if necessary. We also found that some
applications were using IRIs as strings for namespaces (e.g., XML
namespaces). In other words, applications were using IRIs directly as
protocol elements.

2. Normative reference to IDNA. The IRI spec defined translation of
domain name components using IDNA, but IDNA is under transition.

3. Different levels of "liberal processing". IRIs that weren't
actually valid were accepted in places like HTML documents. Two levels
here: one defined by XML community, another by the HTML community (i.e.,
what browsers currently implement).

[slide] Other documents and committees

- HTML5 work in WhatWG and W3C HTML WG
- IETF IDNABIS WG
- IETF EAI WG

TH: Start introducing discussion. In particular, let's have comments and
questions on IRI as protocol element.

MD: We still need to make sure that the conversion from IRI to URI is
well-defined.

PR: Is everybody OK with movement from presentation layer to protocol
element?

AM: Is there any difference on the wire? If so, it would be nice to show
some examples.

LM: There are protocols and protocol elements. That's why we came up
with IRIs vs. URIs in the first place.

AM: That was not my question. Will conversion processes described in
different versions of the spec produce the same data on the wire?

LM: No, because formerly Unicode to UTF-8, hex encoded. It was listed as
an option to do Unicode to punycode. This would give you two processing
paths, because you might end up with hex-encoded UTF-8 *or* ASCII
punycode. If you then took a URI that had percent-encoding and passed it
to a non-Unicode-aware resolver, it would give you different results
than if you had passed it a punycode hostname in the URI.

MD: RFC 3987 didn't explain that very well. But you don't know if
underneath you are dealing with DNS or some other kind of system. This
needs to be an open issue.

JK: The plenary on Thursday night will discuss these issues as well. The
argument there will be that it's not a good idea to have two different
encodings of the same information (UTF-8 and punycode). It's an even
worse idea to have three (UTF-8, UTF-8 hex-encoded, and punycode).

TH: That would appear to be opposed to deployed reality. Certainly we
have these IRIs/URIs in content (e.g., HTML files), not just on the wire.

JK: One of the problems here is that we're showing many signs of digging
a deep hole.

TH: So stop digging?

JK: Start digging a hole over which we have more control.

JH: I want to make sure there's an encoding that doesn't require
conversion to punycode or hex-encoding.

PR: You mean UTF-8?

JH: XMPP is all UTF-8, it would be nice if we could just use that.

LM: URIs are a sequence of characters, not bytes. That needs to be
encoded as UTF-8 or UTF-16 or whatever if you want a sequence of bytes
instead of characters.

SC: We'll discuss this on Thursday night at the plenary. Unicode
identifies characters, but you need to encode them.

AR: I hear that we're trying to make changes based on implementation,
but are we trying to get rid of IDNA?

LM: No, we're discussing these issues with implementors, not ignoring
implementation reality.

MD: I agree that two representations are bad, and three are worse.
Problem is that domain names don't appear in an IRI or URI only in the
path component, they can appear elsewhere (e.g., in query string).

TH: The reality is that things are very messy. Application developers
try very hard to get people where they want to go, even if that input is
not valid. Even if we came up with a totally new identifier, that would
not fix what's out there. My take is that scrapping what we have now and
starting over does us no good, because it might be completely better but
completely undeployed. We need to find something in the middle, because
leaving things as they are right now doesn't help.

MS: HTML5 spec has ended up including text about (1) error handling for
URIs and (2) character encodings in query strings. We hope to have spec
text that we can normatively reference in HTML5 so that implementors can
do the right thing going forward.

TH: Was there discussion of other URI schemes at that point (in the W3C
or WhatWG), or was it limited to HTTP?

MS: I believe it was limited to HTTP.

MS: The "goals" page linked to from the draft charter is very useful and
I'd like to see those concerns addressed.

[

Meeting Editor's Note:

  http://trac.tools.ietf.org/area/app/trac/wiki/DraftIriCharter
  http://trac.tools.ietf.org/area/app/trac/wiki/IriWorkGoals

]

JK: Ted, I agree with you, but URI syntax is currently employed as a
user interface, and I think we'll need to move away from that
eventually. I'd be happy if we could deprecate ASCII-only URIs.

TH: Do we have consensus to do the work necessary to deprecate
ASCII-only URIs?

JK: I would define it differently. Some URIs are not user-facing but
instead are network-facing only, so we haven't felt the need to define
internationalized versions of those.

TH: Yes, the browser location bar screwed us royally because HTTP URIs
were originally supposed to be hidden behind hyperlinks in web pages.

JK: The choices are either looking at things on a scheme-by-scheme
basis, or solving the problem globally for all schemes.

TH: How much work are people willing to put in? Because that's a lot of
work.

MD: I've already put in a lot of time, but I'm willing to continue
working. The deployed content that is not compliant is a small
percentage, but the mountain of content is extremely large.

LM: A few points. draft-duerst-iri-bis-07 recommends (1) renaming the
existing IANA registry of URI schemes to be a registry of URI/IRI
schemes, (2) adding as a requirement for new schemes to define the
non-ASCII characters that are appropriate for the scheme, and (3)
reviewing all the existing schemes for their appropriateness as IRI
identifiers, where the default is that it's an old-style URI. This would
go a long way to making all of the identifiers into IRIs.

TH: I remember an effort to bring all the URIs up to date, and we burned
out three people in the process. Now it would be just as much work, or
more.

MD: Say there is a URI scheme for IP addresses, then it's just numbers
and we don't need any internationalization. On the other hand, for
something like mailto you can only use ASCII because it is so old.

JK: If we had a URI scheme for IP addresses, you can be sure that we
would hear calls for encoding those digits in a localized version,
instead of in "Arabic numerals".

LM: I think we have enough problems without imagining new ones.

BL: Why is this not a presentation issue instead of a protocol issue?

TH: Barry, that was the theory, but it failed the test of implementation.

BL: As we discover that more things need internationalization in the
presentation layer, can't we just say that applications need to become
better at presentation?

AR: If I understand that we'd go back and convert old URIs to IRIs, I
would suggest that the effort to do that for SIP alone would be
enormous.

LM: Currently you could hex-encode UTF-8.

AR: No-ASCII characters are not allowed in SIP URIs. But SIP has messed
this up because there is no normative text about it. I would be stunned
if we're not the only protocol in that boat.

LM: So either it works or it doesn't. If it doesn't, it's ASCII only
until someone fixes it. The registry would enable that to be defined.

AR: So you would be defining a framework, not fixing each scheme.

LM: Right. A framework that says it's either like IRIs now, like URIs
now, or some other definition.

TH: To Barry's point, we've never been able to force people how to layer
things correctly in their applications. The danger is that we're going
to go back into defining human-friendly names. The reality is that the
protocol elements will bleed into the human side of things. But we can't
be so liberal that things break if there are no humans involved, e.g. if
the data is provided to a lower layer.

LM: Preventing bad things from happening to humans is a high priority --
dealing with issues of ambiguity and reliability. Let me speak a bit
about the bleeding of protocol elements into human interfaces, because a
big part of the Internet economy came from being able to advertise your
domain name -- which was bleeding of a protocol element into user space.

BL: I disagree, because if things worked right then a Chinese or
Japanese or Cyrllic name could be converted correctly into a protocol
element.

LM: I'm confused. The use of i18n identifiers works and it's deployed.
There are just a few issues around the edges. The problem is that we
need to address those issues in a coordinated fashion.

PR: The presentation layer leaks into protocol elements. Larry said IRIs
are used as protocol elements, but they are not encoded. And percent
encoded representations provide i18n, too. What we want to get away from
is conflating i18n with a particular encoding. E.g., people use UTF-8,
so what's being argued for is to standardize that usage by saying that
internationalized identifiers are to be encoded in UTF-8.

LM: Sorry that I was not specific enough. Where they are deployed is in
HTML documents.

PR: But those are not in UTF-8, they are in ISO 8859-foo Do you want
these identifiers to be represented in *any* encoding? We need to be
careful about which path we're on.

LM: This isn't something that I *want*, it's what *is*. There is a lot
of software and content that treats a sequence of characters in the
encoding of that document, converts that into Unicode (usually UTF-16
but perhaps also UTF-8), and uses the result as a URI/IRI.

BL: The inclusion of this in an HTML does not make this into a protocol
element.

TH: Some people think they are protocol elements. Both views might be
correct. We can get sidetracked into which encodings we prefer. One of
the goals here is to minimize the number of translations that occur
between applications. Make it as simple as possible, but not simpler.
Don't simplify identifiers, reduce the number of iterations of
translation in any given protocol handoff. Not easy, but different from
what was just described.

PR: Let's reframe. In RFC 822, in the header an address is a protocol
element. In the body, the address is a protocol element in text.

BL: This is going in the wrong direction.

TH: Does anyone think we don't have a problem here?

[laughter]

BL: I'm not saying we shouldn't do this, only that we need to scope it.

[slide] Charter review

LM: There's a draft charter, perhaps we can review that? Can we address
the problem only by working on these three documents?

- draft-duerst-iri-bis-07
- draft-duerst-mailto-bis-07
- RFC 4395

Do we need to work on more? Can we get away with working on even fewer?

JK: First, I don't think we can narrow the scope to this. We need to
look at the impact on all schemes, not just HTTP URIs. Second, I'm leery
of having this WG try to fix mailto, especially in the context of EAI,
because mailto needs to be fixed by people who understand mail.

MD: The mailto draft says very little about EAI. I'm not an expert about
all the details about how you do escaping in email addresses and the
like, so we need people who know about that to provide comments.

AM: I'm happy about reducing the scope because success is good, but the
mailto/EAI interaction needs to be addressed. However, if this WG will
focus only on HTTP then it might make choices that are not generic
enough.

LM: The intent is not to focus on HTTP.

AM: But HTTP is the main application. Maybe the concern lacks a basis.

LM: The charter mentions explicit coordination with other groups. Not
meant to be an exclusive list, and mainly to coordinate requirements.

JK: We have once again fallen into the habit of always mentioning the
the example of web browsers. I agree with Alexey that we need more
examples, at least one example but three more would be even better. But
mailto is probably not a good example.

TH: Maybe look at URNs. They are similar but different enough to be
useful for this work.

MD: RFC 3987 references POP, IMAP, data URI scheme, URN, etc. The idea
to look at other schemes is important. But do we put out a new spec for
those?

LM: I don't think we need to update data: URIs.

MD: We need to look at other schemes, but we don't know which until we
investigate them in detail.

JH: I do think XMPP is a good example because it is more recent. I think
we also need recommendations for people defining new schemes. We can
discuss in the XMPP WG whether more formal cooperation is needed.

LM: So maybe take mailto out and put others in.

AM: Completing the mailto update can happen elsewhere.

LM: There is discussion in the charter of perhaps splitting the
documents, e.g., move everything about domain names into a separate
document. Also perhaps a separate BCP for informational text about why
some characters are problematic and others are not.

Chairs begin asking questions of the room...

TH: HUM Is there a problem here for the IETF to solve?

Hum "yes" -- many
Hum "no" -- silence
Hum "not enough information to decide" -- silence

TH: Raise your hand if you are willing to:
- be on a mailing list to discuss these issues
- review documents
- replace me and Pete at the front of the room

TH: A fair number of volunteers on the first two.

TH: I'm going to ask for two directional questions about the charter....

TH: HUM Should it be within the charter to scrap the existing approach
from RFC 3987 and start over with an entirely new approach?

Hum "yes" -- about one-third.
Hum "no" -- about one-third.
Hum "not enough information to decide" -- about one-third.

TH: HUM Should the charter include an explicit list of schemes to review?

Hum "yes" -- about one-third.
Hum "no" -- about one-third.
Hum "not enough information to decide" -- about one-third.

Clarifying question from the mic: would that include making it a
minimum list, not a limiting list?

LD: Can we ask if this is a deal-breaker?

TH: I don't think we need to go there now, let's clarify the charter
first.

TH: Re-hum?

JK: I don't care which specific schemes are selected for review, just as
long as the WG reviews multiple schemes.

TH: HUM Must we decide specifically which schemes need to be investigated
*before* the WG is chartered, or can the WG sort that out?

Hum "yes" -- none for "before"
Hum "no" -- many for "WG can sort it out"

TH: Must the charter specify a minimum number of schemes to investigate,
or can the WG sort that out?

Hum "yes" -- one hum for "minimum number"
Hum "no" -- many for "WG can sort it out"

TH: So we seem to have consensus that there are volunteers to work on
IRIs at the IETF. Thanks!

END

========================================================================
Received on Wednesday, 25 November 2009 02:44:51 UTC