Re: replacing all URIs with IRIs [charmodReview-17] from Martin Duerst on 2002-05-24 (www-tag@w3.org from May 2002)

From: Martin Duerst <duerst@w3.org>
Date: Sat, 25 May 2002 08:11:43 +0900
To: Aaron Swartz <me@aaronsw.com>, www-tag@w3.org
Message-Id: <4.2.0.58.J.20020525073222.02feb188@localhost>
Hello Aaron,

First some procedural points, starting with the end
of your mail:

>I'm considering appealing this decision,

The Character Model is in last call, so you can raise a comment.
If you think this is important to you, please use the form at
http://www.w3.org/2002/05/charmod/LastCall to make sure we get
your comment.

If the I18N WG then decides to reject your comment, you can
appeal that decision at that stage.


>but I wanted to hear the TAG's position first,

If you think it's important, you should send in a comment yourself.
And the WG should treat this comment as seriously as any
other, on its merits. If the TAG has the same issue, they
should send it in by themselves.


At 14:58 02/05/24 -0500, Aaron Swartz wrote:
>I would like to draw the TAG's attention to this requirement in charmod:
>
>"""
>W3C specifications that define protocol or format elements (e.g. HTTP 
>headers, XML attributes, etc.) which are to be interpreted as URI 
>references (or specific subsets of URI references, such as absolute URI 
>references, URIs, etc.) SHOULD use Internationalized Resource Identifiers 
>(IRI) [I-D IRI] (or an appropriate subset thereof).
>"""
>  - http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-URIs
>
>RDF, for example, has recently moved to replace URIs with IRIs (or 
>something like them).

Strictly speaking, the RDF Core WG is working on clarifying the following
note in the RDF Model and Syntax Specification:
(http://www.w3.org/TR/REC-rdf-syntax/)

 >>>>
  Note: Although non-ASCII characters in URIs are not allowed by [URI], 
[XML] specifies a convention to avoid unnecessary incompatibilities in 
extended URI syntax. Implementors of RDF are encouraged to avoid further 
incompatibility and use the XML convention for system identifiers. Namely, 
that a non-ASCII character in a URI be represented in UTF-8 as one or more 
bytes, and then these bytes be escaped with the URI escaping mechanism 
(i.e., by converting each byte to %HH, where HH is the hexadecimal notation 
of the byte value).
 >>>>


>I find this seriously problematic since it will break many utilities which 
>have made the assumption that RDF identifiers are ASCII strings with no 
>spaces, etc.

I have heard such things in other contexts years ago.
I very much hope the Semantic Web is leading technology,
not just repeating past errors.

Making things work without arbitrary restrictions is easy if
you do it from the start, and not too difficult to fix later
in most cases.


>I can understand presenting strings this way for user-display and 
>user-entry but storing them this way and making them the official encoding 
>seems to be going too far.

XML can 'store' them without problems. N3 also should be able to do it.


>I would think that simply using UTF-8 %-encoding would be fine for these 
>purposes.

Why do you think so? Would you think it would make sense to replace
     mailto:me@aaronsw.com
with something like
     mailto:%6d%65@%a1%a1%72%6f%6e%73%77.%63%6f%6d
or maybe even more appropriately, with something like the above
but using Greek letters instead of Latin ones? This is just about
how people using another script than Latin in their day-to-day
work would feel. Why should they have to use special tools
(having to do syntax analysis so that they can figure out
where a % is an escape character and when not,...) just to
be able to read the text, just because some tools make too
restrictive assumptions?


>An example from the RDF test cases shows an HTTP URI with embedded 
>accented characters in Unicode.

Great. I think we should get a few more examples, with different
scripts. I'd be glad to contribute. That way, tools will quickly
be upgraded.


Regards,    Martin.
Received on Friday, 24 May 2002 19:12:45 UTC