Re: charmod-uri from Jeremy Carroll on 2002-04-11 (w3c-rdfcore-wg@w3.org from April 2002)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Thu, 11 Apr 2002 18:13:28 +0200
To: <w3c-i18n-ig@w3.org>, <w3c-rdfcore-wg@w3.org>
Message-ID: <MABBLGKMPIJFCKFGDBEPKEKJCAAA.jjc@hplb.hpl.hp.com>
I just checked up on what the current draft of charmod and IRI actually say
about normalization. It is still IMO a bit confused, because normalization
of IRIs is not mentioned explicitly in charmod, although the algorithm given
in the IRI spec conforms with the early uniform normalization model of
charmod.

Charmod:

[[[ from section 4.3
http://www.w3.org/TR/charmod/#sec-NormalizationApplication

[S] Specifications of text-based formats and protocols MUST, as part of
their syntax definition, require that the text be in normalized form.

]]]

[[[ Section 8
http://www.w3.org/TR/charmod/#sec-URIs

[S] W3C specifications that define protocol or format elements (e.g. HTTP
headers, XML attributes, etc.) which are to be interpreted as URI references
(or specific subsets of URI references, such as absolute URI references,
URIs, etc.) SHOULD use Internationalized Resource Identifiers (IRI) [I-D
URI-I18N] (or an appropriate subset thereof).

]]]

[[[ IRI Section 2.3
http://www.w3.org/International/2001/draft-masinter-url-i18n-08.txt

[[[
Part I is skipped if the
input is already in an UCS-based encoding (e.g. UTF-8 or UTF-16). In
that case, it is assumed that the IRI is already in NFC.

   Part I)

   1) Represent the IRI characters as a sequence of characters from the
      UCS.

   2) Normalize the character sequence according to Normalization Form
      C, as defined in [UNI15].  (See further discussion in Section
      3.1.)
]]]

Charmod, would benefit if the fact that the same normalization model is used
in the IRI spec was more explicit.


An issue with test003 which IRI raises is that IRI says (particularly (b)):

[[[
2.3 Mapping of IRIs to URIs

This section defines how to map an IRI to a URI. Everything in
this section applies also to IRI references and URI references, as
well as components thereoff (e.g. fragment identifiers).

This mapping has two purposes:

  a) Syntactical: Many URI schemes and components define additional
     syntactical restrictions not captured in Section 2.2. Such
     restrictions can be applied to IRIs by noting that IRIs are only
     valid if they map to syntactically valid URIs. This means that
     such syntactical restrictions do not have to be defined again
     on the IRI level.

  b) Interpretational: URIs identify resources in various ways. IRIs
     also indentify resources. The resource that an IRI identifies is
     the same as the one identified by the URI obtained after
     converting the IRI according to the procedure defined here.
     This means that there is no need to define the association
     between identifier and resource again on the IRI level.
]]]

This seems to suggest that we should do the mapping before the model theory;
which is in tension with the usual refusal to normalize URIs for scheme
case, hostname case, port number, missing default path, or anything else,
except as part of actually executing the protocol.

It is potentially self-inconsistent with the phrase:

[[[
However, this mapping SHOULD only be applied when necessary, as late
as possible.
]]]

Jeremy
Received on Thursday, 11 April 2002 12:06:37 UTC