Re: Fwd: proposed text on IRIEverywhere-27 from Martin Duerst on 2003-02-17 (www-international@w3.org from January to March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 17 Feb 2003 15:59:56 -0500
To: www-tag@w3.org
Cc: Michel Suignard <michelsu@microsoft.com>, www-international@w3.org
Message-Id: <4.2.0.58.J.20030217155807.053a72c8@localhost>
Just to clarify for people who might have misunderstood the
two algorithms below: I intended them not as 'chose one or
the other, but only one', but as 'chose one or the other
depending on the operation' (i.e. namespace comparison or
retrieval).

Regards,   Martin.


At 13:14 03/02/03 -0500, Martin Duerst wrote:

>This is the text that I sent to Chris relating the
>action item from the TAG, quite a while ago.
>
>>Date: Mon, 02 Dec 2002 14:14:35 +0900
>>To: Chris Lilley <chris@w3.org>
>>From: Martin Duerst <duerst@w3.org>
>>Subject: proposed text on IRIEverywhere-27
>>Cc: w3t-archive
>>
>>Hello Chris,
>>
>>Here is some text that we may want to use for our action items.
>>
>>The text to put in depends on the operations the spec in question
>>is doing with IRIs. At the moment, we are covering two operations,
>>'equivalence' comparison (for some definition of equivalence) and
>>resolution.
>>
>>In all cases, the proposed texts can be used more or less
>>using cut-and-paste, but please check carefully to adjust
>>details (wording, terms, level of detail, references)
>>where necessary.
>>
>>
>>For equality comparison, assuming %7e != %7E != ~ :
>>
>>In order to check whether two IRIs match according to
>>this kind of equivalence, proceed according to the
>>following steps:
>>
>>- Represent the two IRIs as a string of characters from the
>>   UCS (Universal Character Set, [ISO10646]/[Unicode]).
>>   For IRIs taken from an XML document, the 'IRI as a string
>>   of characters' refers to the sequence of character information
>>   items in the infoset (i.e. after parsing). For IRIs taken
>>   from other contexts, define/use something similar.
>>- Compare the two strings character by character (without using
>>   any additional equivalences, e.g. case equivalences, i.e.
>>   comparing codepoint-to-codepoint). If you find any difference,
>>   the two IRIs do not match. If you find no differences,
>>   the two IRIs match.
>>
>>
>>
>>For equality comparison, assuming %7e == %7E == ~ :
>>
>>In order to check whether two IRIs match according to
>>this kind of equivalence, proceed according to the
>>following steps (or any procedure that produces the
>>same results):
>>
>>- Represent the two IRIs as a string of characters from the
>>   UCS (Universal Character Set, [ISO10646]/[Unicode]).
>>   For IRIs taken from an XML document, the 'IRI as a string
>>   of characters' refers to the sequence of character information
>>   items in the infoset (i.e. after parsing). For IRIs taken
>>   from other contexts, define/use something similar.
>>- For each of the two strings obtained above, calculate
>>   an 'escaped string' as described in the following:
>>   - Separate the string into groups. A group consists of
>>     either a '%' and the following two characters (a %-group),
>>     or of a single character that is not part of a %-group.
>>   - For each group, do the following:
>>     - If the group is a %-group, convert all letters between
>>       'A' and 'F' to their lowercase equivalents.
>>     - If the group is not a %-group, and if the character is
>>       one of the following 14 characters, then use that character
>>       directly:      % # [ ] ; / ? : @ & = + $ ,
>>       (This will escape characters such as:
>>          SPACE, < > " { } | \ ^ `
>>        It currently not clear whether these will be allowed
>>        as parts of IRIs, but whether they get escaped or not
>>        will not affect the result of the comparison operation
>>        if they are not allowed and therefore don't appear in
>>        input.)
>>     - If the group is not a %-group, and the character is not
>>       listed in the previous clause, then encode the character
>>       into a sequence of bytes using UTF-8, and then convert
>>       each of these bytes into a sequence of a '%' character
>>       followed by two hexadecimal characters together expressing
>>       the hexadecimal value of the byte. Use the letters 'a' - 'f'
>>       (lower case).
>>     (Note: different ways of escaping/unescaping may be chosen
>>      by an implementation, but if this is done, care has to be
>>      taken that all different forms of escaping are mapped to
>>      the same output.)
>>   - Concatenate the result of converting each group, in the order
>>     of the original groups, to obtain the escaped string.
>>   (The escaped strings are to be used just for the comparison
>>   in the next step below. They may be stored to be reused in
>>   subsequent comparisons, but they must not be used for any
>>   other purpose, and must not be exposed.)
>>- Compare the two escaped strings obtained in the previous
>>   step character by character (without using
>>   any additional equivalences, e.g. case equivalences, i.e.
>>   comparing codepoint-to-codepoint). If you find any difference,
>>   the two IRIs do not match. If you find no differences,
>>   the two IRIs match.
>>
>>
>>- Text for resolution: to be done.
>>
>>
>>Regards,    Martin.
Received on Monday, 17 February 2003 19:54:29 UTC