- From: by way of Martin Duerst <timothy@hplb.hpl.hp.com>
- Date: Sat, 04 Sep 2004 10:58:51 +0900
- To: uri@w3.org
Hi Martin, Thanks for the detailed comments. I'll look over all the suggested clarifications to the text in general but it's the URI/IRI issue that most concerns me. Every time I think I've understood the right way to treat this issue, you say something that suggests otherwise. It's happened again. In a previous exchange with you I wrote: > I was looking for a way for tag to be internationalisation-compatible > while adding as little as possible to the spec except in the way of > external references to internationalisation specs. I had a closer look > at draft-duerst-iri-09 with that in mind. > > In Section 3 intro and 3.1, you distinguish identifiers that are used > for resource retrieval from those that aren't. Then in 5.1 you make > the distinction again, in the context of string comparison. > > Well, tags are identifiers that are *not* used for resource retrieval. > So it seems to me that we fall squarely into the class of identifiers > for which it is "not necessary to map the IRI to a URI". > > This matches my conception of how to treat tags from an > internationalisation perspective: they always and only appear in their > IRI forms. So a Chinese tag would look like the example I sent > previously -- as a string of Chinese characters (with our separators in > between). There is no need to map that into a (2396bis) URI. > > So I propose to add a little text on Internationalisation as an > addendum to our syntax, referring to your IRI draft and saying that our > domain name component may be replaced by a IDN (refer to RFC3490); > that, when the left -hand side of email addresses gets an international > standard, that could be used instead; and that the "specific" part of > the tag may be any string of "ipchar" (your draft). > > I don't think I need to mention percent-encoded UTF-8 (or such) at all. > I know the emphasis in the syntactic detail is then rather one-sided, > but I'm trying to be pragmatic. In response to that, you seemed to agree with me. But in your comments on draft 06 below you have put pct-encoded syntax in! What is it I'm missing in thinking that (URI) tags containing pct-encoded characters are: (a) self-defeating -- tags are supposed to be tractable for humans (b) redundant -- it's never necessary to turn a tag containing, say, Chinese characters into URI form; we need be sure only that it's in canonical form and thus comparable with other tags. Cheers, Tim. Martin Duerst wrote: >Hello Tim, >Finally I get around to comment on the newest version of your TAG draft, >a pre-draft at http://taguri.org/06/draft-kindberg-tag-uri-06.txt. >The main comment is that you try to have two separate definitions, >one for TAG URIs and the other for TAG IRIs, but that isn't how the >URI spec and the IRI spec work. For further background, please also >see the issue and discussion at >http://www.w3.org/International/iri-edit#iri-scheme-38 >I also give some comments on general issues that I found, mostly >editorial. > >At 13:27 04/08/24 +0900, Martin Duerst wrote: > >>Network Working Group T. Kindberg >>Internet-Draft Hewlett-Packard Corporation >>Expires: January 27, 2005 S. Hawke >> World Wide Web Consortium >> July 29, 2004 >> >> >> The 'tag' URI scheme >> draft-kindberg-tag-uri-06 > >[snip] [also snipped all page breaks] > >>Abstract >> >> This document describes the "tag" Uniform Resource Identifier (URI) >> scheme, > >This comma is somewhat confusing. It's probably best to end the sentence >here and integrate the points in the remaining clause into the rest of >the paragraph. > >>for identifiers that are unique across space and time. Tag >> URIs (also known as "tags") are distinct from most other URIs in that >> there is no authoritative resolution mechanism. A tag may be used >> purely as an entity identifier. Unlike UUIDs or GUIDs > >Abbreviations shouldn't appear without expansion. (see RFC guidelines) >Also, there should be references for these terms, but referencing >doesn't fit well into an abstract. I'd concentrate on the description >of tags themselves in the abstract, in positive terms (what tags do, >not what they don't), and put comparision with other schemes into a >section in the body of the document, with references. > >>such as "uuid" > >So the uuid scheme is an UUID? Or a GUID? Or both? Some readers >will be confused by such minor term differences without clear >explanation. > >> URIs and "urn:oid" URIs, tags are designed to be tractable to humans. >> >> Furthermore, using tags has some advantages over the common practice >> of using "http" URIs as identifiers for non-HTTP-accessible >> resources. > >[snip] > >>1. Introduction >> >> A tag is a type of Uniform Resource Identifier (URI) [1] designed to >> meet the following requirements: >> >> 1. Identifiers are likely to be unique across space and time, > >How likely? Very likely? Designed to make it easy to be? > >>and >> come from a practically inexhaustible supply. >> 2. Identifiers are relatively convenient for humans to mint >> (create), read, type, remember etc. >> 3. No registration is necessary, > >-> no central registration is necessary > >>at least for holders of domain >> names or email addresses; > >I think that each such holder who creates tags has to keep their >own registry to avoid local conflicts. The draft should be quite >a bit more explicit about this. > >>and there is negligible cost to mint >> each new identifier. >> 4. The identifiers are independent of any particular resolution >> scheme. >> >> For example, the above requirements may apply in the case of a user >> who wants to place identifiers on their documents: > >These are the requirements met by tags, yes? It'd be better >to just say so. > >> a. They > >Who? The documents? The identifiers? The users? Please rework the >whole list so that all the items follow the same syntactic structure. > >>want to be reasonably sure that the identifier is unique. >> Global uniqueness is valuable because it prevents identifiers >> from becoming unintentionally ambiguous. >> b. It is useful for the identifier to be tractable to humans: > >'to humans' -> 'by humans'? > >>they >> should be able to mint new identifiers conveniently, and to type >> them into emails and forms. > >For more aspects of this (memorize,...), see the 'overview and motivation' >section of IRIs. > >> c. They do not want to have to communicate with anyone else in order >> to mint identifiers for their documents. >> d. The user wants to avoid identifiers that might be taken to imply >> the existence of an electronic resource accessible via a default >> resolution mechanism, when no such electronic resource exists. >> >> Existing identification schemes satisfy some but not all of the >> general requirements above. > >Why 'general'? I read it as if these requirements would always apply. > >>For example: >> >> UUIDs [8], [9] are hard for humans to read. >> >> OIDs [10], [11] and Digital Object Identifiers [12] require naming >> authorities to register themselves, > >'themselves': If the identifiers register themselves, that would be >great. But the problem is that registration requires work by >an user. > >>even if they already hold a >> domain name registration. > >So 'they' is users, not ids? But users don't register themselves, >they register some ids or schemes,... > >> URLs (in particular, "http" URLs) are sometimes used as identifiers >> that satisfy most of our requirements. > >'our': Who is 'we'? Better avoid. > >>Many users and organisations >> have already registered a domain name, and the use of the domain name >> to mint identifiers comes at no additional cost. But there are >> drawbacks to URLs-as-identifiers: >> >> o An attempt may be made to resolve a URL-as-identifier, even though >> there is no resource accessible at the "location". >> o Domain names change hands and the new assignee of a domain name >> can't be sure that they are minting new names. For example, if >> example.org is assigned first to a user Smith and then to a user >> Jones, there is no systematic way for Jones to tell whether Smith >> has already used a particular identifier such as http:// >> example.org/9999. >> o Entities could rely on purl.org > >add: or a similar service. >Also, use 'http://purl.org' rather than just 'purl.org', or provide >a reference. > >>as a (first-come, first-served) >> assigner of unique URIs; but a solution without reliance upon >> another entity such as the Online Computer Library Center (OCLC, >> which runs purl.org) may be preferable. >> >> Lastly, many entities -- especially individuals -- are assignees of >> email addresses but not domain names. It would be preferable to >> enable those entities to mint unique identifiers. >> >>2. Tag Syntax and Rules >> >> This section first specifies the syntax of tag URIs and gives >> examples. It then describes a set of rules for minting tags designed >> to make them unique. Finally, it discusses the resolution and >> comparison of tags. >> >>2.1 Tag Syntax and Examples >> >> The general syntax of a tag URI, in ABNF, is: > >You need a reference to the ABNF RFC >(http://www.ietf.org/rfc/rfc2234.txt), >and to check the ABNF with some tool >(see advice to Internet Draft and RFC authors). > >> tagURI = "tag:" taggingEntity ":" [specific] > >Is it possible for 'specific' to be empty? In that case, >is the ':' necessary? Is there any specific meaning for >this case? If this is allowed, please provide an example. >Also, later, 'specific' is defined as *(...), >so the [] parentheses are not at all necessary. > >> Where: >> >> taggingEntity = authorityName "," date >> authorityName = DNSname / emailAddress >> date = 4dig ["-" 2dig ["-" 2dig ]] ; see ISO8601 [2] > >It would be much clearer if this were: > date = year ["-" month ["-" day ]] ; see ISO8601 [2] >and then > year = 4*DIGIT > month = "01" / "02" / "03" / ... > day = ("0" %x31-39) / (("1" / "2") DIGIT) / "30" / "31" >or some such. This easily catches a lot of illegal stuff, and makes >the semantics much more obvious. > >> DNSname = DNScomp / DNSname "." DNScomp ; see RFC1035 [3] > >It's much better to write this rule in a non-recursive fashion: > DNSname = DNScomp *( "." DNScomp ) >And you better don't cite RFC 1035 directly. > >> DNScomp = alphaNum [*(alphaNum /"-") alphaNum] > >To allow Internationalized Domain Names, you have to add >pct-encoded here: > DNScomp = dnsChar [*(dnsChar / "-") dnsChar] > dnsChar = alphaNum / pct-encoded > >> emailAddress = 1*(alphaNum /"-"/"."/"_") "@" DNSname > >I'd strongly recommend to also add pct-encoded here, making this >future-proof for potential internationalization of the LHS: > emailAddress = 1*(alphaNum /"-"/"."/"_"/pct-encoded) "@" DNSname > >> alphaNum = DIGIT / ALPHA >> specific = *( pchar / "/" / "?" ) ; pchar from RFCXXXX [1] > >pchar includes pct-encoded, so this is okay in terms of basic syntax. > >> ALPHA = %x41-5A / %x61-7A ; any char in the range "A"-"Z" >> or "a"-"z" >> DIGIT = %x30-39 ; any char in the range "0" through "9" > >Just import ALPHA and DIGIT from the ABNF RFC, don't repeat them here. > >At this point, you should say some general things about pct-encoded. >What you want to say probably is: >- pct-encoded (including in the case of pchar) is only allowed for > octets above %7F. >- pct-encoded (including in the case of pchar) is only allowed in > sequences that are valid UTF-8 octet sequences. >- pct-encoded is used to encode characters using UTF-8. >- There may be additional restrictions for each of the components > allowing pct-encoded. >- That pct-encoded is only allowed to allow the minting of tag IRIs, > but that tags created as URIs from the start should/must never > contain any pct-encoded pieces, and that tag IRIs also should/must > never contain any pct-encoded pieces. > >> The component "taggingEntity" is the name space part of the URI. To >> avoid ambiguity, the domain name in "authorityName" (whether an email >> address or a simple domain name) MUST be fully qualified. It is >> RECOMMENDED that the domain name should be in lowercase form. >> Alternative formulations of the same authority name will be counted >> as distinct > >'counted' -> 'treated', or even better just say that these *are* >different tags. > >>and hence tags containing them will be unequal (see >> Section 2.4). For example, tags beginning "tag:HP.com,2000:" are >> never equal to those beginning "tag:hp.com,2000:", even though they >> refer to the same domain name. >> >> Authority names could, in principle, belong to any syntactically >> distinct namespaces whose names are assigned to a unique entity at a >> time. Those include, for example, certain IP addresses, certain MAC >> addresses, and telephone numbers. However, to simplify the tag >> scheme, we restrict authority names to be domain names and email >> addresses. Future standards efforts may allow use of other authority >> names following syntax that is disjoint from this syntax. To allow >> for such developments, software that processes tags MUST NOT reject >> them on the grounds that they are outside the syntax for >> authorityName defined above. > >Here, say that a DNSName must, after decoding of percent-encoding and >interpretation of the resulting octet sequence as UTF-8, >be an Internationalized Domain Name according to IDNA [RFC 3490]. >You may also want to say that a DNSName, after decoding of >percent-encoding and interpretation of the resulting octet sequence >as UTF-8, should be normalized as defined by Nameprep [RFC 3491] to >avoid producing TAGs that look very similar but are not the same. >Also, say that pct-encoded is allowed on the left hand side of >emailAddress (before the "@") for future-compatibility, and is only >to be used if and when there is an IETF Standards-Track document >specifying how internationalized email address left hand sides >are handled. > >> The component "specific" is the name-space-specific part of the URI: >> it is a string of URI characters (see restrictions in syntax >> specification) chosen by the minter of the URI. It is RECOMMENDED >> that specific identifiers should be human-friendly. > >Add some text here that after decoding of percent-encoding and >interpretation of the resulting octet sequence as UTF-8, >"specific" should be in NFC and preferably even in NFKC. > >> Examples of tag URIs are: >> >> tag:timothy@hpl.hp.com,2001:web/externalHome >> tag:sandro@w3.org,2004-05:Sandro >> tag:my-ids.com,2001-09-15:TimKindberg:presentations:UBath2004-05-19 >> tag:blogger.com,1999:blog-555 >> tag:yaml.org,2002:int > >An example without 'specific', and some I18N examples, should be added >(I can help). > >>2.2 Rules for Minting Tags >> >> As Section 2.1 has specified, each tag consists of a "tagging entity" >> followed, optionally, by a specific identifier. The tagging entity >> is designated by an "authority name" -- a fully qualified domain name >> or an email address containing a fully qualified domain name -- >> followed by a date. The date is chosen to make the tagging entity >> globally unique, exploiting the fact that domain names and email >> addresses are assigned to at most one entity at a time. That entity >> then ensures that it mints unique identifiers. > >The following paragraph can be reworded (and probably simplified) >once the chances to the syntax rules have been made. > >> The date specifies, according to the Gregorian calendar and UTC, any >> particular day on which the authority name was assigned to the >> tagging entity at 00:00 UTC (the start of the day). The date MAY be >> a past or present date on which the authority name was assigned at >> that moment. The date is specified using one of the "YYYY", >> "YYYY-MM" and "YYYY-MM-DD" formats allowed by the ISO 8601 standard >> [2]. The tag specification permits no other formats. Tagging >> entities MUST ascertain the date with sufficient accuracy >> to avoid accidentally using a date on which the authority name was >> not in fact assigned (many computers and mobile devices have poorly >> synchronised clocks). The date MUST be reckoned from UTC -- which >> may differ from the date in the tagging entity's local timezone at >> 00:00 UTC. > >I think some readers may be confused by "reckoned from UTC". Why not >just say that the date is always in UTC? > > >>That distinction can generally be safely ignored in >> practice, but not on the day of the authority name's assignment. In >> principle it would otherwise be possible on that day for the previous >> assignee and the new assignee to use the same date and thus mint the >> same tags. >> >> In the interests of brevity, the month and day default to 01. A day >> value of 01 MAY be omitted; a month value of 01 MAY be omitted unless >> it is followed by a day value other than 01. > >I'd quote all the 01 (i.e. "01") for easier readability. It is easy here >to confuse MAY with the month of May. > >>For example, "2001-07" >> is the date 2001-07-01 and "2000" is the date 2000-01-01. All date >> formulations specify a moment (00:00 UTC) of a single day, and not a >> period of a day or more such as "the whole of July 2001" or "the >> whole of 2000". Assignment at that moment is all that is required to >> use a given date formulation. > >formulation -> format? or just 'use a given date'? > >> Tagging entities should be aware that alternative formulations of the >> same date will be counted as distinct and hence tags containing them >> will be unequal. For example, tags beginning "tag:hp.com,2000:" are >> never equal to those beginning "tag:hp.com,2000-01-01:", even though >> they refer to the same date (see Section 2.4). > >Here and elsewhere: The IETF prefers to use domain names such as >example.com. > >> An entity MUST NOT mint tags under an authority name that was >> assigned to a different entity at 00:00 UTC on the given date, and it >> MUST NOT mint tags under a future date. >> >> An entity that acquires an authority name immediately after a period >> during which the name was unassigned MAY mint tags as if the entity >> was assigned the name during the unassigned period. This practice >> has considerable potential for error and MUST NOT be used unless the >> entity has substantial evidence that the name was unassigned during >> that period. The authors are currently unaware of any mechanism that >> would count as evidence, other than daily polling of the "whois" >> registry. >> >> For example, Hewlett-Packard holds the domain registration for hp.com >> and may mint any tags rooted at that name with a current or past date >> when it held the registration. It must not mint tags such as >> "tag:champignon.net,2001:" under domain names not registered to it. >> It must not mint tags dated in the future, such as >> "tag:hp.com,2999:". If it obtains assignment of >> "extremelyunlikelytobeassigned.org" on 2001-05-01, then it must not >> mint tags under "extremelyunlikelytobeassigned.org,2001-04-01" unless >> it has evidence proving that that name was continuously unassigned >> between 2001-04-01 and 2001-05-01. >> >> A tagging entity mints specific identifiers that are unique within >> its context, in accordance with any internal scheme that uses only >> URI characters. Some tagging entities (e.g. corporations, mailing >> lists) consist of many people, in which case group decision-making >> and record-keeping procedures SHOULD be used to achieve uniqueness. > >Record-keeping is important for individuals, too. > >>2.3 Resolution of Tags >> >> There is no authoritative resolution mechanism for tags. Unlike most >> other URIs, tags can only be used as identifiers, and are not >> designed to support resolution. If authoritative resolution is a >> desired feature, a different URI scheme should be used. >> >>2.4 Equality of Tags >> >> Tags are simply strings of characters and are considered equal if and >> only if they are completely indistinguishable in their machine >> representations. That is, one can compare tags for equality by >> comparing the numeric codes of their characters, in sequence, for >> numeric equality. This equality-criterion allows for simplification > >equality-criterion -> equality criterion > >> of tag-handling software, which does not have to transform tags in >> any way to compare them. > > >>3. Internationalisation >> >> So far, we have considered tags as URIs, which are represented in a >> subset of US-ASCII characters. As befits our requirement for >> identifiers to be tractable to humans, tags can also be minted as > >The 'can also be minted as' probably needs some more explanation. >In general, any uri scheme that allows pct-encoded in the right way can >also be used with IRIs. See below. > >> Internationalized Resource Identifiers (IRIs) [4]. That is, they can >> be minted in languages that use any characters from the Universal >> Character Set. > >Does a tag have a language? I think it's better to just say: >they can be minted using any characters from ... > >The following procedure can probably be removed. If not, the following >details should be fixed: > >> The procedure for minting tags as IRIs is to use the specification of >> Section 2 but with the following syntactic changes: >> o An International Domain Name (IDN) [5] represented according to >> the rules of 'nameprep' [6] may be used in place of a domain name >> in authorityName. That includes a domain name appearing on the >> right-hand side of an email address. >> o If a standard arises for expressing email addresses in >> international form -- that is, including the left-hand side of >> email addresses -- then that form will be allowed in >> authorityName. >> o An international authorityName MUST appear in at least Normalized >> Form C (NFC) and SHOULD appear in Normalized Form KC (NFKC) [7]. > >This should not be necessary, because Nameprep takes care of this. >But it may be a good thing to say for 'specific'. > >> o The specific component of a tag IRI may be any string allowed by >> the ABNF term *( ipchar / "/" / "?" ) defined in [4]. > >I recommend adding some normalization restrictions here, for the benefit >of transcribability,... > >> Two tag IRIs are equal if and only if they are identical as character >> sequences -- and thus that their machine representations are >> identical when using the same character encodings. > >It may be a good idea to repeat here explicitly that: >- The use of pct-encoding in the syntax rules is only allowed in > order to define the syntax of IRIs allowed in the tag scheme. >- pct-encoding should not be used in tags generated using only > US-ASCII characters. >- pct-encoding should not be used in tags generated including > non-ASCII characters (i.e. IRIs). >- A tag IRI is not equivalent to the tag URI resulting after > mapping the IRI to an URI according to Section 3.1 of [IRI]. > To reduce any problems resulting from this: > - tags should be used mainly with technology that can transport and > handle IRIs (such as RDF). > - If tags are temporarily converted to URIs because they have > to be passed to some infrastructure that isn't able to handle > IRIs, they should be converted back to IRIs when being recived > back from that infrastructure. > > >>4. Security Considerations >> >> Minting a tag, by itself, is an operation internal to the tagging >> entity with no external consequences. The consequences of using an >> improperly minted tag (due to malice or error) in an application >> depends on the application, and must be considered in the design of >> any application that uses tags. >> >> There is a significant possibility of minting errors by people who >> fail to apply the rules governing dates, or who use a shared >> (organizational) authority-name without prior organization-wide >> agreement. Tag-aware software MAY help catch and warn against these >> errors. As stated in Section 2, however, to allow for future >> expansion, software MUST NOT reject tags which do not conform to the >> syntax specified in Section 2. >> >> A malicious party could make it appear that the same domain name or >> email address was assigned to each of two or more entities. Tagging >> entities SHOULD use reputable assigning authorities, and verify >> assignment wherever possible. >> >> Entities SHOULD also avoid the potential for malicious exploitation >> of clock skew, by using authority names that were assigned >> continuously from well before to well after 00:00 UTC on the date >> chosen for the tagging entity -- preferably by intervals in the order >> of days. >> >>5. References >> >>5.1 Normative References >> >> [1] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource >> Identifier (URI): Generic Syntax (Note to the RFC Editor: Please >> update this reference with the RFC resulting from >> draft-fielding-uri-rfc2396bis-xx.txt, and remove this Note)", >> draft-fielding-uri-rfc2396bis-06 (work in progress), July 2004. >> >> [2] "Data elements and interchange formats -- Information >> interchange -- Representation of dates and times", ISO >> (International Organization for Standardization) ISO 8601:1988, >> 1988. >> >> [3] Mockapetris, P., "Domain names - implementation and >> specification", STD 13, RFC 1035, November 1987. >> >> [4] Duerst, M. and M. Suignard, "Internationalized Resource >> Identifiers (IRIs)", draft-duerst-iri-09 (work in progress), >> July 2004. > > This should have a similar RFC Editor comment as [1]. > >> [5] Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing >> Domain Names in Applications (IDNA)", RFC 3490, March 2003. >> >> [6] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for >> Internationalized Domain Names (IDN)", RFC 3491, March 2003. >> >> [7] Duerst, M. and M. Davis, "Unicode Normalization Forms", Unicode >> Standard Annex #15 http://www.unicode.org/unicode/reports/tr15/ >> tr15-23.html, April 2003. >> >>5.2 Informative References >> >> [8] Leach, P. and R. Salz, "UUIDs and GUIDs", draft-leach-uuids-01 >> (work in progress), 1997. >> >> [9] "Information technology - Open Systems Interconnection - Remote >> Procedure Call (RPC)", ISO (International Organization for >> Standardization) ISO/IEC 11578:1996, 1996. >> >> [10] "Specification of abstract syntax notation one (ASN.1)", ITU-T >> recommendation X.208, (see also RFC 1778), 1988. >> >> [11] Mealling, M., "A URN Namespace of Object Identifiers", RFC >> 3061, February 2001. >> >> [12] Paskin, N., "Information Identifiers", Learned Publishing Vol. >> 10, No. 2, pp. 135-156, (see also www.doi.org), April 1997. > >[snip] > >Regards, Martin. -- Tim Kindberg hewlett-packard laboratories filton road stoke gifford bristol bs34 8qz uk purl.org/net/TimKindberg timothy@hpl.hp.com voice +44 (0)117 312 9920 fax ++44 (0)117 312 8003
Received on Saturday, 4 September 2004 01:59:03 UTC