draft-kindberg-tag-uri from Martin Duerst on 2004-08-30 (uri@w3.org from August 2004)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 30 Aug 2004 18:17:57 +0900
To: Tim Kindberg <timothy@hpl.hp.com>
Cc: uri@w3.org, Sandro Hawke <sandro@w3.org>
Message-Id: <4.2.0.58.J.20040829160538.04f78a98@localhost>
Hello Tim,

Finally I get around to comment on the newest version of your TAG draft,
a pre-draft at http://taguri.org/06/draft-kindberg-tag-uri-06.txt.

The main comment is that you try to have two separate definitions,
one for TAG URIs and the other for TAG IRIs, but that isn't how the
URI spec and the IRI spec work. For further background, please also
see the issue and discussion at
http://www.w3.org/International/iri-edit#iri-scheme-38

I also give some comments on general issues that I found, mostly
editorial.


At 13:27 04/08/24 +0900, Martin Duerst wrote:
>Network Working Group                                        T. Kindberg
>Internet-Draft                               Hewlett-Packard Corporation
>Expires: January 27, 2005                                       S. Hawke
>                                                World Wide Web Consortium
>                                                            July 29, 2004
>
>
>                           The 'tag' URI scheme
>                        draft-kindberg-tag-uri-06

[snip] [also snipped all page breaks]

>Abstract
>
>    This document describes the "tag" Uniform Resource Identifier (URI)
>    scheme,

This comma is somewhat confusing. It's probably best to end the sentence
here and integrate the points in the remaining clause into the rest of
the paragraph.


>for identifiers that are unique across space and time.  Tag
>    URIs (also known as "tags") are distinct from most other URIs in that
>    there is no authoritative resolution mechanism.  A tag may be used
>    purely as an entity identifier.  Unlike UUIDs or GUIDs

Abbreviations shouldn't appear without expansion. (see RFC guidelines)
Also, there should be references for these terms, but referencing
doesn't fit well into an abstract. I'd concentrate on the description
of tags themselves in the abstract, in positive terms (what tags do,
not what they don't), and put comparision with other schemes into a
section in the body of the document, with references.


>such as "uuid"

So the uuid scheme is an UUID? Or a GUID? Or both? Some readers
will be confused by such minor term differences without clear
explanation.


>    URIs and "urn:oid" URIs, tags are designed to be tractable to humans.
>
>    Furthermore, using tags has some advantages over the common practice
>    of using "http" URIs as identifiers for non-HTTP-accessible
>    resources.

[snip]


>1.  Introduction
>
>    A tag is a type of Uniform Resource Identifier (URI) [1] designed to
>    meet the following requirements:
>
>    1.  Identifiers are likely to be unique across space and time,

How likely? Very likely? Designed to make it easy to be?


>and
>        come from a practically inexhaustible supply.
>    2.  Identifiers are relatively convenient for humans to mint
>        (create), read, type, remember etc.
>    3.  No registration is necessary,

-> no central registration is necessary


>at least for holders of domain
>        names or email addresses;

I think that each such holder who creates tags has to keep their
own registry to avoid local conflicts. The draft should be quite
a bit more explicit about this.


>and there is negligible cost to mint
>        each new identifier.
>    4.  The identifiers are independent of any particular resolution
>        scheme.
>
>    For example, the above requirements may apply in the case of a user
>    who wants to place identifiers on their documents:

These are the requirements met by tags, yes? It'd be better
to just say so.


>    a.  They

Who? The documents? The identifiers? The users? Please rework the
whole list so that all the items follow the same syntactic structure.


>want to be reasonably sure that the identifier is unique.
>        Global uniqueness is valuable because it prevents identifiers
>        from becoming unintentionally ambiguous.
>    b.  It is useful for the identifier to be tractable to humans:

'to humans' -> 'by humans'?

>they
>        should be able to mint new identifiers conveniently, and to type
>        them into emails and forms.

For more aspects of this (memorize,...), see the 'overview and motivation'
section of IRIs.


>    c.  They do not want to have to communicate with anyone else in order
>        to mint identifiers for their documents.
>    d.  The user wants to avoid identifiers that might be taken to imply
>        the existence of an electronic resource accessible via a default
>        resolution mechanism, when no such electronic resource exists.
>
>    Existing identification schemes satisfy some but not all of the
>    general requirements above.

Why 'general'? I read it as if these requirements would always apply.


>For example:
>
>    UUIDs [8], [9] are hard for humans to read.
>
>    OIDs [10], [11] and Digital Object Identifiers [12] require naming
>    authorities to register themselves,

'themselves': If the identifiers register themselves, that would be
great. But the problem is that registration requires work by
an user.


>even if they already hold a
>    domain name registration.

So 'they' is users, not ids? But users don't register themselves,
they register some ids or schemes,...


>    URLs (in particular, "http" URLs) are sometimes used as identifiers
>    that satisfy most of our requirements.

'our': Who is 'we'? Better avoid.


>Many users and organisations
>    have already registered a domain name, and the use of the domain name
>    to mint identifiers comes at no additional cost.  But there are
>    drawbacks to URLs-as-identifiers:
>
>    o  An attempt may be made to resolve a URL-as-identifier, even though
>       there is no resource accessible at the "location".
>    o  Domain names change hands and the new assignee of a domain name
>       can't be sure that they are minting new names.  For example, if
>       example.org is assigned first to a user Smith and then to a user
>       Jones, there is no systematic way for Jones to tell whether Smith
>       has already used a particular identifier such as http://
>       example.org/9999.
>    o  Entities could rely on purl.org

add: or a similar service.
Also, use 'http://purl.org' rather than just 'purl.org', or provide
a reference.


>as a (first-come, first-served)
>       assigner of unique URIs; but a solution without reliance upon
>       another entity such as the Online Computer Library Center (OCLC,
>       which runs purl.org) may be preferable.
>
>    Lastly, many entities -- especially individuals -- are assignees of
>    email addresses but not domain names.  It would be preferable to
>    enable those entities to mint unique identifiers.
>
>2.  Tag Syntax and Rules
>
>    This section first specifies the syntax of tag URIs and gives
>    examples.  It then describes a set of rules for minting tags designed
>    to make them unique.  Finally, it discusses the resolution and
>    comparison of tags.
>
>2.1  Tag Syntax and Examples
>
>    The general syntax of a tag URI, in ABNF, is:

You need a reference to the ABNF RFC
(http://www.ietf.org/rfc/rfc2234.txt),
and to check the ABNF with some tool
(see advice to Internet Draft and RFC authors).


>       tagURI        = "tag:" taggingEntity ":" [specific]

Is it possible for 'specific' to be empty? In that case,
is the ':' necessary? Is there any specific meaning for
this case? If this is allowed, please provide an example.
Also, later, 'specific' is defined as *(...),
so the [] parentheses are not at all necessary.


>    Where:
>
>       taggingEntity = authorityName "," date
>       authorityName = DNSname / emailAddress
>       date          = 4dig ["-" 2dig ["-" 2dig ]] ; see ISO8601 [2]

It would be much clearer if this were:
        date          = year ["-" month ["-" day ]] ; see ISO8601 [2]
and then
        year          = 4*DIGIT
        month         = "01" / "02" / "03" / ...
        day           = ("0" %x31-39) / (("1" / "2") DIGIT) / "30" / "31"
or some such. This easily catches a lot of illegal stuff, and makes
the semantics much more obvious.


>       DNSname       = DNScomp / DNSname "." DNScomp  ; see RFC1035 [3]

It's much better to write this rule in a non-recursive fashion:

        DNSname       = DNScomp *( "." DNScomp )

And you better don't cite RFC 1035 directly.


>       DNScomp       = alphaNum [*(alphaNum /"-") alphaNum]

To allow Internationalized Domain Names, you have to add
pct-encoded here:

        DNScomp       = dnsChar [*(dnsChar / "-") dnsChar]
        dnsChar       = alphaNum / pct-encoded


>       emailAddress  = 1*(alphaNum /"-"/"."/"_") "@" DNSname

I'd strongly recommend to also add pct-encoded here, making this
future-proof for potential internationalization of the LHS:

        emailAddress  = 1*(alphaNum /"-"/"."/"_"/pct-encoded) "@" DNSname


>       alphaNum      = DIGIT / ALPHA
>       specific      = *( pchar / "/" / "?" ) ; pchar from RFCXXXX [1]

pchar includes pct-encoded, so this is okay in terms of basic syntax.


>       ALPHA         = %x41-5A / %x61-7A ; any char in the range "A"-"Z"
>       or "a"-"z"
>       DIGIT         = %x30-39 ; any char in the range "0" through "9"

Just import ALPHA and DIGIT from the ABNF RFC, don't repeat them here.


At this point, you should say some general things about pct-encoded.
What you want to say probably is:
- pct-encoded (including in the case of pchar) is only allowed for
   octets above %7F.
- pct-encoded (including in the case of pchar) is only allowed in
   sequences that are valid UTF-8 octet sequences.
- pct-encoded is used to encode characters using UTF-8.
- There may be additional restrictions for each of the components
   allowing pct-encoded.
- That pct-encoded is only allowed to allow the minting of tag IRIs,
   but that tags created as URIs from the start should/must never
   contain any pct-encoded pieces, and that tag IRIs also should/must
   never contain any pct-encoded pieces.

>    The component "taggingEntity" is the name space part of the URI.  To
>    avoid ambiguity, the domain name in "authorityName" (whether an email
>    address or a simple domain name) MUST be fully qualified.  It is
>    RECOMMENDED that the domain name should be in lowercase form.
>    Alternative formulations of the same authority name will be counted
>    as distinct

'counted' -> 'treated', or even better just say that these *are*
different tags.


>and hence tags containing them will be unequal (see
>    Section 2.4).  For example, tags beginning "tag:HP.com,2000:" are
>    never equal to those beginning "tag:hp.com,2000:", even though they
>    refer to the same domain name.
>
>    Authority names could, in principle, belong to any syntactically
>    distinct namespaces whose names are assigned to a unique entity at a
>    time.  Those include, for example, certain IP addresses, certain MAC
>    addresses, and telephone numbers.  However, to simplify the tag
>    scheme, we restrict authority names to be domain names and email
>    addresses.  Future standards efforts may allow use of other authority
>    names following syntax that is disjoint from this syntax.  To allow
>    for such developments, software that processes tags MUST NOT reject
>    them on the grounds that they are outside the syntax for
>    authorityName defined above.

Here, say that a DNSName must, after decoding of percent-encoding and
interpretation of the resulting octet sequence as UTF-8,
be an Internationalized Domain Name according to IDNA [RFC 3490].
You may also want to say that a DNSName, after decoding of
percent-encoding and interpretation of the resulting octet sequence
as UTF-8, should be normalized as defined by Nameprep [RFC 3491] to
avoid producing TAGs that look very similar but are not the same.

Also, say that pct-encoded is allowed on the left hand side of
emailAddress (before the "@") for future-compatibility, and is only
to be used if and when there is an IETF Standards-Track document
specifying how internationalized email address left hand sides
are handled.


>    The component "specific" is the name-space-specific part of the URI:
>    it is a string of URI characters (see restrictions in syntax
>    specification) chosen by the minter of the URI.  It is RECOMMENDED
>    that specific identifiers should be human-friendly.

Add some text here that after decoding of percent-encoding and
interpretation of the resulting octet sequence as UTF-8,
"specific" should be in NFC and preferably even in NFKC.


>    Examples of tag URIs are:
>
>       tag:timothy@hpl.hp.com,2001:web/externalHome
>       tag:sandro@w3.org,2004-05:Sandro
>       tag:my-ids.com,2001-09-15:TimKindberg:presentations:UBath2004-05-19
>       tag:blogger.com,1999:blog-555
>       tag:yaml.org,2002:int

An example without 'specific', and some I18N examples, should be added
(I can help).


>2.2  Rules for Minting Tags
>
>    As Section 2.1 has specified, each tag consists of a "tagging entity"
>    followed, optionally, by a specific identifier.  The tagging entity
>    is designated by an "authority name" -- a fully qualified domain name
>    or an email address containing a fully qualified domain name --
>    followed by a date.  The date is chosen to make the tagging entity
>    globally unique, exploiting the fact that domain names and email
>    addresses are assigned to at most one entity at a time.  That entity
>    then ensures that it mints unique identifiers.

The following paragraph can be reworded (and probably simplified)
once the chances to the syntax rules have been made.

>    The date specifies, according to the Gregorian calendar and UTC, any
>    particular day on which the authority name was assigned to the
>    tagging entity at 00:00 UTC (the start of the day).  The date MAY be
>    a past or present date on which the authority name was assigned at
>    that moment.  The date is specified using one of the "YYYY",
>    "YYYY-MM" and "YYYY-MM-DD" formats allowed by the ISO 8601 standard
>    [2].  The tag specification permits no other formats.  Tagging
>    entities MUST ascertain the date with sufficient accuracy
>    to avoid accidentally using a date on which the authority name was
>    not in fact assigned (many computers and mobile devices have poorly
>    synchronised clocks).  The date MUST be reckoned from UTC -- which
>    may differ from the date in the tagging entity's local timezone at
>    00:00 UTC.

I think some readers may be confused by "reckoned from UTC". Why not
just say that the date is always in UTC?



>That distinction can generally be safely ignored in
>    practice, but not on the day of the authority name's assignment.  In
>    principle it would otherwise be possible on that day for the previous
>    assignee and the new assignee to use the same date and thus mint the
>    same tags.
>
>    In the interests of brevity, the month and day default to 01.  A day
>    value of 01 MAY be omitted; a month value of 01 MAY be omitted unless
>    it is followed by a day value other than 01.

I'd quote all the 01 (i.e. "01") for easier readability. It is easy here
to confuse MAY with the month of May.


>For example, "2001-07"
>    is the date 2001-07-01 and "2000" is the date 2000-01-01.  All date
>    formulations specify a moment (00:00 UTC) of a single day, and not a
>    period of a day or more such as "the whole of July 2001" or "the
>    whole of 2000".  Assignment at that moment is all that is required to
>    use a given date formulation.

formulation -> format? or just 'use a given date'?


>    Tagging entities should be aware that alternative formulations of the
>    same date will be counted as distinct and hence tags containing them
>    will be unequal.  For example, tags beginning "tag:hp.com,2000:" are
>    never equal to those beginning "tag:hp.com,2000-01-01:", even though
>    they refer to the same date (see Section 2.4).

Here and elsewhere: The IETF prefers to use domain names such as
example.com.


>    An entity MUST NOT mint tags under an authority name that was
>    assigned to a different entity at 00:00 UTC on the given date, and it
>    MUST NOT mint tags under a future date.
>
>    An entity that acquires an authority name immediately after a period
>    during which the name was unassigned MAY mint tags as if the entity
>    was assigned the name during the unassigned period.  This practice
>    has considerable potential for error and MUST NOT be used unless the
>    entity has substantial evidence that the name was unassigned during
>    that period.  The authors are currently unaware of any mechanism that
>    would count as evidence, other than daily polling of the "whois"
>    registry.
>
>    For example, Hewlett-Packard holds the domain registration for hp.com
>    and may mint any tags rooted at that name with a current or past date
>    when it held the registration.  It must not mint tags such as
>    "tag:champignon.net,2001:" under domain names not registered to it.
>    It must not mint tags dated in the future, such as
>    "tag:hp.com,2999:".  If it obtains assignment of
>    "extremelyunlikelytobeassigned.org" on 2001-05-01, then it must not
>    mint tags under "extremelyunlikelytobeassigned.org,2001-04-01" unless
>    it has evidence proving that that name was continuously unassigned
>    between 2001-04-01 and 2001-05-01.
>
>    A tagging entity mints specific identifiers that are unique within
>    its context, in accordance with any internal scheme that uses only
>    URI characters.  Some tagging entities (e.g.  corporations, mailing
>    lists) consist of many people, in which case group decision-making
>    and record-keeping procedures SHOULD be used to achieve uniqueness.

Record-keeping is important for individuals, too.


>2.3  Resolution of Tags
>
>    There is no authoritative resolution mechanism for tags.  Unlike most
>    other URIs, tags can only be used as identifiers, and are not
>    designed to support resolution.  If authoritative resolution is a
>    desired feature, a different URI scheme should be used.
>
>2.4  Equality of Tags
>
>    Tags are simply strings of characters and are considered equal if and
>    only if they are completely indistinguishable in their machine
>    representations.  That is, one can compare tags for equality by
>    comparing the numeric codes of their characters, in sequence, for
>    numeric equality.  This equality-criterion allows for simplification

equality-criterion -> equality criterion


>    of tag-handling software, which does not have to transform tags in
>    any way to compare them.


>3.  Internationalisation
>
>    So far, we have considered tags as URIs, which are represented in a
>    subset of US-ASCII characters.  As befits our requirement for
>    identifiers to be tractable to humans, tags can also be minted as

The 'can also be minted as' probably needs some more explanation.
In general, any uri scheme that allows pct-encoded in the right way can
also be used with IRIs. See below.


>    Internationalized Resource Identifiers (IRIs) [4].  That is, they can
>    be minted in languages that use any characters from the Universal
>    Character Set.

Does a tag have a language? I think it's better to just say:
they can be minted using any characters from ...


The following procedure can probably be removed. If not, the following
details should be fixed:

>    The procedure for minting tags as IRIs is to use the specification of
>    Section 2 but with the following syntactic changes:
>    o  An International Domain Name (IDN) [5] represented according to
>       the rules of 'nameprep' [6] may be used in place of a domain name
>       in authorityName.  That includes a domain name appearing on the
>       right-hand side of an email address.
>    o  If a standard arises for expressing email addresses in
>       international form -- that is, including the left-hand side of
>       email addresses -- then that form will be allowed in
>       authorityName.
>    o  An international authorityName MUST appear in at least Normalized
>       Form C (NFC) and SHOULD appear in Normalized Form KC (NFKC) [7].

This should not be necessary, because Nameprep takes care of this.
But it may be a good thing to say for 'specific'.


>    o  The specific component of a tag IRI may be any string allowed by
>       the ABNF term *( ipchar / "/" / "?" ) defined in [4].

I recommend adding some normalization restrictions here, for the benefit
of transcribability,...

>    Two tag IRIs are equal if and only if they are identical as character
>    sequences -- and thus that their machine representations are
>    identical when using the same character encodings.

It may be a good idea to repeat here explicitly that:
- The use of pct-encoding in the syntax rules is only allowed in
   order to define the syntax of IRIs allowed in the tag scheme.
- pct-encoding should not be used in tags generated using only
   US-ASCII characters.
- pct-encoding should not be used in tags generated including
   non-ASCII characters (i.e. IRIs).
- A tag IRI is not equivalent to the tag URI resulting after
   mapping the IRI to an URI according to Section 3.1 of [IRI].
   To reduce any problems resulting from this:
   - tags should be used mainly with technology that can transport and
     handle IRIs (such as RDF).
   - If tags are temporarily converted to URIs because they have
     to be passed to some infrastructure that isn't able to handle
     IRIs, they should be converted back to IRIs when being recived
     back from that infrastructure.



>4.  Security Considerations
>
>    Minting a tag, by itself, is an operation internal to the tagging
>    entity with no external consequences.  The consequences of using an
>    improperly minted tag (due to malice or error) in an application
>    depends on the application, and must be considered in the design of
>    any application that uses tags.
>
>    There is a significant possibility of minting errors by people who
>    fail to apply the rules governing dates, or who use a shared
>    (organizational) authority-name without prior organization-wide
>    agreement.  Tag-aware software MAY help catch and warn against these
>    errors.  As stated in Section 2, however, to allow for future
>    expansion, software MUST NOT reject tags which do not conform to the
>    syntax specified in Section 2.
>
>    A malicious party could make it appear that the same domain name or
>    email address was assigned to each of two or more entities.  Tagging
>    entities SHOULD use reputable assigning authorities, and verify
>    assignment wherever possible.
>
>    Entities SHOULD also avoid the potential for malicious exploitation
>    of clock skew, by using authority names that were assigned
>    continuously from well before to well after 00:00 UTC on the date
>    chosen for the tagging entity -- preferably by intervals in the order
>    of days.
>
>5.  References
>
>5.1  Normative References
>
>    [1]  Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource
>         Identifier (URI): Generic Syntax (Note to the RFC Editor: Please
>         update this reference with the RFC resulting from
>         draft-fielding-uri-rfc2396bis-xx.txt, and remove this Note)",
>         draft-fielding-uri-rfc2396bis-06 (work in progress), July 2004.
>
>    [2]  "Data elements and interchange formats -- Information
>         interchange -- Representation of dates and   times", ISO
>         (International Organization for Standardization) ISO 8601:1988,
>         1988.
>
>    [3]  Mockapetris, P., "Domain names - implementation and
>         specification", STD 13, RFC 1035, November 1987.
>
>    [4]  Duerst, M. and M. Suignard, "Internationalized Resource
>         Identifiers (IRIs)", draft-duerst-iri-09 (work in progress),
>         July 2004.

      This should have a similar RFC Editor comment as [1].


>    [5]  Faltstrom, P., Hoffman, P. and A. Costello, "Internationalizing
>         Domain Names in Applications (IDNA)", RFC 3490, March 2003.
>
>    [6]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for
>         Internationalized Domain Names (IDN)", RFC 3491, March 2003.
>
>    [7]  Duerst, M. and M. Davis, "Unicode Normalization Forms", Unicode
>         Standard Annex #15 http://www.unicode.org/unicode/reports/tr15/
>         tr15-23.html, April 2003.
>
>5.2  Informative References
>
>    [8]   Leach, P. and R. Salz, "UUIDs and GUIDs", draft-leach-uuids-01
>          (work in progress), 1997.
>
>    [9]   "Information technology - Open Systems Interconnection - Remote
>          Procedure Call (RPC)", ISO (International Organization for
>          Standardization) ISO/IEC 11578:1996, 1996.
>
>    [10]  "Specification of abstract syntax notation one (ASN.1)", ITU-T
>          recommendation X.208,  (see also RFC 1778), 1988.
>
>    [11]  Mealling, M., "A URN Namespace of Object Identifiers", RFC
>          3061, February 2001.
>
>    [12]  Paskin, N., "Information Identifiers", Learned Publishing Vol.
>          10, No. 2, pp. 135-156,  (see also www.doi.org), April 1997.

[snip]



Regards,    Martin.
Received on Monday, 30 August 2004 09:18:34 UTC