W3C home > Mailing lists > Public > public-html@w3.org > February 2010

RE: HTML CHANGE PROPOSAL; change definition of URL to normative reference to IRIBIS

From: Larry Masinter <LMM@acm.org>
Date: Thu, 25 Feb 2010 18:45:10 -0800
To: "'Maciej Stachowiak'" <mjs@apple.com>
CC: <public-html@w3.org>, "'Ted Hardie'" <ted.ietf@gmail.com>
Message-ID: <000301cab68d$b6aa79b0$23ff6d10$@org>
With regard to ISSUE-56, ACTION-171:

Rationale:

The Issue this proposal is trying to address is:
"Bring URLs section/definition and IRI specification in alignment."

(1) The fundamental rationale is that URLs in HTML and similar
identifiers
in other Internet systems need to have the same syntax and semantics.
The advantages of doing this in technical specifications include all
of those articulated for modular specifications.

(2) The IETF has approved an IRI working group whose charter 
specifically includes working with the W3C HTML working group:
as noted in:
http://lists.w3.org/Archives/Public/public-html/2010Feb/0476.html  
and 
http://tools.ietf.org/wg/iri/charters which includes:

" The IRI specification(s) must (continue to) be suitable
  for normative reference with Web and XML standards from W3C
  specifications. The group should coordinate with the W3C working
  groups on HTML5, XML Core, and Internationalization, as well
  as with IETF HTTPBIS WG to ensure acceptability.  "

Evidence that there is interest outside of the W3C HTML
working group current members to contribute to this work
has been the extensive participation and time spent already
in meetings, including:

  * meetings at the last W3C TPAC
  * Two working group development sessions at IETF meetings
    with significant participation by non-HTML-WG members
 
http://www.alvestrand.no/pipermail/idna-update/2009-October/005720.htm
l 
    http://lists.w3.org/Archives/Public/public-iri/2009Nov/0040.html
 
http://www.alvestrand.no/pipermail/idna-update/2009-July/004598.html
  * Interest in, and discussions with, members of the Unicode
    Consortium Technical Committee.

In addition, there is evidence that this work can succeed:
the discussion in the mailing list for the IRI working group
http://lists.w3.org/Archives/Public/public-iri/ is active;
most of the recent active contributions have been by
W3C HTML Working Group members, with additional contributions
from the broader community of Internet application
development.


The first F2F meeting of the IRI working group in IETF
will be Friday, March 25, but of course, as with all IETF
working groups, the primary work of the group is on the
mailing list, and there is no cost or fee for participation
there.

(3) Recent public-iri discussion seems to raise the issue that the
current definition of URLs in the existing HTML5 specification 
may not match implementations in any case. The analysis of
how currently deployed systems work, and how they should work
in the face of changes to the Internationalization of Domain 
Names, should be done in a context where the affected communities
(IDN, Unicode Technical Committee, HTML WG, etc.) can come
to agreement.

(4) Additional information in the HTML5 bug report
 http://www.w3.org/Bugs/Public/show_bug.cgi?id=8207 
indicate that the reason for rejecting this as a "bug"
is that the IRI document is 'vague' and does not contain
sufficient normative language to satisfy some who believe
that MUST language with normative algorithms is necessary.
However, these requirements should be handled as updates
to the IRI specification, so that the HTML5 specification
not contain divergent implementation advice from that
used by every other application that uses URLs/IRIs.

(5) While there may be additional adjustments necessary
 to align the boundary between what the HTML5 document 
  and the IRIBIS document, this work should
 proceed as bugs on the drafts, as amended by this change
 proposal.

===============================================================
Proposal:

The actual proposal itself was available as an attachment to
http://lists.w3.org/Archives/Public/public-html/2009Nov/0670.html
http://lists.w3.org/Archives/Public/public-html/2009Nov/att-0670/iri-r
ewrite-draft.html

A minor update of that proposal (edited to update the reference
to point to the IETF document) is attached to this message
and also made available in plain text here:


================================================================


NOTE: This is a draft of one way of rewriting section 2.5.1 of The
HTML 5 editor's draft of 25 August 2009, provided as an example.


2.5.1 Terminology

Historically the term "URI" was used for "Universal Resource
Identifier" [RFC1630]; with a Uniform Resource Locator (URL) being the
form of URI which expresses an address which maps onto an access
algorithm using network protocols. Further technical specifications
[RFC 1738], [RFC 1808], [RFC 2396] and [RFC 3986], subsequently
defined a "relative URL", elaborated the distinction between Uniform
Resource Names (URN) and URLs, and led to the adoption of "URI" as
Uniform Resource Identifier, and introduced the notion of an
"Internationalized Resource Identifier" (IRI) [RFC 3987] as a
syntactic form which allowed (unencoded) non-ASCII Unicode characters.
[HTML 4.01] (from which this specification was evolved) used "URI" as
specified by [RFC 2396], but contained recommended processing rules
for HTML agents (in [HTML 4.01] appendix B.2) for handling invalid
values containing non-ASCII characters, roughly corresponding to the
guidance in [RFC 3987].

Popular informal usage continues to use "URL" to refer to any of these
variations, although, for the most part, the term "URL" alone
indicates an "absolute" form including a scheme (see below).

Definition: In this document, the term "URL"  is used for any strings
used to identify a resource, including  relative forms; the
distinction between various forms are made in context or with
qualifiers or by processing rules, as to whether the URL corresponds
to a URI or a "relative reference" (as specified in [RFC 3986]) or the
"internationalized" forms of those, IRI and relative IRI reference (as
specified in [draft-ietf-iri-3987bis]), or to strings which (after
preprocessed by the  rules defined in Section 7.2 of
[draft-ietf-iri-3987bis]) result in one of those forms.

Definition: a valid URL  is a string that matches the production of
"iri-reference" in[draft-ietf-iri-3987bis].

Definition: a valid absolute URL is a string that matches the
production of "IRI" in [draft-ietf-iri-3987bis].

Definition: an absolute URL is a string which results in a valid
absolute URL (defined above) after being processed by the rules of
"Web Address Processing" in section 7.2 of [draft-ietf-iri-3987bis].
Note that this basically means any string which, after preprocessing,
starts with an initial string matching the "scheme" production of
[draft-ietf-iri-3987bis], followed by a colon.

Definition: A relative URL is a URL that is not an absolute URL;
similarly, a valid relative URL is a valid URL that is not an absolute
URL.
Definition: To parse a URL into its component parts means to first
preprocess the string according to section 7.2 of
[draft-ietf-iri-3987bis] "Web Address Processing", and then to parse
the results of preprocessing (as per section 3.2 of
[draft-ietf-iri-3987bis]) against the "iri-reference" (if parsing a
URL)  or the "IRI" production (if parsing an absolute URL).  Note that
the preprocessing steps generally result in a valid URL or a valid
relative URL.  Matching BNF components results in the following parts:

    * <scheme>:  substring that matched "scheme", if any
    * <host>:  substring that matched "ireg-name", if any
    * <port>: substring that matches "port", if any
    * <hostport>: if there is a scheme component and a port component
and the port given by the port component is different than the default
port defined the scheme component (if the default port for the scheme
is known), then  <hostport> is the substring that starts with the
substring matched by the host production and ends with the substring
matched by the   port production, and includes the colon in between
the two. Otherwise, it is the same as the host component.
    * <path>: substring that matches "ipath" , if any
    * <query>: substring that matches "iquery", if any
    * <fragment>:  substring that matches "ifragment", if any
    * <host-specific>: the substring that follows the substring
matched by the "iauthority" production, or the whole string (that is,
the input to the matching algorithm which is the result of
preprocessing by section 7.2) if the "iauthority" production wasn't
matched.

Definition: The phrasing resolve.. relative to... (in the context of
resolve a URL relative to another URL)  is used to describe the
process of combining two strings: an original URL and a base URL
(usually an absolute URL) to obtain parsed components; these parsed
components may  then be recombined to construct a new URL. This is
accomplished by parsing the original and base URLs (preprocessing by
section 7.2 of [draft-ietf-iri-3987bis] first, then matching against
the productions of section 3.2 of [draft-ietf-iri-3987bis]) but then
combining the original and base components following the algorithms in
section 5.2 of [RFC 3986], but applied to the Unicode characters which
constitute the original and base.

Definition: the document base URL of a Document object is the absolute
URL defined by :

   1. Let fallback base url be the document's address (an absolute
URL).
   2. If fallback base url is the string about:blank and the
Document's browsing context has a creator browsing context, then let
fallback base url be the document base URL of the creator Document
instead.
   3. If there is no base element that is both a child of the head
element and has a href attribute, then the document base URL is
fallback base url.
   4. Otherwise, the document base URL url is the result of resolving
the href attribute of the first such element relative to fallback base
url(note that  the base href attribute isn't affected by xml:base
attributes).

 





Received on Friday, 26 February 2010 02:46:05 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:14 UTC