Re: Change proposal for ISSUE-56 from Adam Barth on 2010-07-15 (public-html@w3.org from July 2010)

From: Adam Barth <w3c@adambarth.com>
Date: Wed, 14 Jul 2010 18:12:15 -0700
To: Maciej Stachowiak <mjs@apple.com>
Cc: HTML WG <public-html@w3.org>, Sam Ruby <rubys@intertwingly.net>
Message-ID: <AANLkTikTNA73BKE8yiqTc3rvtZD2zC0R4f3UDYwn8tl2@mail.gmail.com>
Here is the updated text of my change proposal.  Hopefully the updated
proposal is sufficiently specific about the text it proposes
restoring.

== Summary ==

There is no need to align "URL" processing in HTML documents with the
IRI specifications because HTML documents do not contain IRIs (or URIs
for that matter).  We should restore the removed text that explained
how to translate input strings contained in text/html documents into
URIs.

== Rationale ==

ISSUE-56 was raised in error by Michael(tm) Smith based on a message
Roy sent to the working group.  Roy said that "pretending to define a
new URL standard as part of HTML5 is not acceptable ... HTML will
never define the identifiers for the Web. That would be a fundamental
violation of the Web architecture."  Based on my current understanding
of the web architecture and of how a sequence of characters in a
text/html document becomes a URI, he is correct.  However, that does
not imply that we ought to remove the "URL" processing requirements
from the HTML5 specification.

In a recent message to the IRI working group [1], Roy writes:

[[
RFC 3986 defines how to parse URIs (for recipients) and provides many
rules for scheme-specific specs to define how to generate URIs of a
given scheme (for producers) within the overall constraint of matching
the URI syntax (the formal ABNF).

[...]

Please understand that browsers almost never parse URI or IRI or
anything in between.  Browsers have input strings that contain one or
more references, usually in the document encoding, and so there is a
sequence of context-specific and charset-specific and
media-type-specific processing that occurs before you even get to the
individual URI-reference or IRI-reference that are defined by
3986/3987.

Some people have proposed that most of that pre-processing be added to
the IRIbis spec, but I have seen no evidence to suggest that such
pre-processing is even remotely standardizable (it seems to be
different for every input context).  If you can demonstrate or get
agreement on a single way to preprocess an input string, or at least a
few named processes (like single-ref and multi-ref), then that would
be useful.
]]

>From this more detailed message, it appears that it is fully
appropriate for HTML5 to define an algorithm for translating input
strings containing one or more references into one or more URIs (or an
IRIs, as appropriate).  In particular, Roy expects such translations
to be context-specific, charset-specific, and (importantly)
media-type-specific.  To wit: HTML5 ought define the pre-processing
rules that are specific to the text/html media type.

To lend even more credence to this rationale, I quote from the very
same email message [2] written by Roy that Michael(tm) Smith cited in
the description of ISSUE-56.  This quote was omitted from the
description of ISSUE-56 for reasons unknown to me and to Michael(tm)
Smith:

[[
I suggest that the section be removed or replaced with the limited and
specific needs for parsing href and src attribute values such that the
attribute's value string is mapped to a URI-reference with a defined
base-URI.  HTML owns that process of extracting a valid URI-reference
from an attribute's value string.  A simple string parsing
description, with associated context-specific error-handling, is more
than sufficient to satisfy the needs of HTML5 without appearing to
override an existing standard that has recently been agreed to by all
vendors, including the few browser vendors that care about HTML5.
]]

In effect, this change proposal urges the working group to adopt Roy's
proposal: HTML5 should define how to extract a URI-reference from
strings contained in text/html documents, complete with
context-specific error handling.

For those that prefer rationales expressed in terms of objects, this
change proposal makes the following objections:

1) I object to HTML5 deferring to RFC 3987 for parsing input strings
containing one or more references because RFC 3987 does not define an
algorithm for parsing input strings containing one or more references
that takes into account the context-specific, charset-specific, and
media-type-specific rules required by user agents to interoperably
parse such input strings in text/html documents.

2) I object to HTML5 being blocked in the IRIbis working group for
defining an algorithm for extracting URI-references from strings
contained in text/html documents for two reasons:
  a) Defining such an algorithm is out of scope for that working
group's charter [3] because these strings are not IRIs and therefore
are not subject to the requirements contained in RFC 3987.
  b) The IRIbis working group has made essentially no technical
progress since its inception.  To wit: the working group has published
only a -00 version of a single Internet-Draft.  In contrast to Larry's
claim in his change proposal, the mailing list is essentially dead:
    i) There have been only two message in June.
    ii) The messages in May consisted (essentially) of a discussion of
how to render BIDI URIs on billboards.
    iii) The messages in April consisted of coordinating with this
working group.

3) I (strongly) object to HTML5 not defining how to interoperably
process a hyperlink because a hyperlink is the essential feature of a
*hypertext* markup language.

== Proposal Details ==

The proposal details herein takes the form of a set of edit
instructions, specific enough that they can be applied without
ambiguity:

1) Revert http://svn.whatwg.org/webapps@3245.  (Note: the editor and
the working group should feel free to continue to improve this text
after adopting this change proposal.)

== Impact ==

1) Positive effects: User agents will be able to implement
interoperable error handling for translating strings in HTML documents
into URIs.
2) Negative effects: Readers of the HTML5 specification will need to
learn the difference between these input strings and the URIs they
represent.

Q: What conformance classes will have to change?
A: User agents.

Q: What are the risks?
A: We might actually be able to process hyperlinks interoperably,
leading to joy and happiness.  With so much joy in the work, purveyors
of whisky might go out of business.

[1] http://lists.w3.org/Archives/Public/public-iri/2010May/0008.html
[2] http://lists.w3.org/Archives/Public/public-html/2008Jun/0435.html
[3] http://tools.ietf.org/wg/iri/charters
Received on Thursday, 15 July 2010 01:13:16 UTC