file URL is overspecified

The URI reference <http://www.w3.org/TR/REC-html40/references.html>  in the
HTML 4 refers to RFC 2396 <http://www.ietf.org/rfc/rfc2396.txt>  which is
obsolete by RFC 3986 <http://www.ietf.org/rfc/rfc3986.txt> .  The latter
document has a new section 2.5: "Identifying Data", containing the following
new material:

URI characters provide identifying data for each of the URI components,
serving as an external interface for identification between systems.
Although the presence and nature of the URI production interface is hidden
from clients that use its URIs (and is thus beyond the scope of the
interoperability requirements defined by this specification), it is a
frequent source of confusion and errors in the interpretation of URI
character issues.  Implementers have to be aware that there are multiple
character encodings involved in the production and transmission of URIs:
local name and data encoding, public interface encoding, URI character
encoding, data format encoding, and protocol encoding.

Local names, such as file system names, are stored with a local character
encoding.  URI producing applications (e.g., origin servers) will typically
use the local encoding as the basis for producing meaningful names.  The URI
producer will transform the local encoding to one that is suitable for a
public interface and then transform the public interface encoding into the
restricted set of URI characters (reserved, unreserved, and
percent-encodings). Those characters are, in turn, encoded as octets to be
used as a reference within a data format (e.g., a document charset), and
such data formats are often subsequently encoded for transmission over
Internet protocols.

The new statements above are slightly incompatible with what HTML
<http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1>  URI encoding
specification says: 

URIs do not contain non-ASCII values 

That statement is true for what the RFC calls "public interface encoding":
it seems reasonable that the user agent should use an URL when it requests
an external resource; however, requiring that HTML documents should use a
public URI for resources that the user agent is expected to serve without
communicating with an external server, such as local files identified using
then file scheme, seems an excessive complication to me.  Internet Explorer
does <http://blogs.msdn.com/ie/atom.xml>  not respect this prohibition
because it uses IRIs, not URIs, internally, and converts them to URLs if
needed when it communicates with an external server.  If an external URL is
specified in the source document as percent-encoded, it is passed without
altering because encoding is not needed and the server is responsible for
decoding; however, there is no server to decode a local URL and it remains
unresolved.  That is not compliant with the current standard, but I think in
this case the implementation is right and the standard needs some freedom
with respect to local URLs.

Of course, one could always do away with an argument that an HTML document
containing reference to a local resource cannot be published and can be
authored as noncompliant.  However, this is only partially true.  The reason
is that the prohibition of B.2.1 propagated to the XSLT specification that
refers to it explicitly where it specifies how URI attributes should be
transformed in html mode
<http://www.w3.org/TR/xslt#section-HTML-Output-Method> .  In effect, a
document produced by a conforming XSLT processor for local usage is
perfectly valid and perfectly useless: hyperlinks are broken and images do
not show up.

*        My suggestion: The constraints for URLs denoting local resources
should be relaxed.

I understand that this is fixed by HTML
<http://www.whatwg.org/specs/web-apps/current-work/multipage/section-documen
t.html>  5, so this is perhaps the good news:

The href content attribute, if specified, must contain a URI (or IRI).

Best regards,

Christopher Yeleighton

Received on Friday, 15 June 2007 11:57:28 UTC