Re: more 'file' suggestions for draft-hoffman-file-uri from Mike Brown on 2004-09-22 (uri@w3.org from September 2004)

From: Mike Brown <mike@skew.org>
Date: Wed, 22 Sep 2004 14:50:27 -0600
To: Larry Masinter <LMM@acm.org>
Cc: uri@w3.org
Message-id: <4151E593.3010002@skew.org>
Larry Masinter wrote:

>>- The syntax of a file URI is that of absolute-URI, except that
>>  its scheme component must be 'file', case-insensitively.
>>    
>>
>
>I think this just confuses things, since it doesn't really say
>anything.
>  
>
OK, but I think a statement of lexical syntax, independent of semantics, 
should be made. What the syntax is doesn't bother me much, although I am 
under the impression that rfc2396bis requires all URI schemes to 
acknowledge that the resource identifier may contain components that are 
not meaningful to a particular scheme. If we make a statement like what 
is currently in the standard -- that a file URI's syntax is 
file://<host>/<path>, then that implies certain restrictions. For example...

FILE://what@the:/heck?is;this  - file URI or no?

I say it is, on the grounds that it has the file scheme and matches 
absolute-URI. Sure, it contains a bunch of other junk, but that doesn't 
preclude it from being useful as an identifier of a file resource.

Tack on a fragment, though...

FILE://what@the:/heck?is;this#eh?

Is it still a file URI? This is a nuance of the rfc2396bis grammar that 
I'm unsure about. What is lexically a "URI" can have a fragment, but 
sec. 4.3 says "Some protocol elements allow only the absolute form...". 
Is a scheme definition a 'protocol element'? Or does a scheme have to 
acknowledge that a fragment may be present? If the latter, then strike 
what I said about absolute-URI; it should just match the URI syntax rule 
and thus a statement of such would be redundant (though what harm is 
there in having it?).

>>- The *typical* syntax of a file URI is more restrictive
>>  (no query component, authority is usually empty,
>>   path usually starts with "/")
>>    
>>
>
>Well, hmmm, the 'authority' is interpreted differently, but
>it is either empty, 'localhost', or some other value; in
>some implementations, it is a host name in some local name
>space. (For example, many Windows implementations treats
>the 'authority' component as a UNC host, e.g.,
> file://hostname/path/to/file  =>  \\hostname\path\to\file
>  
>
Actually I'd rather not make a statement about the "usual" syntax, per 
se. Information about commonly used values is better provided in a 
separate section that states what each component of a file URI 
represents and how it is usually (or should be) interpreted.

>>- The authority component of a file URI is considered by this
>>  specification to contain a host component exactly as defined
>>  by the rfc2396bis grammar. (I don't want there to be any
>>  ambiguity about what the "host component" is).
>>    
>>
>
>Well, is it ever anything other than empty, 'localhost' or
>a host name?
>
No, I mean that I don't want there to be any confusion as to whether 
"host component" means everything that comes between the 2nd and 3rd 
slash, as is implied by the current spec. That's actually the authority. 
Once we clarify what piece of the authority is the "host", we can make a 
statement about what it represents -- the host associated with the file 
the URI identifies -- and about what common & special values it has -- 
empty, 'localhost', or a URI-friendly value that is derived from the 
host's name.

>>- The path component of a file URI represents an identifier 
>>for the file
>>  as would be used in the host's principal file system interface
>>  (i.e., the path component of a file URI usually represents a file's
>>  "local path" on the host's file system). "File system interface" is
>>  assumed to be a well-understood concept.
>>    
>>
>
>
>Actually, I disagree. What it *should* be is a translation of
>the local file system's path to a file, in the local character
>encoding for the file system, into (hex-encoded) UTF-8, where
>"/" is used consistently for directory delimiters, and with
>an appropriate platform-specific encoding for other top-level
>decorations of the file syntax.
>
Your statement and mine are not in conflict. Mine is a statement of what 
information is *conceptually* represented by a certain literal piece of 
the URI, and yours is a statement that implicitly relies on mine: once 
it is assumed that this piece of the URI represents a file system path 
(which I describe in less presumptive terms, since "path" is generally 
only meaningful in file systems that use hierarchical identification 
conventions), then you can provide the details of how to derive, from 
that file system path, a URI-safe value to be used as the path component 
of a file URI.


>>- Other components of a file URI, if defined, are not defined
>>  by this standard as necessarily representing anything in particular,
>>  but they do contribute to the identification of the file represented
>>  by the URI. Thus, a query component present in a file URI may or may
>>  not affect how the URI is dereferenced on a particular platform,
>>  but even when it does not affect anything, it cannot be assumed,
>>  in the absence of a standard stating otherwise, that a file URI with
>>  a query component is equivalent to a file URI without one.
>>    
>>
>
>I think this is useless. Let's describe what usually works.
>
>I think the query component should be ignored when dereferencing the
>resource, but dynamic content of a file may be able to access 
>"the URI used to reference it", and take advantage of the
>query component in that way.
>
I don't feel it is useless to try to address these issues, but I concede 
that the way I phrased it is far from ideal.

As an implementer I'd like to know whether any extra junk in the file 
URI is to be completely ignored. Can I assume that 
file://junk@myhost/a/b/c?morejunk is equivalent to file://myhost/a/b/c 
as an identifier of file /a/b/c on host myhost? Am I allowed to 
dereference the two URIs in different ways based on the presence/absence 
of the junk, so long as I get a representation of the same file? I can't 
think of a good reason why not.


>>- The manner in which a host component represents a host is
>>  this: If the component is empty or is "localhost" (what if it is
>>  the percent-encoded equivalent of "localhost"?), the component
>>  represents the host on which the URI is being interpreted. No
>>  guidelines are given for the interpretation of any other values;
>>  they may take the form of IP addresses, DNS names, or any other
>>  identifier. No guidelines are given for how to dereference such
>>  identifiers (hey, I'm just describing current practice).
>>    
>>
>
>I see no point in giving 'no guidelines', and you're not
>actually 'describing current practice', you're trying to
>avoid describing it by disclaiming any knowledge of current
>practice.
>

I feel that all we can say about the host part of the URI is that

1. regardless of its value, it is a representation of the host

2. it may have special values 'localhost' or empty string in order to 
represent the host on which the URI is interpreted

3. any other value is a representation (%-encoded etc.) of the name of 
the host. The nature of the name being represented and how to 
dereference it are beyond the scope of the standard. This is the same 
position taken by rfc2396bis. At most, we can say that it is typical to 
use DNS based lookups etc. and that it is inadvisable for there to be 
any surprises in this regard, but we shouldn't require any particular 
host-locating mechanism as a "must".

Further, now that rfc2396bis lets us %-encode the authority, we have to 
figure out what to do with file://local%68%6F%73%74/foo. I am 80% sure 
that we are required to make no distinction between that and 
file://localhost/foo, but a literal interpretation of the current 
standard would fail to require treatment of the %-encoded version as 
representing the host on which the URI is interpreted. (An implementer 
is free to do so anyway, but they're also free to do a DNS lookup of the 
percent-decoded version). So a decision should be made. Personally I 
think the 'localhost' constraint is garbage and should never be 
enforced; if I want to define 'localhost' to be something other than 
127.0.0.1 on my system, that's my prerogative, and I expect 
file://localhost/foo to resolve accordingly.

-Mike
Received on Wednesday, 22 September 2004 20:50:29 UTC