- From: Harry Halpin <hhalpin@ibiblio.org>
- Date: Mon, 4 Apr 2005 22:56:22 -0400 (EDT)
- To: www-tag@w3.org
- Cc: www-rdf-interest@w3.org, semantic-web@w3.org
Again, there seems to be the usual questions about the SemWeb popping up, and in particular http-range-14. There also doesn't seem to be much progress on these issues. Here's some notes that I think may be helpful, which basically try to distinguish between URIs as names for locations versus URIs as locations for physical access, as well as try to define the elusive term "on the Web" as being something that if the Web was destroyed, would also be destroyed. Also I distinguish between the use of representation in REST versus representation in AI/philosophy, which are not always the same. I think these distinctions, and taking them seriously, is clearly very important to http-range-14. The full text is here, and benefited from some discussion with Pat Hayes: http://www.ibiblio.org/hhalpin/homepage/notes/uri.html Text version below: ----------------------------------------------------------------------- URIs as Names for Reference and as Locations for Access httpRange-14 notes By Harry Halpin Thanks to Pat Hayes for some examples and commentary, although any errors are due to me of course! What do URIs identify? In essence, one reason Web works because using a web protocol like http(Hypertext Transfer Protocol), one can from a client send a request to a server to do an operation such as HTTP GET for a given URI and dereference something, often a web-page. However, this very basic feature of the Web is bedeviled by a question: "What is the range of the HTTP dereference function?" In other words, what do URIs identify? In theory this question has been solved by the W3C TAG's AWWW: URIs refer to anything. Upon inspection, the official definition is actually circular: "We do not limit the scope of what might be a resource...it is used in a general sense for whatever might be identified by a URI." The question then arises that if a resource is just anything that could theoretically be with a identified URI, is there anything that can not be identified? It would seem not. This view is given by the AWWW as "our use of the term resource is intentionally more broad. Other things, such as cars and dogs ... are resources too." However, referring to a web-page and the car in my garage are similar, but not exactly the same. The essential difference is this: in the first case on the Web we have physical, connected, access to the Web-page, while in the second case if we are using Semantic Web logic to refer to my car, we only the ability to refer to my car by a URI name, and this has no direct, connected, or physical access. When one uses a URI as a name there is a disconnect, as the thing named may not be on the Web. The division between representation and resource existed but was not explicitly stated, and definitely not noticed by, most of the users of the original hypertext Web. URLs seem to be originally meant to identify the location of representations, such as HTML web-pages, or possibly sets of representations, such when through content negotiation a news website figures out where you live and then serves you your local news. With the advent of the Semantic Web, the problem of httpRange-14 comes up precisely because a URI can be used to refer to anything, not just web pages. To be more precise, the issue comes up because URIs can refer to things that are not "on the Web" and so do not necessarily have a Web-accessible representation. Despite of this, these things that are "not on the Web" are fundamentally "on the Web" in another sense, since they can be reasoned about by the Semantic Web. The crucial point is what does "on the Web" mean? To answer that question we must pursue the historical chain of events from URL to URN to URI. Locations Uniform Resource Locations (URL) did not suffer from the httpRange-14 issue, unlike their nearly identical brethren URIs. Unlike URIs, URLs identified a specific type of thing: a location, which is a physical place. This location was assumed to be on the Web. By "on the Web," something that is physically connected to the Web. A URL denotes a location on some web-server which serves representations (HTML document, music file to download, whatever) to visiting web clients. A location can be connected to the Web because it - even after endless redirection - in a physical place. Take a mundane example: my address. An address is a just a location that has a thing that can (usually) be found at that location, and there exists a specified system for finding the location of an address. This allows multiple locations to be ordered in a way that humans, such as in street addresses (or machines in the case of IP addresses) can navigate easily. In the case of my address, and if one wants to find me, they can try to looks for at the location of my address - and I'm sometimes not there, so my address can give the person trying to find me a metaphysical 404 error. A location can, and should, give you direct, connected, physical access to the thing at the location. URLs are used as names of locations, and sending at HTTP GET (or POST, or HEAD, and so on) to a server requires the server if possible to go to the location and physically access the thing at the location, usually by copying it and sending a copy to your computer. Or sending a very real 404 error. On the Web Something could be found on the Web if it physically and causally connected to the Web. This means that whatever it was "on the Web," it could be encoded into bits and transferred over the Web. However, this is only "on the Web" the Web in the strongest sense: as in always on the Web. A thing can be only on the Web sometimes, or only partially on the Web, or only rarely on the Web. By our definition, if it could not be removed from the Web without loss of its functionality. One can imagine a whole range of possibilities, from being "strongly" on the Web (all the time) to "weakly" on the Web (occasionally). Thus, both documents and servers are "on the Web", and humans are not "on the Web" in a weak sense since they only interacted directly with the Web indirectly through typing on keyboards. Things like the Eiffel Tower or Louis XVI are definitely "not on the Web" on the Web, since Louis XVI is long gone and cannot at any point directly connect physically to the Web, while the Eiffel Tower is only represented on the Web, but no physically sending any bytes to anyone itself. The Eiffel tower is composed not of bytes, but of steel. This brings us to "representations" on the Web. What is the difference between something merely having a representation on the Web and something being fully on the Web? Rephrasing Brian Smith: Some thing is on the Web such that if the Web itself was destroyed, that thing would also be destroyed. If not, it's not fully on the Web. If someone destroyed the Web, this would not damage me if I were being denoted by a URI, but my homepage at that URI would be up in smoke if that what's people were using to refer to me by. I am not on the Web in a strong sense, but my homepage sure is. There are lots of middling cases: my computer is weakly on the Web, more so than myself. If my httpd daemon went down and my computer could no longer access the Web, or the Web itself collapsed, the computer qua computer still exists, but the computer qua Web server went up in smoke with the rest of the Web. One good question yet to be answered when are humans on the Web in a strong sense? Would it require our credit card details to be in an chip beneath our skin with a URI, and wireless internet monitoring us with a GPS that sent messages over the Internet? Those examples seem also too simplistic and extreme. Still, what is the difference between a something being represented on the Web and being on the Web? One necessary but not nearly sufficient condition for "representation" would be that a thing X represents another thing Y if you can destroy thing X and thing Y remains unscathed. Representations qua representations are on the Web, and would be destroyed if the Web was destroyed. However, what they represent would not be destroyed, unless what the representation represented also was on the Web. Representations: REST and AI Before going any further, we have to distinguish two different uses of the word "representation." The first is the use of "representation" as it is used artificial intelligence, cognitive science, and philosophy. In this use, a representation is something that "denotes" or "is about" something else, although often additional requirements are put on exactly what type of things the representation or its denotation may be. This will be called "representationAI." The second use is the use of "representation" as used by REST (The Representational State Transfer web architecture theory of Roy Fielding), where a representation can be whatever that a URI returns from a HTTP request. This will be called a "representationREST". A representationREST, unlike a representationAI, does not necessarily refer to or denote any other thing - although it might! The two definitions are not the same, but not mutually exclusive either. So, the difference between "on the Web" and "not on the Web" is also a test of both types of representation. A representationAI can qua representationAI be entirely on the Web if what it represents is also on the Web. Lots of representations, such an analog photo on my desk, are not on the Web at all. In another case, a picture of me on the Web is on the Web qua itself but not on the Web qua me, because it denotes me, not something on the Web. If the Web was destroyed, it would only destroy the bytes of the representationAI, not necessarily what the representation denoted. Also, representationsAI may have layers of representationAI, as one representation may denote other representationsAI, leading to all sorts of interesting chains of reference. However, representationsREST are by definition on the Web, and would be destroyed if the Web was destroyed, at least as the possible objects of HTTP operations. This is because representationsREST are defined precisely as the bytes that are sent over the Web. One could argue that copies of them archived to a computer might survive. However, those copies would no longer be representationsREST qua the Web, but just whatever they are without the Web being involved. This argument does reveal that both sorts of representation are functional categories that are dependent on their context, as something is never a representationREST without being on the Web (or in some parallel universe, another system that implements REST). Something is never a representationAI without something being represented. Virtual Locations and Digitality This idea of physically being on the Web can be abstracted from the concept of location. "Being on the Web" does not mean a thing has one URL or even physical location. Something could be on the Web and have multiple URLs, are multiple copies in different physical locations. A location can be a virtual location, an abstraction over a set of possible physical representations, as long as it really is a location. What exactly is the "thing" at a URL location? It's not just a particular server, nor is it some abstract resource. It is actually some bytes, a representationREST or set of representationsREST, which one has to actually GET to determine using your web client to see if it's a representationAI. The particular server where the actual representationREST lives is actually denoted by another type of location: wherever it is on the server, and the server has a very concrete IP address. A URL can be a name that denotes a virtual location, which is the forwarded to the place where the concrete bits are stored. These bits are usually on a server somewhere. When one accesses http://www.w3c.org, if I am in Japan I get the mirror of the W3C web-pages in Japan, if I'm in the US I get the one hosted at MIT, but I get the same "resource," regardless. Here the concept of resource as stated by TAG starts making some sense. It's a concept about the contents of a representationREST. However, this resource is not identical to the thing physically received as bytes (that's the representationREST). A resource seems to be the abstract idea of the common information between all the possible representationsREST returned. To properly understand resource then one needs a thorough inspection of theories of information and content, which is beyond the scope of this little note. Still, what is physically returned by a HTTP GET is just the representationREST, which may differ between MIT and Kyoto, while it might not between INRIA and MIT. The fact that the Web is digital becomes crucially important: the "copyability" of the representationsREST, due to their digital nature, is crucial to why the Web works, just as crucial as a universal naming scheme. Yet, things not "on the Web" (Pat Hayes qua Pat Hayes, my dog, etc) don't have this property of copyability. A picture on the Web of Pat Hayes is digital, but Pat Hayes is not, no matter how much time he spends online. What's in a Name? A name is entirely different from a location. Unlike a location, a name does not necessarily give you access to the thing named, and this thing name we will call the referent of the name. The set of all referents of a name (or denotations of a representation for that matter) we will call its interpretation. In fact, names are usually used when connected, physical access is impossible, and as such are place-holders for the physical thing precisely because there is no physical access. This concept of "names" is more in line with the URN effort, which essentially tries to serve as rigid designators in the Kripkean sense for the Web. Since a name does not have any connection to a referent, putting a name on the Web via a URI (such as a URN) does absolutely nothing at all to the referent of the name. When anyone accesses the resource "Pat Hayes" from URI ,http://www.ihmc.us/users/phayes/PatHayes.html, Pat Hayes does magically appear next to them. What that URI currently can return from a HTTP get is a representationREST: a Web-page in HTML encoded as very physical bytes somewhere that get sent to me over a wire as very physical bytes, and then displaying by a very physical computer the social security number of Pat Hayes and other defining details. It could even theoretically return a definition of Pat Hayes in RDF. Yet this particular URI representationREST also serves double-duty as a representationAI, since it contains pictures of the actual Pat Hayes, relevant facts about him, and so on. Pat Hayes himself is not on the Web, since if the Web is destroyed Pat Hayes would merrily go along, and probably with more spare time. So, the use of a URI as a "name" causes a URI to be used as a representationAI. However, what exactly the interpretation of a URI as a "name" actually is goes beyond the physics of transferring bytes. This interpretation is either the yet-to-come metaphysics of the Semantic Web, social meaning, or something else - who knows? But what is important is that it is a non-physical, non-causal, non-connected relationship, unlike the relationship of a location which is a physical, connected, causal relationship. Note that URIs used as names-for-reference are common in the Semantic Web, and the Semantic Web depends on there being names with interpretations to reason over. Because there is no direct access to the thing the URI-as-name identifies, unlike the use of a URI-as-location, the Semantic Web uses URIs without any necessary use of representationsREST. A URI in the Semantic Web is used more like as "place-holders" or even (stretching it a bit) "keys," without any HTTP operation returning any bytes from a server in terms of representationREST. Thus, the Semantic Web uses URIs as representationsAI, while the Good-Old HyperText Web uses URIs as representationsREST. Double Lives as Names and Locations The key of the confusion is that http fundamentally will dereference whatever a URI refers to, and there are two distinct types of functional roles a URI can play: name and location. A URI can serves as a identifier-as-a-name, which is a non-physical relation of reference, and as a identifier of a location, which is a physical relation of access. Just naming something has no effect on the thing named: naming something does not bathe the thing named in any type of energy that we can detect via a physical radar. There is no way to build a detector to detect what exactly someone means by a URI, although we can guess from talking to them or accessing representations they give us. Locations give you physical, connected, access to a thing. If you go to a location to get something, if the thing is there you return with it physically in hand. A name might, but does not have to and usually does not give one any sort of physical, connected, access to the thing named by the location. The word "identifier" is even more vague than name or location, and here the problem of the "identity" crisis appears: how do we know if the URI is being used for something as a name or as a location? The URI itself does not tell us. Even worse, what does "identify" mean, and how can we tell if two things identify the same thing? With representationsAI that is sometimes very clear, as in photographs, and sometimes not so clear, as in abstract art. Even the integers have problems with identification: does "11" identify eleven in decimal or three in binary? We won't know - and can't know unless we are given some sort of decoding scheme. In programming language tradition "identifier" has a pretty secure meaning and in that context the access/reference distinction is theoretically important but not of great practical significance, since everything you can refer to is physically accessible by the computer and has an address in memory. This is not true of logic, and definitely not true of model-theoretic semantics. Importantly, the access and reference distinction holds on the Web with many things that have URIs. In an information space, things may be identified without being accessed via a physical connection. In terms of the AWWW, a "non-information" resource is probably similar to the use of URI-as-access, while the use of URI for reference without access is called an "information resource." Solving the Identity Crisis Then there's the identity crisis: a single URI can actually play both roles (name with no access and location with access) at the same time, which gives us a powerful device for some application. The official view is that the representations are supposed to be interpreted by applications depending on MIME types is clearly focused on the use of a URI as a location for access; yet nothing forbids a URI that returns a representationREST or some other data to be used tell the web client that this URI is also a name for reference in addition to a location for access. In fact, for a URI used only as a name, MIME-types are clearly irrelevant. At least for the time being! It would be useful to distinguish when a URI is used as "name" or as a "location, " and if some URIs can only be used as names or only as locations. In other words, this depends on whether the thing (which would be the "resource") identified by URI is on the Web or not. This already reduces to the "non-information resource" and "information resource" distinction on some level, and so is not a return to the historical Dark Ages of the Web. Since they share a common syntax, it does make sense to unite URLs and URNs on a level as URIs, and even to use URLs as "names." The identity crisis can be solved pretty easily, as shown by the Web Proper Names proposal. First, a separate URI scheme (wpn:// or tdb://) can distinguish the use of URI as names for reference from URI as locations for access. To capitalise even further on the identity crisis, this can be distinguished without a new URI scheme by solving it by the use of a representationREST, by having a type of representation format which says that this URI is a "name" as opposed to a "location." In fact, one could even have a special MIME-type to distinguish names for things: imagine the "name" MIME-type, or the "application/xhtml+xml+name" type. The Future... However, one subject which needs more exploration is the "interpretation" of URIs as names. How does one tell, if a URI as a name for reference, what its interpretation is? All the RDF statements that apply to that URI? And if so, how do we get them in a decentralized system? SPARQL? URIQA? Magic? In other words, assuming the URI gave you machine-readable descriptions in some Semantic Web language readable by machines, should the use of a URI-as-a-name really mean that this URI refers to (or denotes) whatever is necessary to satisfy the Semantic Web description? The Semantic Web allows one to build a number of roles and assertions, and one would assume that its interpretation is those other Semantic Web URIs that are satisfied by these roles and assertions. However, the SemWeb as it stands just has URIs as Semantic Web objects referring as names to other URIs as Semantic Web objects, and does not fulfill what the Semantic Web really needs: a way to move out of the Web and to the wide world beyond the Web. The Web needs to be integrated more into the world, and there lies the true holy grail of the Semantic Web. This is not just a problem for the Web, but the fundamental problem that proved to be the ultimate bane of AI. Indeed, it's easy to just attach a model theory to any formal system and say "We have semantics." Yes, that's strictly true - but let's not forget the adjective "model-theoretic." And models of the real world can be wrong, and often are. The real burden of the Semantic Web will lie on the ability of people and machines to produce models using SemWeb languages whose model-theoretic interpretations are relevant to the real world, and match them in interesting and useful ways that allow the Web to do things that are either impossible or very difficult on the current Web. Can people and machines do this in a large, dencentralized manner? Are the SemWeb standards sufficient for the task? Yet, while the answer to that question is unknown, the winds seem favorable.
Received on Tuesday, 5 April 2005 02:56:23 UTC