- From: Sandro Hawke <sandro@w3.org>
- Date: Mon, 23 Dec 2002 17:59:42 -0500
- To: www-rdf-interest@w3.org
Let me try again to explain what I now think is broken in RDF's use of URI-References (and how to fix it with very little pain). Forgive me for starting with the obvious stuff, but it seems necessary. 1. Addressable Locations The web is a distributed system in which computer systems cooperate to present users with discrete chunks of individually addressable information, usually called "web pages". Each chunk is maintained (or all-too-often fails to be maintained!) at a virtual location; that location has an address which people use to access the information. Pointing to locations is what makes the web work: people click on underlined text or icons to go there. They can read an address on the side of a bus or a carton of milk, type it in, and see whatever the location's owner has published there. Search engines and web catalogs can scan and index the sites, by their addresses, and then help people find the pages they want. 2. Fragment Addresses In presenting information on the web, authors and web designers face a choice about chunk-size. If they make the chunks too big, how can they point people to the right information? If an event is being advertized, the address should lead directly to information about the event, not to an overwhelming list of events. If the the chunks are too small, on the other hand, each page will be unable to convey a coherent concept: users ariving at the page need to know some background, and search engines need enough content to perform proper indexing. One web feature to help with this dilemma is the "fragment" address: at http://www.w3.org/TR/REC-xml one can see the XML specification, while at http://www.w3.org/TR/REC-xml#sec-pi one can see the part of the specification covering XML "processing instructions." The "#sec-pi" part tells your browser that after fetching the information, it should jump to the part labeled "sec-pi" in the internal markup. If the information is presented in small enough chunks, fragment addresses are superfluous, but with big documents (like the XML spec!) they can be very useful. 3. Identifying Things In RDF (and other knowledge representation languages) we want to formally convey information about all sorts of things: people, places, times, mathematical functions, numbers, emotions, qualities, prices, and (of course) books. We also want to talk about web sites and web pages. How should we use the web's existing infrastructure to help us identify all the things we want to talk about? 3.1. Non-clickable links The simplest answer is to use strings which look like web addresses but don't really lead to web pages. These could be UUIDs, tag: URIs, or even http URLs which are not properly served. All of these approaches let people generate a string with confidence that no one else will accidentally generate the same string; strings like this serve to unambiguously identify things, but the connection between the string and the thing is not expressed in the web. This works (and was the first approach I liked), but it does not really use the web. 3.2 Reusing the Fragment Syntax Another approach is to generalize the fragment syntax. The semantics of address#fragment are not fully specified in existing standards, mostly because the meaning of a fragment depends logically on the language in which the information-chunk is being conveyed. Pointing into a text document is different from pointing into an audio recording or a 3-D image. To leave the door open for new formats, RFC 2396 says the semantics of an address with a fragment part depend on the media-type of the content served at that address. This open door allows us to define an RDF media type (application/rdf+xml) where "fragments" are not fragments, but rather arbitrary things. When we say "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" we do not mean some part of the document at that address; we mean some abstract concept of a type-relation, because that document is an RDF one. Do we need to know the media type of the document? Some people say not, that the use of that string as an RDF node or arc label is not governed by RFC 2396; RDF stands on its own and can use URI-like strings in its own way. This may work, but as with UUIDs, it fails to use the web very well. Moreover, its dilutes the power of URIs: any string on the planet which starts with "http://" and does not work as a web address is a wasted opportunity for communication and another chance to confuse and disappoint people. We can do better. Reusing the fragment syntax also causes a few technical problems. What happens if the content at the given address is NOT only application/rdf+xml? Maybe that's just a misconfigured system, but it could be a useful one. I think it would be nice for existing browsers to get human-readable HTML at the same address where an RDF-capable client gets its information. Like other forms of content-negotation, this allows all the forms of addressing (links, advertising, search engines, etc) to index the information itself, regardless of its presentation format. Even if people choose not to use content negotation, there is still a strong and growning view that the fragment syntax is used for, well.., fragments. RDF/XML documents are XML documents, and IMHO the XML community rightly expects XML's fragment syntax (xpointer) to work consistently. As with media types, fragment-syntax reuse may skirt the letter of the law here, since since foo#bar only means the XML element with the ID "bar" when served with certain media types, and the rdf:ID attribute isn't realy an XML ID, but it seems to me that RDF/XML is running (unnecessrily!) against the spirit of both XML and web addressing. 3.3 Using Descriptive Web-Content A third approach is to say that when a web page is about one thing, we can use the page's address as a kind of identifier for that thing. If you visit http://www.w3.org/Consortium/ you'll see it is clearly a page about the W3C. We can use that to identify the W3C itself, calling the W3C "the subject of http://www.w3.org/Consortium/". This is not the same as saying "http://www.w3.org/Consortium/ is a Consortium." That's like pointing at a photographic image of the Eiffel Tower and telling someone "that's the Eiffel Tower!": it works perfectly well with humans, but it introduces more ambiguity than we want in machine processing. (Some humans, of course, might take the opportunity to be pedantic and point out "No, it's a PICTURE of the Eiffel Tower." Some of us try hard not to be like that.) This approach to identification makes excellent use of the billions of existing web pages and the pointers to them throughout our world. Here we say that IF there's a page which has a single, conspicuous subject, we can use that page to held identify the thing. If we want an identifier for something, we can find a page, make a page, or even just allocate an address for the page. If someone sees such an identifier, their browsers stands a good chance of explaining to them what object is being identified AND telling them some useful information about it. (Using content negotiation or a page of mixed HTML and RDF, I would hope the web server would communicate its informaton to an RDF-aware application in RDF/XML via that same address.) This approach uses existing web page, existing search engines, existing retrieval mechanisms, and existing social practice to strongly connect identifiers, the things they identify, and information about the identified things. 4. Node Labels (Subject, Container, Overloaded, and Distinguished) The challenge to using descriptive web-content to identify things is that we risk confusing the page with its subject. If we just label an RDF node "http://www.w3.org/Consortium/" who knows if we are talking about a web location or an industry consortium? I suggest that ideally we would have two kinds of labels, which I'll call "Subject" and "Container" labels. A node with the Subject label of "http://www.w3.org/Consortium/" represents a consortium; we would expect to see arcs from it saying, perhaps, that its director is Tim Berners-Lee. A node with a Container label with the same text represents the web location itself, a container for some information; from it we might find arcs saying its last-modify date was "Wed, 13 Nov 2002 21:57:38 GMT". As I read the working drafts and look at current usage, RDF currently has neither subject nor container labels. I see two interpretations for what it has now, which I call "overloaded" and "distinguished" labels. A node with the "overloaded" label "http://www.w3.org/Consortium/" represents both a web location AND an industry consortium. This use is both absurd and generally workable. It works because the RDF arcs to and from the node are likely to treat it as one or the other; it being both will never be noticed by most users. My strongest argument against this practice is that it flies in the face of accepted system design methods: one should classify objects in the problem domain according to their qualities as perceived by the people who work with them. The idea that something could be both a web site and a consortium hardly seems natural to my small, biased, expert sample (myself). A mode with the "distinguished" label "http://www.w3.org/Consortium/" represents a web location. Distinguished labels are considered to be subject labels when they contain a "#" character and container labels when they do not. I think the vast majority of deployed RDF uses distinguished labels. The likely trouble spots are when the RDF graph is conveying information about web page fragments (eg EARL, and Annotea in some versions) and when RDF authors have chosen not to use fragment syntax for abstract concepts (eg Dublin Core). The DC situation as somewhat eased by considering RDF to use only subject labels for arcs (but distinguished labels for nodes). 5. Delabeling (x:uriRef and x:primarySubject) So how can we talk, in RDF, about web-page fragments and about things which are the subject of entire an web page? Distinguished labels don't allow either of these. Talking about fragments is important in at least Annotea and EARL; talking about the subjects of entire web pages allows vastly more of the existing web to be used for identification purposes. The most practical approach, I think, is to extend the concept of node delabeling. Conventional node delabeling turns <foo> <bar> <baz>. into something like: _:a <bar> _:b. _:a x:identifier "foo". _:b x:identifier "baz". where x:identifier is a property of something linking it to a string which is an unambiguous identifier for it. We need to extend this to handle our different kinds of labels. I suggest x:uriRef for container labels and x:primarySubject (which points the other way) for subject labels. Thus: _:SomePageAboutW3C x:uriRef "http://www.w3.org/Consortium/". and _:SomePageAboutW3C x:primarySubject _:W3C If we want to reverse the direction of x:primarySubject, we could perhaps call it x:descriptionURIRef, but I'm not as fond of that name. x:uriRef is an owl:InverseFunctional property while x:primarySubject is an owl:Functional one. 6. Conclusion It might be nice to have subject and container labeling throughout RDF, but it's late in the game for that. Instead, RDF Core should: 1. Define RDF labeling as being distinguished labeling for nodes and subject labeling for arcs. 2. Define/recommend x:uriRef for talking about fragments of web pages (and whole web pages, when desired). 3. Define/recommend x:primarySubject for talking about things which are the subject of an entire web page (and fragments, when desired). People should then use them. :-) [I'm sending this to rdf-interest instead of rdf-comments because I'd rather get any bugs in this proposal worked out by interested parties before bothering the WG.] -- sandro
Received on Monday, 23 December 2002 18:02:27 UTC