Message-Id: <9212010635.AA23999@pixel.convex.com> To: Edward Vielmetti <emv@msen.com> Cc: "Tony Johnson (415) 926 2278" <TONYJ@scs.slac.stanford.edu>, Subject: Global HyperLinks was: quotes around tags and escape sequences In-Reply-To: Your message of "Mon, 30 Nov 92 23:37:23 EST." <m0mwPMf-0000A7C@garnet.msen.com> Date: Tue, 01 Dec 92 00:35:22 CST From: Dan Connolly <connolly@pixel.convex.com> OK, now you're asking for it. I've been mulling this stuff over in my head for a couple weeks, and I've got some pretty good ideas as to how it all fits together. My model of global hypermedia includes the following terms: Entity -- SGML and MIME use this term. WAIS calls it a document. Gopher calls it an item or a textfile or something. WWW used to call it a document, and now calls it a resource. The meaning is the same in all of them: a unit of retrieval [from the URL document]. Content-Type -- MIME coined this term. SGML calls it a NOTATION. WAIS used to call it :type, but they'll call it :content-type if they follow up on what they told me. Most gopher types fall under this scheme (telnet, cso, and other types that don't use gopher protocol don't fit) Reference -- This is the WWW anchor, the Gopher Menu item, the WAIS :document-id structure, The MIME message/external-body. It is enough information to 1) decide whether to retrieve the entity, 2) perform the retrieval transaction, and 3) process the entity once you've got it. >Really, though, the gopher reference is (in gopherspeak) > >Name=An arbitrary, but meaningful name >Host=gopher.micro.umn.edu >Port=70 >Type=0 >Path=Some Stuff NOTE: Some Stuff is terminated by a newline, and may not contain tabs. >And the "href=" is just a way to squash it down to a single string. >It could just as well be a set of attributes and not a single one. >E.g. > ><a gopherhost="gopher.micro.umn.edu" > gopherport="70" > gopherpath="/Some Stuff" > gophertype="0"> >An arbitrary, but meaningful, name</a> NOTE: for type 7 items, you need gophersearch="terms" too. >expresses the meaning of what's going on in a way that's far closer to >how SGML might do it as far as I have been able to make out...Dan is >that actually legal SGML? Sure, that's legal. I suggested that URLs be expressed in SGML a long time ago. Tim said it was overkill, and I'm starting to agree. Let's take a closer look at references: 1) What features allow users and clients to decide to retrieve an entity: WWW context and content of the anchor (Is it relevant?) MIME content-id (do I have this entity cached already?) content-description (relevant?) content-type (can I process it once I've got it?) SIZE (is it too big to bother?) WAIS :score (relavent to my query?) :headline (relevant?) :doc-id (in cache?) original/distributor-server,database,local-id particularly useful :number-of-lines, :number-of-bytes (too big?) :type, :content-type (can I process it?) :date (how old is it?) Gopher name (is it interesting?) type (can I process it?) 2) What features allow the client to make the transfer? WWW URL -- protocol, host, port, path, type, size, search terms handles local files, HTTP, gopher, WAIS connections. includes search terms for fulltext indexes. scheme mechanism allows gateways to new protocols MIME access-type, etc.: handles ftp, anon-ftp, local-file Ghost body allows arbitrary extra data. Gopher host, port, path, search words WAIS source (host, port, database), doc-id, search terms, relavent documents (these are the novel feature. Quite handy) 3) What features allow the client to process the entity? (Keep in mind that these are features of the reference -- this is information we have _before_ we transfer the entity). WWW processing is tied to the protocol. Content-Type of local files is inferred from file extensions. Entities from HTTP connections are assumed to be text/x-html. Gopher entites are typed: 0=text/plain, 1=application/x-gopher, w=text/x-html. WAIS entites are typed: TEXT=text/plain, WSRC=application/x-wais. MIME content-type mechanism is quite expressive. Any content-type can be encapsulated in a message/rfc822 entity. Multiple entities can be encapsulated in a multipart/mixed entity. Gopher gopher type tells you what to do with the data. text/plain, application/x-gopher are universally supported. other types are supported by pilot projects. WAIS :type tells what to do. text/plain and application/x-wsrc are supported. Other types are supported by pilot projects. Now let's see how we should change the WWW reference mechanism. Here's what we've got currently: <!ELEMENT A - - (#PCDATA)> <!ATTLIST A NAME ID #IMPLIED HREF CDATA #IMPLIED TYPE CDATA #IMPLIED > What's the TYPE used for? It's not a data type. There's some code in LineMode to handle it, but I'm not sure what it does. The NAME identifies the anchor as the target of some other anchor. We should have NAME (or ID) attributes on pretty much all elements, for example: <DL> <DT ID=term>term<DD>definition </DL> The HREF attribute is enough information to retrieve and Entity. Good. But it's got thie #anchor stuck on the end. That should be a separate attribute. It should be an IDREF, so that we can validate that it references an existing ID with an SGML parser. "But," you say, "what if it references an ID outside the current document?" I suggest we treat a group of nodes that reference each other not as separate documents, but as entities of one big document. That way, an author can validate the internal links in his/her web. I suggest two new elements: XREF, for intra-document links (i.e. links within the local web), and SEE for inter-document links (i.e. links that go outside the local web). <!ELEMENT XREF - - (#PCDATA) -- This element is for links within an HTML document. (a document is a collection of entities, or a web of nodes). --> <!ATTLIST XREF CONTEXT CDATA #IMPLIED -- entity containing the XREF is implied -- -- SGML purists would make this attribute an ENTITY reference, and put the URL in the SYSTEM identifier in the prologue. For expediency, we put the URL right in the attribute. -- ORIGIN CDATA #IMPLIED -- another URL, used as an identifier, rather than a locator. Ala the WAIS original-server,database,local-id triple. -- REF IDREF #REQUIRED -- ID of referent element -- > <!ELEMENT SEE - - (#PCDATA) -- This element is for links from an HTML document to any entity in the global web. The location and content-type of the entity are sufficient to resolve the reference. The other attributes could be specified in the text of the SEE content, but by making them attributes, the client software can process them, for example, to display a table of references sorted by date. --> <!ATTLIST SEE LOCATION CDATA #REQUIRED -- URL of referent entity -- CONTENT-TYPE CDATA #REQUIRED -- MIME Content-Type for the entity -- CHUNK CDATA #IMPLIED -- This is the analogue of the #anchor mechanism. If CONTEXT is an SGML entity, this would be an ID, though it won't be validated. However, if CONTEXT is a text file, this could be a line number. The meaning is defined by the content-type. -- ORIGIN CDATA #IMPLIED FROM CDATA #IMPLIED -- email address or name of author/provider -- DATE NUMBER #IMPLIED -- in ISO format: YYYYMMDDHHMMSSZ -- BYTES NUMBER #IMPLIED -- useful in many cases -- MD5 CDATA #IMPLIED -- data signature -- > What do you think? Dan