Message-Id: <9207241532.AA26087@imagine.convex.com> To: cni-arch@uccvma.bitnet Cc: wais-talk@think.com, www-talk@nxoc01.cern.ch Subject: SGML for URLs Date: Fri, 24 Jul 92 10:32:14 CDT From: Dan Connolly <connolly@imagine.convex.com> --cut-here Content-Type: multipart/alternamtive; boundary=alt --alt OBJECTIVE The issue of what to call these things we're defining has been discussed at length. First it was Universal Document Identifier. The name has changed as the objective has been refined. The latest name is Universal Resource Locator. The provisional charter is; To define a printable string syntax to the allow The expression of the address on the network of any accesable object using existing information retrieval protocols; The expression of the name of any object held in a directory system or unique naming space on the network; The distinction to be made easily in the syntax between such protocols and directories and name spaces; New protocols, directories and naming schemes to be included as and when they are developed. [1] Clearly what we are about is defining a language, i.e. a syntax and semantics for communicating some information. The information is the location and/or identity of some information object in the global hypertext. It's a citation or a reference or a hypertext link anchor. I propose a specification for the language of URLs, in the context of a specification for a language of global hypertext references. These global hypertext references include more semantics than just differentiating between protocols and accessing data. There are also issues of determining the type and the identity of the referent data. SGML as a syntactic specification tool That's what it's for, after all. What I propose is a DTD that (with the default SGML declaration) defines the language of global hypertext references. Some examples of the language: <http host="info.cern.ch" path="hypertext/TheProject.html"> <http host="info.cern.ch" path="hypertext/people.html" anchor="timbl"> <http host="info.cern.ch" path="XFIND" search="SGML"> <prospero host="archie.mgil.ca" path="pub/ftp"> <file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl> <ftp host="export.lcs.mit.edu" dir="contrib" name="XcRichText-1.2.tar.Z"> <usenet group="comp.infosystems.gopher"> <usenet article="<abc@convex.com>"> <wais host="quake.think.com" database="INFO" search="help"> <wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000 path="/usr/local/wais/README" > <telnet host="info.cern.ch"> <gopher host="boombox.umn.edu" port=70> <gopher host="boombox.umn.edu" selector="foo "bar"" gtype=0 > The DTD uses only the most basic features of SGML, and thus the resulting language is not very complex. Implementation of a parser for this particular SGML language is a vastly more simple task than implementing an SGML parser. At the same time, we get the benefits of a rigorously defined language based on established standards. Note: I haven't studied the HyTime standard very carefully. I think it's beyond the scope of the task at hand, but I'd like to have that opinion substantiated by someone who really knows. In particular, its Finite Coordinate Systems could be used to model positions within documents: characters, lines, paragraphs. RELAVENT ISSUES Verbosity This syntax is somewhat verbose, but I think that implicit markup (punctuation rather than names) will lead to a mass of quoting in many cases. And the consistency between schemes is not necessarily very high. Long URLs Extra whitespace between tokens has no effect. There is still the problem of quoted strings that are longer than a mailer allows. Certainly there's some SGML feature that I'm not aware of that addresses the issue. I don't believe there's a way to restrict the length of an element, though there is a 960 character limit on the length of an attribute value (in the default SGML declaration). Quoting The SGML numeric character reference (e.g. €) allows an attribute value literal to represent any sequence of bytes. NAMELEN The default SGML declaration specifies that names of elements and attributes be 8 characters or less. It's a conceptually simple matter to operate under an SGML declaration where NAMELEN is higher. Extensibility One problem with the current UDI syntax specification is that it seems to allow new schemes to add arbitrary complexity to the grammar. This specification limits the language to an SMGL start tag. If we adopt this spec, we need to give it a public text identifier, and maintain a registry of the names used (probably with the IANA). DEPLOYMENT AND USAGE The first place to try this specification out is in the WWW browser. (I'll try to make the code changes if I find time). It's a simple matter of elevating UDI's as SGML attributes to URLs as SGML elements. I'd like to have someone who really knows SGML to have a look at this DTD and see if it can be improved. And I'd like to study the HyTime standard, the Davenport DASH, the CFCM standard, etc. to see how this element meshes with their citation strategies. Also, it would be nice to have explicit support from WAIS and Gopher clients -- drag and drop comes to mind. SGML and semantics SGML is famous for being divorced from application semantics. Most of the semantics of URLs is in the constituent protocols. All we need to do is define a way to parse a URL and pass the various bits to the protocol. But as long as we're going to all the trouble to gather information accessible with all these protocols into one specification, it makes sense to define some semantics common to most applications that will use URLs. DATA TYPES Some of the schemes have explicit type information (wais, gopher), some have implicit typing (html, USENET), and some have no typing at all (file, ftp). The MIME content-type system is general and useful enough to warrant support. An application should be able to determine the content-type of the data regardless of the protocol. RESOURCE IDENTITY Many applications have use for determining whether two URLs refer to the same information. Various schemes (such as USENET article id's) may have semantics for identifying resources. But I think this capability is so widely useful that it should be coherently supported for all protocols. connolly@convex.com --alt Content-Type: text/x-html <!DOCTYPE html SYSTEM> <title>Using SGML to define Universal Resource Locators</title> <H1>Objective</H1> The issue of what to call these things we're defining has been discussed at length. First it was Universal Document Identifier. The name has changed as the objective has been refined. The latest name is Universal Resource Locator. The provisional charter is; <a HREF="x-message-id:<9206262004.AA29919@zippy.lcs.mit.edu>"> <h4>To define a printable string syntax to the allow</h4> <ol> <li>The expression of the address on the network of any accesable object using existing information retrieval protocols; <li>The expression of the name of any object held in a directory system or unique naming space on the network; <li>The distinction to be made easily in the syntax between such protocols and directories and name spaces; <li>New protocols, directories and naming schemes to be included as and when they are developed. </ol> </a> <p> Clearly what we are about is defining a language, i.e. a syntax and semantics for communicating some information. <p> The information is the location and/or identity of some information object in the global hypertext. It's a citation or a reference or a hypertext link anchor. <p> I propose a specification for the language of URLs, in the context of a specification for a language of global hypertext references. <p> These global hypertext references include more semantics than just differentiating between protocols and accessing data. There are also issues of determining the type and the identity of the referent data. <H2>SGML as a syntactic specification tool</H2> That's what it's for, after all. What I propose is a DTD that (with the default SGML declaration) defines the language of global hypertext references. <p> <h4>Some examples of the language:</h4> <XMP> <http host="info.cern.ch" path="hypertext/TheProject.html"> <http host="info.cern.ch" path="hypertext/people.html" anchor="timbl"> <http host="info.cern.ch" path="XFIND" search="SGML"> <prospero host="archie.mgil.ca" path="pub/ftp"> <file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl> <ftp host="export.lcs.mit.edu" dir="contrib" name="XcRichText-1.2.tar.Z"> <usenet group="comp.infosystems.gopher"> <usenet article="<abc@convex.com>"> <wais host="quake.think.com" database="INFO" search="help"> <wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000 path="/usr/local/wais/README" > <telnet host="info.cern.ch"> <gopher host="boombox.umn.edu" port=70> <gopher host="boombox.umn.edu" selector="foo "bar"" gtype=0 > </XMP> The DTD uses only the most basic features of SGML, and thus the resulting language is not very complex. Implementation of a parser for this particular SGML language is a vastly more simple task than implementing an SGML parser. At the same time, we get the benefits of a rigorously defined language based on established standards. <dl><dt>Note: <dd>I haven't studied the HyTime standard very carefully. I think it's beyond the scope of the task at hand, but I'd like to have that opinion substantiated by someone who really knows. In particular, its Finite Coordinate Systems could be used to model positions within documents: characters, lines, paragraphs. </dl><p> <h3>Relavent Issues</h3> <dl> <dt>Verbosity <dd>This syntax is somewhat verbose, but I think that implicit markup (punctuation rather than names) will lead to a mass of quoting in many cases. And the consistency between schemes is not necessarily very high. <dt>Long URLs <dd>Extra whitespace between tokens has no effect. There is still the problem of quoted strings that are longer than a mailer allows. Certainly there's some SGML feature that I'm not aware of that addresses the issue. <p> I don't believe there's a way to restrict the length of an element, though there is a 960 character limit on the length of an attribute value (in the default SGML declaration). <dt>Quoting <dd>The SGML numeric character reference (e.g. €) allows an attribute value literal to represent any sequence of bytes. <dt>NAMELEN <dd>The default SGML declaration specifies that names of elements and attributes be 8 characters or less. It's a conceptually simple matter to operate under an SGML declaration where NAMELEN is higher. <dt>Extensibility <dd>One problem with the current UDI syntax specification is that it seems to allow new schemes to add arbitrary complexity to the grammar. This specification limits the language to an SMGL start tag. <p> If we adopt this spec, we need to give it a public text identifier, and maintain a registry of the names used (probably with the IANA). </dl> <h3>Deployment and Usage</h3> The first place to try this specification out is in the WWW browser. (I'll try to make the code changes if I find time). It's a simple matter of elevating UDI's as SGML attributes to URLs as SGML elements. I'd like to have someone who really knows SGML to have a look at this DTD and see if it can be improved. And I'd like to study the HyTime standard, the Davenport DASH, the CFCM standard, etc. to see how this element meshes with their citation strategies. Also, it would be nice to have explicit support from WAIS and Gopher clients -- drag and drop comes to mind. <h2>SGML and semantics</h2> SGML is famous for being divorced from application semantics. Most of the semantics of URLs is in the constituent protocols. All we need to do is define a way to parse a URL and pass the various bits to the protocol. But as long as we're going to all the trouble to gather information accessible with all these protocols into one specification, it makes sense to define some semantics common to most applications that will use URLs. <h3>Data Types</h3> Some of the schemes have explicit type information (wais, gopher), some have implicit typing (html, USENET), and some have no typing at all (file, ftp). The MIME content-type system is general and useful enough to warrant support. An application should be able to determine the content-type of the data regardless of the protocol. <h3>Resource Identity</h3> Many applications have use for determining whether two URLs refer to the same information. Various schemes (such as USENET article id's) may have semantics for identifying resources. But I think this capability is so widely useful that it should be coherently supported for all protocols. <address>connolly@convex.com</> </HTML> --alt-- --cut-here <!-- Universal Resource Locator specification derived from http://info.cern.ch/hypertext/WWW/Addressing/BNF.html on 24 July 1992 by connolly@convex.com --> <!-- Typical usage: <!DOCTYPE url SYSTEM> (we need a public identifier) or as part of another SGML document type: <!ELEMENT url SYSTEM> &url; --> <!-- minimization? I believe you can omit the name= part of an SGML attribute specification in some circumstances. I don't think it works with CDATA attributes because order is not significant. --> <!-- news: scheme renames USENET --> <!-- file: is somewhat vague. I suggest explicit support for FTP: --> <!ENTITY % schemes "http|file|ftp|usenet|telnet|prospero|gopher|wais"> <!ELEMENT url - - (%schemes;)* > <!-- content model of URL: more than one element in a URL? (obviously an application can use multiple URLs. The question is whether to define semantics for multiple elements in a single URL.) Also, what about type, size, search information? Perhaps one element should describe the connection information, another element or elements describes the path to the data (allowing us to define semantics of hierarchical databases) and another element defines the type of information there. --> <!ELEMENT (%schemes;) - O EMPTY > <!-- TCP connection info: internet domain address and port number --> <!ENTITY % host "host CDATA #REQUIRED" > <!ENTITY % hostp "%host; port NUMBER #IMPLIED" > <!ENTITY % types "text|image|audio|video|message|multi|appl"> <!ENTITY % stypes "plain|richtext| gif|g3fax| basic| mpeg| rfc822|external|partial| mixed|altern|parallel| octets|ps|oda"> <!-- content-type parameters? --> <!ENTITY % cte "7bit|8bit|qp|base64|binary" -- we could define several of the gopher types in terms of encodings and types e.g. x-binhex, application/x-stuffit --> <!ENTITY % MD5 "datasig CDATA #IMPLIED" -- MD5 data signature --> <!ENTITY % bytes "bytes NUMBER #IMPLIED"> <!ENTITY % lines "lines NUMBER #IMPLIED"> <!ATTLIST http -- information accessing attributes -- %hostp; path CDATA #REQUIRED -- server local name -- -- must match xalpha [/ path ] -- -- can a CDATA attribute contain an arbitrary bytestream? -- search CDATA #IMPLIED -- search terms -- anchor CDATA #IMPLIED -- HTML anchor name -- -- information content attributes -- type (%types) text subtype (%stypes) #IMPLIED encoding (%cte) 7bit %MD5; %bytes; > <!ATTLIST prospero %hostp; path CDATA #REQUIRED -- prospero path should not be constrained to WWW path syntax -- -- information content attributes -- type (%types) appl subtype (%stypes) octets encoding (%cte) binary %MD5; %bytes; > <!ATTLIST file %host; path CDATA #REQUIRED -- unix path should not be constrained to WWW path syntax -- -- information content attributes -- type (%types) appl subtype (%stypes) octets encoding (%cte) binary %MD5; %bytes; > <!ATTLIST ftp %hostp; dir CDATA #REQUIRED -- directory for cd command -- name CDATA #REQUIRED -- name for get command -- user CDATA "anonymous" -- anonymous ftp by default -- password CDATA #IMPLIED -- not always needed -- -- information content attributes -- type (%types) appl subtype (%stypes) octets encoding (%cte) binary -- use 7bit for ascii transfers -- %MD5; %bytes; > <!ATTLIST usenet group CDATA #IMPLIED -- usenet newsgroup name -- article CDATA #IMPLIED -- article message-id -- -- information content attributes -- type (%types) message subtype (%stypes) rfc822 encoding (%cte) 7bit %MD5; %lines; -- you can add headers without changing a USENET article, so bytes isn't a good measure -- > <!-- should we split this into two nodes so that we can put #REQUIRED on the size and type for documents? --> <!ATTLIST wais %hostp; database CDATA #IMPLIED -- WAIS database name -- search CDATA #IMPLIED -- search terms -- -- what about relavent documents? -- wtype CDATA #IMPLIED -- WAIS data type -- -- this should be obsoleted by the MIME type system -- bytes NUMBER #IMPLIED path CDATA #IMPLIED -- split into original x, y? -- -- information content attributes -- type (%types) text subtype (%stypes) plain encoding (%cte) binary %MD5; > <!ATTLIST telnet %hostp; user CDATA #IMPLIED -- username -- > <!ATTLIST gopher %hostp; gtype CDATA "1" -- gopher type -- -- again, MIME types should be used -- -- www browser can be inundated by non-text data unless it recognizes other types -- selector CDATA "" -- gopher object selector -- search CDATA #IMPLIED -- fulltext search terms -- -- information content attributes -- type (%types) #IMPLIED subtype (%stypes) #IMPLIED encoding (%cte) binary %MD5; %bytes; > --cut-here Content-type: text/sgml Content-Description: Example URLs <!DOCTYPE url SYSTEM> <url> <http host="info.cern.ch" path="hypertext/TheProject.html"> <http host="info.cern.ch" path="hypertext/people.html" anchor="timbl"> <http host="info.cern.ch" path="XFIND" search="SGML"> <prospero host="archie.mgil.ca" path="pub/ftp"> <file host="snoopy" path="~connolly/bin/cgrep.pl" type=appl subtype=x-perl> <ftp host="export.lcs.mit.edu" dir="contrib" name="XcRichText-1.2.tar.Z"> <usenet group="comp.infosystems.gopher"> <usenet article="<abc@convex.com>"> <wais host="quake.think.com" database="INFO" search="help"> <wais host="quake.think.com" database="INFO" wtype="TEXT" size=1000 path="/usr/local/wais/README" > <telnet host="info.cern.ch"> <gopher host="boombox.umn.edu" port=70> <gopher host="boombox.umn.edu" selector="foo "bar"" gtype=0 > </url> --cut-here--