- From: Gerald Oskoboiny <gerald@cs.ualberta.ca>
- Date: Tue, 26 Dec 1995 04:26:59 -0700 (MST)
- To: www-talk@w3.org
Daniel W. Connolly writes: > Hypermail is great. Mhonarc is even better. But I've got a lot of > ideas for improvements: I wrote something called "HURL: the Hypertext Usenet Reader & Linker." (A better name would be "a hypertext interface to news archives"). More info is at: <URL:http://ugweb.cs.ualberta.ca/~gerald/hurl/>. Before I get anyone's hopes up, I should point out: - you can't actually play with it now, because all existing builds are on sunsite.unc.edu, which recently suffered a major disk crash; - I *still* don't have a distribution package ready, although I've been promising one for ages. Hopefully within a couple of weeks. If you want to see what the interface looks like, there are screen shots available at the URL above. Hopefully Sunsite will be back to normal RSN. > Requirements: > > 0. Support MIME ala mhonarc. HURL was originally designed for Usenet archives, and since MIME isn't widely used on Usenet (yet), this hasn't been a high priority. Right now it treats everything as text/plain and puts it in <PRE> blocks. I don't know what mhonarc does; I could probably make HURL handle text/html easily enough, but there's other stuff I'd like to work on first, I think. > 1. Base the published URLs on the global message-ids, not on local > sequence numbers. So in stead of: > > http://www.foo.com/archive/mlist/00345.html > > I want to see: > > http://www.foo.com/archive/mlist?message-id=234234223@bar.net > > This allows folks to write down the URL of the message as soon > as they post it -- they don't have to wait for it to show > up in the archive. Yup. HURL's URLs are something like: http://www.foo.com/www-talk/msgid?foo@bar.net Broken links are a big peeve of mine, so I've tried to make sure that any URLs created by HURL will work forever. FYI, Kevin Hughes at EIT has written a script that redirects message-ID- based queries to the appropriate URL within his Hypermail archives of the www-* lists; for more information see <URL:http://www.eit.com/www.lists/refer.html>. > Hhmmm... I wonder if deployed web clients handle relative query > urls correctly, e.g.: > > References: <a href="?message-id=0923408.xxx.yyy">09823408.xxx.yyy</a> With HURL this is just <a href="msgid?0923408.xxx.yyy">09823408.xxx.yyy</a>. Message-ID references only get linked if the article actually exists in the archive. So if there are 5 articles in the References: line, and only 4 of them happen to be in the archive (possibly due to a thread that was dragged in from another group), the 5th one doesn't get a link. This was expensive, but worthwhile IMO because it prevents error messages like "sorry, that article isn't in the archive." Also, msgid refs get linked within the body of articles, not just in the References line. > 2. Support format negotiation. Make the original message/rfc822 data > available as well as the enhanced-with-links html format -- at the > same address. This _should_ allow clients to treat the message as a > message, i.e. reply to it, etc. by specifying: > > Accept: message/rfc822 Hmm. No format negotiation, but there's a "see original article" link that shows the current article with full headers and without all the extra hypertext junk. Also, the original RFC822-style article can be retrieved using a URL of: http://site.org/path/original?foo@bar.com instead of: http://site.org/path/msgid?foo@bar.com . > 3. Keep the index pages to a reasonable size. Don't list 40000 > messages by default. The cover page should show the last 50 or so > messages, plus a query form where folks can select articles... Yup. A query result puts you in something called the Message List Browser, which shows (by default) 100 messages per "page", with "Next" and "Previous" links to other pages, etc. > 4. Allow relational queries: by date, author, subject, message-id, > keywords, or any combination. Essentially, treat the archive as a > relational database table with fields message-id, from, date, subject, > keywords, and body. There's a query page that allows for any article headers to be queried and combined with AND logic. OR logic can be specified on individual parts of each query using a comma, so you can do this: Subject: center,centre AND Date: 94 AND From: netscape to "find articles with Subject containing center or centre posted in 1994 by someone whose From line matches netscape". Unfortunately, back when I started this project I couldn't find a good freely-distributable database package, so the current query system is an ugly hack (it works, but it's inefficient). However, a friend of mine has been working on an Isite-based replacement, apparently with good results. I hope to replace my hack with his stuff sometime in the future. > Goals: > > 5. Generate HTML on the fly, not in batch. Yup. > Cache the most recent pages of course (in memory?), > but don't waste all that disk space. I don't think this would be a win for HURL, because: - pages have state info encoded in them (such as a cookie identifying the current query result), so each returned article is unique; - the article-displaying script isn't "slow" (relatively, anyway). (and the HTML version is never stored on the server). Better for HURL would be to cache query results, which is on my list of things to do. > Update the index in real-time, as messages arrive, not in batch. I have nightly builds of several archives (and builds are rotated into place so there's no downtime), but there's no incremental indexing (yet). I had initially envisioned the builds taking place infrequently so this wasn't a high priority, but it's one of the things I want to implement next. > 6. Allow batch query results. Offer to return the raw message/rfc822 > data (optionally compressed) for, e.g. "all messages from july 7 to > dec 1 with fred in the from field". I plan to add the ability to download a .tar.gz or .zip file of the messages comprising the current query result in their original RFC822 format. > 7. Export a harvest gatherer interface, so that collections of mail > archives can be combined into harvest broker search services where > folks can do similar relational and full-text queries. I've had some good preliminary results with full-text searches against individual archives using Glimpse, but nothing like Harvest yet... > 8. Allow annotations (using PICS ratings???) for "yeah, that > was a really good post!" or "hey: if you liked that, you > should take a look at ..." The original motivation for this project was to do exactly that: take 150,000 articles from talk.bizarre and sort them into the Good, the Bad, and the Ugly. (For talk.bizarre, the Good are few and far between, but when they're good, they're really good). ObAttribution: Mark-Jason Dominus was the one with the original idea to do this article-scoring stuff (in fall of 93, I think), and he was the one with the incredible foresight to archive everything posted to t.b. for the last five years (and counting). I'm not sure how this voting stuff should proceed exactly; it could be as simple as vote-on-a-scale-from-one-to-ten, or something much, much more complex (and powerful). > 9. Make it a long-running process exporting an ILU interface, rather > than a fork-per-invocation CGI script. Provide a CGI-to-ILU hack for > interoperability with pre-ILU web servers. Uh. I'll just pretend I didn't see this. > Major brownie points to anybody who builds something that supports at > least 1 thru 4 and makes it available to the rest of us. Mmm, brownies. I still fail the "make it available to the rest of us" condition, though. Even if I wanted to make a distribution package today, I couldn't, due to the Sunsite crash which has left me (temporarily) without access to my most recent code. > I'd really like to use it for all the mailing lists around here. I've been meaning to do a build of www-* and html-wg for a while now, and Sunsite recently got another 20 gigs of disk, so as soon as things settle down over there I might take care of this... : > Ah! I just remembered one more: > > >Goals: > > 10. Support a WYSIWYG-ASCII format, alal SeText or WikiWikiWeb[1] > so that folks can send reasonable looking plain text email, > but it can be converted to rich HTML in an automated fashion. I'm not quite clear on this, but I try to add links within articles whenever appropriate. This sort of interface definitely has lots of neat possibilities; some of the things I've done so far include: - auto-recognition and linking of URLs within articles (of course); - if you're reading an article and see a word you don't understand, you can click on "Filters..." and then "Webster" to reload the current article with each (non-trivial) word linked to the Webster gateway at cmu.edu (other filters include rot13 decoding, etc.); and things I'd like to do include: - auto-recognizing footnote references like your [1] above (i.e., putting an HREF to a fragment ID of "foot1" and putting a matching NAME around the appropriate footnote at the bottom of articles). I'd have to change my article-displaying loop to be a two-pass process if I wanted these links to be "reliable", though. - auto-recognizing any occurrences of "ISBN x-xxxx-xxxxx-x" and linking them to a complete Library of Congress entry on that publication. - adding a nice way to specify newsgroup-specific filters; for instance, for an archive of comp.lang.perl.misc you might want to link any perl function references within each article to a hypertext man page somewhere, or link perl5 package references to more information on the package; for an archive of rec.food.recipes you might want to add links from each ingredient listed to an imperial-to-metric converter... > and Daniel reminded me: > > 12. Interoperate with USENET somehow. (perhaps the archive > parallels a moderated newsgroup?) Interoperating with Usenet wasn't part of the original plan, since I figured that most of the time people would be reading really really old stuff: does it make sense to followup to a 4-year-old article? How about a 10-year-old article? (We have stuff from net.bizarre, too.) It's a possibility, though. Gerald 247 lines!? ugh, sorry... -- Gerald Oskoboiny <gerald@cs.ualberta.ca> http://ugweb.cs.ualberta.ca/~gerald/
Received on Tuesday, 26 December 1995 06:28:59 UTC