Re: comments? mirrors.txt (aka site metadata)

I think this is good metadata to provide, but I'm not convinced that
'/mirrors.txt' is the right way to do it. So-called "well-known locations"
are locked to the granularity of a Web site, which may or may not be the
case; you may only want to mirror part of it (either a subdirectory, or
only certain types of files, etc.). They also force a name on Web sites;
while you might be reasonably sure that mirrors.txt isn't used by most Web
sites, what if it is? Do those Web sites have to change their URLs?
Effectively, you're dictating how people lay out their URLs, and that
isn't a good thing (see TBL's arguments re: URI opacity).

So, what you need is a) a way to describe what parts of the site are
mirrored and b) a way to discover the description.

The simplest way to do this might be to define an HTML link tag;

  <link rel="alternate" href="http://mirror.other.com/" title="The Other
Mirror" />

There can be as many of these as you like in a document, to advertise the
different mirrors available.

The only problem here is that "alternate" means that the href is an
alternate for THIS document; you also want to say that its children can be
used in place of this document's children as well. This might be
accomplished by defining a "mirror" link-type as well;

  <link rel="alternate mirror" href="http://mirror.other.com/" title="The
Other Mirror" />

Now, this tag can be placed on the root ('/') HTML document of a site, and
machines will be able to figure out what its mirrors are. You can also
place it in other places (e.g., '/~bob/') and it will figure out where to
find mirrors of just that resource and its descendants. This would need to
be more fully specified, but you get the idea.

HTTP headers could also be used in a similar fashion; either the somewhat
defunct Link header (which is the complement of the link tag), or a
specialized Mirror header, like:

  Mirror: base="/"; href="http://mirror.other.com/"; title="The Other
Mirror"

The most common complaint about this approach is that it's hard for some
people to modify their HTTP headers, but the mechanism should still be
provided for those that want to use it; it can be quite useful.

For more complex configurations (or if you just have a lot of mirrors!),
you may indeed need a separate file to describe how a site is mirrored. In
these cases, I'd very much suggest using an XML-based format, and
discovering its location with Link tags and HTTP headers as before:

  <link rel="mirror-index" href="/mirror-index.xml" />

Note that the location and name of the index are NOT fixed, just hinted
through the link tag. This way, you can put this tag (or HTTP header) on
the root resource on the site, and have the index auto-discovered.

In some cases, it may be desireable to avoid fetching the root of the
site's HTML to discover where the resource is. There's been some
discussion of how to do this (e.g., OPTIONS * with content negotiation for
the metadata's media type), but there isn't any clear consensus yet. This
problem is why well-known locations are common; robots.txt and p3p.xml use
them because there may be problems if the metadata is in the HTML (i.e.,
the site may not want robots to fetch the root HTML, or the privacy policy
of the root HTML may be unacceptable, respectively), but there isn't such
a constraint on site mirrors.

It's interesting that this problem shows up again and again, with roughly
the same solutions offered; well-known locations, LINK and META tags, HTTP
headers, and some unspecified out-of-band discovery mechanism. IMHO it
would be very, VERY good to settle of one means (or set of means) of
discovering metadata and one framework to describe it, rather than
reinventing the wheel and hitting the same potholes each time. (Andre,
this complaint is directed at the Web community, not you! ;)

RDF was supposed to address these things. I think it's the right approach,
because you need a general model for metadata, multiple ways to express it
(in XML, in HTTP headers, etc.), and a way to combine metadata from
multiple sources to come up with a definitive view of the world.
Unfortunately, it has moved on to grander schemes before solving these
simple problems.

BTW, we ran into problems very similar to these with metadata at Akamai;
one of the outcomes of that was URISpace
(http://www.w3.org/TR/urispace.html). It's probably too heavyweight for
your particular project, but you might find it interesting. I've been
thinking of expressing entire Web site configurations (Apache .conf files,
P3P.xml, robots info, etc.) in URISpace, and then writing a transform to
create the appropriate configuration files from one source. Anybody
interested?

Cheers,


----- Original Message -----
From: "Andre John Mas" <ajmas@newtradetech.com>
To: <www-talk@w3.org>
Sent: Monday, March 31, 2003 9:53 AM
Subject: comments? mirrors.txt


>
> Hi,
>
> Mirroring a web site or ftp site is a great way of reducing load
> and improving access times. The only thing though is that there is
> no method for telling a web browser to automatically go to a mirror.
> For this reason I have been thinking that a 'mirrors.txt' file might
> be of use at the root of a web site that is either the master or a
> mirror, in the same way that a robot.txt file is made available.
>
> Follows is an example of what the contents of such a file would contain:
>
> ----start of example
> #this is a comment
>
> title:   Project Gutenberg
> description: Project Gutenberg is the Internet's oldest producer of FREE
>    electronic books (eBooks or eTexts).
> master:  http://gutenberg.net/
> search:  master
>
> mirror.name: University of North Carolina - HTTP
> mirror.city: Chapel Hill
> mirror.state: North Carolina
> mirror.country: USA
> mirror.gridref:
> mirror.url: http://www.ibiblio.org/gutenberg/
> mirror.update.freq: daily
> mirror.comment: Main Project Gutenberg Collection Site
>
> mirror.name: University of North Carolina - FTP
> mirror.city: Chapel Hill
> mirror.state: North Carolina
> mirror.country: USA
> mirror.gridref: 0/+1000,-1000
> mirror.url: ftp://ibiblio.org/pub/docs/books/gutenberg/
> mirror.update.freq: daily
> mirror.comment: Main Project Gutenberg FTP Site -- If it doesn't allow
>    access, please try the corresponding HTTP site above
>
> ----end of example
>
> Most of the fields should be self explaining, though for the less
> obvious:
>   - search: values would be mirror or master. This is important if
>     only the master offers a search facility
>   - mirror.gridref: the grid coordinates of the mirror. The slash
>     is there for a future use, such as defining planet ID as prefix.
>     The grid ref would always be the last child. I know this is
>     overkill, and probably no one will take this seriously, but I
>     would like to make this future proof, if there is no extra cost.
>   - mirror.update.freq: how oftern the mirror is updated (should this
>     be a numerical, textual value or both?)
>
> Some sites mirror several others, so the site would probably need more
> than one mirror file. Two suggestions are to have the additional mirror
> files have a numeric suffix, e.g. mirrors.txt, mirrors2.txt, etc. or
> to have a mirrors.txt file that refers to the other mirror.txt files.
>
> Also, search engines, such as Google, could make use of this information
> to tie together mirrors under one link, to make for smarter navigation.
> Something such as:
>
>    PROJECT GUTENBERG -
>    Project Gutenberg is the Internet's oldest producer of FREE
>    electronic books (eBooks or eTexts).
>    gutenberg.org/ - 18k - Master - Closest Mirror - Other Mirrors
>
> This is a first jab at something that could well be of use, so I would
> certainly appreciate your comments and whether this is something that
> could be added as a web standard?
>
> regards
>
> Andre
>
> P.S. I am not associated with Project Gutenberg, I am just using it as
> a useful example of real site that could benefit from such a solution.
>
>
>

Received on Tuesday, 1 April 2003 12:06:56 UTC