Re: Meta Tag Draft - New version.

Martijn Koster (m.koster@webcrawler.com)
Thu, 21 Dec 1995 13:30:32 -0700


Message-Id: <v02140802acff54cb00f0@[199.221.45.139]>
Date: Thu, 21 Dec 1995 13:30:32 -0700
To: davidmsl@anti.tesi.dsi.unimi.it (Davide Musella),
From: m.koster@webcrawler.com (Martijn Koster)
Subject: Re: Meta Tag Draft - New version.
Cc: musella@dsi.unimi.it

At 4:06 PM 12/20/95, Davide Musella wrote:
>Hello to everybody.. Here there is the new version of the meta-tag draft.

This one keeps going in various form(u)s :-)

First of all I don't think HTML tags are the ideal place for generalised
meta-information about documents, because it's limited to HTML, only allows
a single viewpoint, etc.  I'd much prefer seeing a URC, or at least the use
of LINK elements to point to separate documents with META data (as
suggested by Murray Maloney), which you can then negotiate.

Having said that, it seems to refuse to die, and it is targetted as a
quickly deployable interrim measure, so I thought I'd better comments on
parts I disagree with (summary at the end).

>   Now the synopsis of the META HTTP-EQUIV Tag is not severe, allowing so
>   the use of different key words to define the same things.

I had to read this twice before I understood.  This may be better
re-phrased as "Currently (in HTML 2.0) the synopsis is not
well-defined".

> 3. HTTP-EQUIV.

I'd much prefer to change the focus of this draft away from HTTP-EQUIV,
and concentrate on NAME, to which http-equiv might be added only if
required. That way you separate the two purposes: embedding META information
in HTML, and associating HTTP headers with HTML documents.

The draft contains no rationale for sending generalised META info in HTTP,
so let's think about this... I can think of a few reasons:

1: to allow retrieval of this info via a HEAD request

   One could argue this is useful for indexing, but in practice robots don't
   do this: they want the entire document before deciding what/how to index
   it, at which point they can parse the HTML and use the META info straight
   away from there.  Also, because of this not being widely implemented doing
   a HEAD instead of a GET would usually result in receiving no META data, and
   requiring a GET after all; this double rounde-trip is enough reason to
   just.  do a GET, in which case you might as well parse it from the HTML
  <HEAD> element.

2: to allow proxies and client access to this without parsing the HTML

   That is lazy, and current wisdom says servers should not have to do this
   stuff if the client can do it.

3: to set/override HTTP server settings

   One could e.g. put in a HTTP-EQUIV="Content-Language" header,
   if your server does support HTTP-EQUIV, but not configuration
   of Cantent-language.

   This would be a computationally expensive way of doing it, even if
   the server caches this info. Much better to do that out of band.
   Dubious from a security aspect too (eg. Location:)

4: to get HTTP-headers via other protocols (file://, gopher://...)

   I find this dubious:
   - if you have conflicting values in your server configuration and
     your HTTP-Equiv, what is a server to do? What is a client to do?
   - I doubt many browser's architectures allow this to be done easily.
   - this only works for HTML. Why not solve it properly in the protocol
     so it works for all media types?

So I don't think it's a good idea. If there is a good reason it should
be mentioned in the draft.


>   It is possible to use any text string [in http-equiv],

No, not if this is to go straight over the wire in a HTTP header:
you're syntactically constrained by the HTTP spec here.
The constraints should be explicitly mentioned,
or at the very least referenced.
Of course if you do just NAME and CONTENT you can relax this.

This also opens up possible clashes in the name spaces between independent
extensions by the HTML META tag and the HTTP spec.  For example, if someone
puts in a http-equiv "Payment: loads money", and later the HTTP spec
decides to add a tag "Payment" with a rigorous syntax, then you have
servers sending bogus headers.

For this reason I'd be a lot happier if you were required to prepend
"Meta-" to any unspecified string.  I'd be even happier if general strings
weren't allowed at all (there is little point if the syntax and semantics
aren't defined)

> but if you want to define
>   these properties you have to use the following words:
>
>  ...
>         expire:  to indicate the expire date of the document
>        language: to indicate the language of the document

Without a syntax you can't do much with those fields...

Anyway this is strange; in HTTP the respective headers are "Expires"
and "Content-Language", and have pre-determined syntaxes.
It seems to me these should be the same, and for them to be used the
syntax and semantics need to be mentioned or referenced.

>        public (Boolean): to indicate if the document is available to
>                         everybody or not

First of all I question the value of this; wether a document is public or
not should be determined by the protocol or policy, not the document.  What
if a browser sese a Public: 'NO', should it drop the document?  You
can't guarantee that.  You also can't specify target groups who have
access, so as a security measure it's not all that valuable.

Anyway, if you specify a type 'Boolean' you need to specify the values
(0-1, YES-NO, ON-OFF ?), otherwise it doesn't really help.

>   An HTTP server must process these tags for an HEAD HTTP request,
>   Do not name an HTTP-EQUIV attribute the same as a response header
>   that should typically only be generated by the HTTP server. Some
>   inappropriate names are "Server", "Date", and "Last-Modified".
>   Whether a name is inappropriate depends on the particular server
>   implementation. It is recommended that servers ignore any META
>   elements that specify HTTP equivalents (case insensitively) to their
>   own reserved response headers.

This brings me to a feeling of unease about this draft: Is it a way of
associating meta-data info with a document, or is it a way of configuring a
server, or conveying HTTP info in HTML via other protocols?  Don't these
conflict somewhere?

> 4. NAME.
>
>   This attributes can be used to define some properties such as
>   author, publication date etc. If absent the name can be assumed to be
>   the same as the value of HTTP-EQUIV.

According to section 3, the HTTP-Equiv may also be absent, so both may be
absent, leaving the Content useless :-) Either should be required.
Personally I'd prefer the emphasis to be on the NAME=>CONTENT pair rather
than the HTTP-EQUIV=>CONTENT pair.

> 5. CONTENT
>
>   Used to supply a value for a named property.
>   If it's used with the HTTP-EQUIV it can contain more than one single
>   information; it is possible to use the Boolean operator (AND, OR) to
>   insert a Boolean definition of the field.
>   The AND operator will be represented by the SPACE (ASCII[32]) and the
>   OR operator by the COMMA (ASCII[44]).
>   The AND operator is processed before the OR operator. So a string
>   like this: "Red ball, White ball" means :"ball AND (red OR white)".
>   Examples:
>
>   <META HTTP-EQUIV= "Keywords" CONTENT= "Italy Product, Italy Tourism">
>
>   The spaces between a comma and a word or vice versa are ignored.

I find this strange and confusing.  First of all, this holds only true for
those fields you have defined in section 3, not for HTTP headers.
Secondly, why not simply say "Keyword phrases are separated by commas?"
without delving into a non-obvious boolean system?

> 6. Cataloging an HTML document
>
>   These 'keywords' were specifically conceived for exaustively and
>   completely catalogue the HTML document.

I guess you mean "exhaustively" and "cataloguing (?)"?

I don't think you should claim anything "exhaustive and complete", because
things especially meta-data never are.

>   This allows the software agents to index at best your own document.

"This allows you to aid web robots in indexing your document."?

>   To do a preliminary indexing, it's important to use at least the
>   http-equiv meta-tag "keywords".

This sentence doesn't run...

I'm also missing a "Security Considerations" section, which seems very
needed to warn about people spamming and abusing this tag, especially
when it could override HTTP-proper headers.

Sorry to be a bit negative here, but I really think this should be well
thought-out if it is to end up in a spec the entire networking community
will have to live with.

So in summary: rather that (just) this meta tag, look at using LINK to
associate META data, seriously reconsider (euphemism for "don't do")
general HTTP-EQUIV, specify syntax as well as semantics for the fields, and
consider the security issues.

-- Martijn

Email: m.koster@webcrawler.com
WWW: http://info.webcrawler.com/mak/mak.html