Embedded (inline) indexing tags from Thomas Hedden on 2004-12-31 (www-html@w3.org from January 2005)

From: Thomas Hedden <thomas@hedden.org>
Date: Fri, 31 Dec 2004 06:00:20 +0000
To: www-html@w3.org
Message-ID: <41D2E966.7040108@hedden.org>
I have spent some time trying to find
whether the following topic has been
discussed, and have not been able to
find anything about it. However, I am
new to this list, so if this is old hat
please excuse me.

I have always thought that there should
be some way of tagging words, phrases,
sentences, graphics (actually anything)
with an indexing tag that can be used to
generate a proper index. This is distinct
from META data, since META data is in the
header, and can only be used to find WEB
PAGES, not individual parts of web pages,
while what I have in mind would be in tags
embedded in the text: "inline" indexing tags,
if you will.

Here is an example of what I have in mind.
This is not very well thought out, and I
don't really know the spec, so if someone
has a better idea, all the better.

<index level="1" term="Rocky and Bullwinkle", term="Bullwinkle the 
Moose"; level="2" term="Bullwinkle the Moose">Bullwinkle</index>
<index level="1" term="Rocky and Bullwinkle", term="Rocky the Flying 
Squirrel"; level="2" term="Rocky the Flying Squirrel">Rocky the Flying 
Squirrel</index>

(Another thing which needs to be done is to specify
the level 1 term under which a level 2 term should
appear, and I'm having trouble thinking of the
best way to do that right now.)

A program could be run on the markup page to
generate an index that would look something like
this:

Bullwinkle the Moose
Rocky and Bullwinkle
    Bullwinkle the Moose
    Rocky the Flying Squirrel
Rocky the Flying Squirrel

There could be defaults to make it simpler
to write the tags, for example if no term
is specified then the term would default to
the tagged word/phrase, etc., the default
level would be "1", etc.

IMHO, the entire w3 community hasn't paid
proper attention to indexing for the simple
reason that whole text searching is now
free, very quick, and is adequate for many
purposes. However, after we get over the
initial euphoria of being able to perform
whole-text searching, we should realize that
at the end of the day it's really not very good:
It requires searching for synonyms, since one
author might use one term and another author
might use a synonym, and it finds all manner
of unrelated rubbish. Not only that, but a
particular passage might be of interest for
a certain topic even if it does not contain
the term under which it should properly be
indexed.

Making a proper index takes time and effort
on the part of a human indexer, and to
facilitate this I think a tag should be
made available which authors can use, or
if they are not inclined to do this, one
which an indexer could go back and add later,
with the goal of generating a GOOD-QUALITY
index. This would be very simple to do
once the content is properly tagged.

Of course anyone is free to do this on his/her
own, but it would only be really useful if
there was some standardization so that true
indexing engines could produce true indexes,
rather than having whole text search engines
give us everything including the kitchen sink.

Thank you for your time.

Thomas Hedden


-- 
--------------------------------------------------------------
| Thomas Hedden            | Voice & fax   +1 (978) 371-2126 |
| 98 East Riding Drive     | Skype     callto://thomashedden |
| Carlisle, MA  01741-1602 | Cell          +1 (978) 930-0462 |
| U.S.A.                   | E-mail thomas AT hedden DOT org |
| Planet Earth             | WWW       http://www.hedden.org |
--------------------------------------------------------------
| Linux Counter registration # 203894, http://counter.li.org |
--------------------------------------------------------------
Received on Thursday, 6 January 2005 22:34:25 UTC