Re: Embedded (inline) indexing tags from Edward Lass on 2005-01-11 (www-html@w3.org from January 2005)

From: Edward Lass <elass@goer.state.ny.us>
Date: Tue, 11 Jan 2005 14:32:01 -0500
To: <thomas@hedden.org>,<www-html@w3.org>
Message-Id: <s1e3e36a.085@mail.goer.state.ny.us>
You could store the information in an XML document and use XSLT[1] to
output it in (X)HTML. Using XML this way is very similar to a
database-driven website, but arguably more flexible. The XML document
might look something like this:

<show>
   <title>Rocky and Bullwinkle</title>
   <character>
      <name>Bullwinkle the Moose</name>
      ...insert any other information here...
   </character>
   <character>
      <name>Rocky the Squirrel</name>
      ...ditto...
   </character>
</show>

These sort of data could be transformed to make all the show titles
into headers and then all the character names into list items in an
unordered list.  And then each character could get his or her own page
with the name as a header and the other information in paragraphs.

This is good for content management, especially reusing and indexing
objects within a particular site.  Search engines, however, obviously
wouldn't see what's going on behind the scenes.

You could also transform the data into RDF/XML syntax[2] and attach it
to the (X)HTML document as follows:

<link rel="alternate" type="application/rdf+xml" href="...URL here..."
/>

This could help the search engines, but major search engines don't
index RDF data in any meaningful way, or at least not yet.

In general, Thomas, I would suggest reading up on the W3C's efforts for
a so-called Semantic Web[3].

Ed.

[1] http://www.w3.org/TR/xslt
[2] http://www.w3.org/TR/rdf-syntax-grammar 
[3] http://www.w3.org/2001/sw/

>>> Thomas Hedden <thomas@hedden.org> 12/31/2004 1:00:20 AM >>>




I have spent some time trying to find
whether the following topic has been
discussed, and have not been able to
find anything about it. However, I am
new to this list, so if this is old hat
please excuse me.

I have always thought that there should
be some way of tagging words, phrases,
sentences, graphics (actually anything)
with an indexing tag that can be used to
generate a proper index. This is distinct
from META data, since META data is in the
header, and can only be used to find WEB
PAGES, not individual parts of web pages,
while what I have in mind would be in tags
embedded in the text: "inline" indexing tags,
if you will.

Here is an example of what I have in mind.
This is not very well thought out, and I
don't really know the spec, so if someone
has a better idea, all the better.

<index level="1" term="Rocky and Bullwinkle", term="Bullwinkle the 
Moose"; level="2" term="Bullwinkle the Moose">Bullwinkle</index>
<index level="1" term="Rocky and Bullwinkle", term="Rocky the Flying 
Squirrel"; level="2" term="Rocky the Flying Squirrel">Rocky the Flying

Squirrel</index>

(Another thing which needs to be done is to specify
the level 1 term under which a level 2 term should
appear, and I'm having trouble thinking of the
best way to do that right now.)

A program could be run on the markup page to
generate an index that would look something like
this:

Bullwinkle the Moose
Rocky and Bullwinkle
    Bullwinkle the Moose
    Rocky the Flying Squirrel
Rocky the Flying Squirrel

There could be defaults to make it simpler
to write the tags, for example if no term
is specified then the term would default to
the tagged word/phrase, etc., the default
level would be "1", etc.

IMHO, the entire w3 community hasn't paid
proper attention to indexing for the simple
reason that whole text searching is now
free, very quick, and is adequate for many
purposes. However, after we get over the
initial euphoria of being able to perform
whole-text searching, we should realize that
at the end of the day it's really not very good:
It requires searching for synonyms, since one
author might use one term and another author
might use a synonym, and it finds all manner
of unrelated rubbish. Not only that, but a
particular passage might be of interest for
a certain topic even if it does not contain
the term under which it should properly be
indexed.

Making a proper index takes time and effort
on the part of a human indexer, and to
facilitate this I think a tag should be
made available which authors can use, or
if they are not inclined to do this, one
which an indexer could go back and add later,
with the goal of generating a GOOD-QUALITY
index. This would be very simple to do
once the content is properly tagged.

Of course anyone is free to do this on his/her
own, but it would only be really useful if
there was some standardization so that true
indexing engines could produce true indexes,
rather than having whole text search engines
give us everything including the kitchen sink.

Thank you for your time.

Thomas Hedden


-- 
--------------------------------------------------------------
| Thomas Hedden            | Voice & fax   +1 (978) 371-2126 |
| 98 East Riding Drive     | Skype     callto://thomashedden |
| Carlisle, MA  01741-1602 | Cell          +1 (978) 930-0462 |
| U.S.A.                   | E-mail thomas AT hedden DOT org |
| Planet Earth             | WWW       http://www.hedden.org |
--------------------------------------------------------------
| Linux Counter registration # 203894, http://counter.li.org |
--------------------------------------------------------------





This message has been scanned by the NYS GOER WebShield.
Received on Tuesday, 11 January 2005 19:32:59 UTC