Re: Exploring new vocabularies for HTML from Ian Hickson on 2008-03-30 (public-html@w3.org from March 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Sun, 30 Mar 2008 09:55:59 +0000 (UTC)
To: Michael Kohlhase <m.kohlhase@jacobs-university.de>
Cc: Bruce Miller <bruce.miller@nist.gov>, public-html@w3.org, www-math@w3.org
Message-ID: <Pine.LNX.4.62.0803300948260.28180@hixie.dreamhostps.com>

Again with my rarely-used Google hat on, and not my vendor-neutral editor 
hat on:

On Sun, 30 Mar 2008, Michael Kohlhase wrote:
> > 
> > At least insofar as Google is concerned, we definitely only want one 
> > representation for the purposes of search engines.
>
> I cannot but agree. So, if there is more than one aspect of the 
> information we should keep them in one place. This is at the core of the 
> <semantics>/<annotation-xml> proposal.

When I say "only one representation", I mean only one, not two 
representations smuggled in under one resource. The <semantics>/ 
<annotation-xml> feature is a way of including two representations.

> > We have found that whenever we use one representation for searching 
> > and another is presented to the user, the two end up being out of sync 
> > and the results presented to the user are less useful than if we 
> > ignore the "semantic" version and base our algorithms exclusively on 
> > the "presentational" version that the user sees.
>
> That may be true for the particular information retrieval method (bag of 
> words) that google uses. But this is certainly not true for mathematical 
> formulae, where the bag of glyphs used in a formula gives almost no 
> indication of the meaning.

This is what people have said for license metadata, geographical 
addresses, prose semantics, accessibility enhancements, and many other 
areas. However, despite this, in all these cases we have found the same 
thing: that having more than one copy of the data anywhere results in the 
non-primary instances (the instances that are not shown to the user) to be 
of significantly lower fidelity, becoming out of sync, and being 
errorneous. In all cases we have found that simply ignoring the supposedly 
"semantic" data and focusing on the data directly seen by the user in 
regular interactions with the user agent results in orders of magnitude 
better quality results.

> Maybe this helps: I think it is symptomatic that google is near useless 
> for finding math (formulae).

Google's lack of quality in terms of finding maths stems directly from the 
lack of attention this use case has received in general, and has no 
bearing whatsoever on whether Google uses presentational MathML or content 
MathML or StarMath 5.0 or LaTeX representations to create its index.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 30 March 2008 09:56:40 UTC