- From: Ian Hickson <ian@hixie.ch>
- Date: Sun, 30 Mar 2008 09:55:59 +0000 (UTC)
- To: Michael Kohlhase <m.kohlhase@jacobs-university.de>
- Cc: Bruce Miller <bruce.miller@nist.gov>, public-html@w3.org, www-math@w3.org
Again with my rarely-used Google hat on, and not my vendor-neutral editor hat on: On Sun, 30 Mar 2008, Michael Kohlhase wrote: > > > > At least insofar as Google is concerned, we definitely only want one > > representation for the purposes of search engines. > > I cannot but agree. So, if there is more than one aspect of the > information we should keep them in one place. This is at the core of the > <semantics>/<annotation-xml> proposal. When I say "only one representation", I mean only one, not two representations smuggled in under one resource. The <semantics>/ <annotation-xml> feature is a way of including two representations. > > We have found that whenever we use one representation for searching > > and another is presented to the user, the two end up being out of sync > > and the results presented to the user are less useful than if we > > ignore the "semantic" version and base our algorithms exclusively on > > the "presentational" version that the user sees. > > That may be true for the particular information retrieval method (bag of > words) that google uses. But this is certainly not true for mathematical > formulae, where the bag of glyphs used in a formula gives almost no > indication of the meaning. This is what people have said for license metadata, geographical addresses, prose semantics, accessibility enhancements, and many other areas. However, despite this, in all these cases we have found the same thing: that having more than one copy of the data anywhere results in the non-primary instances (the instances that are not shown to the user) to be of significantly lower fidelity, becoming out of sync, and being errorneous. In all cases we have found that simply ignoring the supposedly "semantic" data and focusing on the data directly seen by the user in regular interactions with the user agent results in orders of magnitude better quality results. > Maybe this helps: I think it is symptomatic that google is near useless > for finding math (formulae). Google's lack of quality in terms of finding maths stems directly from the lack of attention this use case has received in general, and has no bearing whatsoever on whether Google uses presentational MathML or content MathML or StarMath 5.0 or LaTeX representations to create its index. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Sunday, 30 March 2008 09:56:39 UTC