Intractible problems serving web pages with MathML? from Richard Kaye on 2006-04-27 (www-math@w3.org from April 2006)

From: Richard Kaye <R.W.Kaye@bham.ac.uk>
Date: Thu, 27 Apr 2006 15:47:35 +0100
To: www-math@w3.org
Message-Id: <200604271547.35999.R.W.Kaye@bham.ac.uk>
Dear all,

I am  a mathematician working in a maths department at a university using 
MathML in web pages. At the moment I am the only member of my species that I 
know about.  I would like to encourage others -- when the technicalities are 
ironed out.

My minimum requirements are: (a) on the client side:

1. Web pages should be viewable correctly in the most common
properly-equipped browsers.  Currently Mozilla and IE+MathPlayer.

2. Web pages should be viewable partially in other common
browsers, such as IE (without MathPlayer), Safari, Konqueror, ...

3. Web pages should be clearly listed by all the main search engines.

(b) on the server side:

4. Web pages should be served with a minimum of specialist software
or setting up required on the server.

I hope no-one thinks this is unreasonable.

My set-up currently has 1 but not 2 or 3.  It uses a fairly old
Apache with only a few small tweaks which I regard as meeting
requirement 4 (though I am aware many other people won't be 
allowed or able to make any changes to *their* server).

More specifically: I was advised to use content negotiation 
and serve pages as application/xhtml+xml with a text/html 
fall-back, which is what I do.

Unfortunately IE+MathPlayer sets its "accept" field
to "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/vnd.ms-excel, application/vnd.ms-powerpoint,
application/msword, */*" (and something completely different when
"refresh" is pressed---ARGGH!) with the effect that it does not 
distinguish between application/xhtml+xml and text/html.  So I have to 
set the qs setting for application/xhtml+xml a bit higher to make sure
these clients see the correct pages.  The problem is then that 
IE users *without* MathPlayer also get the application/xhtml+xml
pages, which they can't view at all (without saving to disk 
changing extension and then re-opening -- something that few people
consider doing and is in general highly dangerous on a MS-windows 
machine.)

Until recently googlebot did seem to prefer text/html and 
they indexed my pages properly. It seems that some recent 
change (last 2-3 months?) has been made at google, and there 
is no longer any preference for text/html. See 
  http://mat140.bham.ac.uk/~richard/googlelisting.png
for a snippet showing how my pages are listed.  I really object to
google listing my page with an incorrect "title" (in fact,
it uses the <?xml ...?> and <!DOCTYPE ...> declaration as a 
title) and saying that my standards-compliant page is "File Format: 
Unrecognized". What's more, I am sure my readers will automatically 
distrust the document because of this, or click the wrong link, or both.
Also I am not sure that google is indexing my page fully or reading the 
keywords properly anyway.

But I should add that google is one of the better ones.  At least my 
pages are listed *partially* there!  But (ironically) by far the *best* 
one at the moment for me is MSN ( http://search.msn.com/ ).  
The msnbot asks for "text/html, text/plain, text/xml, application/*"
and therefore gets the plain HTML page I serve and indexes it
properly.  The majority of other bots ask for "*/*", and get
the XHTML file which they can't handle.  (Does anyone else log
the "accept" field?  I have only just started doing so, so cannot 
verify my suspicion that googlebot has changed.)

So that's where I am at.  What are the solutions?  Here are some
baked and half-baked ideas I have had or have had suggested to me.

1. Perhaps I shouldn't try to cater for IE at all.  This is very 
tempting until I remember the queues of students I will have knocking
on my door asking why they can't view my web pages properly.  There
are already very clear instructions on my pages saying 
  (1) Use Firefox, don't use IE and 
  (2) if you do insist on using IE you must install MathPlayer
but I had endless numbers of people saying they *still* couldn't
view the pages properly and eventually, in all cases, I discovered they
were ALL using IE without MathPlayer having gone through a page saying
they MUST use Firefox or install MathPlayer.

2. Perhaps I should issue instructions or provide a script to change
the registry on MS-Windows machines.  I have found a key
" My 
Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet 
Settings\Accepted Documents"
which seems to contain the http "accept" header that IE uses.  I changed
it and I got a different document back.  But I didn't find any
documentation for this on the web. (I wonder if there are others keys 
I should know about too...)  More particularly, I really do not
understand why MathPlayer doesn't change this registry key on
installation to indicate it can now handle application/xhtml+xml.
That would solve *all* of my problems!  Alternatively, does anyone
know how to write such scripts?  (I program in unix myself :)
Of course I would still have to persuade users to run a script from
an unknown source... ouch!

3. I could write some javascript that would try to identify the
browser and refresh with the most suitable page.  This seems to be
the only solution that doesn't involve changes on the client or server
side.  There are problems, including the performance hit of having to
load each document twice, and having to arrange things so the bots
index the page correctly (they don't use javascript, I presume?).

4. A more specialist server set-up might solve all the problems.
This server could identify the agent and serve the best document.  I am
aware such things are being developed and may try this out on my private
"experimental" server in the non-critical period of the summer vacation. 
The downsides are
  a. maintenence of the server is required every time a new agent
     or plugin is released or updated
  b. a potential server-side performance hit
  c. I probably won't be able to persuade the web master of our 
     main Departmental server to install such software.  (He's 
     very helpful and interested but has distinctly finite amounts
     of time.)

I'd love to hear your views, comments and suggestions. Congratulations and 
many thanks if you managed to get this far in this rather long post!

Best wishes

Richard
Received on Thursday, 27 April 2006 14:51:31 UTC