Given a university's name, retrieve URL for university's home page.

Dear all,

As I am something of an LOD noob, please feel free to point me in the
direction of other mailing lists or sources of advice if you feel they
are more appropriate than public-lod is for my request below.

I wish to solve the following problem: given a string that represents
one of perhaps several common orthographic representations of a
university's name (e.g. "Cambridge University" might be given, instead
of "University of Cambridge"), retrieve the URL of that university's
home page on the WWW.

My first attempt at a solution is a two-step process. It is to query
the Wikipedia API in order to obtain, with any luck, the title for the
university's article in Wikipedia, e.g.:
http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=score&srredirects=true&srlimit=1&format=json&srsearch=Cambridge%20University
yields {"query-continue":{"search":{"sroffset":1}},"query":{"searchinfo":{"totalhits":86254},"search":[{"ns":0,"title":"University
of Cambridge"}]}}

The second step is to use that title to submit a SPARQL query to
DBpedia in the hope of obtaining the university's website's URL, e.g.
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FUniversity_of_Cambridge%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=text%2Fhtml&timeout=0
yields an HTML table containing the desired result.

This attempt suffers from several shortcomings:

(1) Step 1 does not reliably yield a result unless the string is
varied slightly and resubmitted, e.g.
http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=score&srredirects=true&srlimit=1&format=json&srsearch=Pennsylvania%20State%20University%20-%20University%20Park
does not yield an article title, but
http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=score&srredirects=true&srlimit=1&format=json&srsearch=Pennsylvania%20State%20University-University%20Park
does.

(2) Step 2 does not reliably yield a result, even if step 1 is
successful and Wikipedia has a record of the university's website,
e.g. http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FHarvard_University%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=text%2Fhtml&timeout=0
yields no URL.

(3) In step 3, I am using HTML output from the SPARQL query only
because the JSON output seems to be unreliable. For example,
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FUniversity_of_California,_Los_Angeles%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=text%2Fhtml&timeout=0
yields the desired URL in the output but
http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FUniversity_of_California,_Los_Angeles%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=json&timeout=0
does not.

I therefore suspect that there are better approaches, e.g.: better
ways for me to use the APIs of the resources I am querying (i.e.
Wikipedia and DBpedia), or better resources to query, or some
combination of the two. If you can suggest any such improvements (or,
as I mentioned above, more appropriate sources of advice), I would be
grateful.

Many thanks in advance,

Sam

Received on Monday, 13 May 2013 18:39:48 UTC