Re: Given a university's name, retrieve URL for university's home page. from Lushan Han on 2013-05-14 (public-lod@w3.org from May 2013)

From: Lushan Han <lushan1@umbc.edu>
Date: Tue, 14 May 2013 08:42:14 -0400
To: Sam Kuper <sam.kuper@uclmail.net>
Cc: public-lod <public-lod@w3.org>
Message-ID: <CAOyMU3gfStOQaZVHL1QzEDoE1-vs4jW2MjC=3FuACZ1UMna0Sg@mail.gmail.com>
Hi Sam,

Why don't you use Google or Bing? Typically the first or second result
which are not from the wikipedia site would be what you want. I did this
before and it ran pretty well.

Best regards,

Lushan Han


On Mon, May 13, 2013 at 2:39 PM, Sam Kuper <sam.kuper@uclmail.net> wrote:

> Dear all,
>
> As I am something of an LOD noob, please feel free to point me in the
> direction of other mailing lists or sources of advice if you feel they
> are more appropriate than public-lod is for my request below.
>
> I wish to solve the following problem: given a string that represents
> one of perhaps several common orthographic representations of a
> university's name (e.g. "Cambridge University" might be given, instead
> of "University of Cambridge"), retrieve the URL of that university's
> home page on the WWW.
>
> My first attempt at a solution is a two-step process. It is to query
> the Wikipedia API in order to obtain, with any luck, the title for the
> university's article in Wikipedia, e.g.:
>
> http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=score&srredirects=true&srlimit=1&format=json&srsearch=Cambridge%20University
> yields
> {"query-continue":{"search":{"sroffset":1}},"query":{"searchinfo":{"totalhits":86254},"search":[{"ns":0,"title":"University
> of Cambridge"}]}}
>
> The second step is to use that title to submit a SPARQL query to
> DBpedia in the hope of obtaining the university's website's URL, e.g.
>
> http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FUniversity_of_Cambridge%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=text%2Fhtml&timeout=0
> yields an HTML table containing the desired result.
>
> This attempt suffers from several shortcomings:
>
> (1) Step 1 does not reliably yield a result unless the string is
> varied slightly and resubmitted, e.g.
>
> http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=score&srredirects=true&srlimit=1&format=json&srsearch=Pennsylvania%20State%20University%20-%20University%20Park
> does not yield an article title, but
>
> http://en.wikipedia.org/w/api.php?action=query&list=search&srprop=score&srredirects=true&srlimit=1&format=json&srsearch=Pennsylvania%20State%20University-University%20Park
> does.
>
> (2) Step 2 does not reliably yield a result, even if step 1 is
> successful and Wikipedia has a record of the university's website,
> e.g.
> http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FHarvard_University%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=text%2Fhtml&timeout=0
> yields no URL.
>
> (3) In step 3, I am using HTML output from the SPARQL query only
> because the JSON output seems to be unreliable. For example,
>
> http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FUniversity_of_California,_Los_Angeles%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=text%2Fhtml&timeout=0
> yields the desired URL in the output but
>
> http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%3Fwebsite%0D%0AWHERE++{+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FUniversity_of_California,_Los_Angeles%3E+dbpprop%3Awebsite+%3Fwebsite+.+}&format=json&timeout=0
> does not.
>
> I therefore suspect that there are better approaches, e.g.: better
> ways for me to use the APIs of the resources I am querying (i.e.
> Wikipedia and DBpedia), or better resources to query, or some
> combination of the two. If you can suggest any such improvements (or,
> as I mentioned above, more appropriate sources of advice), I would be
> grateful.
>
> Many thanks in advance,
>
> Sam
>
>
Received on Tuesday, 14 May 2013 12:42:45 UTC