Re: ISO 639 Cookbook was ... LD? algorithm and questions. from Stasinos Konstantopoulos on 2012-03-27 (public-gld-wg@w3.org from March 2012)

From: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
Date: Tue, 27 Mar 2012 14:25:00 +0300
To: Gannon Dick <gannon_dick@yahoo.com>
Cc: public-lod <public-lod@w3.org>, "eGov IG (Public)" <public-egov-ig@w3.org>, public-gld-wg@w3.org
Message-ID: <20120327112500.GF8005@iit.demokritos.gr>
Gannon, hi.

I am adding GLD to the list of recipients, as this is relevant
to ISSUE-26.

There is a balance to be achieved here between the utility of closed
sets where all instances of the can safely assumed to be universally
understood and the open nature of both the world and the Semantic Web.

In other words, if each piece of published data were to come up with its
own identifiers for each language variation that happened to be
pertinent to the data, that would make the data less understood, less
linked, and less useful. If, on the other hand, one had to choose one of
a closed set of identifiers none of which is appropriate, this would
make the data less accurate and, again, less useful.

By we do not need to choose between these two extremes, because it is
exactly situations like this that differentiate semantic technologies
from relational data stores: the ability to extend vocabularies in a way
that allows consumers that do not know about the extension to retrieve
some (although not all) of the semantics of the data.

Coming back to language code lists, IMHO the best approach is to allow
language properties to range over URIs beyond the ISO language codes,
only if such language fillers are linked to their closest match in the
ISO codeset. In other words, allow ad-hoc extensions of the codeset only
if the extended codepoint is linked to codepoint that would have been
used if no extension were allowed.

As concrete example, let us define a new property, possibly a sub-class
of skos:related, that has:

 relatedToLanguage rdfs:domain dc:LinguisticSystem .
 relatedToLanguage rdfs:range http://id.loc.gov/vocabulary/iso639-1/iso639-1_Language .

We can now define arbitrarily fine language varieties and historical
forms of languages without loosing the link to the main entry. For
example:

 ex:en_16c rdf:type dc:LinguisticSystem ;
           rdfs:label "16c English" ;
           relatedToLanguage http://id.loc.gov/vocabulary/iso639-1/el .

 ex:en_Gla rdf:type dc:LinguisticSystem ;
           rdfs:label "English as spoken in Glasgow" ;
           relatedToLanguage http://id.loc.gov/vocabulary/iso639-1/el .

Coming back to goverments, under the regime above language lists like
ISO 639 cannot be used as an excuse to not provide for local or even
ad-hoc extensions.

Best,
Stasinos



On Sat Mar 17 14:52:19 2012 Gannon Dick said:

> "A criticism voiced by detractors of Linked Data suggest that Linked Data modeling is too hard or time consuming."
> 
> There are some sets of standard codes which are infrequently updated.  It might pay for a data set repository to build identifiers to order.  In this way, the standards can be maintained complete and, more to the point, applications can "assume" they are complete.
> 
> There is an example (ISO 639 Language Codes) here: http://www.rustprivacy.org/2012/urn/lang/loc.tar.gz
> 
> This includes two mysql databases:
> 1. A "lite" version with just the tables needed to specify either "terminology" or "bibliographic" codes (including currency).  I used the D2R Server.
> 
> 2. A full maintainable version, which starts with a "maintain table" and regenerates the tables which address the sticky bits.
> 
> (The following in case you get caught playing with this at your day job, otherwise, have fun)
> 
> There are a number of little technical issues, but for Government, one huge Moral Hazard.  The language of Legislation, Policy and Statistical Reporting are coupled with Jurisdiction.  The Moral Hazard arises from the situation where speaking a language not understood by a psychiatrist is then considered insane.  Nobody wants a government who acts like that, and the Open Data Community doesn't want data sets which skip over distinct populations (without saying so) either. 
> 
> 
> --Gannon
> 
> 
> 
> 
> ________________________________
>  From: Bernadette Hyland <bhyland@3roundstones.com>
> To: Hugh Glaser <hg@ecs.soton.ac.uk>; Yury Katkov <katkov.juriy@gmail.com> 
> Cc: Semantic Web <semantic-web@w3.org>; public-lod <public-lod@w3.org> 
> Sent: Friday, March 16, 2012 4:11 PM
> Subject: Re: How to find the data I need in LD? algorithm and questions.
>  
> 
> Hi,
> Hugh - I responded earlier today to Yury, off-list.  So I would offer a different perspective, perhaps because the sun is out here today and it is Friday afternoon and the plum blossoms are blooming...
> 
> We've moved from:
> * shouting (circa 2003-2006) to
> * the meme of Linked Data by TimBL (2007) [1] 
> * proof-of-concepts (2008-2010) to
> * a couple academic books, conference talks & keynotes on real world deployments involving LD/LOD (2010, 2011) to
> * developers books, W3C Recommendations, published use cases/CXO guides (2012)
> 
> FWIW, I offered to fold in some of Yury's guidance to the draft Linked Data Cookbook[2] and suggested the cookbook as a possible resource for his students.
> 
> If you are open to a different viewpoint, here is what I see on the ground in 2012.  There are publishers, both in the private & public sector, who are beginning to publish data as Linked Data.  It is of course a new approach to data publishing and consumption and there are some really entrenched players, so it isn't going to happen within one or two years.  Furthermore, everyone has a "day job" and learning yet another way to publish your data doesn't sound like a career-building activity on face value ...
> 
> I contend, it will take some public successes, plus a couple of pragmatic Linked Data books for developers, some cookbooks or how-to's, and some well-formed W3C Recommendations for Linked Open Data to be pervasive ... all of which is in progress.
> 
> It will take probably 10 years before LD/LOD publishing is 'mainstream' but make no mistake, it will happen.  A Linked Data approach to publishing data (on the Web of data) is as disruptive as the Web of documents was circa 1995.   
> 
> It will save organizations millions and governments billions of dollars (or their currency equivalents) in enterprise information integration.  Do I have documented ROIs in a glossy printed consulting report to back that up - no, not yet.  I believe we (as in the Linked Data ecosystem) will have this soon.   The numbers & case studies will come from big international organizations involved in issue tracking & customer care, business publishing, healthcare, logistics and defense (the non-secret-squirrel-part of defense).
> 
> Regardless whether orgs are doing LD behind the firewall or in front of it, publishing Linked Data makes good economic sense but we're in the early days.  Don't loose heart.
> 
> I see university students are learning about LD now in undergrad CS classes.  About 20 of us from the UK, Netherlands, Spain, US, India, Australia in government / academe / private sector meet weekly on the W3 Gov't Linked Data Working Group  to nut out vocabs, best practices & a cookbook for gov't publication & consumption.  
> 
> FYR, data.gov recently featured a blogpost [4] by a uni student who did a mashup where he didn't know the publisher of US Gov't content, although he did work under the supervision of someone who knows a bit about RDF.
> 
> 
> Kind regards,
> 
> Bernadette Hyland
> 
> 
> [1] http://www.w3.org/DesignIssues/LinkedData.html
> [2] http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
> [3]  http://www.data.gov/communities/node/116/blogs/6170
> 
> 
> On Mar 16, 2012, at 4:15 PM, Hugh Glaser wrote:
> 
> Hi Yury
> >Well I am sorry to see you have had no response, but it is not so surprising, really.
> >You will find that essentially there are very few people doing what you are trying to do.
> >The Semantic Web and Linked Data world is made up of people who publish, and rarely consume.
> >It is almost unheard of for someone to consume someone else's data, unless they know the publisher.
> >Everyone is shouting, but not many listening.
> >OK, I might not be in a great mood today, but I'm not far wrong.
> >
> >To your problem.
> >Your steps seem reasonable.
> >I would, however, add the use of VoiD (http://www.w3.org/TR/void/, http://semanticweb.org/wiki/VoiD).
> >VoiD is designed to deliver what you want, I think (if it doesn't, then it should be made to).
> >Some sites do publish VoiD descriptions, and these can often be located automatically by looking in the sitemap, which can in turn be discovered by looking in robots.txt.
> >Keith Alexander has a store of collected VoiD descriptions (http://kwijibo.talis.com/voiD/), as do we (http://void.rkbexplorer.com).
> >I would also suggest that my own site, http://sameas.org might lead from interesting URIs to other related URIs, and hence interesting stores.
> >
> >Hope that helps.
> >Best
> >Hugh
> >
> >On 16 Mar 2012, at 04:58, Yury Katkov wrote:
> >
> >
> >Hi!
> >>
> >
> >>
> >What do you usually do when you want to find a dataset for your needs?
> >>
> >I'm preparing a tiny tutorial on this topic for the students and ask
> >>
> >you to share your experience.
> >>
> >My typical algorithm is the following:
> >>
> >0) Define the topic. I have to know precisely what kind of data I need.
> >>
> >1) Look at Linked Data cloud and other visualizations to ensure that
> >>
> >the needed data is presented somewhere. If for example I want to
> >>
> >improve Mendeley or Zotero I look at these visualizations and search
> >>
> >for publication data.
> >>
> >2) Search the needed properties and classes with Sindice, Sig.ma and Swoogle.
> >>
> >3) Look at CKAN description of the dataset, its XML citemap and VoiD metadata.
> >>
> >4) explore the dataset that were found on the previous step with some
> >>
> >simple SPARQL queries like these:
> >>
> >
> >>
> >SELECT DISTINCT ?p WHERE {
> >>
> >?s ?p ?o
> >>
> >}
> >>
> >
> >>
> >SELECT DISTINCT ?class WHERE {
> >>
> >{ ?class a rdfs:Class . }
> >>
> >UNION
> >>
> >{?class a owl:Class . }
> >>
> >}
> >>
> >
> >>
> >SELECT DISCTINCT ?label WHERE {
> >>
> >{?a rdfs:label ?label}
> >>
> >UNION
> >>
> >{?a dc:title ?label}
> >>
> >/* and possibly some more things to search foaf:name's and so on */
> >>
> >}
> >>
> >
> >>
> >I can also use COUNTing and GROUPing BY to get some quick statistics
> >>
> >about the datasets.
> >>
> >5) When I find some interesting URIs I use semantic web browsers
> >>
> >Marbles and Sig.ma to navigate through the dataset.
> >>
> >5) Ask these smart guys in Semantic Web mailing list and Public LOD
> >>
> >mailing list. Probably go to semanticoverflow and ask for help there
> >>
> >as well
> >>
> >======================
> >>
> >Here are my questions:
> >>
> >
> >>
> >1) What else do you typically doing to find the dataset?
> >>
> >2) Is there a resource where I can find the brief description of the
> >>
> >dataset in terms of properties and classes that mentioned there? And
> >>
> >these cool arrows in Richard Cyganiak's diagram: is there a resource
> >>
> >where I can find the information about relationship between the given
> >>
> >dataset and the rest of the world?
> >>
> >3) I have similar algorithm for searching vocabularies. Can resources
> >>
> >like Schemapedia help me in searching the dataset?
> >>
> >4) Do you know any other meeting SPARQL queries that can be handy when
> >>
> >I search something in the dataset.
> >>
> >
> >>
> >Sincerely yours,
> >>
> >-----
> >>
> >Yury Katkov
> >>
> >
> >>
> >-- 
> >Hugh Glaser,  
> >            Web and Internet Science
> >            Electronics and Computer Science,
> >            University of Southampton,
> >            Southampton SO17 1BJ
> >Work: +44 23 8059 3670, Fax: +44 23 8059 3045
> >Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652
> >http://www.ecs.soton.ac.uk/~hg/
> >
> >
> >
Received on Tuesday, 27 March 2012 11:25:45 UTC