Re: ISO 639 Cookbook was ... LD? algorithm and questions. from Gannon Dick on 2012-03-27 (public-egov-ig@w3.org from March 2012)

From: Gannon Dick <gannon_dick@yahoo.com>
Date: Tue, 27 Mar 2012 15:03:26 -0700 (PDT)
To: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
Cc: public-lod <public-lod@w3.org>, "eGov IG \(Public\)" <public-egov-ig@w3.org>, "public-gld-wg@w3.org" <public-gld-wg@w3.org>
Message-ID: <1332885806.27247.YahooMailNeo@web112607.mail.gq1.yahoo.com>
Comments below



________________________________
 From: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
To: Gannon Dick <gannon_dick@yahoo.com> 
Cc: public-lod <public-lod@w3.org>; eGov IG (Public) <public-egov-ig@w3.org>; public-gld-wg@w3.org 
Sent: Tuesday, March 27, 2012 6:25 AM
Subject: Re: ISO 639 Cookbook was ... LD? algorithm and questions.
 
Gannon, hi.

I am adding GLD to the list of recipients, as this is relevant
to ISSUE-26.

There is a balance to be achieved here between the utility of closed
sets where all instances of the can safely assumed to be universally
understood and the open nature of both the world and the Semantic Web.

In other words, if each piece of published data were to come up with its
own identifiers for each language variation that happened to be
pertinent to the data, that would make the data less understood, less
linked, and less useful. If, on the other hand, one had to choose one of
a closed set of identifiers none of which is appropriate, this would
make the data less accurate and, again, less useful.
=============
I agree
=============
By we do not need to choose between these two extremes, because it is
exactly situations like this that differentiate semantic technologies
from relational data stores: the ability to extend vocabularies in a way
that allows consumers that do not know about the extension to retrieve
some (although not all) of the semantics of the data.
=============
I would say the two extremes exist at all times and 
navigation is critical.  The Discovery path is not the inverse of the 
(data) Supply path - there is a phase change which involves the 
collection of statistics.  The data is <b>either</b> 
faithful or commercially efficient but cannot always be both.   I can 
only demonstrate the case exists *sometimes*.  For example, the UN 
saying, "give a man a fish and you have fed him for a day, teach a man 
to fish and you have fed him for a lifetime" (or something like that).  
If you ask the Russian Federation or the Greek Government they point you
 to their Fishing Lessons page (en).  If you have fish and want a list 
of Russians or Greeks to whom the page is not available, the answer is 
that every Russian and every Greek can learn to fish if they like.  Open
 Data is just that.  That the fishing lessons are in English is of no 
consequence <i>other</i> than rending anonymous the good 
fisherman who wrote the lesson.
=============
Coming back to language code lists, IMHO the best approach is to allow
language properties to range over URIs beyond the ISO language codes,
only if such language fillers are linked to their closest match in the
ISO codeset. In other words, allow ad-hoc extensions of the codeset only
if the extended codepoint is linked to codepoint that would have been
used if no extension were allowed.

As concrete example, let us define a new property, possibly a sub-class
of skos:related, that has:

relatedToLanguage rdfs:domain dc:LinguisticSystem .
relatedToLanguage rdfs:range http://id.loc.gov/vocabulary/iso639-1/iso639-1_Language .

We can now define arbitrarily fine language varieties and historical
forms of languages without loosing the link to the main entry. For
example:

ex:en_16c rdf:type dc:LinguisticSystem ;
           rdfs:label "16c English" ;
           relatedToLanguage http://id.loc.gov/vocabulary/iso639-1/el .

ex:en_Gla rdf:type dc:LinguisticSystem ;
           rdfs:label "English as spoken in Glasgow" ;
           relatedToLanguage http://id.loc.gov/vocabulary/iso639-1/el .

==========================================
With reference to the above, this is a good backward looking 
interoperability mechanism, as long as you remember that there is no 
forward looking "solution".  There are no identifiable people in Glasgow
 who speak the average "this", although there is a large group of 
anonymous who speak "this" fluently.  The group (not class) probably 
includes several Greek and Russian fishing experts on holiday, too.  
This is both a feature of governance and a bug of discovery.
==========================================
Coming back to goverments, under the regime above language lists like
ISO 639 cannot be used as an excuse to not provide for local or even
ad-hoc extensions.
==========================================
Governments gain nothing from solving the forward looking problem 
although it would be nice for the commercial world if they did.  
Governments do gain a great deal leveling ambiguity by using a language 
understood in Glasgow, London, Moscow, Athens and elsewhere, and they 
can hope no one notices that the fishing lessons sound much the same 
(the backward problem).  They use Artificial Bureaucracy.  It differs 
from Artificial Intelligence in this way:  You are in Vienna and notice 
that the Danube is a handy way to get to Budapest.  Rome requires quite a
 bit more rowing, or a more "intelligent" access to resources (airplanes
 are good).  From the viewpoint of a government run data repository, the
 trip down the Danube looks like this:

http://www.rustprivacy.org/2012/urn-lex/danube.html
 (sorry, not all the 
links work, it is a screen shot).  This is the domain model and the 
direction of discovery is left to right.  However "propaganda" has no 
reason to travel right to left - aside from tourism promotion;  the 
sunny beaches of Greenland and the shark-free bathtubs of Australia are 
just good marketing.

--Gannon

Best,
Stasinos



On Sat Mar 17 14:52:19 2012 Gannon Dick said:

> "A criticism voiced by detractors of Linked Data suggest that Linked Data modeling is too hard or time consuming."
> 
>
 There are some sets of standard codes which are infrequently updated.  
It might pay for a data set repository to build identifiers to order.  
In this way, the standards can be maintained complete and, more to the 
point, applications can "assume" they are complete.
> 
> There is an example (ISO 639 Language Codes) here: http://www.rustprivacy.org/2012/urn/lang/loc.tar.gz
> 
> This includes two mysql databases:
>
 1. A "lite" version with just the tables needed to specify either 
"terminology" or "bibliographic" codes (including currency).  I used the
 D2R Server.
> 
> 2. A full maintainable version, which 
starts with a "maintain table" and regenerates the tables which address 
the sticky bits.
> 
> (The following in case you get caught playing with this at your day job, otherwise, have fun)
> 
>
 There are a number of little technical issues, but for Government, one 
huge Moral Hazard.  The language of Legislation, Policy and Statistical 
Reporting are coupled with Jurisdiction.  The Moral Hazard arises from 
the situation where speaking a language not understood by a psychiatrist
 is then considered insane.  Nobody wants a government who acts like 
that, and the Open Data Community doesn't want data sets which skip over
 distinct populations (without saying so) either. 
> 
> 
> --Gannon
> 
> 
> 
> 
> ________________________________
>  From: Bernadette Hyland <bhyland@3roundstones.com>
> To: Hugh Glaser <hg@ecs.soton.ac.uk>; Yury Katkov <katkov.juriy@gmail.com> 
> Cc: Semantic Web <semantic-web@w3.org>; public-lod <public-lod@w3.org> 
> Sent: Friday, March 16, 2012 4:11 PM
> Subject: Re: How to find the data I need in LD? algorithm and questions.
>  
> 
> Hi,
>
 Hugh - I responded earlier today to Yury, off-list.  So I would offer a
 different perspective, perhaps because the sun is out here today and it
 is Friday afternoon and the plum blossoms are blooming...
> 
> We've moved from:
> * shouting (circa 2003-2006) to
> * the meme of Linked Data by TimBL (2007) [1] 
> * proof-of-concepts (2008-2010) to
> * a couple academic books, conference talks & keynotes on real world deployments involving LD/LOD (2010, 2011) to
> * developers books, W3C Recommendations, published use cases/CXO guides (2012)
> 
>
 FWIW, I offered to fold in some of Yury's guidance to the draft Linked 
Data Cookbook[2] and suggested the cookbook as a possible resource for 
his students.
> 
> If you are open to a different viewpoint,
 here is what I see on the ground in 2012.  There are publishers, both 
in the private & public sector, who are beginning to publish data as
 Linked Data.  It is of course a new approach to data publishing and 
consumption and there are some really entrenched players, so it isn't 
going to happen within one or two years.  Furthermore, everyone has a 
"day job" and learning yet another way to publish your data doesn't 
sound like a career-building activity on face value ...
> 
>
 I contend, it will take some public successes, plus a couple of 
pragmatic Linked Data books for developers, some cookbooks or how-to's, 
and some well-formed W3C Recommendations for Linked Open Data to be 
pervasive ... all of which is in progress.
> 
> It will take
 probably 10 years before LD/LOD publishing is 'mainstream' but make no 
mistake, it will happen.  A Linked Data approach to publishing data (on 
the Web of data) is as disruptive as the Web of documents was circa 
1995.   
> 
> It will save organizations millions and 
governments billions of dollars (or their currency equivalents) in 
enterprise information integration.  Do I have documented ROIs in a 
glossy printed consulting report to back that up - no, not yet.  I 
believe we (as in the Linked Data ecosystem) will have this soon.   The 
numbers & case studies will come from big international 
organizations involved in issue tracking & customer care, business 
publishing, healthcare, logistics and defense (the 
non-secret-squirrel-part of defense).
> 
> Regardless 
whether orgs are doing LD behind the firewall or in front of it, 
publishing Linked Data makes good economic sense but we're in the early 
days.  Don't loose heart.
> 
> I see university students are
 learning about LD now in undergrad CS classes.  About 20 of us from the
 UK, Netherlands, Spain, US, India, Australia in government / academe / 
private sector meet weekly on the W3 Gov't Linked Data Working Group  to
 nut out vocabs, best practices & a cookbook for gov't publication 
& consumption.  
> 
> FYR, data.gov recently featured a 
blogpost [4] by a uni student who did a mashup where he didn't know the 
publisher of US Gov't content, although he did work under the 
supervision of someone who knows a bit about RDF.
> 
> 
> Kind regards,
> 
> Bernadette Hyland
> 
> 
> [1] http://www.w3.org/DesignIssues/LinkedData.html
> [2] http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook
> [3]  http://www.data.gov/communities/node/116/blogs/6170
> 
> 
> On Mar 16, 2012, at 4:15 PM, Hugh Glaser wrote:
> 
> Hi Yury
> >Well I am sorry to see you have had no response, but it is not so surprising, really.
> >You will find that essentially there are very few people doing what you are trying to do.
> >The Semantic Web and Linked Data world is made up of people who publish, and rarely consume.
> >It is almost unheard of for someone to consume someone else's data, unless they know the publisher.
> >Everyone is shouting, but not many listening.
> >OK, I might not be in a great mood today, but I'm not far wrong.
> >
> >To your problem.
> >Your steps seem reasonable.
> >I would, however, add the use of VoiD (http://www.w3.org/TR/void/, http://semanticweb.org/wiki/VoiD).
> >VoiD is designed to deliver what you want, I think (if it doesn't, then it should be made to).
>
 >Some sites do publish VoiD descriptions, and these can often be 
located automatically by looking in the sitemap, which can in turn be 
discovered by looking in robots.txt.
> >Keith Alexander has a store of collected VoiD descriptions (http://kwijibo.talis.com/voiD/), as do we (http://void.rkbexplorer.com).
> >I would also suggest that my own site, http://sameas.org might lead from interesting URIs to other related URIs, and hence interesting stores.
> >
> >Hope that helps.
> >Best
> >Hugh
> >
> >On 16 Mar 2012, at 04:58, Yury Katkov wrote:
> >
> >
> >Hi!
> >>
> >
> >>
> >What do you usually do when you want to find a dataset for your needs?
> >>
> >I'm preparing a tiny tutorial on this topic for the students and ask
> >>
> >you to share your experience.
> >>
> >My typical algorithm is the following:
> >>
> >0) Define the topic. I have to know precisely what kind of data I need.
> >>
> >1) Look at Linked Data cloud and other visualizations to ensure that
> >>
> >the needed data is presented somewhere. If for example I want to
> >>
> >improve Mendeley or Zotero I look at these visualizations and search
> >>
> >for publication data.
> >>
> >2) Search the needed properties and classes with Sindice, Sig.ma and Swoogle.
> >>
> >3) Look at CKAN description of the dataset, its XML citemap and VoiD metadata.
> >>
> >4) explore the dataset that were found on the previous step with some
> >>
> >simple SPARQL queries like these:
> >>
> >
> >>
> >SELECT DISTINCT ?p WHERE {
> >>
> >?s ?p ?o
> >>
> >}
> >>
> >
> >>
> >SELECT DISTINCT ?class WHERE {
> >>
> >{ ?class a rdfs:Class . }
> >>
> >UNION
> >>
> >{?class a owl:Class . }
> >>
> >}
> >>
> >
> >>
> >SELECT DISCTINCT ?label WHERE {
> >>
> >{?a rdfs:label ?label}
> >>
> >UNION
> >>
> >{?a dc:title ?label}
> >>
> >/* and possibly some more things to search foaf:name's and so on */
> >>
> >}
> >>
> >
> >>
> >I can also use COUNTing and GROUPing BY to get some quick statistics
> >>
> >about the datasets.
> >>
> >5) When I find some interesting URIs I use semantic web browsers
> >>
> >Marbles and Sig.ma to navigate through the dataset.
> >>
> >5) Ask these smart guys in Semantic Web mailing list and Public LOD
> >>
> >mailing list. Probably go to semanticoverflow and ask for help there
> >>
> >as well
> >>
> >======================
> >>
> >Here are my questions:
> >>
> >
> >>
> >1) What else do you typically doing to find the dataset?
> >>
> >2) Is there a resource where I can find the brief description of the
> >>
> >dataset in terms of properties and classes that mentioned there? And
> >>
> >these cool arrows in Richard Cyganiak's diagram: is there a resource
> >>
> >where I can find the information about relationship between the given
> >>
> >dataset and the rest of the world?
> >>
> >3) I have similar algorithm for searching vocabularies. Can resources
> >>
> >like Schemapedia help me in searching the dataset?
> >>
> >4) Do you know any other meeting SPARQL queries that can be handy when
> >>
> >I search something in the dataset.
> >>
> >
> >>
> >Sincerely yours,
> >>
> >-----
> >>
> >Yury Katkov
> >>
> >
> >>
> >-- 
> >Hugh Glaser,  
> >            Web and Internet Science
> >            Electronics and Computer Science,
> >            University of Southampton,
> >            Southampton SO17 1BJ
> >Work: +44 23 8059 3670, Fax: +44 23 8059 3045
> >Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652
> >http://www.ecs.soton.ac.uk/~hg/
> >
> >
> >
Received on Tuesday, 27 March 2012 22:03:57 UTC