RE: LD4LT Teleconference, Today 3pm CET, on Linghub

Hi John and all.

All metadata at META-SHARE come from the consortium partners: at the bottom of http://www.meta-share.eu/ you can see who the partners are. Each partner maintains their own repo where they describe (according to the META-SHARE schema) and store their resources; metadata records (and only metadata) are harvested from the repos to the managing nodes, which all share the same catalogue.

This means that most probably you will get from META-SHARE and CLARIN duplicates because some partners are in both; for instance, ELRA metadata are also in META-SHARE. Still, I would expect differences in the metadata descriptions: I think but I’m not sure CLARIN harvests the ELRA records from OLAC (Khalid, is this it?), so the metadata schema is different. However, in ELRA records the ELRA id can help in the identification of the duplicates. 

Best,

Penny

From: johnmccrae@gmail.com [mailto:johnmccrae@gmail.com] On Behalf Of John P. McCrae
Sent: Friday, March 20, 2015 4:33 PM
To: Khalid Choukri
Cc: public-ld4lt@w3.org
Subject: Re: LD4LT Teleconference, Today 3pm CET, on Linghub

 

Hi Khalid,

We will actually pick up duplicates, this should be push to production in the next few weeks.

I will mark the records from the CLARIN VLO as "ELRA (via CLARIN VLO)" (Tracker: https://github.com/liderproject/linghub/issues/18)

For META-SHARE we don't really have any information for most records as to where they came from before META-SHARE, e.g., 

http://metashare.elda.org/repository/browse/eurom1e-english/f01a4c96de6811e2b1e400259011f6eaf6ec06978e9b4d5e89cd122f3f96961a/
http://linghub.org/metashare/f01a4c96de6811e2b1e400259011f6eaf6ec06978e9b4d5e89cd122f3f96961a

 

Regards,

John

 

On Fri, Mar 20, 2015 at 2:18 PM, Khalid Choukri <choukri@elda.org <mailto:choukri@elda.org> > wrote:

Hi John
Thanks for the clarification, 
This is an essential and tricky issue, we should insist that harvested data is labeled as such and hence prevent people form harvesting things from secondary sources.
I am not sure you can, at this stage, filter duplicate records.

For us (ELRA)  all the records were provided to you including via Meta-share, in the worst case some thing like "ELRA (via META-SHARE)" could be OK  (though we think  ELRA  should be the sole source for ELRA catalogued resources).

Best regards
Khalid


 

On 20/03/2015 12:41, John P. McCrae wrote:

Hi Khalid, 

 

The source property is intended to indicate where we got the record from, in this case actually from CLARIN! Would it be better if I clarify it by making it something like, for example "Meertens Institute (via CLARIN VLO)", or "ELRA (via CLARIN VLO)"?

Regards,

John

 

On Thu, Mar 19, 2015 at 7:03 PM, Khalid Choukri <choukri@elda.org <mailto:choukri@elda.org> > wrote:

Hi John

I am so sorry I missed the telco
I manage to review the slides and the web site and I realise that you harvested so many sources which also harvest other sources in a very long loop; and leading to many wrong labelling  (but at least I understand why you mention that our community has over 100K resources)

As examples searching the titles with "Speecon" , a resource available only from ELRA catalogue I see that it is labeled as:  Souce : CLARIN 




Thai Speecon database <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0288> 


Description <http://purl.org/dc/elements/1.1/description>  

Desktop/Microphone 


Language <http://purl.org/dc/terms/language>  

Thai <http://www.lexvo.org/id/iso639-3/tha>  


Source <http://purl.org/dc/elements/1.1/source>  

CLARIN 


Title <http://purl.org/dc/elements/1.1/title>  

Thai Speecon database 

Czech Speecon database <http://linghub.org/lremap/efd68ccbda3ae46c3f4c04db49d34989> 


Language <http://purl.org/dc/terms/language>  

Czech <http://www.lexvo.org/id/iso639-3/ces>  


Title <http://purl.org/dc/elements/1.1/title>  

Czech Speecon database 


Type <http://purl.org/dc/terms/type>  

Corpus <http://babelnet.org/rdf/s00022825n>  

Czech Speecon database <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0298> 


Description <http://purl.org/dc/elements/1.1/description>  

Desktop/Microphone 


Language <http://purl.org/dc/terms/language>  

Czech <http://www.lexvo.org/id/iso639-3/ces>  


Source <http://purl.org/dc/elements/1.1/source>  

CLARIN 


Title <http://purl.org/dc/elements/1.1/title>  

Czech Speecon database



The same applies to Eurom1:



EUROM1_fr <http://linghub.org/clarin/Speech_and_Language_Data_Repository/oai_sldr_org_sldr000035> 


Contributor <http://purl.org/dc/elements/1.1/contributor>  

SAM_A European project 


Creator <http://purl.org/dc/elements/1.1/creator>  

Institut de la communication parlée (ICP, Grenoble FR) 


Description <http://purl.org/dc/elements/1.1/description>  

The EUROM1 database contains recordings of 60 speakers in eleven European Languages: Danish, Dutch, British English, French, German, Norwegian, Swedish, Dutch, Greek, Portuguese and Spanish. It was explicitly designed to aid the phonetic comparison of languages, with similar materials and recording protocols in all languages.<br />Only French EUROM1 is accessible here. It was used as a resource for the MULTEXT project.<br />This version has been reformatted for compliance with long-term preservation specifications. 


Rights <http://purl.org/dc/elements/1.1/rights>  

info:eu-repo/date/submitted/2008-09-01 


Source <http://purl.org/dc/elements/1.1/source>  

CLARIN 


Subject <http://purl.org/dc/elements/1.1/subject>  

 

Title <http://purl.org/dc/elements/1.1/title>  

EUROM1_fr


and to others ....



GlobalPhone Korean <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0200> 


Description <http://purl.org/dc/elements/1.1/description>  

Desktop/Microphone 


Language <http://purl.org/dc/terms/language>  

Korean <http://www.lexvo.org/id/iso639-3/kor>  


Source <http://purl.org/dc/elements/1.1/source>  

CLARIN 


Title <http://purl.org/dc/elements/1.1/title>  

GlobalPhone Korean



as you can imagine this is misleading, and I am wondering if we can help you correct this.

All the best
Khalid 









On 19/03/2015 15:08, John P. McCrae wrote:

For those who have not yet joined the link to the GotoMeeting is here

https://global.gotomeeting.com/join/360074461

Regards,

John

 

On Thu, Mar 19, 2015 at 2:16 PM, John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de <mailto:jmccrae@cit-ec.uni-bielefeld.de> > wrote:

Dear all,

In the teleconference this afternoon we will present Linghub <http://linghub.org/> , the work of several members of this group and the LIDER project.

Here are some slides I will present to start the discussion:

 

https://docs.google.com/presentation/d/1ZDzHYcgHvqzp_zK77vGFZ36kEMEmNt9rw7kBJmrhetQ/edit?usp=sharing

Regards,

John P. McCrae

 

 

-- 

************************************************* 
Khalid CHOUKRI 
ELRA General Secretary & ELDA CEO 
email: choukri@elda.org <mailto:choukri@elda.org>  ; Web: www.elra.info <http://www.elra.info>  www.elda.org <http://www.elda.org>  
Tel. +33 1 43 13 33 33 <tel:%2B33%201%2043%2013%2033%2033>  - Fax. +33 1 43 13 33 30 <tel:%2B33%201%2043%2013%2033%2030>  
*************************************************** 
** Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org>  
**************************************************** 

 

 

-- 

************************************************* 
Khalid CHOUKRI 
ELRA General Secretary & ELDA CEO 
email: choukri@elda.org <mailto:choukri@elda.org>  ; Web: www.elra.info <http://www.elra.info>  www.elda.org <http://www.elda.org>  
Tel. +33 1 43 13 33 33 <tel:%2B33%201%2043%2013%2033%2033>  - Fax. +33 1 43 13 33 30 <tel:%2B33%201%2043%2013%2033%2030>  
*************************************************** 
** Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org>  
**************************************************** 

 

Received on Friday, 20 March 2015 20:10:30 UTC