- From: Penny Labropoulou <penny@ilsp.gr>
- Date: Fri, 20 Mar 2015 22:09:55 +0200
- To: "'John P. McCrae'" <jmccrae@cit-ec.uni-bielefeld.de>, "'Khalid Choukri'" <choukri@elda.org>
- Cc: <public-ld4lt@w3.org>
- Message-ID: <005701d06349$d69efb10$83dcf130$@ilsp.gr>
Hi John and all. All metadata at META-SHARE come from the consortium partners: at the bottom of http://www.meta-share.eu/ you can see who the partners are. Each partner maintains their own repo where they describe (according to the META-SHARE schema) and store their resources; metadata records (and only metadata) are harvested from the repos to the managing nodes, which all share the same catalogue. This means that most probably you will get from META-SHARE and CLARIN duplicates because some partners are in both; for instance, ELRA metadata are also in META-SHARE. Still, I would expect differences in the metadata descriptions: I think but I’m not sure CLARIN harvests the ELRA records from OLAC (Khalid, is this it?), so the metadata schema is different. However, in ELRA records the ELRA id can help in the identification of the duplicates. Best, Penny From: johnmccrae@gmail.com [mailto:johnmccrae@gmail.com] On Behalf Of John P. McCrae Sent: Friday, March 20, 2015 4:33 PM To: Khalid Choukri Cc: public-ld4lt@w3.org Subject: Re: LD4LT Teleconference, Today 3pm CET, on Linghub Hi Khalid, We will actually pick up duplicates, this should be push to production in the next few weeks. I will mark the records from the CLARIN VLO as "ELRA (via CLARIN VLO)" (Tracker: https://github.com/liderproject/linghub/issues/18) For META-SHARE we don't really have any information for most records as to where they came from before META-SHARE, e.g., http://metashare.elda.org/repository/browse/eurom1e-english/f01a4c96de6811e2b1e400259011f6eaf6ec06978e9b4d5e89cd122f3f96961a/ http://linghub.org/metashare/f01a4c96de6811e2b1e400259011f6eaf6ec06978e9b4d5e89cd122f3f96961a Regards, John On Fri, Mar 20, 2015 at 2:18 PM, Khalid Choukri <choukri@elda.org <mailto:choukri@elda.org> > wrote: Hi John Thanks for the clarification, This is an essential and tricky issue, we should insist that harvested data is labeled as such and hence prevent people form harvesting things from secondary sources. I am not sure you can, at this stage, filter duplicate records. For us (ELRA) all the records were provided to you including via Meta-share, in the worst case some thing like "ELRA (via META-SHARE)" could be OK (though we think ELRA should be the sole source for ELRA catalogued resources). Best regards Khalid On 20/03/2015 12:41, John P. McCrae wrote: Hi Khalid, The source property is intended to indicate where we got the record from, in this case actually from CLARIN! Would it be better if I clarify it by making it something like, for example "Meertens Institute (via CLARIN VLO)", or "ELRA (via CLARIN VLO)"? Regards, John On Thu, Mar 19, 2015 at 7:03 PM, Khalid Choukri <choukri@elda.org <mailto:choukri@elda.org> > wrote: Hi John I am so sorry I missed the telco I manage to review the slides and the web site and I realise that you harvested so many sources which also harvest other sources in a very long loop; and leading to many wrong labelling (but at least I understand why you mention that our community has over 100K resources) As examples searching the titles with "Speecon" , a resource available only from ELRA catalogue I see that it is labeled as: Souce : CLARIN Thai Speecon database <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0288> Description <http://purl.org/dc/elements/1.1/description> Desktop/Microphone Language <http://purl.org/dc/terms/language> Thai <http://www.lexvo.org/id/iso639-3/tha> Source <http://purl.org/dc/elements/1.1/source> CLARIN Title <http://purl.org/dc/elements/1.1/title> Thai Speecon database Czech Speecon database <http://linghub.org/lremap/efd68ccbda3ae46c3f4c04db49d34989> Language <http://purl.org/dc/terms/language> Czech <http://www.lexvo.org/id/iso639-3/ces> Title <http://purl.org/dc/elements/1.1/title> Czech Speecon database Type <http://purl.org/dc/terms/type> Corpus <http://babelnet.org/rdf/s00022825n> Czech Speecon database <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0298> Description <http://purl.org/dc/elements/1.1/description> Desktop/Microphone Language <http://purl.org/dc/terms/language> Czech <http://www.lexvo.org/id/iso639-3/ces> Source <http://purl.org/dc/elements/1.1/source> CLARIN Title <http://purl.org/dc/elements/1.1/title> Czech Speecon database The same applies to Eurom1: EUROM1_fr <http://linghub.org/clarin/Speech_and_Language_Data_Repository/oai_sldr_org_sldr000035> Contributor <http://purl.org/dc/elements/1.1/contributor> SAM_A European project Creator <http://purl.org/dc/elements/1.1/creator> Institut de la communication parlée (ICP, Grenoble FR) Description <http://purl.org/dc/elements/1.1/description> The EUROM1 database contains recordings of 60 speakers in eleven European Languages: Danish, Dutch, British English, French, German, Norwegian, Swedish, Dutch, Greek, Portuguese and Spanish. It was explicitly designed to aid the phonetic comparison of languages, with similar materials and recording protocols in all languages.<br />Only French EUROM1 is accessible here. It was used as a resource for the MULTEXT project.<br />This version has been reformatted for compliance with long-term preservation specifications. Rights <http://purl.org/dc/elements/1.1/rights> info:eu-repo/date/submitted/2008-09-01 Source <http://purl.org/dc/elements/1.1/source> CLARIN Subject <http://purl.org/dc/elements/1.1/subject> Title <http://purl.org/dc/elements/1.1/title> EUROM1_fr and to others .... GlobalPhone Korean <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0200> Description <http://purl.org/dc/elements/1.1/description> Desktop/Microphone Language <http://purl.org/dc/terms/language> Korean <http://www.lexvo.org/id/iso639-3/kor> Source <http://purl.org/dc/elements/1.1/source> CLARIN Title <http://purl.org/dc/elements/1.1/title> GlobalPhone Korean as you can imagine this is misleading, and I am wondering if we can help you correct this. All the best Khalid On 19/03/2015 15:08, John P. McCrae wrote: For those who have not yet joined the link to the GotoMeeting is here https://global.gotomeeting.com/join/360074461 Regards, John On Thu, Mar 19, 2015 at 2:16 PM, John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de <mailto:jmccrae@cit-ec.uni-bielefeld.de> > wrote: Dear all, In the teleconference this afternoon we will present Linghub <http://linghub.org/> , the work of several members of this group and the LIDER project. Here are some slides I will present to start the discussion: https://docs.google.com/presentation/d/1ZDzHYcgHvqzp_zK77vGFZ36kEMEmNt9rw7kBJmrhetQ/edit?usp=sharing Regards, John P. McCrae -- ************************************************* Khalid CHOUKRI ELRA General Secretary & ELDA CEO email: choukri@elda.org <mailto:choukri@elda.org> ; Web: www.elra.info <http://www.elra.info> www.elda.org <http://www.elda.org> Tel. +33 1 43 13 33 33 <tel:%2B33%201%2043%2013%2033%2033> - Fax. +33 1 43 13 33 30 <tel:%2B33%201%2043%2013%2033%2030> *************************************************** ** Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org> **************************************************** -- ************************************************* Khalid CHOUKRI ELRA General Secretary & ELDA CEO email: choukri@elda.org <mailto:choukri@elda.org> ; Web: www.elra.info <http://www.elra.info> www.elda.org <http://www.elda.org> Tel. +33 1 43 13 33 33 <tel:%2B33%201%2043%2013%2033%2033> - Fax. +33 1 43 13 33 30 <tel:%2B33%201%2043%2013%2033%2030> *************************************************** ** Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org> ****************************************************
Received on Friday, 20 March 2015 20:10:30 UTC