- From: Marta Villegas <marta.villegas@gmail.com>
- Date: Thu, 26 Mar 2015 10:15:22 +0100
- To: "dave.lewis@cs.tcd.ie" <dave.lewis@cs.tcd.ie>
- Cc: public-ld4lt@w3.org
- Message-ID: <CAPq_VFnD82wSxq-VtW7S84+++KEYqGJzOS_YcNMRLFgk=pM7kQ@mail.gmail.com>
Hi all, DASISH project addressed this kind of problems. They get data from CLARIN, DARIAH, CESSDA, .. May be you can check: - "Data Servide Infraestructure for the Social Science and Humanities" at http://dasish.eu/publications/projectreports/DASISH-D5.2_AB_final__25nov-R.PDF - https://github.com/TheLanguageArchive/oai-harvest-manager - workflow of the Joint Metadata Domain at https://github.com/DASISH/jmd-scripts - http://teresah.dasish.eu/tools/pie-slice/data-sources/teresah as an example. It's interesting the dc:hasPart approach. I hope it helps! 2015-03-24 17:48 GMT+01:00 dave.lewis@cs.tcd.ie <dave.lewis@cs.tcd.ie>: > John, > Prov-o would be one way of capturing this, using prov:wasDerivedFrom and > optionally additional activity meta-data about the harvesting process > itself. > > cheers, > Dave > > > > On 20/03/2015 13:18, Khalid Choukri wrote: > > Hi John > Thanks for the clarification, > This is an essential and tricky issue, we should insist that harvested > data is labeled as such and hence prevent people form harvesting things > from secondary sources. > I am not sure you can, at this stage, filter duplicate records. > > For us (ELRA) all the records were provided to you including via > Meta-share, in the worst case some thing like "ELRA (via META-SHARE)" could > be OK (though we think ELRA should be the sole source for ELRA > catalogued resources). > > Best regards > Khalid > > > On 20/03/2015 12:41, John P. McCrae wrote: > > Hi Khalid, > > The source property is intended to indicate where *we* got the record > from, in this case actually from CLARIN! Would it be better if I clarify it > by making it something like, for example "Meertens Institute (via CLARIN > VLO)", or "ELRA (via CLARIN VLO)"? > > Regards, > John > > On Thu, Mar 19, 2015 at 7:03 PM, Khalid Choukri <choukri@elda.org> wrote: > >> Hi John >> >> I am so sorry I missed the telco >> I manage to review the slides and the web site and I realise that you >> harvested so many sources which also harvest other sources in a very long >> loop; and leading to many wrong labelling (but at least I understand why >> you mention that our community has over 100K resources) >> >> As examples searching the titles with "Speecon" , a resource available >> only from ELRA catalogue I see that it is labeled as: Souce : CLARIN >> >> Thai Speecon database >> <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0288> >> Description <http://purl.org/dc/elements/1.1/description> >> Desktop/Microphone Language <http://purl.org/dc/terms/language> Thai >> <http://www.lexvo.org/id/iso639-3/tha> Source >> <http://purl.org/dc/elements/1.1/source> CLARIN Title >> <http://purl.org/dc/elements/1.1/title> Thai Speecon database >> Czech Speecon database >> <http://linghub.org/lremap/efd68ccbda3ae46c3f4c04db49d34989> >> Language <http://purl.org/dc/terms/language> Czech >> <http://www.lexvo.org/id/iso639-3/ces> Title >> <http://purl.org/dc/elements/1.1/title> Czech Speecon database Type >> <http://purl.org/dc/terms/type> Corpus >> <http://babelnet.org/rdf/s00022825n> >> Czech Speecon database >> <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0298> >> Description <http://purl.org/dc/elements/1.1/description> >> Desktop/Microphone Language <http://purl.org/dc/terms/language> Czech >> <http://www.lexvo.org/id/iso639-3/ces> Source >> <http://purl.org/dc/elements/1.1/source> CLARIN Title >> <http://purl.org/dc/elements/1.1/title> Czech Speecon database >> >> >> >> The same applies to Eurom1: >> >> EUROM1_fr >> <http://linghub.org/clarin/Speech_and_Language_Data_Repository/oai_sldr_org_sldr000035> >> Contributor <http://purl.org/dc/elements/1.1/contributor> SAM_A >> European project Creator <http://purl.org/dc/elements/1.1/creator> Institut >> de la communication parlée (ICP, Grenoble FR) Description >> <http://purl.org/dc/elements/1.1/description> The EUROM1 database >> contains recordings of 60 speakers in eleven European Languages: Danish, >> Dutch, British English, French, German, Norwegian, Swedish, Dutch, Greek, >> Portuguese and Spanish. It was explicitly designed to aid the phonetic >> comparison of languages, with similar materials and recording protocols in >> all languages.<br />Only French EUROM1 is accessible here. It was used as a >> resource for the MULTEXT project.<br />This version has been reformatted >> for compliance with long-term preservation specifications. Rights >> <http://purl.org/dc/elements/1.1/rights> >> info:eu-repo/date/submitted/2008-09-01 Source >> <http://purl.org/dc/elements/1.1/source> CLARIN Subject >> <http://purl.org/dc/elements/1.1/subject> >> Title <http://purl.org/dc/elements/1.1/title> EUROM1_fr >> >> >> and to others .... >> >> GlobalPhone Korean >> <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0200> >> Description <http://purl.org/dc/elements/1.1/description> >> Desktop/Microphone Language <http://purl.org/dc/terms/language> Korean >> <http://www.lexvo.org/id/iso639-3/kor> Source >> <http://purl.org/dc/elements/1.1/source> CLARIN Title >> <http://purl.org/dc/elements/1.1/title> GlobalPhone Korean >> >> >> >> as you can imagine this is misleading, and I am wondering if we can help >> you correct this. >> >> All the best >> Khalid >> >> >> >> >> >> >> >> On 19/03/2015 15:08, John P. McCrae wrote: >> >> For those who have not yet joined the link to the GotoMeeting is here >> >> https://global.gotomeeting.com/join/360074461 >> >> Regards, >> John >> >> On Thu, Mar 19, 2015 at 2:16 PM, John P. McCrae < >> jmccrae@cit-ec.uni-bielefeld.de> wrote: >> >>> Dear all, >>> >>> In the teleconference this afternoon we will present Linghub >>> <http://linghub.org/>, the work of several members of this group and >>> the LIDER project. >>> >>> Here are some slides I will present to start the discussion: >>> >>> >>> https://docs.google.com/presentation/d/1ZDzHYcgHvqzp_zK77vGFZ36kEMEmNt9rw7kBJmrhetQ/edit?usp=sharing >>> >>> Regards, >>> John P. McCrae >>> >> >> >> -- >> >> ************************************************* >> * Khalid CHOUKRI * >> ELRA General Secretary & ELDA CEO >> email: choukri@elda.org ; Web: www.elra.info www.elda.org >> Tel. +33 1 43 13 33 33 <%2B33%201%2043%2013%2033%2033> - Fax. +33 1 43 >> 13 33 30 <%2B33%201%2043%2013%2033%2030> >> *************************************************** >> ** >> * Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org> >> **************************************************** * >> > > > -- > > ************************************************* > * Khalid CHOUKRI * > ELRA General Secretary & ELDA CEO > email: choukri@elda.org ; Web: www.elra.info www.elda.org > Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30 > *************************************************** > ** > * Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org> > **************************************************** * > > > -- > Director - Knowledge and Data Engineering Group > The CNGL Centre for Global Intelligent Content > School of Computer Science and Statistics > Trinity College Dublin > > -- Marta Villegas marta.villegas@gmail.com
Received on Thursday, 26 March 2015 09:15:58 UTC