Re: LD4LT Teleconference, Today 3pm CET, on Linghub

Hi all,

DASISH project addressed this kind of problems. They get data from CLARIN,
DARIAH, CESSDA, .. May be you can check:


- "Data Servide Infraestructure for the Social Science and Humanities" at
http://dasish.eu/publications/projectreports/DASISH-D5.2_AB_final__25nov-R.PDF

- https://github.com/TheLanguageArchive/oai-harvest-manager

- workflow of the Joint Metadata Domain at
https://github.com/DASISH/jmd-scripts

- http://teresah.dasish.eu/tools/pie-slice/data-sources/teresah as an
example. It's interesting the dc:hasPart approach.

I hope it helps!




2015-03-24 17:48 GMT+01:00 dave.lewis@cs.tcd.ie <dave.lewis@cs.tcd.ie>:

>  John,
> Prov-o would be one way of capturing this, using prov:wasDerivedFrom and
> optionally additional activity meta-data about the harvesting process
> itself.
>
> cheers,
> Dave
>
>
>
> On 20/03/2015 13:18, Khalid Choukri wrote:
>
> Hi John
> Thanks for the clarification,
> This is an essential and tricky issue, we should insist that harvested
> data is labeled as such and hence prevent people form harvesting things
> from secondary sources.
> I am not sure you can, at this stage, filter duplicate records.
>
> For us (ELRA)  all the records were provided to you including via
> Meta-share, in the worst case some thing like "ELRA (via META-SHARE)" could
> be OK  (though we think  ELRA  should be the sole source for ELRA
> catalogued resources).
>
> Best regards
> Khalid
>
>
> On 20/03/2015 12:41, John P. McCrae wrote:
>
> Hi Khalid,
>
>  The source property is intended to indicate where *we* got the record
> from, in this case actually from CLARIN! Would it be better if I clarify it
> by making it something like, for example "Meertens Institute (via CLARIN
> VLO)", or "ELRA (via CLARIN VLO)"?
>
>  Regards,
>  John
>
> On Thu, Mar 19, 2015 at 7:03 PM, Khalid Choukri <choukri@elda.org> wrote:
>
>>  Hi John
>>
>> I am so sorry I missed the telco
>> I manage to review the slides and the web site and I realise that you
>> harvested so many sources which also harvest other sources in a very long
>> loop; and leading to many wrong labelling  (but at least I understand why
>> you mention that our community has over 100K resources)
>>
>> As examples searching the titles with "Speecon" , a resource available
>> only from ELRA catalogue I see that it is labeled as:  Souce : CLARIN
>>
>>  Thai Speecon database
>> <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0288>
>>   Description <http://purl.org/dc/elements/1.1/description>
>> Desktop/Microphone   Language <http://purl.org/dc/terms/language>  Thai
>> <http://www.lexvo.org/id/iso639-3/tha>   Source
>> <http://purl.org/dc/elements/1.1/source>  CLARIN   Title
>> <http://purl.org/dc/elements/1.1/title>  Thai Speecon database
>>  Czech Speecon database
>> <http://linghub.org/lremap/efd68ccbda3ae46c3f4c04db49d34989>
>>   Language <http://purl.org/dc/terms/language>  Czech
>> <http://www.lexvo.org/id/iso639-3/ces>   Title
>> <http://purl.org/dc/elements/1.1/title>  Czech Speecon database   Type
>> <http://purl.org/dc/terms/type>  Corpus
>> <http://babelnet.org/rdf/s00022825n>
>> Czech Speecon database
>> <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0298>
>>   Description <http://purl.org/dc/elements/1.1/description>
>> Desktop/Microphone   Language <http://purl.org/dc/terms/language>  Czech
>> <http://www.lexvo.org/id/iso639-3/ces>   Source
>> <http://purl.org/dc/elements/1.1/source>  CLARIN   Title
>> <http://purl.org/dc/elements/1.1/title>  Czech Speecon database
>>
>>
>>
>> The same applies to Eurom1:
>>
>> EUROM1_fr
>> <http://linghub.org/clarin/Speech_and_Language_Data_Repository/oai_sldr_org_sldr000035>
>>   Contributor <http://purl.org/dc/elements/1.1/contributor>  SAM_A
>> European project   Creator <http://purl.org/dc/elements/1.1/creator>  Institut
>> de la communication parlée (ICP, Grenoble FR)   Description
>> <http://purl.org/dc/elements/1.1/description>  The EUROM1 database
>> contains recordings of 60 speakers in eleven European Languages: Danish,
>> Dutch, British English, French, German, Norwegian, Swedish, Dutch, Greek,
>> Portuguese and Spanish. It was explicitly designed to aid the phonetic
>> comparison of languages, with similar materials and recording protocols in
>> all languages.<br />Only French EUROM1 is accessible here. It was used as a
>> resource for the MULTEXT project.<br />This version has been reformatted
>> for compliance with long-term preservation specifications.   Rights
>> <http://purl.org/dc/elements/1.1/rights>
>> info:eu-repo/date/submitted/2008-09-01   Source
>> <http://purl.org/dc/elements/1.1/source>  CLARIN   Subject
>> <http://purl.org/dc/elements/1.1/subject>
>>   Title <http://purl.org/dc/elements/1.1/title>  EUROM1_fr
>>
>>
>> and to others ....
>>
>> GlobalPhone Korean
>> <http://linghub.org/clarin/European_Language_Resources_Association/oai_catalogue_elra_info_ELRA_S0200>
>>   Description <http://purl.org/dc/elements/1.1/description>
>> Desktop/Microphone   Language <http://purl.org/dc/terms/language>  Korean
>> <http://www.lexvo.org/id/iso639-3/kor>   Source
>> <http://purl.org/dc/elements/1.1/source>  CLARIN   Title
>> <http://purl.org/dc/elements/1.1/title>  GlobalPhone Korean
>>
>>
>>
>> as you can imagine this is misleading, and I am wondering if we can help
>> you correct this.
>>
>> All the best
>> Khalid
>>
>>
>>
>>
>>
>>
>>
>> On 19/03/2015 15:08, John P. McCrae wrote:
>>
>>  For those who have not yet joined the link to the GotoMeeting is here
>>
>> https://global.gotomeeting.com/join/360074461
>>
>>  Regards,
>>  John
>>
>> On Thu, Mar 19, 2015 at 2:16 PM, John P. McCrae <
>> jmccrae@cit-ec.uni-bielefeld.de> wrote:
>>
>>>   Dear all,
>>>
>>>  In the teleconference this afternoon we will present Linghub
>>> <http://linghub.org/>, the work of several members of this group and
>>> the LIDER project.
>>>
>>>  Here are some slides I will present to start the discussion:
>>>
>>>
>>> https://docs.google.com/presentation/d/1ZDzHYcgHvqzp_zK77vGFZ36kEMEmNt9rw7kBJmrhetQ/edit?usp=sharing
>>>
>>>  Regards,
>>>  John P. McCrae
>>>
>>
>>
>>   --
>>
>> *************************************************
>> * Khalid CHOUKRI *
>> ELRA General Secretary & ELDA CEO
>> email: choukri@elda.org ; Web: www.elra.info www.elda.org
>> Tel. +33 1 43 13 33 33 <%2B33%201%2043%2013%2033%2033> - Fax. +33 1 43
>> 13 33 30 <%2B33%201%2043%2013%2033%2030>
>> ***************************************************
>> **
>> * Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org>
>> **************************************************** *
>>
>
>
> --
>
> *************************************************
> * Khalid CHOUKRI *
> ELRA General Secretary & ELDA CEO
> email: choukri@elda.org ; Web: www.elra.info www.elda.org
> Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30
> ***************************************************
> **
> * Info on LREC: www.lrec-conf.org <http://www.lrec-conf.org>
> **************************************************** *
>
>
> --
> Director - Knowledge and Data Engineering Group
> The CNGL Centre for Global Intelligent Content
> School of Computer Science and Statistics
> Trinity College Dublin
>
>


-- 
Marta Villegas
marta.villegas@gmail.com

Received on Thursday, 26 March 2015 09:15:58 UTC