Re: Metashare as used by LingHub from Marta Villegas on 2015-02-05 (public-ld4lt@w3.org from February 2015)

From: Marta Villegas <marta.villegas@gmail.com>
Date: Thu, 5 Feb 2015 12:15:10 +0100
To: Penny Labropoulou <penny@ilsp.gr>
Cc: "John P. McCrae" <jmccrae@cit-ec.uni-bielefeld.de>, public-ld4lt@w3.org
Message-ID: <CAPq_VFnBfUFao0OLC1YwEyB0336QNtuYHbmEb0BzgaKZz__G0w@mail.gmail.com>
Hi again, I forgot the attached files (sorry)

I hope this helps!

2015-02-05 12:12 GMT+01:00 Marta Villegas <marta.villegas@gmail.com>:

> Hi John, Penny and all
>
> I'm sending you the xsl file we use to analise MS schema and the output we
> get.
>
> The xsl script generates all possible 'xpaths' from the root element
> (resourceInfo) to all terminal nodes.
> Each node corresponds to an xml element. The output looks like:
>
> */resourceInfo[n](1)/identificationInfo[](11)/resourceName[n](6)@*
>
> where [] collects element's cardinality & () = count(element + siblings)
> In this example, *resourceName[n](6) *means resource Name is unbounded
> and has 5 siblings
>
> Terminal nodes (ending with @) are simple typed elements that become data
> type properties.
>
> Non terminal nodes correspond to 'embedded' XML elements. They are all
> complex elements and, generally, they have @type or @ref attributes in the
> schema. Few are locally described (complex elements with no @type nor @ref
> but locally described).
>
> In principle, complex elements (all non terminal ones) should generate a
> Class + an Object Property. This however, 'over generates' the resulting
> graph. Those with [] suggests better not to follow the rule.
>
> - *Nodes with [](1) and [](*)* can be removed (these are nodes with
> cardinality 1 and no siblings)
>
> for example:
>
>
> *.../creationInfo[](17)/creationTool[unbounded](4)/targetResourceNameURI[](1)@*
>
> where:     X  creationTool  url.
>
> is better than:    X  creationTool [y  targetResourceNameUri  uri ] .
>
> similarly in:
>
>
> *.../evaluationReport[n](9)/documentInfo[](*)/title[unbounded](20)@*
>
>  X evaluationReport [ Y a documentType ; title 'some title' ]
>
> is better than
>
> X evaluationReport [y a documentationInfoType ; documentInfo [ z a
> documentInfoType ; title 'some title']].
>
> Compare the following lines:
>
>
> *.../licenceInfo[n](5)/licensor[n](11)/personInfo[](*)/surname[n](6)@*
>
> *       .../metadataInfo[](11)/metadataCreator[n](9)/surname[n](6)@*
>
>
> In the first case the personInfo node is superfluous (here *licensor *is
> typed as Actor which in turn is defined as a choice between Person and
> Organisation. XML does not help!!)
>
> In the second case, *metadataCreator *is typed as Person. The
> corresponding MXL instances show this 'problem':
>
> <*metadataInfo*>
>     <metadataCreationDate>2006-05-04</metadataCreationDate>
>     *<metadataCreator>*
>         <surname lang="en-US">surname0</surname>
>         <givenName lang="en-US">givenName0</givenName>
>         <sex>male</sex>
> <*licenceInfo*>
>     <licence>CC-BY</licence>...
>     *<licensor>*
> *        <personInfo>*
>             <surname lang="en-US">surname0</surname>
>             <givenName lang="en-US">givenName0</givenName>
>             <sex>male</sex>
> These are XML problems which can be easily addressed in owl. Having
> something like: an Actor super class with subclasses for Person &
> Organisation; one metadataCreator property with range Person and one
> licesor property with range Actor
>
> X metadataCreator [ y a Person ; surname ?surname'] .
> X licensor [ y a Person ; surname ?surname'] .
>
> - *other [] nodes *need careful revision.
>
> for example:
>
> /resourceInfo[n](1)/*identificationInfo[]*(11)/resourceName[n](6)@
> /resourceInfo[n](1)/*identificationInfo[]*(11)/description[n](6)@
> /resourceInfo[n](1)/*identificationInfo[]*(11)/resourceShortName[n](6)@
> /resourceInfo[n](1)*/identificationInfo[]*(11)/url[n](6)@
> /resourceInfo[n](1)*/identificationInfo[](*11)/metaShareId[](6)@
> /resourceInfo[n](1)*/identificationInfo[]*(11)/identifier[n](6)@
>
> also look at:
>
>
> /resourceInfo[n](1)/resourceComponentType[](11)/toolServiceInfo[](*)/toolServiceEvaluationInfo[](9)/evaluationReport[n](9)/documentInfo[](*)/documentType[](20)@
>
>
>
>
> 2015-01-28 13:40 GMT+01:00 Penny Labropoulou <penny@ilsp.gr>:
>
>>
>>
>>
>>
>> On Tue, Jan 27, 2015 at 10:03 PM, Penny Labropoulou <penny@ilsp.gr>
>> wrote:
>>
>> Hi John and all.
>>
>> Thanx for the quick work!
>>
>> Below are a few comments/replies in between the lines.
>>
>>
>>
>> 1) Some names have been shortened, e.g.,
>> 'ConformanceToBestStandardsAndPractices' ->
>> 'StandardsBestPractices', should we accept such names or stay true to
>> MetaShare?
>> I think we should decide this on a case-by-case basis; although some
>> names are long, they are self-explanatory. In general, at ld4lt we have
>> changed some names (e.g. resource to language resource) when it was agreed
>> that the new label is better.
>>
>> Hmm... change for the sake of change is difficult, particularly when it
>> is only a small part of the vocabulary, that creates gotchas.
>>
>> There are some typos that we have spotted and also comments raised by
>> various.
>>
>>
>> 2) A lot of MetaShare names have (unnecessarily) the words 'Info', 'Type'
>> or 'InfoType', we could eliminate these.
>> All “info” elements are in fact component names: in accordance to the
>> CMDI principles, elements (and other components) are grouped into
>> semantically coherent components. For instance, the identificationInfo
>> groups together elements that are used for the identification of a
>> resource, such as the resourceId, a url used as landing page, the
>> resourceName and shortName, the description etc. If I have understood well,
>> this structure is not needed/not a good practice for RDF and this is why
>> they have been eliminated already at the IULA/UPF mapping.
>>
>> “type” elements are used in MetaShare for components that can be re-used:
>> e.g. persons can be licensors, contact points, resource creators etc., but
>> in all cases they are encoded using the personInfoType, which groups
>> together given name, surname, communication information etc. Again, I think
>> this is not mapped in RDF as such, if I understand well.
>>
>> Yeah that is my feeling too, I would like to shorten the names, however
>> it seems hard to do this consistently as it would create clashes, e.g.,
>> ActualUse/ActualUseInfo, DocumentType/DocumentInfo
>>
>>
>> 3) IULA have split the AnnotationType class into 5 subclasses
>> (DiscourseAnnotation, etc.)
>>
>> That’s an improvement from the original model and I suggest we stick to
>> it.
>>
>> 4) There are many properties suggested by IULA or in the 'DISTRIBUTION'
>> model that have no correspondence in the MetaShare data... we should
>> discuss these on a case-by-case basis, right?
>>
>> We have already discussed with Victor the distribution and licensing
>> module and have come up with a proposal re-introducing some of the original
>> MetaShare elements that were not mapped in the IULA/UPF version and using
>> the odrl (mainly) and cc vocabularies ; the general ideas are to be found
>> at
>> https://www.w3.org/community/ld4lt/wiki/Metashare_vocabulary_for_licenses
>> and https://www.w3.org/community/ld4lt/wiki/Examples and the mappings
>> were documented in the previous googlesheet. I will add these to the new
>> googlesheet by next week.
>>
>> I incorporated all the functional (non-documentary) information from the
>> distribution model already... or at least I tried, let me know if I missed
>> anything.
>>
>> Ok; to be checked
>>
>> 5) The Prev. Google Doc proposed mapping to both SWRC and BIBO, do we
>> need to do BIBO as well (SWRC seems sufficient)?
>>
>> 6) I added the license modelling that LingHub does in ODRL, could one of
>> our ODRL experts look at it and fix the last one?
>>
>> Please, see also the two wikis on licensing, especially the examples. And
>> as discussed, together with Victor we will provide a file with the RDF
>> representations in odrl of the licenses used in MetaShare (of course, only
>> of those that have not already been RDFized).
>>
>> This refers to "R4 To neatly represent conditions of use"... but I
>> couldn't find the structured definitions of conditions of use so I wrote my
>> own in the sheet titled "License Modelling"
>>
>> To be checked and finalized by Monday.
>>
>> 7) Some property values, especially *resource types*, such as *ontology*
>> or *corpus* were created as classes in the Google Doc, shall we confirm
>> this usage pattern?
>>
>> This needs some more thinking, checking the various cases. Is there a
>> list of these?
>>
>> This seems to be individuals of the classes 'ResourceType' and
>> 'LexicalConceptualResourceType', approximately, here are the lists for
>> reference:
>>
>> In Prev. Google Doc: BabelNet*, ComputationalLexicon, Corpus,
>> CorpusAudio*, CorpusCollection*, CorpusImage*, CorpusText*,
>> CorpusTextNgram*, CorpusTextNumerical*, CorpusVideo*, Framenet,
>> LexicalConceptualResource, Lexicon, MachineReadableDictionary, Ontology,
>> TerminologicalResource, Thesaurus, ToolService*, WordList, WordNet
>>
>> From Metashare: computationalLexicon, framenet, lexicon,
>> machineReadableDictionary, ontology, other*, terminologicalResource,
>> thesaurus, wordList, wordnet, corpus, languageDescription*,
>> lexicalConcepturalResource
>>
>> *Unique elements
>>
>> We might need a telco discussion for this, but first let me check the
>> current mappings.
>>
>>
>>
>> 8) *See attached diagram.* There is a big difference in granularity
>> between the XSD and IULA-UPF's ontology. For example, there are 4 tags
>> between the resource and its actual usage in the XML, e.g.,
>>
>> <resourceInfo> ...
>>
>>   <usageInfo> ...
>>
>>     <actualUsageInfo> ....
>>
>>       <useNLPspecific>parsing</useNLPspecific> ....
>>
>> Where is in the IULA model this is considerably simplified to
>>
>> :resource a ms:Resource ;
>>
>>   ms:actualUse ms:parsing
>>
>>
>>
>> This would be great, but it also loses information, for example, the IULA
>> schema associates the *availability* with the *Resource*. However, the
>> XSD schema associates an *availability* with each *Distribution*
>> (download file). In fact, there are resources that have different
>> availability for different downloads (e.g., BabelNet), so there is
>> information loss here. Thus, LingHub is very conservative and sticks to the
>> XSD, e.g.,
>>
>> :resource a ms:ResourceInfo ;
>>
>>   ms:usageInfo [
>>
>>     ms:actualUsageInfo [
>>
>>       ms:useNLPspecific ms:parsing ] ]
>>
>> What shall we recommend here?
>>
>>
>>
>> Again, discuss on a case-by-case basis. For instance, for availability,
>> we have re-introduced the distribution element, as  otherwise we lose in
>> semantics. For other cases, I think we should see them more closely. The
>> grouping into components made sense in XSD because it brought together
>> elements. I will have to look at them more closely and explain for each
>> case why this grouping was meant, so that we can decide if this should also
>> remain in the RDF mapping. Is there an easy way of spotting these cases?
>>
>> OK, we should discuss this in a telco.
>>
>>
>>
>> A final question: how will we add the comments/decisions from the
>> previous googlesheet to the current one? As said, I can do this for the
>> distribution/licensing module elements but for the rest?
>>
>> Add any comments you want (possibly copied from previous doc). Apart from
>> that I would like to keep the sheet itself clean until the next ldl4lt
>> telco at least
>>
>> Regards,
>> John
>>
>>
>>
>> Best,
>>
>> Penny
>>
>>
>>
>>
>>
>
>
>
> --
> Marta Villegas
> marta.villegas@gmail.com
>



-- 
Marta Villegas
marta.villegas@gmail.com
Attachments

text/plain attachment: MSxpath.txt
text/xml attachment: MSxsdxpath.xsl
Received on Thursday, 5 February 2015 11:15:54 UTC