RE: Metashare as used by LingHub

Hi all!

The only additional thing to consider is that the IULA-based model doesn't contain all the elements used at MS; they have focused mainly on text corpora and lexical/conceptual resources. But I think as soon as we decide on the basics (e.g. removal of …info), the differences will not be that many. I agree, something to decide at the telco.

Best,

Penny

 

 

From: Jorge Gracia [mailto:jgracia@fi.upm.es] 
Sent: Thursday, February 05, 2015 3:54 PM
To: Marta Villegas
Cc: Penny Labropoulou; John P. McCrae; public-ld4lt@w3.org
Subject: Re: Metashare as used by LingHub

 

Thanks John, Penny and Marta for this useful thread! :-) 

 

> how will we add the comments/decisions from the previous googlesheet to the current one? 

 

I wonder whether this is the fastest way to proceed or, on the contrary we should stick on the previus IULA-based model and continue implementing changes there. I am telling this because the IULA version is already "cleaned" of superfluous nodes, and already follows the rules described by Marta for simplifying complexities of the XSD model. Well, something to be decided later on in the telco...

 

Best regards,

Jorge

 

 

2015-02-05 12:15 GMT+01:00 Marta Villegas <marta.villegas@gmail.com <mailto:marta.villegas@gmail.com> >:

Hi again, I forgot the attached files (sorry)

 

I hope this helps!

 

2015-02-05 12:12 GMT+01:00 Marta Villegas <marta.villegas@gmail.com <mailto:marta.villegas@gmail.com> >:

Hi John, Penny and all

 

I'm sending you the xsl file we use to analise MS schema and the output we get.

 

The xsl script generates all possible 'xpaths' from the root element (resourceInfo) to all terminal nodes.

Each node corresponds to an xml element. The output looks like:

 

/resourceInfo[n](1)/identificationInfo[](11)/resourceName[n](6)@

 

where [] collects element's cardinality & () = count(element + siblings)

In this example, resourceName[n](6) means resource Name is unbounded and has 5 siblings

 

Terminal nodes (ending with @) are simple typed elements that become data type properties.

 

Non terminal nodes correspond to 'embedded' XML elements. They are all complex elements and, generally, they have @type or @ref attributes in the schema. Few are locally described (complex elements with no @type nor @ref but locally described).

 

In principle, complex elements (all non terminal ones) should generate a Class + an Object Property. This however, 'over generates' the resulting graph. Those with [] suggests better not to follow the rule.

 

- Nodes with [](1) and [](*) can be removed (these are nodes with cardinality 1 and no siblings)

 

for example:

 

         .../creationInfo[](17)/creationTool[unbounded](4)/targetResourceNameURI[](1)@

 

where:     X  creationTool  url.  

 

is better than:    X  creationTool [y  targetResourceNameUri  uri ] .

 

similarly in:

 

          .../evaluationReport[n](9)/documentInfo[](*)/title[unbounded](20)@

 

 X evaluationReport [ Y a documentType ; title 'some title' ]

 

is better than

 

X evaluationReport [y a documentationInfoType ; documentInfo [ z a documentInfoType ; title 'some title']].

 

Compare the following lines:

 

        .../licenceInfo[n](5)/licensor[n](11)/personInfo[](*)/surname[n](6)@

 

       .../metadataInfo[](11)/metadataCreator[n](9)/surname[n](6)@

 

 

In the first case the personInfo node is superfluous (here licensor is typed as Actor which in turn is defined as a choice between Person and Organisation. XML does not help!!)

 

In the second case, metadataCreator is typed as Person. The corresponding MXL instances show this 'problem':

 

<metadataInfo>

    <metadataCreationDate>2006-05-04</metadataCreationDate>

    <metadataCreator>

        <surname lang="en-US">surname0</surname>

        <givenName lang="en-US">givenName0</givenName>

        <sex>male</sex>

<licenceInfo>

    <licence>CC-BY</licence>...

    <licensor>

        <personInfo>

            <surname lang="en-US">surname0</surname>

            <givenName lang="en-US">givenName0</givenName>

            <sex>male</sex>

These are XML problems which can be easily addressed in owl. Having something like: an Actor super class with subclasses for Person & Organisation; one metadataCreator property with range Person and one licesor property with range Actor

 

X metadataCreator [ y a Person ; surname ?surname'] .

X licensor [ y a Person ; surname ?surname'] .

 

- other [] nodes need careful revision.

 

for example:

 

/resourceInfo[n](1)/identificationInfo[](11)/resourceName[n](6)@

/resourceInfo[n](1)/identificationInfo[](11)/description[n](6)@

/resourceInfo[n](1)/identificationInfo[](11)/resourceShortName[n](6)@

/resourceInfo[n](1)/identificationInfo[](11)/url[n](6)@

/resourceInfo[n](1)/identificationInfo[](11)/metaShareId[](6)@

/resourceInfo[n](1)/identificationInfo[](11)/identifier[n](6)@

 

also look at:

 

/resourceInfo[n](1)/resourceComponentType[](11)/toolServiceInfo[](*)/toolServiceEvaluationInfo[](9)/evaluationReport[n](9)/documentInfo[](*)/documentType[](20)@

 

 

 

 

2015-01-28 13:40 GMT+01:00 Penny Labropoulou <penny@ilsp.gr <mailto:penny@ilsp.gr> >:

 

 

On Tue, Jan 27, 2015 at 10:03 PM, Penny Labropoulou <penny@ilsp.gr <mailto:penny@ilsp.gr> > wrote:

Hi John and all.

Thanx for the quick work!

Below are a few comments/replies in between the lines.

 

1) Some names have been shortened, e.g., 'ConformanceToBestStandardsAndPractices' -> 
'StandardsBestPractices', should we accept such names or stay true to MetaShare?
I think we should decide this on a case-by-case basis; although some names are long, they are self-explanatory. In general, at ld4lt we have changed some names (e.g. resource to language resource) when it was agreed that the new label is better.

Hmm... change for the sake of change is difficult, particularly when it is only a small part of the vocabulary, that creates gotchas. 

There are some typos that we have spotted and also comments raised by various.


2) A lot of MetaShare names have (unnecessarily) the words 'Info', 'Type' or 'InfoType', we could eliminate these.
All “info” elements are in fact component names: in accordance to the CMDI principles, elements (and other components) are grouped into semantically coherent components. For instance, the identificationInfo groups together elements that are used for the identification of a resource, such as the resourceId, a url used as landing page, the resourceName and shortName, the description etc. If I have understood well, this structure is not needed/not a good practice for RDF and this is why they have been eliminated already at the IULA/UPF mapping.

“type” elements are used in MetaShare for components that can be re-used: e.g. persons can be licensors, contact points, resource creators etc., but in all cases they are encoded using the personInfoType, which groups together given name, surname, communication information etc. Again, I think this is not mapped in RDF as such, if I understand well.

Yeah that is my feeling too, I would like to shorten the names, however it seems hard to do this consistently as it would create clashes, e.g., ActualUse/ActualUseInfo, DocumentType/DocumentInfo


3) IULA have split the AnnotationType class into 5 subclasses (DiscourseAnnotation, etc.)

That’s an improvement from the original model and I suggest we stick to it.

4) There are many properties suggested by IULA or in the 'DISTRIBUTION' model that have no correspondence in the MetaShare data... we should discuss these on a case-by-case basis, right?

We have already discussed with Victor the distribution and licensing module and have come up with a proposal re-introducing some of the original MetaShare elements that were not mapped in the IULA/UPF version and using the odrl (mainly) and cc vocabularies ; the general ideas are to be found at https://www.w3.org/community/ld4lt/wiki/Metashare_vocabulary_for_licenses and https://www.w3.org/community/ld4lt/wiki/Examples and the mappings were documented in the previous googlesheet. I will add these to the new googlesheet by next week.

I incorporated all the functional (non-documentary) information from the distribution model already... or at least I tried, let me know if I missed anything.

Ok; to be checked

5) The Prev. Google Doc proposed mapping to both SWRC and BIBO, do we need to do BIBO as well (SWRC seems sufficient)?

6) I added the license modelling that LingHub does in ODRL, could one of our ODRL experts look at it and fix the last one?

Please, see also the two wikis on licensing, especially the examples. And as discussed, together with Victor we will provide a file with the RDF representations in odrl of the licenses used in MetaShare (of course, only of those that have not already been RDFized). 

This refers to "R4 To neatly represent conditions of use"... but I couldn't find the structured definitions of conditions of use so I wrote my own in the sheet titled "License Modelling"

To be checked and finalized by Monday.

7) Some property values, especially resource types, such as ontology or corpus were created as classes in the Google Doc, shall we confirm this usage pattern?

This needs some more thinking, checking the various cases. Is there a list of these?

This seems to be individuals of the classes 'ResourceType' and 'LexicalConceptualResourceType', approximately, here are the lists for reference:

In Prev. Google Doc: BabelNet*, ComputationalLexicon, Corpus, CorpusAudio*, CorpusCollection*, CorpusImage*, CorpusText*, CorpusTextNgram*, CorpusTextNumerical*, CorpusVideo*, Framenet, LexicalConceptualResource, Lexicon, MachineReadableDictionary, Ontology, TerminologicalResource, Thesaurus, ToolService*, WordList, WordNet

>From Metashare: computationalLexicon, framenet, lexicon, machineReadableDictionary, ontology, other*, terminologicalResource, thesaurus, wordList, wordnet, corpus, languageDescription*, lexicalConcepturalResource

*Unique elements

We might need a telco discussion for this, but first let me check the current mappings.

 

8) See attached diagram. There is a big difference in granularity between the XSD and IULA-UPF's ontology. For example, there are 4 tags between the resource and its actual usage in the XML, e.g.,

<resourceInfo> ...

  <usageInfo> ...

    <actualUsageInfo> ....

      <useNLPspecific>parsing</useNLPspecific> ....

Where is in the IULA model this is considerably simplified to

:resource a ms:Resource ;

  ms:actualUse ms:parsing   

 

This would be great, but it also loses information, for example, the IULA schema associates the availability with the Resource. However, the XSD schema associates an availability with each Distribution (download file). In fact, there are resources that have different availability for different downloads (e.g., BabelNet), so there is information loss here. Thus, LingHub is very conservative and sticks to the XSD, e.g.,

:resource a ms:ResourceInfo ;

  ms:usageInfo [

    ms:actualUsageInfo [

      ms:useNLPspecific ms:parsing ] ]

What shall we recommend here?

 

Again, discuss on a case-by-case basis. For instance, for availability, we have re-introduced the distribution element, as  otherwise we lose in semantics. For other cases, I think we should see them more closely. The grouping into components made sense in XSD because it brought together elements. I will have to look at them more closely and explain for each case why this grouping was meant, so that we can decide if this should also remain in the RDF mapping. Is there an easy way of spotting these cases?

OK, we should discuss this in a telco.

 

A final question: how will we add the comments/decisions from the previous googlesheet to the current one? As said, I can do this for the distribution/licensing module elements but for the rest?

Add any comments you want (possibly copied from previous doc). Apart from that I would like to keep the sheet itself clean until the next ldl4lt telco at least

Regards,
John

 

Best,

Penny

 

 





 

-- 

Marta Villegas
marta.villegas@gmail.com <mailto:marta.villegas@gmail.com> 





 

-- 

Marta Villegas
marta.villegas@gmail.com <mailto:marta.villegas@gmail.com> 





 

-- 

Jorge Gracia, PhD
Ontology Engineering Group
Artificial Intelligence Department
Universidad Politécnica de Madrid
http://jogracia.url.ph/web/

Received on Friday, 6 February 2015 08:48:35 UTC