RE: Metashare as used by LingHub from Penny Labropoulou on 2015-01-28 (public-ld4lt@w3.org from January 2015)

From: Penny Labropoulou <penny@ilsp.gr>
Date: Wed, 28 Jan 2015 14:34:35 +0200
To: "'John P. McCrae'" <jmccrae@cit-ec.uni-bielefeld.de>, "'Marta Villegas'" <marta.villegas@gmail.com>
Cc: <public-ld4lt@w3.org>
Message-ID: <26b301d03af6$c74e55f0$55eb01d0$@ilsp.gr>

Hi again; I'm not going to get into the discussion for the XSD – OWL mappings; you're definitely the experts!

Just a clarification: the "targetResourceNameURI" was meant to bring together 3 types of data: free text, url and link to another resource; this was not implemented for various reasons, but it was kept as a reminder for later implementation. The textNumericalContentInfo is a different case: it contains one element of type free text. I suppose that's why it's considered superfluous.

On Wed, Jan 28, 2015 at 12:38 PM, Marta Villegas < <mailto:marta.villegas@gmail.com> marta.villegas@gmail.com> wrote:

Dear all,

Some comments about the way we (IULA) proceeded whe RDFying MS model that I hope will help:

General XSD2RDF rules can be summarised as follows:

XSD OWL

xs:simpleType rdfs:Datatype

xs:simpleType with xs:enumeration rdfs:Datatype., plus an instance for every enumerated value.

I assume you mean owl:Class here ^^

xs:complexType owl:Class

global element with simple type rdfs:Datatype

local element with complex type owl:ObjectProperty

local element with simple type owl:DatatypeProperty

MS model fllows a ‘document-centric’ approach (in some cases, elements merely act as a way to organize information for human consumption). Applying the rules above to the original XSD schema would derive into a graph filled with ‘superfluous’ nodes. Thus, we decided to identify these nodes before the actual RDFication process.

The criteria applied when 'simplifying' take into account the (i) tree structure of the nodes, (ii) their cardinality and (iii) the XPath axes.

rule1:

Embedded complex elements with cardinalityMax=1 can be removed, provided they do not contain text nor attributes. This allows for a simplification of the model, for example:

I am using the schema below, which does not seem to have this property, is this the same as you?

http://metashare.ilsp.gr/META-XMLSchema/v3.0/

resource/identificationInfo/resourceName

resource/identificationInfo /description

resource/identificationInfo /resourceShortName

resource/identificationInfo /url

…

becomes

resource/ resourceName

resource/ description

resource/ resourceShortName

resource/url

…

Removing the IndentificationInfo element implies removing the resulting relation & class. Assuming that (i) a class in OWL is a classification of individuals into groups which share common characteristics, and that (ii) individuals belong to some Class, we infer that for an XML element to become an OWL Class it is expected that there exist individuals belonging to this Class. (it is hard to 'imagine' individulas of IdentificationInfo class)

Note that such a rule can be applied provided this does not derive in ‘sibling conflicts’. Nodes define the scope in which embedded elements occur and this needs to be taken into account. If we remove the identificationInfo element, its children nodes become children of the resourceInfo node. This means that resurceName, description, resurceShortName, etc. become sibling nodes of contactPerson, validationInfo, etc. Promoted nodes need to be unique in their new axe.

(in the article we report problems concerning naming conventions in the MS model. Example:

resource/metadataInfo/source

resource/metadataInfo/originalMetadataSchema

resource/metadataInfo/originalMetadataLink

resource/metadataInfo/metadataLanguageName

resource/metadataInfo/metadataLanguageId

resource/metadataInfo/metadataLastDateUpdated

resource/metadataInfo/revision****

resource/metadataInfo/metadataCreator

resource/versionInfo/version

resource/versionInfo/revision****

resource/versionInfo/lastDateUpdated

resource/versionInfo/updateFrequency

In this case sibling conflicts forbide removal of metadataInfo and versionInfo (unless we rename things)

we identified 11 wrapping elements in the MS schema.

Interesting... I will see if I can remove some elements using the same principle

rule2

When having complex elements with one and only one simple element, this can be removed . Example:

/validationTool/targetResourceNameURI.

becomes

validationTool

We identified a total of 9 ‘superfluous nodes’ in the MS schema occuring in the following contexts: accessTool, annotationTool, creationTool, derivedResource, originalSource, relatedResource, resourceAssociatedWith, textNumericalContentInfo and validationTool.

Interestingly, I got nearly the same result but for a different reason... that is I assumed that the meta-type targetResourceInfoType was exactly a URL... this covers all the cases you identified other than textNumericalContentInfo

Personally, I'd rather prefere to follow MS model as accurately as possible (and avoid neverending discussions) but I think you cannot aviod addressing (i) those issues that come when moving from XSD to RDF and (ii) possible improvements (not many) to the original schema that are already identified by the MS team.

you can find much more detailed info in an article in LREC and I can send you the list of 'removed' elements and the script used to identify them.

That would be super

Regards,

John

Hope this helps

2015-01-28 11:28 GMT+01:00 John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de <mailto:jmccrae@cit-ec.uni-bielefeld.de> >:

On Tue, Jan 27, 2015 at 10:03 PM, Penny Labropoulou <penny@ilsp.gr <mailto:penny@ilsp.gr> > wrote:

Hi John and all.

Thanx for the quick work!

Below are a few comments/replies in between the lines.

1) Some names have been shortened, e.g., 'ConformanceToBestStandardsAndPractices' ->
'StandardsBestPractices', should we accept such names or stay true to MetaShare?
I think we should decide this on a case-by-case basis; although some names are long, they are self-explanatory. In general, at ld4lt we have changed some names (e.g. resource to language resource) when it was agreed that the new label is better.

Hmm... change for the sake of change is difficult, particularly when it is only a small part of the vocabulary, that creates gotchas.

2) A lot of MetaShare names have (unnecessarily) the words 'Info', 'Type' or 'InfoType', we could eliminate these.
All “info” elements are in fact component names: in accordance to the CMDI principles, elements (and other components) are grouped into semantically coherent components. For instance, the identificationInfo groups together elements that are used for the identification of a resource, such as the resourceId, a url used as landing page, the resourceName and shortName, the description etc. If I have understood well, this structure is not needed/not a good practice for RDF and this is why they have been eliminated already at the IULA/UPF mapping.

“type” elements are used in MetaShare for components that can be re-used: e.g. persons can be licensors, contact points, resource creators etc., but in all cases they are encoded using the personInfoType, which groups together given name, surname, communication information etc. Again, I think this is not mapped in RDF as such, if I understand well.

Yeah that is my feeling too, I would like to shorten the names, however it seems hard to do this consistently as it would create clashes, e.g., ActualUse/ActualUseInfo, DocumentType/DocumentInfo

3) IULA have split the AnnotationType class into 5 subclasses (DiscourseAnnotation, etc.)

That’s an improvement from the original model and I suggest we stick to it.

4) There are many properties suggested by IULA or in the 'DISTRIBUTION' model that have no correspondence in the MetaShare data... we should discuss these on a case-by-case basis, right?

We have already discussed with Victor the distribution and licensing module and have come up with a proposal re-introducing some of the original MetaShare elements that were not mapped in the IULA/UPF version and using the odrl (mainly) and cc vocabularies ; the general ideas are to be found at https://www.w3.org/community/ld4lt/wiki/Metashare_vocabulary_for_licenses and https://www.w3.org/community/ld4lt/wiki/Examples and the mappings were documented in the previous googlesheet. I will add these to the new googlesheet by next week.

I incorporated all the functional (non-documentary) information from the distribution model already... or at least I tried, let me know if I missed anything.

5) The Prev. Google Doc proposed mapping to both SWRC and BIBO, do we need to do BIBO as well (SWRC seems sufficient)?

6) I added the license modelling that LingHub does in ODRL, could one of our ODRL experts look at it and fix the last one?

Please, see also the two wikis on licensing, especially the examples. And as discussed, together with Victor we will provide a file with the RDF representations in odrl of the licenses used in MetaShare (of course, only of those that have not already been RDFized).

This refers to "R4 To neatly represent conditions of use"... but I couldn't find the structured definitions of conditions of use so I wrote my own in the sheet titled "License Modelling"

7) Some property values, especially resource types, such as ontology or corpus were created as classes in the Google Doc, shall we confirm this usage pattern?

This needs some more thinking, checking the various cases. Is there a list of these?

This seems to be individuals of the classes 'ResourceType' and 'LexicalConceptualResourceType', approximately, here are the lists for reference:

In Prev. Google Doc: BabelNet*, ComputationalLexicon, Corpus, CorpusAudio*, CorpusCollection*, CorpusImage*, CorpusText*, CorpusTextNgram*, CorpusTextNumerical*, CorpusVideo*, Framenet, LexicalConceptualResource, Lexicon, MachineReadableDictionary, Ontology, TerminologicalResource, Thesaurus, ToolService*, WordList, WordNet

>From Metashare: computationalLexicon, framenet, lexicon, machineReadableDictionary, ontology, other*, terminologicalResource, thesaurus, wordList, wordnet, corpus, languageDescription*, lexicalConcepturalResource

*Unique elements

8) See attached diagram. There is a big difference in granularity between the XSD and IULA-UPF's ontology. For example, there are 4 tags between the resource and its actual usage in the XML, e.g.,

<resourceInfo> ...

<usageInfo> ...

<actualUsageInfo> ....

<useNLPspecific>parsing</useNLPspecific> ....

Where is in the IULA model this is considerably simplified to

:resource a ms:Resource ;

ms:actualUse ms:parsing

This would be great, but it also loses information, for example, the IULA schema associates the availability with the Resource. However, the XSD schema associates an availability with each Distribution (download file). In fact, there are resources that have different availability for different downloads (e.g., BabelNet), so there is information loss here. Thus, LingHub is very conservative and sticks to the XSD, e.g.,

:resource a ms:ResourceInfo ;

ms:usageInfo [

ms:actualUsageInfo [

ms:useNLPspecific ms:parsing ] ]

What shall we recommend here?

Again, discuss on a case-by-case basis. For instance, for availability, we have re-introduced the distribution element, as otherwise we lose in semantics. For other cases, I think we should see them more closely. The grouping into components made sense in XSD because it brought together elements. I will have to look at them more closely and explain for each case why this grouping was meant, so that we can decide if this should also remain in the RDF mapping. Is there an easy way of spotting these cases?

OK, we should discuss this in a telco.

A final question: how will we add the comments/decisions from the previous googlesheet to the current one? As said, I can do this for the distribution/licensing module elements but for the rest?

Add any comments you want (possibly copied from previous doc). Apart from that I would like to keep the sheet itself clean until the next ldl4lt telco at least

Regards,
John

Best,

Penny

Marta Villegas
marta.villegas@gmail.com <mailto:marta.villegas@gmail.com>

Received on Wednesday, 28 January 2015 12:35:52 UTC