Re: Metashare as used by LingHub from John P. McCrae on 2015-01-28 (public-ld4lt@w3.org from January 2015)

From: John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de>
Date: Wed, 28 Jan 2015 13:15:16 +0100
To: Marta Villegas <marta.villegas@gmail.com>
Cc: Penny Labropoulou <penny@ilsp.gr>, public-ld4lt@w3.org
Message-ID: <CAC5njqqYcQXsSDSTZu+neguU7MOvH-pog9zsHormyir8DW1-Mw@mail.gmail.com>
On Wed, Jan 28, 2015 at 12:38 PM, Marta Villegas <marta.villegas@gmail.com>
wrote:

> Dear all,
>
> Some comments about the way we (IULA) proceeded whe RDFying MS model that
> I hope will help:
>
> General XSD2RDF rules can be summarised as follows:
>
> *XSD OWL*
> xs:simpleType                      rdfs:Datatype
> xs:simpleType with xs:enumeration rdfs:Datatype., plus an instance for
> every enumerated value.
>
I assume you mean owl:Class here ^^

> xs:complexType owl:Class
> global element with simple type rdfs:Datatype
> local element with complex type owl:ObjectProperty
> local element with simple type owl:DatatypeProperty
>
> MS model fllows a  ‘document-centric’ approach (in some cases, elements
> merely act as a way to organize information for human consumption).
> Applying the rules above to the original XSD schema would derive into a
> graph filled with ‘superfluous’ nodes. Thus, we decided to identify these
> nodes before the actual RDFication process.
>
> The criteria applied when 'simplifying' take into account the (i) tree
> structure of the nodes, (ii) their cardinality and (iii) the XPath axes.
>
> *rule1:*
> Embedded complex elements with cardinalityMax=1 can be removed, provided
> they do not contain text nor attributes. This allows for a simplification
> of the model, for example:
>
I am using the schema below, which does not seem to have this property, is
this the same as you?

http://metashare.ilsp.gr/META-XMLSchema/v3.0/

>
> resource/*identificationInfo*/resourceName
> resource/*identificationInfo */description
> resource/*identificationInfo */resourceShortName
> resource/*identificationInfo */url
> …
>
> becomes
>
> resource/ resourceName
> resource/ description
> resource/ resourceShortName
> resource/url
> …
>
> Removing the *IndentificationInfo *element implies removing the resulting
> relation & class. Assuming that (i) a class in OWL is a classification of
> individuals into groups which share common characteristics, and that (ii)
> individuals belong to some Class, we infer that for an XML element to
> become an OWL Class it is expected that there exist individuals belonging
> to this Class. (it is hard to 'imagine' individulas of *IdentificationInfo
> *class)
>
> Note that such a rule can be applied provided this does not derive in
> ‘sibling conflicts’. Nodes define the scope in which embedded elements
> occur and this needs to be taken into account. If we remove the *identificationInfo
> *element, its children nodes become children of the *resourceInfo *node.
> This means that *resurceName, description, resurceShortName*, etc. become
> sibling nodes of *contactPerson, validationInfo*, etc. Promoted nodes
> need to be unique in their new axe.
>
> (in the article we report problems concerning naming conventions in the MS
> model. Example:
>
> resource/metadataInfo/source
> resource/metadataInfo/originalMetadataSchema
> resource/metadataInfo/originalMetadataLink
> resource/metadataInfo/metadataLanguageName
> resource/metadataInfo/metadataLanguageId
> resource/metadataInfo/metadataLastDateUpdated
> resource/metadataInfo/*revision*****
> resource/metadataInfo/metadataCreator
>
> resource/versionInfo/version
> resource/versionInfo/*revision*****
> resource/versionInfo/lastDateUpdated
> resource/versionInfo/updateFrequency
>
> In this case sibling conflicts forbide removal of metadataInfo and
> versionInfo (unless we rename things)
>
> we identified 11 wrapping elements in the MS schema.
>
Interesting... I will see if I can remove some elements using the same
principle

>
> *rule2*
> When having complex elements with one and only one simple element, this
> can be removed . Example:
>
> /validationTool/*targetResourceNameURI*.
>
> becomes
>
> validationTool
>
> We identified a total of 9 ‘superfluous nodes’ in the MS schema occuring
> in the following contexts: accessTool, annotationTool, creationTool,
> derivedResource, originalSource, relatedResource, resourceAssociatedWith,
> textNumericalContentInfo and validationTool.
>
Interestingly, I got nearly the same result but for a different reason...
that is I assumed that the meta-type targetResourceInfoType was exactly a
URL... this covers all the cases you identified other than
textNumericalContentInfo

>
>
> Personally, I'd rather prefere to follow MS model as accurately as
> possible (and avoid neverending discussions) but I think you cannot aviod
> addressing (i) those issues that come when moving from XSD to RDF and (ii)
> possible improvements (not many) to the original schema that are already
> identified by the MS team.
>
> you can find much more detailed info in an article in LREC and I can send
> you the list of 'removed' elements and the script used to identify them.
>
That would be super

Regards,
John

>
> Hope this helps
>
>
>
>
> 2015-01-28 11:28 GMT+01:00 John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de
> >:
>
>>
>>
>> On Tue, Jan 27, 2015 at 10:03 PM, Penny Labropoulou <penny@ilsp.gr>
>> wrote:
>>
>>> Hi John and all.
>>>
>>> Thanx for the quick work!
>>>
>>> Below are a few comments/replies in between the lines.
>>>
>>>
>>>
>>> 1) Some names have been shortened, e.g.,
>>> 'ConformanceToBestStandardsAndPractices' ->
>>> 'StandardsBestPractices', should we accept such names or stay true to
>>> MetaShare?
>>> I think we should decide this on a case-by-case basis; although some
>>> names are long, they are self-explanatory. In general, at ld4lt we have
>>> changed some names (e.g. resource to language resource) when it was agreed
>>> that the new label is better.
>>>
>> Hmm... change for the sake of change is difficult, particularly when it
>> is only a small part of the vocabulary, that creates gotchas.
>>
>>>
>>> 2) A lot of MetaShare names have (unnecessarily) the words 'Info',
>>> 'Type' or 'InfoType', we could eliminate these.
>>> All “info” elements are in fact component names: in accordance to the
>>> CMDI principles, elements (and other components) are grouped into
>>> semantically coherent components. For instance, the identificationInfo
>>> groups together elements that are used for the identification of a
>>> resource, such as the resourceId, a url used as landing page, the
>>> resourceName and shortName, the description etc. If I have understood well,
>>> this structure is not needed/not a good practice for RDF and this is why
>>> they have been eliminated already at the IULA/UPF mapping.
>>>
>>> “type” elements are used in MetaShare for components that can be
>>> re-used: e.g. persons can be licensors, contact points, resource creators
>>> etc., but in all cases they are encoded using the personInfoType, which
>>> groups together given name, surname, communication information etc. Again,
>>> I think this is not mapped in RDF as such, if I understand well.
>>>
>> Yeah that is my feeling too, I would like to shorten the names, however
>> it seems hard to do this consistently as it would create clashes, e.g.,
>> ActualUse/ActualUseInfo, DocumentType/DocumentInfo
>>
>>>
>>> 3) IULA have split the AnnotationType class into 5 subclasses
>>> (DiscourseAnnotation, etc.)
>>>
>>> That’s an improvement from the original model and I suggest we stick to
>>> it.
>>>
>>> 4) There are many properties suggested by IULA or in the 'DISTRIBUTION'
>>> model that have no correspondence in the MetaShare data... we should
>>> discuss these on a case-by-case basis, right?
>>>
>>> We have already discussed with Victor the distribution and licensing
>>> module and have come up with a proposal re-introducing some of the original
>>> MetaShare elements that were not mapped in the IULA/UPF version and using
>>> the odrl (mainly) and cc vocabularies ; the general ideas are to be found
>>> at
>>> https://www.w3.org/community/ld4lt/wiki/Metashare_vocabulary_for_licenses
>>> and https://www.w3.org/community/ld4lt/wiki/Examples and the mappings
>>> were documented in the previous googlesheet. I will add these to the new
>>> googlesheet by next week.
>>>
>> I incorporated all the functional (non-documentary) information from the
>> distribution model already... or at least I tried, let me know if I missed
>> anything.
>>
>>> 5) The Prev. Google Doc proposed mapping to both SWRC and BIBO, do we
>>> need to do BIBO as well (SWRC seems sufficient)?
>>>
>>> 6) I added the license modelling that LingHub does in ODRL, could one of
>>> our ODRL experts look at it and fix the last one?
>>>
>>> Please, see also the two wikis on licensing, especially the examples.
>>> And as discussed, together with Victor we will provide a file with the RDF
>>> representations in odrl of the licenses used in MetaShare (of course, only
>>> of those that have not already been RDFized).
>>>
>> This refers to "R4 To neatly represent conditions of use"... but I
>> couldn't find the structured definitions of conditions of use so I wrote my
>> own in the sheet titled "License Modelling"
>>
>>> 7) Some property values, especially *resource types*, such as *ontology*
>>> or *corpus* were created as classes in the Google Doc, shall we confirm
>>> this usage pattern?
>>>
>>> This needs some more thinking, checking the various cases. Is there a
>>> list of these?
>>>
>> This seems to be individuals of the classes 'ResourceType' and
>> 'LexicalConceptualResourceType', approximately, here are the lists for
>> reference:
>>
>> In Prev. Google Doc: BabelNet*, ComputationalLexicon, Corpus,
>> CorpusAudio*, CorpusCollection*, CorpusImage*, CorpusText*,
>> CorpusTextNgram*, CorpusTextNumerical*, CorpusVideo*, Framenet,
>> LexicalConceptualResource, Lexicon, MachineReadableDictionary, Ontology,
>> TerminologicalResource, Thesaurus, ToolService*, WordList, WordNet
>>
>> From Metashare: computationalLexicon, framenet, lexicon,
>> machineReadableDictionary, ontology, other*, terminologicalResource,
>> thesaurus, wordList, wordnet, corpus, languageDescription*,
>> lexicalConcepturalResource
>>
>> *Unique elements
>>
>>>
>>>
>>> 8) *See attached diagram.* There is a big difference in granularity
>>> between the XSD and IULA-UPF's ontology. For example, there are 4 tags
>>> between the resource and its actual usage in the XML, e.g.,
>>>
>>> <resourceInfo> ...
>>>
>>>   <usageInfo> ...
>>>
>>>     <actualUsageInfo> ....
>>>
>>>       <useNLPspecific>parsing</useNLPspecific> ....
>>>
>>> Where is in the IULA model this is considerably simplified to
>>>
>>> :resource a ms:Resource ;
>>>
>>>   ms:actualUse ms:parsing
>>>
>>>
>>>
>>> This would be great, but it also loses information, for example, the
>>> IULA schema associates the *availability* with the *Resource*. However,
>>> the XSD schema associates an *availability* with each *Distribution*
>>> (download file). In fact, there are resources that have different
>>> availability for different downloads (e.g., BabelNet), so there is
>>> information loss here. Thus, LingHub is very conservative and sticks to the
>>> XSD, e.g.,
>>>
>>> :resource a ms:ResourceInfo ;
>>>
>>>   ms:usageInfo [
>>>
>>>     ms:actualUsageInfo [
>>>
>>>       ms:useNLPspecific ms:parsing ] ]
>>>
>>> What shall we recommend here?
>>>
>>>
>>>
>>> Again, discuss on a case-by-case basis. For instance, for availability,
>>> we have re-introduced the distribution element, as  otherwise we lose in
>>> semantics. For other cases, I think we should see them more closely. The
>>> grouping into components made sense in XSD because it brought together
>>> elements. I will have to look at them more closely and explain for each
>>> case why this grouping was meant, so that we can decide if this should also
>>> remain in the RDF mapping. Is there an easy way of spotting these cases?
>>>
>> OK, we should discuss this in a telco.
>>
>>>
>>>
>>> A final question: how will we add the comments/decisions from the
>>> previous googlesheet to the current one? As said, I can do this for the
>>> distribution/licensing module elements but for the rest?
>>>
>> Add any comments you want (possibly copied from previous doc). Apart from
>> that I would like to keep the sheet itself clean until the next ldl4lt
>> telco at least
>>
>> Regards,
>> John
>>
>>>
>>>
>>> Best,
>>>
>>> Penny
>>>
>>>
>>>
>>
>>
>
>
> --
> Marta Villegas
> marta.villegas@gmail.com
>
Received on Wednesday, 28 January 2015 12:15:48 UTC