- From: John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de>
- Date: Wed, 28 Jan 2015 13:15:16 +0100
- To: Marta Villegas <marta.villegas@gmail.com>
- Cc: Penny Labropoulou <penny@ilsp.gr>, public-ld4lt@w3.org
- Message-ID: <CAC5njqqYcQXsSDSTZu+neguU7MOvH-pog9zsHormyir8DW1-Mw@mail.gmail.com>
On Wed, Jan 28, 2015 at 12:38 PM, Marta Villegas <marta.villegas@gmail.com> wrote: > Dear all, > > Some comments about the way we (IULA) proceeded whe RDFying MS model that > I hope will help: > > General XSD2RDF rules can be summarised as follows: > > *XSD OWL* > xs:simpleType rdfs:Datatype > xs:simpleType with xs:enumeration rdfs:Datatype., plus an instance for > every enumerated value. > I assume you mean owl:Class here ^^ > xs:complexType owl:Class > global element with simple type rdfs:Datatype > local element with complex type owl:ObjectProperty > local element with simple type owl:DatatypeProperty > > MS model fllows a ‘document-centric’ approach (in some cases, elements > merely act as a way to organize information for human consumption). > Applying the rules above to the original XSD schema would derive into a > graph filled with ‘superfluous’ nodes. Thus, we decided to identify these > nodes before the actual RDFication process. > > The criteria applied when 'simplifying' take into account the (i) tree > structure of the nodes, (ii) their cardinality and (iii) the XPath axes. > > *rule1:* > Embedded complex elements with cardinalityMax=1 can be removed, provided > they do not contain text nor attributes. This allows for a simplification > of the model, for example: > I am using the schema below, which does not seem to have this property, is this the same as you? http://metashare.ilsp.gr/META-XMLSchema/v3.0/ > > resource/*identificationInfo*/resourceName > resource/*identificationInfo */description > resource/*identificationInfo */resourceShortName > resource/*identificationInfo */url > … > > becomes > > resource/ resourceName > resource/ description > resource/ resourceShortName > resource/url > … > > Removing the *IndentificationInfo *element implies removing the resulting > relation & class. Assuming that (i) a class in OWL is a classification of > individuals into groups which share common characteristics, and that (ii) > individuals belong to some Class, we infer that for an XML element to > become an OWL Class it is expected that there exist individuals belonging > to this Class. (it is hard to 'imagine' individulas of *IdentificationInfo > *class) > > Note that such a rule can be applied provided this does not derive in > ‘sibling conflicts’. Nodes define the scope in which embedded elements > occur and this needs to be taken into account. If we remove the *identificationInfo > *element, its children nodes become children of the *resourceInfo *node. > This means that *resurceName, description, resurceShortName*, etc. become > sibling nodes of *contactPerson, validationInfo*, etc. Promoted nodes > need to be unique in their new axe. > > (in the article we report problems concerning naming conventions in the MS > model. Example: > > resource/metadataInfo/source > resource/metadataInfo/originalMetadataSchema > resource/metadataInfo/originalMetadataLink > resource/metadataInfo/metadataLanguageName > resource/metadataInfo/metadataLanguageId > resource/metadataInfo/metadataLastDateUpdated > resource/metadataInfo/*revision***** > resource/metadataInfo/metadataCreator > > resource/versionInfo/version > resource/versionInfo/*revision***** > resource/versionInfo/lastDateUpdated > resource/versionInfo/updateFrequency > > In this case sibling conflicts forbide removal of metadataInfo and > versionInfo (unless we rename things) > > we identified 11 wrapping elements in the MS schema. > Interesting... I will see if I can remove some elements using the same principle > > *rule2* > When having complex elements with one and only one simple element, this > can be removed . Example: > > /validationTool/*targetResourceNameURI*. > > becomes > > validationTool > > We identified a total of 9 ‘superfluous nodes’ in the MS schema occuring > in the following contexts: accessTool, annotationTool, creationTool, > derivedResource, originalSource, relatedResource, resourceAssociatedWith, > textNumericalContentInfo and validationTool. > Interestingly, I got nearly the same result but for a different reason... that is I assumed that the meta-type targetResourceInfoType was exactly a URL... this covers all the cases you identified other than textNumericalContentInfo > > > Personally, I'd rather prefere to follow MS model as accurately as > possible (and avoid neverending discussions) but I think you cannot aviod > addressing (i) those issues that come when moving from XSD to RDF and (ii) > possible improvements (not many) to the original schema that are already > identified by the MS team. > > you can find much more detailed info in an article in LREC and I can send > you the list of 'removed' elements and the script used to identify them. > That would be super Regards, John > > Hope this helps > > > > > 2015-01-28 11:28 GMT+01:00 John P. McCrae <jmccrae@cit-ec.uni-bielefeld.de > >: > >> >> >> On Tue, Jan 27, 2015 at 10:03 PM, Penny Labropoulou <penny@ilsp.gr> >> wrote: >> >>> Hi John and all. >>> >>> Thanx for the quick work! >>> >>> Below are a few comments/replies in between the lines. >>> >>> >>> >>> 1) Some names have been shortened, e.g., >>> 'ConformanceToBestStandardsAndPractices' -> >>> 'StandardsBestPractices', should we accept such names or stay true to >>> MetaShare? >>> I think we should decide this on a case-by-case basis; although some >>> names are long, they are self-explanatory. In general, at ld4lt we have >>> changed some names (e.g. resource to language resource) when it was agreed >>> that the new label is better. >>> >> Hmm... change for the sake of change is difficult, particularly when it >> is only a small part of the vocabulary, that creates gotchas. >> >>> >>> 2) A lot of MetaShare names have (unnecessarily) the words 'Info', >>> 'Type' or 'InfoType', we could eliminate these. >>> All “info” elements are in fact component names: in accordance to the >>> CMDI principles, elements (and other components) are grouped into >>> semantically coherent components. For instance, the identificationInfo >>> groups together elements that are used for the identification of a >>> resource, such as the resourceId, a url used as landing page, the >>> resourceName and shortName, the description etc. If I have understood well, >>> this structure is not needed/not a good practice for RDF and this is why >>> they have been eliminated already at the IULA/UPF mapping. >>> >>> “type” elements are used in MetaShare for components that can be >>> re-used: e.g. persons can be licensors, contact points, resource creators >>> etc., but in all cases they are encoded using the personInfoType, which >>> groups together given name, surname, communication information etc. Again, >>> I think this is not mapped in RDF as such, if I understand well. >>> >> Yeah that is my feeling too, I would like to shorten the names, however >> it seems hard to do this consistently as it would create clashes, e.g., >> ActualUse/ActualUseInfo, DocumentType/DocumentInfo >> >>> >>> 3) IULA have split the AnnotationType class into 5 subclasses >>> (DiscourseAnnotation, etc.) >>> >>> That’s an improvement from the original model and I suggest we stick to >>> it. >>> >>> 4) There are many properties suggested by IULA or in the 'DISTRIBUTION' >>> model that have no correspondence in the MetaShare data... we should >>> discuss these on a case-by-case basis, right? >>> >>> We have already discussed with Victor the distribution and licensing >>> module and have come up with a proposal re-introducing some of the original >>> MetaShare elements that were not mapped in the IULA/UPF version and using >>> the odrl (mainly) and cc vocabularies ; the general ideas are to be found >>> at >>> https://www.w3.org/community/ld4lt/wiki/Metashare_vocabulary_for_licenses >>> and https://www.w3.org/community/ld4lt/wiki/Examples and the mappings >>> were documented in the previous googlesheet. I will add these to the new >>> googlesheet by next week. >>> >> I incorporated all the functional (non-documentary) information from the >> distribution model already... or at least I tried, let me know if I missed >> anything. >> >>> 5) The Prev. Google Doc proposed mapping to both SWRC and BIBO, do we >>> need to do BIBO as well (SWRC seems sufficient)? >>> >>> 6) I added the license modelling that LingHub does in ODRL, could one of >>> our ODRL experts look at it and fix the last one? >>> >>> Please, see also the two wikis on licensing, especially the examples. >>> And as discussed, together with Victor we will provide a file with the RDF >>> representations in odrl of the licenses used in MetaShare (of course, only >>> of those that have not already been RDFized). >>> >> This refers to "R4 To neatly represent conditions of use"... but I >> couldn't find the structured definitions of conditions of use so I wrote my >> own in the sheet titled "License Modelling" >> >>> 7) Some property values, especially *resource types*, such as *ontology* >>> or *corpus* were created as classes in the Google Doc, shall we confirm >>> this usage pattern? >>> >>> This needs some more thinking, checking the various cases. Is there a >>> list of these? >>> >> This seems to be individuals of the classes 'ResourceType' and >> 'LexicalConceptualResourceType', approximately, here are the lists for >> reference: >> >> In Prev. Google Doc: BabelNet*, ComputationalLexicon, Corpus, >> CorpusAudio*, CorpusCollection*, CorpusImage*, CorpusText*, >> CorpusTextNgram*, CorpusTextNumerical*, CorpusVideo*, Framenet, >> LexicalConceptualResource, Lexicon, MachineReadableDictionary, Ontology, >> TerminologicalResource, Thesaurus, ToolService*, WordList, WordNet >> >> From Metashare: computationalLexicon, framenet, lexicon, >> machineReadableDictionary, ontology, other*, terminologicalResource, >> thesaurus, wordList, wordnet, corpus, languageDescription*, >> lexicalConcepturalResource >> >> *Unique elements >> >>> >>> >>> 8) *See attached diagram.* There is a big difference in granularity >>> between the XSD and IULA-UPF's ontology. For example, there are 4 tags >>> between the resource and its actual usage in the XML, e.g., >>> >>> <resourceInfo> ... >>> >>> <usageInfo> ... >>> >>> <actualUsageInfo> .... >>> >>> <useNLPspecific>parsing</useNLPspecific> .... >>> >>> Where is in the IULA model this is considerably simplified to >>> >>> :resource a ms:Resource ; >>> >>> ms:actualUse ms:parsing >>> >>> >>> >>> This would be great, but it also loses information, for example, the >>> IULA schema associates the *availability* with the *Resource*. However, >>> the XSD schema associates an *availability* with each *Distribution* >>> (download file). In fact, there are resources that have different >>> availability for different downloads (e.g., BabelNet), so there is >>> information loss here. Thus, LingHub is very conservative and sticks to the >>> XSD, e.g., >>> >>> :resource a ms:ResourceInfo ; >>> >>> ms:usageInfo [ >>> >>> ms:actualUsageInfo [ >>> >>> ms:useNLPspecific ms:parsing ] ] >>> >>> What shall we recommend here? >>> >>> >>> >>> Again, discuss on a case-by-case basis. For instance, for availability, >>> we have re-introduced the distribution element, as otherwise we lose in >>> semantics. For other cases, I think we should see them more closely. The >>> grouping into components made sense in XSD because it brought together >>> elements. I will have to look at them more closely and explain for each >>> case why this grouping was meant, so that we can decide if this should also >>> remain in the RDF mapping. Is there an easy way of spotting these cases? >>> >> OK, we should discuss this in a telco. >> >>> >>> >>> A final question: how will we add the comments/decisions from the >>> previous googlesheet to the current one? As said, I can do this for the >>> distribution/licensing module elements but for the rest? >>> >> Add any comments you want (possibly copied from previous doc). Apart from >> that I would like to keep the sheet itself clean until the next ldl4lt >> telco at least >> >> Regards, >> John >> >>> >>> >>> Best, >>> >>> Penny >>> >>> >>> >> >> > > > -- > Marta Villegas > marta.villegas@gmail.com >
Received on Wednesday, 28 January 2015 12:15:48 UTC