Re: NLP Interchange Format was: Re: Let's drop RDFa in the requirements ! from Tadej Stajner on 2012-05-11 (public-multilingualweb-lt@w3.org from May 2012)

From: Tadej Stajner <tadej.stajner@ijs.si>
Date: Fri, 11 May 2012 16:04:03 +0200
To: public-multilingualweb-lt@w3.org
Message-ID: <4FAD1C53.5040705@ijs.si>
Hi, Sebastian, some good points -

On 5/11/2012 10:47 AM, Sebastian Hellmann wrote:
> Dear Felix,
>
> On 05/11/2012 09:33 AM, Felix Sasaki wrote:
>> Thanks a lot for this, Sebastian.
>>
>> https://www.w3.org/International/multilingualweb/lt/track/issues/2
> The link to the issues says "Unauthorized - These pages are restricted 
> to W3C Members."
>
>> we envisage ITS attributes (its-*) for HTML5 and an automatic 
>> conversion to
>> RDFa
>>
>> "
>>
>>     - the working group will provide an algorithm to convert its- 
>> attributes
>>     into RDFa and Microdata markup, to serve the needs of the 
>> Semantic Web
>>     community and of search engine optimization.
>>     - The conversion to RDFa will add URIs to each metadata item in 
>> an HTML5
>>     document. This is needed as reference points for the metadata 
>> items after
>>     extraction of RDF.
>>
>> "
> Are RDFa and Microdata enough to fulfill your requirements. The use 
> case to deploy these is normally limited: i.e. a Web Content (XHTML) 
> publisher embeds additional structured information into his own web 
> pages.  Is this enough?

Besides that, our situation also considers that HTML markup can be 
generated early in the process and is annotated and transformed by 
various agents. While most tools in this pipeline will operate on more 
specific XML-based formats with ITS2 extensions, we'd like the 
possibility to convert the final results to RDFa . We're aware of the 
limitations, so we initially declared this to be an output format 
instead of a in-pipeline working format.

>
> Although not technically impossible, I would consider these issues 
> difficult to tackle with RDFa:
> - overlapping annotations
> - multiple annotations, i.e. from more than one layer or provider
> - merging of annotated documents (I am not sure, if this is possible)
> - third-party annotations (RDFa and Microdata can only be embedded by 
> the "owner" of the web document)
>

- Overlaps where two annotations both only partially overlap are 
technically impossible in any inline format that could fit into HTML5 
anyway. However, overlaps where one annotation annotates only a 
subsequence of the other one should be possible and valid; your example 
with the 'semantic' hasLexicalEntry is a good example of this .
- Multiple providers could be handled by the Provenance data category 
that is also in the draft.
- Aside from dealing with namespace clash (same local node names), we 
don't see major issues with merging documents with inline annotations. 
For document-wide annotations, the author needs to resolve the merge 
anyway, regardless of format.
- In the MLW-LT world, there are so many agents involved in the process 
that it's hard to pinpoint a single 'owner' for the document anyway. For 
the sake of completeness, let's assume that the agent (person or 
component involved) that does the serialization into RDFa assumes this 
role. There are also some data categories proposed that annotate this 
more explicitly.


>
>> Tadej is likely to work describing that conversion algorithm (which I 
>> guess
>> will be pretty straightforward). Sebastian or others, how would NIF fit
>> into this picture? What alignment between the conversion to RDFa and
>> potentially to NIF is needed?
> As far as I know there are only few, if any, approaches mixing 
> stand-off and inline annotations, but with RDF and RDFa this might 
> actualy be no far stretch at all.
> We would definitely need to add another URI scheme to NIF which allow 
> the transition from and to RDFa. The RDF properties can be reused, 
> directly.
> @Tadej, did you already make an RDFa example? I think choosing the 
> right URIs is challenging and a general problem, so you would need to 
> tackle it anyhow.
> Blank nodes are not advisable (they are difficult to merge and IIRC 
> increase complexity from P to NP in RDFS entailement. Please ask, if 
> you want the references, I have them somewhere...)
>

Yes, I am working RDFa, RDFa Lite and HTML5+ITS2 serializations 
(+microdata), right now only for some data categories. They're a 
combination of inline and stand-off. To summarize, the major sticking 
point was 'how to represent literals as subject nodes', and we think 
that the NIF recipes of having some rules to generate URIs are a good 
idea and fix most of this mismatch. While the offset-based algorithm is 
inappropriate since in CMSs that use templates, the actual offset is not 
known at authoring time, but only at render time. So I'm mostly using 
the hash-based one, but this one is also not ideal, since our 
annotations are meant to be inlined, which changes the contexts and 
invalidates the hashes in URIs.

Some examples that I'm currently generating (Let's have 'Welcome to 
Dublin!' as the running example - formatted for clarity):
HTML5+ITS2:

|<div>
   Welcome to
<span  translate="no"
	itsx-mentions="http://dbpedia.org/resource/Dublin"
	itsx-entity-type="http://schema.org/Place">
	Dublin
	<span  hidden="hidden"  itsx-alternative-label="Dublin"  lang="en"></span>
</span>!
</div>
|

|RDFa:|

|<div  xmlns:itsx="http://www.w3.org/20XX/XX/its2.0">
   Welcome to
<span  about="#hash_4_6_60ccf35ef21554243c7b87bcc467e3ba_Dublin"translate="no">
	<a  rel="itsx:||mentions|"|  resource="http://dbpedia.org/resource/Dublin"></a>
	<a  rel||="itsx:entityType"  resource="http://schema.org/Place"></a>
	Dublin
||	<!-- the following lines couldalso be||stand-off||  -->|
|	<span  about="http://dbpedia.org/resource/Dublin"hidden="hidden">
		<span  property="rdfs:label"  lang="en">Dublin</span>	
	</span>
</span>!
</div>
|

RDFa Lite:

|<div  prefix="itsx: http://www.w3.org/20XX/XX/its2.0">
   Welcome to
<span  about="#hash_4_6_60ccf35ef21554243c7b87bcc467e3ba_Dublin"translate="no">
	<a  property||="itsx:mentions"||  ||resource="http://dbpedia.org/resource/Dublin"></a>
	<a||  ||property||||="itsx:entityType"||  resource="http://schema.org/Place"></a>
	Dublin|
|	<!-- the following lines couldalso be stand-off-->|
|	<span  about="http://dbpedia.org/resource/Dublin"hidden="hidden">
		<span  property="rdfs:label"lang="en">Dublin</span>	
	</span>
</span>!
</div>|

The general algorithm for conversion is mostly changing its-* to 
property="its-*" or rel="its-*" while generating URIs for inline 
annotations. At some point, I was declaring them as blank nodes, but 
wasn't aware that it's not advisable - thanks for the info. There are 
also some subtle exceptions to these patterns in my examples: while the 
RDFa (Lite) examples express labels of the entity as RDFa statements, 
the HTML5+ITS2 has a dedicated property that expresses alternative 
labels in different languages.

The examples you posted earlier make sense, although I can't find any 
definition of sso:oen in the SSO documentation - I also imagine that 
this property is roughly equivalent to itsx:mentions.

In terms of URI generation recipes, I am considering another one that 
would help this use case: de-referencing via element id or more 
generally, XPath. The hash-based one still has the issue of not playing 
nice with inline annotations (but is fine with stand-off). The 
compromise is that they can't point to to arbitrary substrings, but only 
to uniquely identifiable DOM elements, and they need stronger 
infrastructure, since they operate on the DOM and not on a raw character 
sequence. For us, that tradeoff is not that problematic since we would 
usually inline and identify annotations in tags anyway (in the case we 
don't inline, the hash-based is still fine)

For example:
Having a <p id="paragraph1">..</p>, the XPath expression would be 
//p[@id='paragraph1']. The corresponding URI could then be|: 
#xpath_%2F%2Fp%5B%40id%3D'paragraph1'%5D|
I'd also be happy with just accessing by id, producing an URI 
#id_paragraph1.

-- Tadej

>
> I have not read all the documents regarding MLW-LT, I hope to get a 
> much clearer picture at the workshop.
>
> All the best,
> Sebastian
>
>>
>> Felix
>>
>>
>> 2012/5/10 Sebastian Hellmann<hellmann@informatik.uni-leipzig.de>
>>
>>> Dear all,
>>> I was following the conversation about RDFa and would like to draw your
>>> attention to the NLP Interchange Format (NIF), which we are still
>>> developing within LOD2. Although I am not 100% up-to-date with all your
>>> requirements, I would assume, that NIF tackles some of the issues 
>>> you are
>>> having, i.e. the no literals as subject problem or a general 
>>> uncertainty
>>> how to handle things.
>>>
>>> Please find the latest document (one week old) about it here:
>>> http://svn.aksw.org/papers/**2012/WWW_NIF/public/string_**ontology.pdf<http://svn.aksw.org/papers/2012/WWW_NIF/public/string_ontology.pdf> 
>>>
>>>
>>> We are currently gathering requirements for NIF version 2.0. We will
>>> prepare a draft within the next two months and then a community 
>>> reviewing
>>> phase.
>>> I will be at Dublin, so please feel free to ask me any questions.
>>>
>>> NIF is already compatible to the lemon model and NERD.
>>>
>>> So to compare it to Tadej example, I made one here:
>>> It concerns the first occurrence of "Semantic Web" on 
>>> http://www.w3.org/**
>>> DesignIssues/LinkedData.html<http://www.w3.org/DesignIssues/LinkedData.html>  
>>> highlighted here:
>>> http://pcai042.informatik.uni-**leipzig.de/~swp12-9/**
>>> vorprojekt/index.php?**annotation_request=http%3A%2F%**
>>> 2Fwww.w3.org%2FDesignIssues%**2FLinkedData.html%23hash_10_**12_**
>>> 60f02d3b96c55e137e13494cf9a02d**06_Semantic%2520Web<http://pcai042.informatik.uni-leipzig.de/~swp12-9/vorprojekt/index.php?annotation_request=http%3A%2F%2Fwww.w3.org%2FDesignIssues%2FLinkedData.html%23hash_10_12_60f02d3b96c55e137e13494cf9a02d06_Semantic%2520Web> 
>>>
>>>
>>> Here is the NIF example for it (sso:oen is probably the same as
>>> itsx:mentions):
>>> <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_729<http://www.w3.org/DesignIssues/LinkedData.html#offset_717_729> 
>>>
>>>       a str:StringInContext ;
>>>       
>>> itsx:mentions<http://dbpedia.org/resource/**Semantic_Web<http://dbpedia.org/resource/Semantic_Web>>
>>> .
>>>       
>>> sso:oen<http://dbpedia.org/resource/**Semantic_Web<http://dbpedia.org/resource/Semantic_Web>>
>>> .
>>>
>>> Additionally "semantic" could have a lexical entry. Note that 1. the
>>> offset is 4 shorter and that the DBpedia Wiktionary link is working 
>>> already
>>> of type lemon:LexicalEntry .
>>>
>>> <http://www.w3.org/**DesignIssues/LinkedData.html#**offset_717_725<http://www.w3.org/DesignIssues/LinkedData.html#offset_717_725> 
>>>
>>>     a str:StringInContext ;
>>>     
>>> sso:hasLexicalEntry<http://wiktionary.dbpedia.**org/resource/semantic<http://wiktionary.dbpedia.org/resource/semantic>>
>>> .
>>>
>>>
>>> All the best,
>>> Sebastian
>>>
>>>
>>>
>>> On 05/08/2012 03:46 PM, Dave Lewis wrote:
>>>
>>>> Hi Maxime,
>>>> Thanks you for this further clarification.
>>>>
>>>> I think a formulation you define, where the litteral would be the
>>>> _object_ of the triple while the span is the subject, may be 
>>>> sufficient for
>>>> what ITS is looking for. We only want to mark the litteral for further
>>>> processing, rather than wanting to make direct assertions about it 
>>>> as a
>>>> subject.
>>>>
>>>> The question of whether we should be using RDFa for this at all is a
>>>> broader one. It would be good to get other views, especially from 
>>>> potential
>>>> implementors of ITS2.0 on this?
>>>>
>>>> Also, to reinforce Maxime's point, the ontolex members and their
>>>> expertise would be very welcome at the upcoming dublin workshop. On 
>>>> the 11
>>>> june we are looking at future roadmaps for convergence of the 
>>>> multilingual
>>>> web with LOD. On the 12 and 13th we will be focussing directly on the
>>>> requirements for the ITS2.0 recommendation that the MLW-LT WG is 
>>>> currently
>>>> producing. We've not finalised the schedule yet, but I imagine that 
>>>> these
>>>> RDFa issue would be examined early on the 12th in the context of
>>>> terminology management and it tool support in localization.
>>>>
>>>> Kind Regards,
>>>> Dave
>>>>
>>>>
>>>> On 02/05/2012 11:08, Maxime Lefrançois wrote:
>>>>
>>>>> Hi Dave, The MSW-CG and MLW-LT-XG members,
>>>>> my answers below
>>>>>
>>>>> ------------------------------**------------------------------**
>>>>> ------------
>>>>>
>>>>>     *De: *"David Lewis"<dave.lewis@cs.tcd.ie>
>>>>>     *À: 
>>>>> *public-multilingualweb-lt@w3.**org<public-multilingualweb-lt@w3.org>
>>>>>     *Envoyé: *Mardi 1 Mai 2012 02:23:47
>>>>>     *Objet: *Re: Let's drop RDFa in the requirements !
>>>>>
>>>>>     Hi Maxime,
>>>>>     Some comments below:
>>>>>
>>>>>     On 27/04/2012 15:57, Maxime Lefrançois wrote:
>>>>>
>>>>>         Hi,
>>>>>
>>>>>         in mail
>>>>>         
>>>>> http://lists.w3.org/Archives/**Public/public-multilingualweb-**
>>>>> lt/2012Apr/0131.html<http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Apr/0131.html> 
>>>>>
>>>>> ,
>>>>>         I wrote a possible RDFa markup to represent the fact that "a
>>>>>         fragment of text is identified as a named entity". I stressed
>>>>>         that there is a shift of meaning : the meaning using RDFa is:
>>>>>         "there is a resource in the document that its:lexicalizes a
>>>>>         named entity, and that has for its:value in english some
>>>>>         fragment of text".
>>>>>
>>>>>         Actually, there will always be a shift of meaning if we 
>>>>> are to
>>>>>         use RDFa, and this is a strong conceptualization
>>>>>         incompatibility between ITS and RDF. In fact, in ITS one
>>>>>         annotates fragments of text (litterals), but in RDF litterals
>>>>>         can't be subject of a triple. As simple as that.
>>>>>
>>>>>
>>>>>     But does wrapping the litteral in a span and then adding an id
>>>>>     attribute to that not make it dereferencable and then therefore
>>>>>     the potential subject of a triple?
>>>>>
>>>>> Yes and no,
>>>>>   - the uri could be the subject of a triple anywhere of the web, 
>>>>> but the
>>>>> uri refers to the span, and not to the the text fragment that the 
>>>>> span
>>>>> contains.
>>>>>   - if you want to add a triple in the very same document, you 
>>>>> need RDFa,
>>>>> and in RDF/RDFa there is no mechanism to use a litteral as a 
>>>>> subject, it is
>>>>> forbidden. In RDFa lite, the minimal triple needs a property="" 
>>>>> attribute
>>>>> to define the property of the triple, and the text fragment is the 
>>>>> object
>>>>> of the triple.:
>>>>> <span id="myid" property="its:property">**mytext</span>  ----->  
>>>>> [:myid
>>>>> its:property "mytext"]
>>>>>
>>>>
>>>>
>>> -- 
>>> Dipl. Inf. Sebastian Hellmann
>>> Department of Computer Science, University of Leipzig
>>> Projects: http://nlp2rdf.org , http://dbpedia.org
>>> Homepage: 
>>> http://bis.informatik.uni-**leipzig.de/SebastianHellmann<http://bis.informatik.uni-leipzig.de/SebastianHellmann>
>>> Research Group: http://aksw.org
>>>
>>>
>>>
>>
>
>
Received on Friday, 11 May 2012 14:05:00 UTC