Re: ITS > NIF conversion - one testing contributor minimum needed from Felix Sasaki on 2013-05-23 (public-multilingualweb-lt-tests@w3.org from May 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Thu, 23 May 2013 09:30:08 +0200
To: Phil Ritchie <philr@vistatec.ie>
CC: Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
Message-ID: <519DC580.6050404@w3.org>
Cool, thanks a lot, Phil, that was fast! I will then create a few input 
- output files for LQI by Monday & let's see who else will step up.

Best,

Felix

Am 23.05.13 09:24, schrieb Phil Ritchie:
> Felix
>
> I volunteer. If the single test category could be LQI all the better.
>
> Phil
>
>
>
> On 23 May 2013, at 08:17, "Felix Sasaki" <fsasaki@w3.org 
> <mailto:fsasaki@w3.org>> wrote:
>
>> Hi all,
>>
>> we have one feature in ITS2 that is not yet tested: the conversion to 
>> NIF. To fill this gap, I would propose the following approach.
>>
>> 1) The conversion is tested with example files for one data category. 
>> No need to have the conversion output of several data categories in 
>> one output file.
>>
>>
>> 2) Like the definition of the NIF > ITS algorithm
>> http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif
>> the conversion output represents ITS information, but no information 
>> about how it was generated (local markup, global, interitance, defaults).
>>
>>
>> 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, 
>> ... for testing purpose having just one representation would be good, 
>> for ease of comparison.
>>
>>
>> 4) It doesn't make sense to add the NIF conversion input / output to 
>> the test suite master file and to take it into account for 
>> comparison. The reason is that the test suite master file does a line 
>> by line comparison of test suite output. That doesn't provide useful 
>> info for NIF.
>>
>>
>> 5) So how to compare output? The "meat" of the conversion to NIF is 
>> that for each node in a document that holds ITS information, a triple 
>> of the following form is generated:
>> SubjectURI ITS2DataCategorySpecificPredicate Value
>>
>> "SubjectURI" is an URI that consists of the document base URI plus 
>> "#" plus character offsets. Example for "Dublin" in
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
>> this is
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18
>>
>> "ITS2DataCategorySpecificPredicate" is an RDF predicate that 
>> identifies the data category information in question, e.g. for 
>> "Dublin" it is "itsrdf:taIdentRef".
>>
>> "Value" is the data category value. For "Dublin" in the above 
>> example, this is http://dbpedia.org/resource/Dublin
>>
>> So we would need to compare whether two implementations create the 
>> sample triples of above form for a given input document. "The same" 
>> does not mean "the same offsets", e.g. "#char=12,18", since white 
>> space normalization for the NIF conversion is not defined in ITS2. 
>> "The same"  means: for each node that contains ITS information there 
>> must be a triple like above, with the same predicate and object.
>>
>>
>> 6) Do we need automatic comparison?
>> With the definition of "the same" like above an automatic comparison 
>> of the output is hard. But it is not needed IMO: NIF conversion is 
>> one feature like e.g. Translate "global"; so having 5-10 input files 
>> that we can check manually would be sufficient. Also, we don't need 
>> to cover several data categories in one NIF conversion test file, and 
>> we might restrict the testing even to one data category: "text 
>> analyis", "translate", ...
>>
>>
>> 7) How to do this practically?
>>
>> Below is a template for the output ITS2NIF conversion, in Turtle 
>> serialization.
>>
>> [
>> @prefix : <XXX-base-uriXXX#> .
>>      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
>>      @prefix nif: 
>> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
>>      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>
>>     :char=0,XXX-complete-length-XXX     a nif:Context;
>>          nif:isString " XXX-complete-source-file-text-content-XXX";
>>          nif:sourceUrl <XXX-base-uriXXX> .
>>
>> :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a 
>> nif:RFC5147String;
>>          nif:anchorOf "XXX-annotated-string-XXX";
>>          nif:referenceContext :char=0,XXX-complete-length-XXX;
>>
>>
>>          XXX-annotation-predicate-XXX XXX-annotation-value-XXX .
>> ]
>>
>> The part starting with
>> ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a 
>> nif:RFC5147String;"
>> and ending with
>> "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ."
>> would be needed for each annotation.
>>
>> Here is an example how the template woud be filled in for
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
>> Note that in below test file, I am only processing "text analysis" 
>> information, see 6) above for the rationale.
>>
>> [
>> @prefix : 
>> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#> 
>> .
>>      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
>>      @prefix nif: 
>> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
>>      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>
>>     :char=0,30     a nif:Context;
>>          nif:isString " Welcome to Dublin in Ireland!";
>>          nif:sourceUrl 
>> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html> 
>> .
>>
>>     :char=12,18     a nif:RFC5147String;
>>          nif:anchorOf "Dublin";
>>          nif:referenceContext :char=0,30;
>>          itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin> .
>>
>> :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D) 
>> nif:convertedFrom :char=12,18 .
>> ]
>>
>>
>> 8) What effort is needed from test suite contributors?
>>
>> For Leroy / TCD as the test suite owner, I'd say no effort is needed 
>> except creating directories for NIF input / output and documenting 
>> them on the test suite main page, saying that they are part of the 
>> normative conformance testing.
>>
>> We need at least one test suite contributor that would - in addition 
>> to me - implement the conversion exemplified in 7). The 
>> implementation should be pretty straightforward:
>> 0 Re-use the code that generates your test suite output
>> 1 Create the template under 7), fill in file base URI 
>> "XXX-base-uriXXX", "XXX-complete-source-file-text-content-XXX" and 
>> "XXX-complete-length-XXX"
>> 2 For each node that has ITS 2 annotation:
>> 2.1 create a subject URI. That is, instead of generating XPath in a 
>> line like /html/body[1]/h2[1]/span[1] , you
>> 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in 
>> all element nodes preceding the current node
>> 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of 
>> the current node and calculate plus XXX-ITS-Annotation-Start-XXX
>> 2.2 Now you have the subject URI via using above offsets and the base 
>> UR. And now you can create this triple:
>> :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a 
>> nif:RFC5147String;
>> 2.3 create the nif:anchorOf triple by putting the string value of the 
>> current node in, e.g. "Dublin";
>> 2.4 create the nif:referenceContext triple by using the count of the 
>> complete document string length, e.g. :char=0,30;
>> 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef 
>> <http://dbpedia.org/resource/Dublin> "
>> 3 after the last annoation, put the dot "." at the end instead of ";".
>>
>> 9) How much testing do we need?
>>
>> I have implemented the conversion for Text Analysis local, see
>> http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl
>> and an example conversion, in which the triples are shown with the 
>> RDF validator http://tinyurl.com/qhhjgmb
>> So if we would have one more implementer who could do 8) we would be 
>> done.
>> I'd be happy to contribute input files. I would also be happy to do 
>> this for any other data category. But before starting I'd like to 
>> know which data category to use, so that I don't need to redo the 
>> input files.
>>
>>
>> 10) When do we need this?
>>
>> We need this for finalizing ITS2, that is the testing needs to be 
>> done within the next three weeks. I hope that with the test suite 
>> based description under 8) the processing of the actual conversion 
>> is  straightforward and a question of a few hours for whose who are 
>> producing test suite output anyway.
>>
>>
>> 11) How critical is this?
>>
>> We need - in addition to me - one more volunteer, otherwise we can't 
>> finalize ITS2.
>>
>>
>> Best,
>>
>> Felix
>
>
> ************************************************************
> VistaTEC Ltd. Registered in Ireland 268483.
> Registered Office, VistaTEC House, 700, South Circular Road,
> Kilmainham. Dublin 8. Ireland.
>
> The information contained in this message, including any accompanying
> documents, is confidential and is intended only for the addressee(s).
> The unauthorized use, disclosure, copying, or alteration of this
> message is strictly forbidden. If you have received this message in
> error please notify the sender immediately.
> ************************************************************
>
Received on Thursday, 23 May 2013 07:30:43 UTC