Re: ITS > NIF conversion - one testing contributor minimum needed from Leroy Finn on 2013-05-23 (public-multilingualweb-lt-tests@w3.org from May 2013)

From: Leroy Finn <finnle@tcd.ie>
Date: Thu, 23 May 2013 11:53:01 +0100
To: Felix Sasaki <fsasaki@w3.org>
Cc: Phil Ritchie <philr@vistatec.ie>, Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
Message-ID: <CAMYWBwtNNA7-WMVHPfqB=93WnyScgvZeS52g4GsLTkQdruzn-g@mail.gmail.com>
Sorry meant ITS and NIF.

Leroy


On 23 May 2013 11:51, Leroy Finn <finnle@tcd.ie> wrote:

> Felix,
>
> Dave has discussed implementing  NIF and RDF so we could be in a position
> to test also.
>
> Cheers,
> Leroy
>
>
> On 23 May 2013 08:30, Felix Sasaki <fsasaki@w3.org> wrote:
>
>>  Cool, thanks a lot, Phil, that was fast! I will then create a few input
>> - output files for LQI by Monday & let's see who else will step up.
>>
>> Best,
>>
>> Felix
>>
>> Am 23.05.13 09:24, schrieb Phil Ritchie:
>>
>> Felix
>>
>>  I volunteer. If the single test category could be LQI all the better.
>>
>> Phil
>>
>>
>>
>> On 23 May 2013, at 08:17, "Felix Sasaki" <fsasaki@w3.org> wrote:
>>
>>   Hi all,
>>
>> we have one feature in ITS2 that is not yet tested: the conversion to
>> NIF. To fill this gap, I would propose the following approach.
>>
>> 1) The conversion is tested with example files for one data category. No
>> need to have the conversion output of several data categories in one output
>> file.
>>
>>
>> 2) Like the definition of the NIF > ITS algorithm
>> http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif
>> the conversion output represents ITS information, but no information
>> about how it was generated (local markup, global, interitance, defaults).
>>
>>
>> 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ...
>> for testing purpose having just one representation would be good, for ease
>> of comparison.
>>
>>
>> 4) It doesn't make sense to add the NIF conversion input / output to the
>> test suite master file and to take it into account for comparison. The
>> reason is that the test suite master file does a line by line comparison of
>> test suite output. That doesn't provide useful info for NIF.
>>
>>
>> 5) So how to compare output? The "meat" of the conversion to NIF is that
>> for each node in a document that holds ITS information, a triple of the
>> following form is generated:
>> SubjectURI ITS2DataCategorySpecificPredicate Value
>>
>> "SubjectURI" is an URI that consists of the document base URI plus "#"
>> plus character offsets. Example for "Dublin" in
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
>> this is
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18
>>
>> "ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies
>> the data category information in question, e.g. for "Dublin" it is
>> "itsrdf:taIdentRef".
>>
>> "Value" is the data category value. For "Dublin" in the above example,
>> this is http://dbpedia.org/resource/Dublin
>>
>> So we would need to compare whether two implementations create the sample
>> triples of above form for a given input document. "The same" does not mean
>> "the same offsets", e.g. "#char=12,18", since white space normalization for
>> the NIF conversion is not defined in ITS2. "The same"  means: for each node
>> that contains ITS information there must be a triple like above, with the
>> same predicate and object.
>>
>>
>> 6) Do we need automatic comparison?
>> With the definition of "the same" like above an automatic comparison of
>> the output is hard. But it is not needed IMO: NIF conversion is one feature
>> like e.g. Translate "global"; so having 5-10 input files that we can check
>> manually would be sufficient. Also, we don't need to cover several data
>> categories in one NIF conversion test file, and we might restrict the
>> testing even to one data category: "text analyis", "translate", ...
>>
>>
>> 7) How to do this practically?
>>
>> Below is a template for the output ITS2NIF conversion, in Turtle
>> serialization.
>>
>> [
>> @prefix : <XXX-base-uriXXX#> .
>>      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>.
>>      @prefix nif:
>> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>.
>>      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>>
>>     :char=0,XXX-complete-length-XXX     a nif:Context;
>>          nif:isString " XXX-complete-source-file-text-content-XXX";
>>          nif:sourceUrl <XXX-base-uriXXX> .
>>
>>     :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a
>> nif:RFC5147String;
>>          nif:anchorOf "XXX-annotated-string-XXX";
>>          nif:referenceContext :char=0,XXX-complete-length-XXX;
>>
>>
>>          XXX-annotation-predicate-XXX XXX-annotation-value-XXX .
>> ]
>>
>> The part starting with
>> ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a
>> nif:RFC5147String;"
>> and ending with
>> "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ."
>> would be needed for each annotation.
>>
>> Here is an example how the template woud be filled in for
>>
>> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
>> Note that in below test file, I am only processing "text analysis"
>> information, see 6) above for the rationale.
>>
>> [
>> @prefix :
>> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#>.
>>      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>.
>>      @prefix nif:
>> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>.
>>      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>>
>>     :char=0,30     a nif:Context;
>>          nif:isString " Welcome to Dublin in Ireland!";
>>          nif:sourceUrl
>> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html>.
>>
>>     :char=12,18     a nif:RFC5147String;
>>          nif:anchorOf "Dublin";
>>          nif:referenceContext :char=0,30;
>>          itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin><http://dbpedia.org/resource/Dublin>.
>>
>>     :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D)
>> nif:convertedFrom :char=12,18 .
>> ]
>>
>>
>> 8) What effort is needed from test suite contributors?
>>
>> For Leroy / TCD as the test suite owner, I'd say no effort is needed
>> except creating directories for NIF input / output and documenting them on
>> the test suite main page, saying that they are part of the normative
>> conformance testing.
>>
>> We need at least one test suite contributor that would - in addition to
>> me - implement the conversion exemplified in 7). The implementation should
>> be pretty straightforward:
>> 0 Re-use the code that generates your test suite output
>> 1 Create the template under 7), fill in file base URI "XXX-base-uriXXX",
>> "XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX"
>> 2 For each node that has ITS 2 annotation:
>> 2.1 create a subject URI. That is, instead of generating XPath in a line
>> like /html/body[1]/h2[1]/span[1] , you
>> 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all
>> element nodes preceding the current node
>> 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of the
>> current node and calculate plus XXX-ITS-Annotation-Start-XXX
>> 2.2 Now you have the subject URI via using above offsets and the base UR.
>> And now you can create this triple:
>>        :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX
>> a nif:RFC5147String;
>> 2.3 create the nif:anchorOf triple by putting the string value of the
>> current node in, e.g. "Dublin";
>> 2.4 create the nif:referenceContext triple by using the count of the
>> complete document string length, e.g. :char=0,30;
>> 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef
>> <http://dbpedia.org/resource/Dublin> <http://dbpedia.org/resource/Dublin>"
>> 3 after the last annoation, put the dot "." at the end instead of ";".
>>
>> 9) How much testing do we need?
>>
>> I have implemented the conversion for Text Analysis local, see
>>
>> http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl
>> and an example conversion, in which the triples are shown with the RDF
>> validator http://tinyurl.com/qhhjgmb
>> So if we would have one more implementer who could do 8) we would be done.
>> I'd be happy to contribute input files. I would also be happy to do this
>> for any other data category. But before starting I'd like to know which
>> data category to use, so that I don't need to redo the input files.
>>
>>
>> 10) When do we need this?
>>
>> We need this for finalizing ITS2, that is the testing needs to be done
>> within the next three weeks. I hope that with the test suite based
>> description under 8) the processing of the actual conversion is
>> straightforward and a question of a few hours for whose who are producing
>> test suite output anyway.
>>
>>
>> 11) How critical is this?
>>
>> We need - in addition to me - one more volunteer, otherwise we can't
>> finalize ITS2.
>>
>>
>> Best,
>>
>> Felix
>>
>>
>> ************************************************************
>> VistaTEC Ltd. Registered in Ireland 268483.
>> Registered Office, VistaTEC House, 700, South Circular Road,
>> Kilmainham. Dublin 8. Ireland.
>>
>> The information contained in this message, including any accompanying
>> documents, is confidential and is intended only for the addressee(s).
>> The unauthorized use, disclosure, copying, or alteration of this
>> message is strictly forbidden. If you have received this message in
>> error please notify the sender immediately.
>> ************************************************************
>>
>>
>>
>
Received on Thursday, 23 May 2013 10:53:33 UTC