Re: ITS > NIF conversion - one testing contributor minimum needed from Leroy Finn on 2013-05-23 (public-multilingualweb-lt-tests@w3.org from May 2013)

From: Leroy Finn <finnle@tcd.ie>
Date: Thu, 23 May 2013 11:51:59 +0100
To: Felix Sasaki <fsasaki@w3.org>
Cc: Phil Ritchie <philr@vistatec.ie>, Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
Message-ID: <CAMYWBwswqPBvcR_wAPB0Hp7YPsnKHOP81a38VQ219Ds-ve-TQw@mail.gmail.com>
Felix,

Dave has discussed implementing  NIF and RDF so we could be in a position
to test also.

Cheers,
Leroy


On 23 May 2013 08:30, Felix Sasaki <fsasaki@w3.org> wrote:

>  Cool, thanks a lot, Phil, that was fast! I will then create a few input
> - output files for LQI by Monday & let's see who else will step up.
>
> Best,
>
> Felix
>
> Am 23.05.13 09:24, schrieb Phil Ritchie:
>
> Felix
>
>  I volunteer. If the single test category could be LQI all the better.
>
> Phil
>
>
>
> On 23 May 2013, at 08:17, "Felix Sasaki" <fsasaki@w3.org> wrote:
>
>   Hi all,
>
> we have one feature in ITS2 that is not yet tested: the conversion to NIF.
> To fill this gap, I would propose the following approach.
>
> 1) The conversion is tested with example files for one data category. No
> need to have the conversion output of several data categories in one output
> file.
>
>
> 2) Like the definition of the NIF > ITS algorithm
> http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif
> the conversion output represents ITS information, but no information about
> how it was generated (local markup, global, interitance, defaults).
>
>
> 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ...
> for testing purpose having just one representation would be good, for ease
> of comparison.
>
>
> 4) It doesn't make sense to add the NIF conversion input / output to the
> test suite master file and to take it into account for comparison. The
> reason is that the test suite master file does a line by line comparison of
> test suite output. That doesn't provide useful info for NIF.
>
>
> 5) So how to compare output? The "meat" of the conversion to NIF is that
> for each node in a document that holds ITS information, a triple of the
> following form is generated:
> SubjectURI ITS2DataCategorySpecificPredicate Value
>
> "SubjectURI" is an URI that consists of the document base URI plus "#"
> plus character offsets. Example for "Dublin" in
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
> this is
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18
>
> "ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies
> the data category information in question, e.g. for "Dublin" it is
> "itsrdf:taIdentRef".
>
> "Value" is the data category value. For "Dublin" in the above example,
> this is http://dbpedia.org/resource/Dublin
>
> So we would need to compare whether two implementations create the sample
> triples of above form for a given input document. "The same" does not mean
> "the same offsets", e.g. "#char=12,18", since white space normalization for
> the NIF conversion is not defined in ITS2. "The same"  means: for each node
> that contains ITS information there must be a triple like above, with the
> same predicate and object.
>
>
> 6) Do we need automatic comparison?
> With the definition of "the same" like above an automatic comparison of
> the output is hard. But it is not needed IMO: NIF conversion is one feature
> like e.g. Translate "global"; so having 5-10 input files that we can check
> manually would be sufficient. Also, we don't need to cover several data
> categories in one NIF conversion test file, and we might restrict the
> testing even to one data category: "text analyis", "translate", ...
>
>
> 7) How to do this practically?
>
> Below is a template for the output ITS2NIF conversion, in Turtle
> serialization.
>
> [
> @prefix : <XXX-base-uriXXX#> .
>      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>.
>      @prefix nif:
> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>.
>      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>
>     :char=0,XXX-complete-length-XXX     a nif:Context;
>          nif:isString " XXX-complete-source-file-text-content-XXX";
>          nif:sourceUrl <XXX-base-uriXXX> .
>
>     :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a
> nif:RFC5147String;
>          nif:anchorOf "XXX-annotated-string-XXX";
>          nif:referenceContext :char=0,XXX-complete-length-XXX;
>
>
>          XXX-annotation-predicate-XXX XXX-annotation-value-XXX .
> ]
>
> The part starting with
> ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a
> nif:RFC5147String;"
> and ending with
> "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ."
> would be needed for each annotation.
>
> Here is an example how the template woud be filled in for
>
> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html
> Note that in below test file, I am only processing "text analysis"
> information, see 6) above for the rationale.
>
> [
> @prefix :
> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#>.
>      @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>.
>      @prefix nif:
> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>.
>      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
>
>     :char=0,30     a nif:Context;
>          nif:isString " Welcome to Dublin in Ireland!";
>          nif:sourceUrl
> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html>.
>
>     :char=12,18     a nif:RFC5147String;
>          nif:anchorOf "Dublin";
>          nif:referenceContext :char=0,30;
>          itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin><http://dbpedia.org/resource/Dublin>.
>
>     :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D)
> nif:convertedFrom :char=12,18 .
> ]
>
>
> 8) What effort is needed from test suite contributors?
>
> For Leroy / TCD as the test suite owner, I'd say no effort is needed
> except creating directories for NIF input / output and documenting them on
> the test suite main page, saying that they are part of the normative
> conformance testing.
>
> We need at least one test suite contributor that would - in addition to me
> - implement the conversion exemplified in 7). The implementation should be
> pretty straightforward:
> 0 Re-use the code that generates your test suite output
> 1 Create the template under 7), fill in file base URI "XXX-base-uriXXX",
> "XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX"
> 2 For each node that has ITS 2 annotation:
> 2.1 create a subject URI. That is, instead of generating XPath in a line
> like /html/body[1]/h2[1]/span[1] , you
> 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all
> element nodes preceding the current node
> 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of the
> current node and calculate plus XXX-ITS-Annotation-Start-XXX
> 2.2 Now you have the subject URI via using above offsets and the base UR.
> And now you can create this triple:
>        :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX     a
> nif:RFC5147String;
> 2.3 create the nif:anchorOf triple by putting the string value of the
> current node in, e.g. "Dublin";
> 2.4 create the nif:referenceContext triple by using the count of the
> complete document string length, e.g. :char=0,30;
> 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef
> <http://dbpedia.org/resource/Dublin> <http://dbpedia.org/resource/Dublin>"
> 3 after the last annoation, put the dot "." at the end instead of ";".
>
> 9) How much testing do we need?
>
> I have implemented the conversion for Text Analysis local, see
>
> http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl
> and an example conversion, in which the triples are shown with the RDF
> validator http://tinyurl.com/qhhjgmb
> So if we would have one more implementer who could do 8) we would be done.
> I'd be happy to contribute input files. I would also be happy to do this
> for any other data category. But before starting I'd like to know which
> data category to use, so that I don't need to redo the input files.
>
>
> 10) When do we need this?
>
> We need this for finalizing ITS2, that is the testing needs to be done
> within the next three weeks. I hope that with the test suite based
> description under 8) the processing of the actual conversion is
> straightforward and a question of a few hours for whose who are producing
> test suite output anyway.
>
>
> 11) How critical is this?
>
> We need - in addition to me - one more volunteer, otherwise we can't
> finalize ITS2.
>
>
> Best,
>
> Felix
>
>
> ************************************************************
> VistaTEC Ltd. Registered in Ireland 268483.
> Registered Office, VistaTEC House, 700, South Circular Road,
> Kilmainham. Dublin 8. Ireland.
>
> The information contained in this message, including any accompanying
> documents, is confidential and is intended only for the addressee(s).
> The unauthorized use, disclosure, copying, or alteration of this
> message is strictly forbidden. If you have received this message in
> error please notify the sender immediately.
> ************************************************************
>
>
>
Received on Thursday, 23 May 2013 10:52:31 UTC