- From: Leroy Finn <finnle@tcd.ie>
- Date: Thu, 23 May 2013 11:51:59 +0100
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: Phil Ritchie <philr@vistatec.ie>, Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
- Message-ID: <CAMYWBwswqPBvcR_wAPB0Hp7YPsnKHOP81a38VQ219Ds-ve-TQw@mail.gmail.com>
Felix, Dave has discussed implementing NIF and RDF so we could be in a position to test also. Cheers, Leroy On 23 May 2013 08:30, Felix Sasaki <fsasaki@w3.org> wrote: > Cool, thanks a lot, Phil, that was fast! I will then create a few input > - output files for LQI by Monday & let's see who else will step up. > > Best, > > Felix > > Am 23.05.13 09:24, schrieb Phil Ritchie: > > Felix > > I volunteer. If the single test category could be LQI all the better. > > Phil > > > > On 23 May 2013, at 08:17, "Felix Sasaki" <fsasaki@w3.org> wrote: > > Hi all, > > we have one feature in ITS2 that is not yet tested: the conversion to NIF. > To fill this gap, I would propose the following approach. > > 1) The conversion is tested with example files for one data category. No > need to have the conversion output of several data categories in one output > file. > > > 2) Like the definition of the NIF > ITS algorithm > http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif > the conversion output represents ITS information, but no information about > how it was generated (local markup, global, interitance, defaults). > > > 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ... > for testing purpose having just one representation would be good, for ease > of comparison. > > > 4) It doesn't make sense to add the NIF conversion input / output to the > test suite master file and to take it into account for comparison. The > reason is that the test suite master file does a line by line comparison of > test suite output. That doesn't provide useful info for NIF. > > > 5) So how to compare output? The "meat" of the conversion to NIF is that > for each node in a document that holds ITS information, a triple of the > following form is generated: > SubjectURI ITS2DataCategorySpecificPredicate Value > > "SubjectURI" is an URI that consists of the document base URI plus "#" > plus character offsets. Example for "Dublin" in > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html > this is > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18 > > "ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies > the data category information in question, e.g. for "Dublin" it is > "itsrdf:taIdentRef". > > "Value" is the data category value. For "Dublin" in the above example, > this is http://dbpedia.org/resource/Dublin > > So we would need to compare whether two implementations create the sample > triples of above form for a given input document. "The same" does not mean > "the same offsets", e.g. "#char=12,18", since white space normalization for > the NIF conversion is not defined in ITS2. "The same" means: for each node > that contains ITS information there must be a triple like above, with the > same predicate and object. > > > 6) Do we need automatic comparison? > With the definition of "the same" like above an automatic comparison of > the output is hard. But it is not needed IMO: NIF conversion is one feature > like e.g. Translate "global"; so having 5-10 input files that we can check > manually would be sufficient. Also, we don't need to cover several data > categories in one NIF conversion test file, and we might restrict the > testing even to one data category: "text analyis", "translate", ... > > > 7) How to do this practically? > > Below is a template for the output ITS2NIF conversion, in Turtle > serialization. > > [ > @prefix : <XXX-base-uriXXX#> . > @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>. > @prefix nif: > <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>. > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>. > > :char=0,XXX-complete-length-XXX a nif:Context; > nif:isString " XXX-complete-source-file-text-content-XXX"; > nif:sourceUrl <XXX-base-uriXXX> . > > :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a > nif:RFC5147String; > nif:anchorOf "XXX-annotated-string-XXX"; > nif:referenceContext :char=0,XXX-complete-length-XXX; > > > XXX-annotation-predicate-XXX XXX-annotation-value-XXX . > ] > > The part starting with > ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a > nif:RFC5147String;" > and ending with > "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ." > would be needed for each annotation. > > Here is an example how the template woud be filled in for > > http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html > Note that in below test file, I am only processing "text analysis" > information, see 6) above for the rationale. > > [ > @prefix : > <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#>. > @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>. > @prefix nif: > <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>. > @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>. > > :char=0,30 a nif:Context; > nif:isString " Welcome to Dublin in Ireland!"; > nif:sourceUrl > <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html>. > > :char=12,18 a nif:RFC5147String; > nif:anchorOf "Dublin"; > nif:referenceContext :char=0,30; > itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin><http://dbpedia.org/resource/Dublin>. > > :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D) > nif:convertedFrom :char=12,18 . > ] > > > 8) What effort is needed from test suite contributors? > > For Leroy / TCD as the test suite owner, I'd say no effort is needed > except creating directories for NIF input / output and documenting them on > the test suite main page, saying that they are part of the normative > conformance testing. > > We need at least one test suite contributor that would - in addition to me > - implement the conversion exemplified in 7). The implementation should be > pretty straightforward: > 0 Re-use the code that generates your test suite output > 1 Create the template under 7), fill in file base URI "XXX-base-uriXXX", > "XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX" > 2 For each node that has ITS 2 annotation: > 2.1 create a subject URI. That is, instead of generating XPath in a line > like /html/body[1]/h2[1]/span[1] , you > 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all > element nodes preceding the current node > 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of the > current node and calculate plus XXX-ITS-Annotation-Start-XXX > 2.2 Now you have the subject URI via using above offsets and the base UR. > And now you can create this triple: > :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a > nif:RFC5147String; > 2.3 create the nif:anchorOf triple by putting the string value of the > current node in, e.g. "Dublin"; > 2.4 create the nif:referenceContext triple by using the count of the > complete document string length, e.g. :char=0,30; > 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef > <http://dbpedia.org/resource/Dublin> <http://dbpedia.org/resource/Dublin>" > 3 after the last annoation, put the dot "." at the end instead of ";". > > 9) How much testing do we need? > > I have implemented the conversion for Text Analysis local, see > > http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl > and an example conversion, in which the triples are shown with the RDF > validator http://tinyurl.com/qhhjgmb > So if we would have one more implementer who could do 8) we would be done. > I'd be happy to contribute input files. I would also be happy to do this > for any other data category. But before starting I'd like to know which > data category to use, so that I don't need to redo the input files. > > > 10) When do we need this? > > We need this for finalizing ITS2, that is the testing needs to be done > within the next three weeks. I hope that with the test suite based > description under 8) the processing of the actual conversion is > straightforward and a question of a few hours for whose who are producing > test suite output anyway. > > > 11) How critical is this? > > We need - in addition to me - one more volunteer, otherwise we can't > finalize ITS2. > > > Best, > > Felix > > > ************************************************************ > VistaTEC Ltd. Registered in Ireland 268483. > Registered Office, VistaTEC House, 700, South Circular Road, > Kilmainham. Dublin 8. Ireland. > > The information contained in this message, including any accompanying > documents, is confidential and is intended only for the addressee(s). > The unauthorized use, disclosure, copying, or alteration of this > message is strictly forbidden. If you have received this message in > error please notify the sender immediately. > ************************************************************ > > >
Received on Thursday, 23 May 2013 10:52:31 UTC