- From: Leroy Finn <finnle@tcd.ie>
- Date: Thu, 23 May 2013 11:53:01 +0100
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: Phil Ritchie <philr@vistatec.ie>, Multilingual Web LT-TESTS Public <public-multilingualweb-lt-tests@w3.org>
- Message-ID: <CAMYWBwtNNA7-WMVHPfqB=93WnyScgvZeS52g4GsLTkQdruzn-g@mail.gmail.com>
Sorry meant ITS and NIF. Leroy On 23 May 2013 11:51, Leroy Finn <finnle@tcd.ie> wrote: > Felix, > > Dave has discussed implementing NIF and RDF so we could be in a position > to test also. > > Cheers, > Leroy > > > On 23 May 2013 08:30, Felix Sasaki <fsasaki@w3.org> wrote: > >> Cool, thanks a lot, Phil, that was fast! I will then create a few input >> - output files for LQI by Monday & let's see who else will step up. >> >> Best, >> >> Felix >> >> Am 23.05.13 09:24, schrieb Phil Ritchie: >> >> Felix >> >> I volunteer. If the single test category could be LQI all the better. >> >> Phil >> >> >> >> On 23 May 2013, at 08:17, "Felix Sasaki" <fsasaki@w3.org> wrote: >> >> Hi all, >> >> we have one feature in ITS2 that is not yet tested: the conversion to >> NIF. To fill this gap, I would propose the following approach. >> >> 1) The conversion is tested with example files for one data category. No >> need to have the conversion output of several data categories in one output >> file. >> >> >> 2) Like the definition of the NIF > ITS algorithm >> http://www.w3.org/TR/2013/WD-its20-20130521/#conversion-to-nif >> the conversion output represents ITS information, but no information >> about how it was generated (local markup, global, interitance, defaults). >> >> >> 3) RDF can be represented in various formats: RDF/XML, RDFa, Turtle, ... >> for testing purpose having just one representation would be good, for ease >> of comparison. >> >> >> 4) It doesn't make sense to add the NIF conversion input / output to the >> test suite master file and to take it into account for comparison. The >> reason is that the test suite master file does a line by line comparison of >> test suite output. That doesn't provide useful info for NIF. >> >> >> 5) So how to compare output? The "meat" of the conversion to NIF is that >> for each node in a document that holds ITS information, a triple of the >> following form is generated: >> SubjectURI ITS2DataCategorySpecificPredicate Value >> >> "SubjectURI" is an URI that consists of the document base URI plus "#" >> plus character offsets. Example for "Dublin" in >> >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html >> this is >> >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#char=12,18 >> >> "ITS2DataCategorySpecificPredicate" is an RDF predicate that identifies >> the data category information in question, e.g. for "Dublin" it is >> "itsrdf:taIdentRef". >> >> "Value" is the data category value. For "Dublin" in the above example, >> this is http://dbpedia.org/resource/Dublin >> >> So we would need to compare whether two implementations create the sample >> triples of above form for a given input document. "The same" does not mean >> "the same offsets", e.g. "#char=12,18", since white space normalization for >> the NIF conversion is not defined in ITS2. "The same" means: for each node >> that contains ITS information there must be a triple like above, with the >> same predicate and object. >> >> >> 6) Do we need automatic comparison? >> With the definition of "the same" like above an automatic comparison of >> the output is hard. But it is not needed IMO: NIF conversion is one feature >> like e.g. Translate "global"; so having 5-10 input files that we can check >> manually would be sufficient. Also, we don't need to cover several data >> categories in one NIF conversion test file, and we might restrict the >> testing even to one data category: "text analyis", "translate", ... >> >> >> 7) How to do this practically? >> >> Below is a template for the output ITS2NIF conversion, in Turtle >> serialization. >> >> [ >> @prefix : <XXX-base-uriXXX#> . >> @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>. >> @prefix nif: >> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>. >> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>. >> >> :char=0,XXX-complete-length-XXX a nif:Context; >> nif:isString " XXX-complete-source-file-text-content-XXX"; >> nif:sourceUrl <XXX-base-uriXXX> . >> >> :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a >> nif:RFC5147String; >> nif:anchorOf "XXX-annotated-string-XXX"; >> nif:referenceContext :char=0,XXX-complete-length-XXX; >> >> >> XXX-annotation-predicate-XXX XXX-annotation-value-XXX . >> ] >> >> The part starting with >> ":char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX a >> nif:RFC5147String;" >> and ending with >> "XXX-annotation-predicate-XXX XXX-annotation-value-XXX ." >> would be needed for each annotation. >> >> Here is an example how the template woud be filled in for >> >> http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html >> Note that in below test file, I am only processing "text analysis" >> information, see 6) above for the rationale. >> >> [ >> @prefix : >> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html#>. >> @prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#><http://www.w3.org/2005/11/its/rdf#>. >> @prefix nif: >> <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#><http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>. >> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#><http://www.w3.org/1999/02/22-rdf-syntax-ns#>. >> >> :char=0,30 a nif:Context; >> nif:isString " Welcome to Dublin in Ireland!"; >> nif:sourceUrl >> <http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html><http://www.w3.org/International/multilingualweb/lt/drafts/its20/examples/html5/EX-HTML-whitespace-normalization.html>. >> >> :char=12,18 a nif:RFC5147String; >> nif:anchorOf "Dublin"; >> nif:referenceContext :char=0,30; >> itsrdf:taIdentRef <http://dbpedia.org/resource/Dublin><http://dbpedia.org/resource/Dublin>. >> >> :xpath(/html/body%5B1%5D/h2%5B1%5D/span%5B1%5D/text()%5B1%5D) >> nif:convertedFrom :char=12,18 . >> ] >> >> >> 8) What effort is needed from test suite contributors? >> >> For Leroy / TCD as the test suite owner, I'd say no effort is needed >> except creating directories for NIF input / output and documenting them on >> the test suite main page, saying that they are part of the normative >> conformance testing. >> >> We need at least one test suite contributor that would - in addition to >> me - implement the conversion exemplified in 7). The implementation should >> be pretty straightforward: >> 0 Re-use the code that generates your test suite output >> 1 Create the template under 7), fill in file base URI "XXX-base-uriXXX", >> "XXX-complete-source-file-text-content-XXX" and "XXX-complete-length-XXX" >> 2 For each node that has ITS 2 annotation: >> 2.1 create a subject URI. That is, instead of generating XPath in a line >> like /html/body[1]/h2[1]/span[1] , you >> 2.1.1 generate XXX-ITS-Annotation-Start-XXX: count the characters in all >> element nodes preceding the current node >> 2.1.2 generate XXX-ITS-Annotation-End-XXX: count the string length of the >> current node and calculate plus XXX-ITS-Annotation-Start-XXX >> 2.2 Now you have the subject URI via using above offsets and the base UR. >> And now you can create this triple: >> :char=XXX-ITS-Annotation-Start-XXX,XXX-ITS-Annotation-End-XXX >> a nif:RFC5147String; >> 2.3 create the nif:anchorOf triple by putting the string value of the >> current node in, e.g. "Dublin"; >> 2.4 create the nif:referenceContext triple by using the count of the >> complete document string length, e.g. :char=0,30; >> 2.5 for each annotation, create the triples, "e.g. itsrdf:taIdentRef >> <http://dbpedia.org/resource/Dublin> <http://dbpedia.org/resource/Dublin>" >> 3 after the last annoation, put the dot "." at the end instead of ";". >> >> 9) How much testing do we need? >> >> I have implemented the conversion for Text Analysis local, see >> >> http://www.w3.org/People/fsasaki/its20-general-processor/tools/its-ta-2-nif.xsl >> and an example conversion, in which the triples are shown with the RDF >> validator http://tinyurl.com/qhhjgmb >> So if we would have one more implementer who could do 8) we would be done. >> I'd be happy to contribute input files. I would also be happy to do this >> for any other data category. But before starting I'd like to know which >> data category to use, so that I don't need to redo the input files. >> >> >> 10) When do we need this? >> >> We need this for finalizing ITS2, that is the testing needs to be done >> within the next three weeks. I hope that with the test suite based >> description under 8) the processing of the actual conversion is >> straightforward and a question of a few hours for whose who are producing >> test suite output anyway. >> >> >> 11) How critical is this? >> >> We need - in addition to me - one more volunteer, otherwise we can't >> finalize ITS2. >> >> >> Best, >> >> Felix >> >> >> ************************************************************ >> VistaTEC Ltd. Registered in Ireland 268483. >> Registered Office, VistaTEC House, 700, South Circular Road, >> Kilmainham. Dublin 8. Ireland. >> >> The information contained in this message, including any accompanying >> documents, is confidential and is intended only for the addressee(s). >> The unauthorized use, disclosure, copying, or alteration of this >> message is strictly forbidden. If you have received this message in >> error please notify the sender immediately. >> ************************************************************ >> >> >> >
Received on Thursday, 23 May 2013 10:53:33 UTC