Re: [aksw-core] Charter for W3C Community Group Natural Language Interfaces for the Web of Data

From: Petr Baudis <pasky@ucw.cz> · Date: Sun, 10 Apr 2016 04:47:41 +0200

  Hi!

  I was actually wondering about what the main purpose and relevance
of this W3C community would be - but the idea of proposing some common
reference benchmarks and suites for training+testing machine learned
information retrieval systems is excellent and makes absolute sense!

  The most widespread common benchmarks for NLI interfaces for the Web
of Data are probably:

  * Semantic Parsing datasets like GeoQuery, Free917, WebQuestions,
    SimpleQuestions and QALD.

  * Fulltext Question Answering datasets like the TREC9-12 QA dataset.

  * Domain-specific datasets (which may be also hybrid), e.g. BioASQ.

  There are various issues that need to be addressed - e.g. non-even
sampling of user inputs (say numerical questions have bad coverage),
issues of evaluating output correctness, temporal instability and
dependencies on continuously evolving corpora.

  Plus, getting started requires a lot of preprocessing of the inputs
(ranging from not sharing a common format to different methods for
entity linking) which could be common but often isn't shared (or is just
thrown into public repositories with little or no documentation) and
that makes things hard for newcomers.

  And I'm sure others who approach this from different viewpoints will
see different issues to address.  (My background is in machine learning
and NLP rather than semantic web and ontologies.)

  In our activities related to the YodaQA, we are maintaining and
evolving several datasets, maybe they could serve as starting points
for some common benchmarks:

  * https://github.com/brmson/dataset-factoid-curated for evolution of
    much cleaned up TREC9-12 dataset; many of these questions cannot
    be answered by Web of Data (at least for now), though

  * https://github.com/brmson/dataset-factoid-movies for domain-specific
    questions on movies (which makes for an attractive and well defined
    subset for good coverage)

  * https://github.com/brmson/dataset-factoid-webquestions for a suite
    of tools and post-processed versions of the popular WebQuestions
    datasets

  I think making progress in commonly accepted benchmarks and query
datasets would be a valuable contribution of this community!  Would
others in this community be interested in working towards this?

On Wed, Apr 06, 2016 at 04:38:19PM +0200, Ricardo Usbeck wrote:
> Dear all, 
> 
> thanks for the online as well as offline discussion. So far, we identified some common points for deliverables of this group:
> 
> * At least one common format for benchmarks
> * At least one test suite for extensive benchmarking of components, e.g., like [2]
> 
> We also identified discussions pertaining to:
> * How and how often to communicate
> 
> @Edgard: I think we will focus on protocols like [1]
> 
> Furthermore, if you want to edit the charter [3], let me know your github user name. 
> 
> Best regards
> Ricardo
> 
> [1] Andreas Both, Dennis Diefenbach, Kuldeep Singh, Saeedeh Shekarpour, Didier Cherix and Christoph Lange. Qanary -- An Extensible Vocabulary for Open Question Answering Systems
> [2] GERBIL -- General Entity Annotation Benchmark Framework by Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitelonis, and Lars Wesemann in24th WWW conference
> [3] https://github.com/Natural-Language-Interfaces-CG/charter
> 
> On 16 Mar 2016, at 15:53, Edgard Marx <marx@informatik.uni-leipzig.de> wrote:
> > 
> > Hi Ricardo,
> > 
> > Thanks for leading the discussion and organization.
> > 
> > >>  * scope and goals (e.g., an ontology to ease communication of modules across platforms and deployments)
> > 
> > First of all, I would like to start a discussion regarding the scope of the working group.
> > In my opinion, a good start is define some borders.
> > 
> > For instance, will the group work in interfaces as (a) Communication Protocols or (b) User Interfaces?
> > We can even be more specific e.g. In case (a) our work will be just define the message format etc.
> > 
> > In my opinion the scope should be in a functionality level of NLP processes e.g. input/output  not even specifying the format.
> > Program languages does it and work.
> > 
> > >> * communication process (monthly telcos?)
> > 
> > I think nowadays there are nice social media tools that help people to follow and participate in discussion e.g. Facebook, Doodle.
> > I would think in organize calls just if it is extremely necessary. however, I am not against in having it :-).
> > 
> > >>  * deliverables? how to coordinate a specification
> > Yes, this work just fine, tasks/goals/roles.
> > 
> > best regards,
> > Edgard
> > 
> > 
> > On Wed, Mar 16, 2016 at 9:37 AM, Ricardo Usbeck <usbeck@informatik.uni-leipzig.de <mailto:usbeck@informatik.uni-leipzig.de>> wrote:
> > *** Please apologise for cross-posting***
> > 
> > Dear all,
> > 
> > we are currently looking for input to our charter for the W3C Community Group Natural Language Interfaces for the Web of Data https://www.w3.org/community/nli/ <https://www.w3.org/community/nli/>. 
> > 
> > The current draft can be found here http://natural-language-interfaces-cg.github.io/charter/charter-nli.md <http://natural-language-interfaces-cg.github.io/charter/charter-nli.md>
> > 
> > Main issues currently:
> >  * communication process (monthly telcos?)
> >  * scope and goals (e.g., an ontology to ease communication of modules across platforms and deployments)
> >  * deliverables? how to coordinate a specification
> > And of course, anything else you are interested in to clarify the direction of this CG.
> > 
> > 
> > Feel free to contribute directly to the git repository https://github.com/Natural-Language-Interfaces-CG/charter <https://github.com/Natural-Language-Interfaces-CG/charter>
> > 
> > Best regards,
> > Ricardo 
> > 
> > _______________________________________________
> > aksw-core mailing list
> > aksw-core@lists.informatik.uni-leipzig.de <mailto:aksw-core@lists.informatik.uni-leipzig.de>
> > http://lists.informatik.uni-leipzig.de/mailman/listinfo/aksw-core <http://lists.informatik.uni-leipzig.de/mailman/listinfo/aksw-core>
> > 
> > 
> 

-- 
				Petr Baudis
	If you have good ideas, good data and fast computers,
	you can do almost anything. -- Geoffrey Hinton