From embedded structured data to queryable websites from William Van Woensel on 2017-01-27 (public-lod@w3.org from January 2017)

From: William Van Woensel <William.Van.Woensel@Dal.Ca>
Date: Fri, 27 Jan 2017 13:00:16 +0000
To: "public-lod@w3.org" <public-lod@w3.org>
CC: Sven Casteleyn <sven.casteleyn@uji.es>
Message-ID: <8utpiovkn1m4aqxyvnhd8m32.1485521966468@email.android.com>

Hi all,

I've been following the latest discussion on this mailing list, about useful ways to leverage HTML+RDFa, with great interest. (Apologies for my late input.) As a PhD student, I authored a few papers [1, 2] about utilizing semantic annotations to increase the accuracy of website adaptation / augmentation. And indeed, one of the following steps in this line of research was to make websites queryable: for example, by crawling websites, extracting embedded structured data as RDF (reified, to retain links to the accompanying web content), and then making the RDF data available for querying; with the necessary crawling, querying, etc. middleware deployed either in JavaScript, on the website's server, or some specialized server. On top of such a queryable-website infrastructure, one can deploy all sorts of useful applications, such as website adaptation mechanisms and many others. In fact, you may remember that we had a relevant discussion some time ago on the public-lod mailing list (https://lists.w3.org/Archives/Public/public-lod/2016Oct/) on the advantages of *embedded* structured data, as opposed to metadata that is individually available.

After having privately discussed this idea of "queryable websites" (as well as some other related ideas) a while ago with Ruben, mentioning my own (partial) implementation and offering to cooperate on the effort, I am quite surprised to see this idea reappear here now. Unfortunately, this means that there are now two separate approaches and implementations, likely with a lot of shared code and duplicated work. We're currently in the process of writing a journal paper on this work. Of course, having multiple existing systems also reflects the interest of the community on this topic, which is not bad.

Regardless, in the same line of research, another major issue is to what extent *useful* embedded structured data are actually present in websites for 3rd party scenarios. Recent work in this field confirms that the large majority of new schema.org classes and properties are not being adopted [3], and semantic annotations are often limited to title and description [4, 5]. Quote from Bizer et al. [4] on a third-party e-commerce scenario: "this means that applications that for instance want to find out which websites offer a specific product need to employ additional information extraction techniques on these fields in order to gain a deeper understanding of their content (exact product type, product features), following the promise that a little semantics goes a long way."

Consequently, a first useful step would be to study the scope of the available embedded structured data, and for what kind of third-party scenarios they could be useful. The Web Data Commons initiative recently released a new corpus - right on time for this kind of effort :)

Kind regards,

William

[1] Sven Casteleyn, William Van Woensel, Olga De Troyer. Assisting Mobile Web Users: Client-Side Injection of Context-Sensitive Cues into Websites. In proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services, pp. 443–450, ACM, Paris, France, 2010, ISBN: 978-1-4503-0421-4.

[2] William Van Woensel, Sven Casteleyn, Olga De Troyer. A Generic Approach for On-The-Fly Adding of Context-Aware Features to Existing Websites. In proceedings of the 22nd Conference on Hypertext and Hypermedia, pp. 143–152, ACM, Eindhoven, Netherlands, 2011, ISBN: 978-1-4503-0256-2.

[3] Robert Meusel, Christian Bizer, Heiko Paulheim. A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time. In proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (WIMS2015), Limassol, Cyprus, July 2015.

[4] Christian Bizer, Kai Eckert, Robert Meusel, Hannes Mühleisen, Michael Schuhmacher, and Johanna Völker: Deployment of RDFa, Microdata, and Microformats on the Web - A Quantitative Analysis. In Proceedings of the 12th International Semantic Web Conference, Part II: In-Use Track, pp.17-32 (ISWC2013).

[5] Robert Meusel, Petar Petrovski, Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. In Proceedings of the 13th International Semantic Web Conference: Replication, Benchmark, Data and Software Track (ISWC2014).

Received on Friday, 27 January 2017 13:00:53 UTC