Re: Whither the schema.org JSON-LD context? from Martin Hepp on 2014-01-07 (public-vocabs@w3.org from January 2014)

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Tue, 7 Jan 2014 12:45:16 +0100
To: Markus Lanthaler <markus.lanthaler@gmx.net>
Cc: "'Dan Brickley'" <danbri@danbri.org>, "'Sandro Hawke'" <sandro@hawke.org>, "'Gregg Kellogg'" <gregg@greggkellogg.net>, "'Dan Brickley'" <danbri@google.com>, "'Ramanathan Guha'" <guha@google.com>, "'W3C Web Schemas Task Force'" <public-vocabs@w3.org>, "'Linked JSON'" <public-linked-json@w3.org>
Message-Id: <5843248D-0E2A-42BC-BCBA-200C47DC37B2@ebusiness-unibw.org>
Hi all:

A related observation: I experience ***huge*** amounts of traffic on the HTML definitions of a few selected http://www.productontology.org types, and I suspect that the reason is that some browsers or other components dereference the URIs of types used in Microdata or RDFa markup (which does not make sense in 99% of the cases, IMHO), at least when used with the <link> element in HTML, as e.g. in

<div itemscope itemtype="http://schema.org/Product" itemid="#product">
    <link itemprop="additionalType" href="http://www.productontology.org/id/Automobile" />
    <span itemprop="name">.. a short name for the object ...</span>
...
</div>

I did not yet fully investigate the reasons, but I suspect that some clients improperly use prefetching [1] on URIs of conceptual elements mentioned in Microdata or RDFa markup.

The clients or other components in between (this could also be pre-caching techniques operated by mobile network providers that aim at accelerating Web browsing on mobile devices) try to load the URIs of types.

I am raising this because the same could likely happen to a majority of schema.org element URIs - with the disastrous effect that millions of clients try to dereference schema.org URIs as soon as they process an HTML document that mentions a schema.org element in a <link> element.

The problem is worse in my scenario since I use content negotiation and different URIs for the element (http://www.productontology.org/id/Automobile) and the document (http://www.productontology.org/doc/Automobile), which means that a client deferencing the e.g. URI

    http://www.productontology.org/id/Automobile

will get a HTTP 303 redirect to 

    http://www.productontology.org/doc/Automobile

which, to my knowledge, cannot be cached. So even with proper HTTP cache control, I cannot stop the traffic on the URIs of the conceptual elements.

Since schema.org uses the same URIs for the page of an element and the element, HTTP caching will be more effective; still, there is the risk of huge amounts of useless requests.

Martin


[1] http://en.wikipedia.org/wiki/Link_prefetching


On Jan 7, 2014, at 11:16 AM, Markus Lanthaler wrote:

> On Monday, January 06, 2014 8:56 PM, Dan Brickley wrote:
>> I think it's reasonable to expect a static file published this
>> quarter.
> 
> Great!
> 
> 
>> However you're right that we do have concerns about the
>> schema.org *website* forming an integral part of numerous unknown
>> software systems and applications. It ought to be possible to do
>> useful things with schema.org-based json-ld without a dependency on
>> the Web site.
> 
> Sure.. and as you know it's quite simple. All you would have to do is to change the example to start with
> 
> {
>  "@context": {
>    "@vocab": "http://schema.org/"
>  },
>  ...
> }
> 
> instead of the slightly simpler
> 
> {
>  "@context": "http://schema.org/",
>  ...
> }
> 
> 
>> W3C's experience with XML parsers that auto-fetch
>> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd and
>> http://www.w3.org/1999/xhtml when parsing XML is relevant here:
> [...]
>> 
>> If JSON is the new XML and JSON-LD is the emerging best practice for
>> interoperable JSON, it isn't unreasonable to expect XML-levels of
>> usage. So let's try to learn from the W3C XML DTD experience.
> 
> I think there's a very important difference to that experience. XML namespaces are not links and are thus not *expected* to be dereferenced. Thus, AFAICT, for a long time those URLs returned non-cacheable HTTP error responses. If you know that a document is going to be requested often, you can plan for it (CDN, long cache validity etc.). I know it's important to keep these things in mind but I'm still not convinced that serving a small static file (even if it is requested millions of times) causes much costs. Otherwise, all the free JavaScript library CDNs etc. would have been shut down already a long time ago..
> 
> 
> On Tuesday, January 07, 2014 9:14 AM, Dan Brickley wrote:
>> On 7 January 2014 02:03, Sandro Hawke <sandro@hawke.org> wrote:
>>> There's a kind of natural feedback loop here that if schema.org starts
>>> to get overloaded and slow, clients will have more motivation to cache.
>>> Perhaps that's the solution to the many-people-on-one-IP-address; rather
>>> than giving a 429, just de-prioritize or temporarily tar-pit folks
>>> asking too fast.   It would sure be nice if there was a way to give an
>>> error message, or at least know who to contact.   I bet user-agent
>>> fields are not set very well in general....
>> 
>> I'm going to ignore the remarks about controlling data on the Web, and
>> focus on the fact that this sounds like a giant science experiment.
>> 
>> How about if content-negotiated requests for the json-ld version of
>> schema.org's homepage had a 60 second (or so) pause built-in?
> 
> For the first request or for subsequent requests? I think it's a great idea to 
> 
>> encourage better use of caching and
>> avoidance of fresh fetches within tight code loops.
> 
> but a very bad idea if everyone has to pay that price.
> 
> 
>> BTW is a redirect URL a legitimate response to such a request, or does
>> the JSON have to be returned directly?
> 
> It is but considerably increases latency and should thus be avoided. Again, changing the examples on schema.org and other places to use, e.g. http://schema.org/context would solve the conneg problem. Summarized, I think there are enough options. We just have to choose one and execute it as soon as possible. The longer we wait, the more difficult it becomes.
> 
> 
> Cheers,
> Markus
> 
> 
> --
> Markus Lanthaler
> @markuslanthaler
> 
> 

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/
Received on Tuesday, 7 January 2014 11:45:47 UTC