Re: Whither the schema.org JSON-LD context? from Sandro Hawke on 2014-01-07 (public-linked-json@w3.org from January 2014)

From: Sandro Hawke <sandro@hawke.org>
Date: Mon, 06 Jan 2014 21:03:47 -0500
To: Gregg Kellogg <gregg@greggkellogg.net>, Dan Brickley <danbri@google.com>
CC: Ramanathan Guha <guha@google.com>, W3C Web Schemas Task Force <public-vocabs@w3.org>, Linked JSON <public-linked-json@w3.org>
Message-ID: <52CB6083.3080901@hawke.org>
On 01/06/2014 03:24 PM, Gregg Kellogg wrote:
> On Jan 6, 2014, at 11:56 AM, Dan Brickley <danbri@google.com> wrote:
>
>> +Cc: Guha
>>
>> On 6 January 2014 18:48, Gregg Kellogg <gregg@greggkellogg.net> wrote:
>>> For some time, we've been expecting schema.org to publish a json-ld context at http://schema.org/ via content-negotation when the request is made with an accept header including application/ld+json. On behalf of the Linked JSON Community Group, I'd like to get an update on this.
>>>
>>> To get around this, many (most) JSON-LD tool suppliers have provided their own context based on the schema.org vocabulary definition, but this is prone to error, and difference of implementation between the various tools. I understand that there could be some concern about excessive requests for the context, when it's not necessary, however, it's hard to see that this would even approach the number of requests for http://schema.org/ itself, from tools that encounter that in HTML.
>>>
>>> Any timeline on when this might be available?
>> I think it's reasonable to expect a static file published this
>> quarter. However you're right that we do have concerns about the
>> schema.org *website* forming an integral part of numerous unknown
>> software systems and applications. It ought to be possible to do
>> useful things with schema.org-based json-ld without a dependency on
>> the Web site.
>>
>> W3C's experience with XML parsers that auto-fetch
>> http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd and
>> http://www.w3.org/1999/xhtml when parsing XML is relevant here:
>>
>> Excerpting from
>> http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/
>>
>> "Handling all these requests costs us considerably: servers,
>> bandwidth and human time spent analyzing traffic patterns and devising
>> methods to limit or block excessive new request patterns."
>>
>> If someone has millions of schema.org-based JSON-LD documents that
>> they want to parse into RDF or otherwise consume via json-ld tooling,
>> are there code snippets and examples for the popular toolkits that
>> make it likely the schema.org will see one request (per session, day,
>> application invocation etc.) rather than millions?
> I think that this is reasonable; we can discuss it on the next JSON-LD call. Using HTTP headers that allow caching and allow a client to wait 24 hours before checking back using last-modified or ETag would do this. On your part, if your terms of use restrict overuse of the service, returning something like a status 429 (Too Many Requests) would allow you to black-list sites that are abusing the system and create push-back on vendors to adhere to the terms of use and caching policy. A request to http://schema.org/ using application/ld+json would then return something like the following HTTP headers:
>
> Content-Type: application/ld+json
> Last-Modified: ...
> ETag: ...
> Cache-Control: public, max-age=86400
> Vary: Content-Type
>
> Some allowance should be made for production vs testing environments, so avoiding returning a 429 should probably be avoided unless a truely excessive number of requests is detected, or through some webmaster intervention.

Right.   The Web already has mechanisms for dealing with this, with 
cache control.

One could even set cache headers for much higher than 24 hours.

Several interesting questions come up....

For some APIs it might make sense to have the context never change. That 
is, when the context changes, the API URI is going to change anyway.   
In this case, you could set the max-age to 10 years, and encourage 
software developers to ship clients with this pre-loaded in the cache.

I wonder whether it's possible to set things up so there's one cache 
duration for when the client sees everyone working well, and another 
when it isn't (like there's a missing prefix).    Probably more tricky 
than it's worth.

How to get clients to do the right thing?    They'll probably want to be 
lazy and not cache in many cases.    But if schema.org actually enforces 
a policy of giving 429's after 100 requests in an hour, that might teach 
people to do client-side caching.

On the other hand, maybe that's not practical.    You can't necessarily 
tell it's the same client.  There might be an auditorium with hundreds 
of people running the same software, all behind one NAT IP address.

If there were a way to monetize the traffic it wouldn't be a problem.   
I'd expect for individual search engines this wouldn't be a problem, but 
I can see how it might be for schema.org as a organization.  That is, 
Google can handle the traffic, and decide it's worth it because it puts 
data on the web more under their control, and they can monetize that 
control in lots of ways.... but perhaps their partners in schema.org 
wouldn't like that.

There's a kind of natural feedback loop here that if schema.org starts 
to get overloaded and slow, clients will have more motivation to 
cache.    Perhaps that's the solution to the 
many-people-on-one-IP-address; rather than giving a 429, just 
de-prioritize or temporarily tar-pit folks asking too fast.   It would 
sure be nice if there was a way to give an error message, or at least 
know who to contact.   I bet user-agent fields are not set very well in 
general....

     -- Sandro


> Gregg
>
>> If JSON is the new XML and JSON-LD is the emerging best practice for
>> interoperable JSON, it isn't unreasonable to expect XML-levels of
>> usage. So let's try to learn from the W3C XML DTD experience.
>>
>> Dan
>
Received on Tuesday, 7 January 2014 02:04:30 UTC