Re: Using W3id for more than just persistent/permanent identifiers? from Stian Soiland-Reyes on 2015-11-30 (public-perma-id@w3.org from November 2015)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Mon, 30 Nov 2015 09:14:45 +0000
To: "Haag, Jason" <jason.haag.ctr@adlnet.gov>
Cc: public-perma-id <public-perma-id@w3.org>
Message-ID: <CAPRnXt=AKMygqZ_j1oGte5rV7RGOJiZ8OQ3gHxbXVJjYRnKzog@mail.gmail.com>
In theory this is a good idea.

However, the "raw" GitHub URLs don't provide correct Content-Type,
which could be a bit confusing for RDF clients.

e.g.

stain@biggie:/tmp$ curl -I
https://raw.githubusercontent.com/adlnet/xapi-vocabulary/master/ontology.ttl
 | grep Content-Type
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  2633    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
X-Content-Type-Options: nosniff
Content-Type: text/plain; charset=utf-8


The raw URL domain name also seems to change every year or so, so it
does not seem too reliable to me..


Github Pages [1] do however (finally) seem to support a bit more
proper content types like .ttl and .rdf, and this has been used by
myself [2] and others [3] at w3id - second example shows a simplistic
Accept handling.  Your suggestion have a bit better browser detection
- nevertheless this is quite crude content negotiation, limited to
what is possible by .htaccess.

To use Github Pages you would typically need a gh-pages branch instead
of master, and access at https://$username.github.io/$repository/$path



I have also found the third-party service http://rawgit.com/ as a very
good proxy for accessing arbitrary resources in GitHub repositories.
Example:


stain@biggie:/tmp$ curl -I
https://rawgit.com/adlnet/xapi-vocabulary/master/ontology.ttl | grep
Content-T
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
Content-Type: text/turtle;charset=utf-8
X-Content-Type-Options: nosniff


Like the GitHub resources, this also provides correct content-type and
the Access-Control-Allow-Origin header so these resources can be
accessed by browsers.

The cdn.rawgit.com alternative, as suggested for use in production,
gives very good performance, and caches the resource, possibly
"forever". You therefore would probably use a tag instead of "master",
downside here would be that you would need to modify the w3id
redirection for any new releases.


rawgit are used by many of w3id resources, for example [4] or with
content negotiation: [5]


Perhaps an interesting approach here with
https://w3id.org/library/holding is the one of using the gh-pages
branch, but access through rawgit for RDF resources. This means
browsers get redirected to
http://dini-ag-kim.github.io/holding-ontology/holding.html while RDF
clients go to https://cdn.rawgit.com/dini-ag-kim/holding-ontology/gh-pages/holding.ttl
-- I can assume that this is because .ttl was not handled by GitHub
Pages earlier - so it would be a question of preference of the Fastly
CDN versus MaxCDN really.




[1] https://pages.github.com/
[2] https://github.com/perma-id/w3id.org/blob/master/bundle/.htaccess
[3] https://github.com/perma-id/w3id.org/blob/master/dgarijo/ro/.htaccess
[4] https://github.com/perma-id/w3id.org/blob/master/cwl/.htaccess
[5] https://github.com/perma-id/w3id.org/blob/master/library/.htaccess


On 23 November 2015 at 14:25, Haag, Jason <jason.haag.ctr@adlnet.gov> wrote:
> Thanks everyone for the feedback. It might be a good idea to update
> the instructions to reflect these guidelines for w3id permanent
> identifiers.
>
> I'm also curious about leveraging github in general for the target URL
> in .htaccess. Is anyone else doing this? Since github supports raw
> data and many communities of practice are already using github to host
> their RDF data would this be a good practice? For example,
>
> Options +FollowSymLinks
> RewriteEngine on
> # Rewrite rule to serve HTML content from the vocabulary URI if requested
> RewriteCond %{HTTP_ACCEPT}
> !application/rdf\+xml.*(text/html|application/xhtml\+xml)
> RewriteCond %{HTTP_ACCEPT} text/html [OR]
> RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
> RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
> RewriteRule ^ontology$ http://xapi.vocab.pub/ontology/index.html [R=303]
> RewriteRule ^ontology/$ http://xapi.vocab.pub/ontology/index.html [R=303]
>
> # Rewrite rule to serve Turtle content from the vocabulary URI if requested
> RewriteCond %{HTTP_ACCEPT} text/turtle
> RewriteRule ^ontology$
> https://raw.githubusercontent.com/adlnet/xapi-vocabulary/master/ontology.ttl
> [R=303]
> RewriteRule ^ontology/$
> https://raw.githubusercontent.com/adlnet/xapi-vocabulary/master/ontology.ttl
> [R=303]
>
>
> -------------------------------------------------------
> Advanced Distributed Learning Initiative
> +1.850.266.7100(office)
> +1.850.471.1300 (mobile)
> jhaag75 (skype)
> http://linkedin.com/in/jasonhaag
>
>
> On Mon, Nov 23, 2015 at 5:09 AM, Stian Soiland-Reyes
> <soiland-reyes@cs.manchester.ac.uk> wrote:
>> Agree that actually hosting should not be our concern, as it
>> brings along issues like copyright, maintenance, ownership,
>> licensing.
>>
>> There could potentially be room for this community to override a
>> redirection to a web archived version if the original provider's
>> server goes AWOL (e.g. what almost happened with the VoID ontology
>> earlier this year).   Such "reawakenings" should however still
>> redirect to a third-party server, e.g. archive.org
>>
>>
>>
>> On the other side, in my view, doing basic content negotiation as part of
>> .htaccess I think *could* be part of w3id, e.g. send browsers to
>> http://example.com/resource.html, APIs to
>> http://api.example.com/resource.jsonld and RDF clients to
>> http://example.com/resource.ttl
>>
>> Even redirecting non-JSON-LD clients to online services like Gregg Kelloggs
>> RDF Distiller could be in scope.
>>
>> The .htaccess way of doing content negotiation (e.g. by checking the Accept
>> header with a regex) is not strictly according to the HTTP spec as you would be
>> ignoring the quality parameter q= and multiple requested types -- but at least
>> it would be better than no content negotiation at all.
>>
>> (I believe to get Apache to do the content negotiation properly you would need
>> some actual files with actual content types bound - this could in theory be
>> some empty dummy files with a .htaccess redirect overriding those files again -
>> but this sounds like a technology debt trap not for the faint hearted:)
>>
>> One other issue here is that getting such .htaccess files right requrires a
>> fair amount of trial-and-error which we can't really be doing per pull request.
>>
>> But I believe doing some kind of Docker-version of w3id could be a way to test
>> it locally - kind of extending the current .travis.yml to set up a w3id
>> redirector on http://localhost:8080/ for you to test - and then we could have
>> a recommended template to copy. (where you delete the file types you don't
>> support)
>>
>>
>>
>> Excerpts from David I. Lehn's message of 2015-11-17 21:39:06 +0000:
>>> On Thu, Nov 12, 2015 at 3:16 PM, Haag, Jason <jason.haag.ctr@adlnet.gov> wrote:
>>> > I know that w3id was really established to provide a persistent identifier
>>> > mechanism for RDF resource and vocabulary IRIs. However, the fact that it is
>>> > already configured for content negotiation provides a real opportunity for
>>> > communities that don't have the resources (both severs & expertise).
>>> > Technical expertise such as contentneg and having a collaborative workflow
>>> > are often seen as a barriers to getting started with linked data.
>>> >
>>> > Are there any objections to also using the w3id server to host linked data
>>> > files such as html/rdfa, turtle, json-ld? I know that is not what it is
>>> > intended for, but thought I would ask. It could help automate my particular
>>> > workflow where I'm working with several different organizations that are
>>> > wanting to publish their linked data collaboratively and while also having a
>>> > tool to generate the persistent identifiers.
>>> >
>>>
>>> While it would be easy to allow arbitrary data, I don't think we
>>> should take w3id.org in that direction at this time.  The service was
>>> designed to be a simple redirector.  If we also add in hosting user
>>> data there are three problems I see.
>>>
>>> 1. A likely small increase in resource usage and cost.
>>> 2. The current system involves manual interaction for updates.  Part
>>> of the choice for doing that was the assumption that link updates
>>> would be rare but content the links point to could be updated however
>>> the owners want without w3id.org in the loop.  Better tooling would
>>> fix this issue.
>>> 3. If the service hosts user content, we'll have to worry about what
>>> that content is.  I don't think we want to be in the middle of
>>> copyright issues, other similar claims, or content policing.  With
>>> links we mostly avoid such problems.
>>>
>>> There are many cloud services out there that can host data and would
>>> be easy to point w3id.org links at.  They also likely have better
>>> interfaces for dealing with files than what w3id.org is using now.  If
>>> it in fact is too difficult for some people, perhaps that's an
>>> opportunity for another service?
>>>
>>> -dave
>>
>> --
>> Stian Soiland-Reyes, eScience Lab
>> School of Computer Science
>> The University of Manchester
>> http://soiland-reyes.com/stian/work/    http://orcid.org/0000-0001-9842-9718
>>



-- 
Stian Soiland-Reyes, eScience Lab
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/    http://orcid.org/0000-0001-9842-9718
Received on Monday, 30 November 2015 09:15:37 UTC