Re: Problems and Opportunities at purl.org from Shane McCarron on 2015-12-01 (public-perma-id@w3.org from December 2015)

From: Shane McCarron <shane@aptest.com>
Date: Tue, 1 Dec 2015 16:21:23 -0600
To: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Cc: Norman Gray <norman@astro.gla.ac.uk>, Pemanent Identifier CG <public-perma-id@w3.org>
Message-ID: <CAOk_reFi7z37g8ZjfgAipxZ=VMwCvSzHpecm0QOFniQ_FY7v0w@mail.gmail.com>
Good news - The W3C (almost) has a Recommendation for CSV.  Finally
standardizing the format.

http://www.w3.org/TR/2015/PR-tabular-data-model-20151117/

On Tue, Dec 1, 2015 at 2:15 PM, Stian Soiland-Reyes <
soiland-reyes@cs.manchester.ac.uk> wrote:

> Thank you,
>
> I agree on all points :))
>
> Some simple file format should be sufficient for most cases, then generate
> to $currentMainstream technology, or translate to $otherFormat.
>
> What I like about CSV - as awkwardly unspecified as it is, is that it is
> easy for non-techies to understand, so we could keep the current Github
> pull request model for a while.
>
> YAML could be another candidate, but then there's indentation to worry
> about. JSON and XML are hard to hand-edit.
>
> Content-negotiation is one "fancy" thing that has been mentioned on w3id
> list (purl.org doesn't do this), but if we were to agree to support it,
> that should be sufficient to add later as an additional mediaType column or
> similar.
>
> OK, so let's get some quick code repository up and running that can do
> csv->htaccess or similar! Python or Ruby? :) (deliberately not proposing
> Node.js here..)
>
> Former OCLS/purl guys, what is the current schema, or in what form could
> we get the data? I can kind of deduce most of it from the UI and seen the
> documentation for batch updates (which I never got to work myself).
> On 30 Nov 2015 17:04, "Norman Gray" <norman@astro.gla.ac.uk> wrote:
>
>>
>> Greetings.
>>
>> [apologies for the delay here: it was Stian's recent message that
>> reminded me I wanted to reply to his earlier one]
>>
>> On 23 Nov 2015, at 11:45, Stian Soiland-Reyes wrote:
>>
>> What I see a danger with proposing some new $shinyServerSoftware is that
>>> we can
>>> easily bind ourself into the same trap as purl.org - becoming high
>>> maintenance
>>> sysadmin-wise, and potentially relying on abandoned technology.
>>> Apache HTTP server also scales very well, and you can't say it's
>>> proprietary
>>> or at immediate risk of being abandoned. :)
>>>
>>
>> I think we should remember that the _short term_ here is one or two
>> decades, and that in this context the 'long term' implies preservation
>> 'beyond one technology generation', and thus axiomatically dealing with
>> Apache httpd's successor, rather than merely what mod_rewrite's manual
>> looks like in 2025.
>>
>> I don't think we have to worry about the transition to URLs' successor,
>> since PURL++ will surely be swept up in whatever web-wide transition path
>> that requires.
>>
>> Thus I believe that at this stage we should not be thinking of
>> $shinyServerSoftware at all, but of what the preservation data format is
>> (.csv files?) and how the schema is documented (.txt files).  Turning that
>> into an actual service (doubtless using httpd and .htaccess files to begin
>> with) is Just A Matter Of Code.
>>
>> So...
>>
>> What I like is the ideas that have been proposed to have a kind of "build"
>>> stage with more managable CSV files or something, that then "compile"
>>> into .htaccess or XML or whatever you fancy using a
>>> straight-forward Python/Ruby/nodejs script.
>>>
>>> [...]
>>>
>>> This would also mean also that libraries and researchers could use &
>>> archive
>>> the w3id "database" without having to parse .htaccess or do thousands of
>>> HTTP
>>> request.  (We might want to clarify the license on that database!)
>>>
>>
>> ...I think I'm agreeing with Stian here, but possibly being more emphatic
>> about it.
>>
>> So, a concrete question: what is the format of the purl.org data?  I
>> imagine a rather simple db schema.  For the reasons above, I think that we
>> should not regard a tree of .htaccess files as anything other than than
>> disposable, or intermediate, implementation technology.
>>
>> but we would need to migrate the existing w3id.org <http://w3id.org/>
>>>>> PURLs forward, I think.
>>>>>
>>>> In the same spirit, is that _really_ the case?
>>>>
>>>
>>> Not migrating would undermine the whole reason for having w3id.org -
>>> how would
>>> anyone trust to use us if suddenly we wipe the existing
>>> identifiers?
>>>
>>
>> First: I doubt it would be necessary in fact to abandon anything at
>> w3id.org.  That said, I think it would be good to retain the option in
>> principle.  If there's a robust model for purl++ which happens to undermine
>> one or two of the more creative current .htaccess redirections, then the
>> long-term preservability (ie, 2--10 decades) is arguably more important
>> than preserving a redirection that's been in existence for only a fraction
>> of that.  This is an argument about priorities, not a proposal for deletion.
>>
>> That would count as a decent rationale for 'deaccessioning' those w3ids
>> (and deaccessioning is something archivists are forced to do from time to
>> time).
>>
>> The current collection should be quite managable to convert manually in a
>>> couple of days - so I don't see this as a big issue.
>>>
>>
>> Ditto.
>>
>> Being able to support pretty much of all of the existing purl.org
>>> redirects is
>>> however much more important.  They should all be rewritable to .htaccess
>>>
>>
>> ...or to something which is mechanically implementable as .htaccess.
>>
>> All the best,
>>
>> Norman
>>
>>
>> --
>> Norman Gray  :  https://nxg.me.uk
>> SUPA School of Physics and Astronomy, University of Glasgow, UK
>>
>


-- 
Shane McCarron
Managing Director, Applied Testing and Technology, Inc.
Received on Tuesday, 1 December 2015 22:22:02 UTC