Re: Namespace persistence etc from Dan Brickley on 2016-08-24 (public-sdw-wg@w3.org from August 2016)

From: Dan Brickley <danbri@google.com>
Date: Wed, 24 Aug 2016 12:52:13 +0100
To: Phil Archer <phila@w3.org>
Cc: SDW WG Public List <public-sdw-wg@w3.org>, Scott Simmons <ssimmons@opengeospatial.org>
Message-ID: <CAK-qy=7RmbyFvqvrDSv0LUJUewQjnCmCiHp=zic=TJYO0J-SYw@mail.gmail.com>
On 24 August 2016 at 12:17, Phil Archer <phila@w3.org> wrote:
> Hey Dan, pls see inline below.
>
>
> On 24/08/2016 10:14, Dan Brickley wrote:
>>
>> (excuse the belatedness of this reply, I thought I had responded but
>> don't see it in the thread)
>>
>> On 13 July 2016 at 06:12, Phil Archer <phila@w3.org> wrote:
>>>
>>> @Scott - please chime in with any variance to this from an OGC
>>> perspective.
>>>
>>> Dear all,
>>>
>>> I must begin by apologising for not being on the SSN call today/last
>>> night.
>>> I could make up some convoluted reason but the truth is that I forgot.
>>>
>>> I know one of the topics discussed was the issue around vocabulary term
>>> persistence so I should set out a few things about that.
>>>
>>> The principle is, I think, straightforward: any change made to a
>>> vocabulary
>>> shouldn't break existing implementations. Since we don't know who has an
>>> implementation, we can't write to everyone and ask "if we change this
>>> will
>>> your thing break?" Therefore we have to be cautious.
>>
>>
>> We discussed this a bit further f2f last time. If you want to be this
>> strict you will literally only be allowing yourself meaningless
>> changes to a term's definition. For example, if you change the case,
>> spelling, indentation, punctuation, phrasing order or other minor
>> aspects of the rdfs:comment of a type or property, you're not
>> affecting 1.) for a type, the things that are in it 2.) for a
>> property, the pairs of things that it relates. As soon as you start
>> tweaking the text to clarify meaning, you affect 1.) or 2.), and these
>> can always potentially create breakage. The notion that some changes
>> are broadening and some are restricting does not affect whether those
>> changes might break things; all that is needed for potential breakage
>> is any change from previous conditions. Software and applications can
>> be very fragile, and embody all kinds of assumptions.
>>
>> Consider the example of Course markup, and a CourseInstance type with
>> a courseMode property. Imagine version one of the definition gave
>> "face-to-face" as a (text or URL-based) value option for that
>> property. A later revision might want to clarify whether Skype
>> sessions (or VR or whatever) counted as face-to-face. Prior to that
>> clarification applications could have assumed it did, or that it
>> didn't; there's always the risk of breakage even with modest
>> improvements. This is not a radical change in meaning, but can make
>> the difference between something working as intended and not. It is
>> also not a theoretical example but comes from Google's review of the
>> draft Courses schema,
>>
>> https://www.w3.org/community/schema-course-extend/wiki/Mode_of_study_or_delivery
>
>
> I guess it's a question of balance, then. It is only search engines and, I
> think, even amongst those, only Google, that has access to this kind of view
> of the real world. So you're able to look at how terms are actually used and
> make an assessment. The rest of the world works without such access and so I
> tend to err on the side of caution/conservatism. If there is clear evidence,
> wherever it comes from, that a term's definition should be amended to match
> the ground truth then, OK, that seems right to do so. But that evidence
> needs to be available I think, otherwise, a new term should probably be
> minted.

Actually it is surprisingly hard to find out how things are used even
within (a fast moving complex company like) Google, never mind our Web
search competitors or other consumers of Web data.

My perspective is more than for schemas that aspire to global adoption
- and I'll count Dublin Core, FOAF and Schema.org in that direction -
it is very easy to allow the metaphorical "concrete to set around your
feet" and to be paralyzed into inaction through fear of breaking
things via schema changes / improvements. And that this can have huge
cost for adoption. Dublin Core became prematurely conservative about
change in the late '90s, then even in the much more informal FOAF
effort we also worried (imho) too early about breaking things if we
changed the schema.

The insight from doing this at Google is *not* really that we can
assess precisely how the data is used everywhere. We do have some
insight into how it is *published* (like webdatacommons but scaled
up). Usage in sense of data consumption is another matter. The "view
from Google" in my experience is much more about appreciating how
schema definition nuances are often of less impact than other
pragmatic considerations. Many publishers don't read the definitions
anyway, but work from examples, tutorials and other supporting
materials that are not within a formal versioning system. They are
also often tool-guided, for better or for worse. Published data often
has syntactic, formatting or other errors. For all the sites that are
worrying carefully whether "face-to-face" includes Skype, there are
100s of relevant sites that aren't yet adopting, or whose adoption
could be improved. It is important to respect early adopters
(publishers and consumers), but also important to keep a focus also on
simplicity and usability and future users --- and such a focus  can be
difficult to reconcile with formally release of a new version for
every non-trivial change. The choice at schema.org was for data
consumers to carry more of the burden for handling changes,
improvements and smoothing out bugs in the data.
https://en.wikipedia.org/wiki/Robustness_principle remains reasonable
guidance.


> Then we get into how long does something have to be published before it's
> locked? If I publish a new term today and think better of it tomorrow, am I
> required to keep it as it is in case someone somewhere used my original? In
> 24 hours, no. In a week, almost certainly not. A month? 6? A year? There's
> no right answer to that.


Yep - there are no hard and fast rules. At schema.org we changed
http://schema.org/Language recently after several years of it
including "computer languages", for another example. It now says,
"Natural languages such as Spanish, Tamil, Hindi, English, etc. Formal
language code tags expressed in BCP 47 can be used via the
alternateName property. The Language type previously also covered
programming languages such as Scheme and Lisp, which are now best
represented using ComputerLanguage.". This fairly soft nudge towards a
new idiom hopefully is reasonably respectful of its previous wider
definition, and better than introducing /v2/Language into the
namespace as another fiddly thing for publishers to have to try to
understand and remember.


>>> That's what leads to W3C saying that vocabulary terms may not be deleted
>>> or
>>> their semantics changed radically. But it only applies at the namespace
>>> level. If you have a new namespace, you can do what you like since
>>> nothing
>>> will break. *However* it's going to be really confusing if some terms in
>>> the
>>> old and new namespaces are the same but with radically different
>>> semantics.
>>> So my interpretation is:
>>>
>>> Same namespace:
>>> ===============
>>> No deletions.
>>> No changing or tightening or semantics (i.e. don't add a new domain or
>>> range
>>
>>
>> FWIW the approach we took at schema.org was to use weaker domain-like
>> and range-like properties that give us more wiggle-room
>> (domainIncludes, rangeIncludes). It is a kind of promise that things
>> might continue evolving.
>
>
> Yes and if you'd done that when you were editing RDF Schema it might have
> been a good idea, but, well, the RDF WG wrote it as it is.

The RDFS design was pretty much complete in 1998 :) Let's not get into
the horrors of versioning schema languages...

We aren't obliged to use all the formal machinery from the schema
languages. It is tempting to use this stuff just because it is there
and it seems neat and tidy to write these things down for machines.
But sometimes mechanical simplicity is outweighed by other
considerations. OWL and RDFS have no idea about the passage of time.
OWL can't distinguish property with "at most one value (at any point
in time)" from "at most one value (ever)". Depending on your
preferences this can either be an argument for standardizing ever more
powerful ontology/schema formalisms, or an argument that human-facing
definitions deserve as much attention as the axioms.


>> You mention tightening and (later) weakening. What about clarifying?
>> Realizing that definitions were not as tight as originally hoped is a
>> hugely important class of schema edit.
>>
>>> - make a sub class|property and put the new restrictions on that)
>>> Deprecation is OK.
>>> Loosening semantics is OK (so you *can* remove a domain or range
>>> restriction
>>
>>
>> This can also cause breakage, if downstream clients expect the data to
>> already embody those restrictions. There are also restrictions that
>> are not embodied in domain/range but are carried in the textual
>> definitions.
>
> OK, so I'm tending towards conservative.

Even if the cost is dozens of extra namespace URIs in play? And a
built-in bias towards fragmentation and stagnation, since larger
vocabularies will suffer from version proliferation more than tiny
ones. If you have a vocabulary with > 100 terms, will you really want
to both releasing a new version of the whole thing just to improve a
single term's definition? It will be tempting to leave the formal
definition in a stale state and simply update "best practice",
tutorials, examples and tools with hints instead. At which point it
may be cleaner and fairer to say "sorry, we changed our mind slightly"
and update the core specification too (while leaving a record of the
changes).


>>> since it is extremely unlikely that doing so will break anyone's existing
>>> implementation).
>>> Adding new terms is fine.
>>> Clarifying existing definitions is OK.
>>> Adding new translations of labels is expressly encouraged.
>>
>> +1
>>>
>>>
>>> Different namespace
>>> ===================
>>> We can be a little more relaxed here. Recall that documents on w3.org are
>>> persistent so the original documentation will always be there (at the
>>> original URI or redirected from it).
>>>
>>> No need to replicate the whole of the old vocabulary, so no need to
>>> include
>>> deprecated terms - they are deprecated by not being included in the new
>>> namespace.
>>>
>>> Assuming the vocabulary has the same name then terms that appear in both
>>> old
>>> and new should broadly be the same although semantics can change a
>>> little.
>>> It's a matter of judgement.
>>>
>>> The case I keep in mind is Dublin Core/DC Terms. dc:creator took either
>>> text
>>> or a URI as a value - which was confusing. dcterms:creator should take a
>>> URI.
>>
>>
>> Minor nitpic, DC doesn't say quite that. See
>>
>> http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#terms-creator
>> It says that the value of a dcterms:creator property will be a
>> dcterms:Agent. For which see
>> http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#Agent
>> "A resource that acts or has the power to act.", "
>> Examples of Agent include person, organization, and software agent.".
>>
>> So (in json-ld) you could have something like,
>>
>>  {
>>    "...": "......",
>>    "dcterms:creator":
>>     {
>>       "@type": "dcterms:Agent",
>>       "foo": "bar", ....
>>     }
>>   }
>>
>> There are those in the Linked Data community who take the view that
>> every time you mention an entity you should give a URI for it, but
>> that viewpoint is not currently baked into DC Terms. All that DC Terms
>> says is that a creator is something that can act, which is pretty
>> broad. But it does as you point out discourage us from using names of
>> those things as values for the property.
>
>
> Understood. But please bear in mind that not everyone has several hollowed
> out mountains full of servers to interpret fuzziness.

I wasn't saying it was good or bad to use bnodes, only that
dcterms:creator is agnostic on this point. Requiring publishers to
know a Linked Data URI for every entity also comes with some cost...

cheers,

Dan
Received on Wednesday, 24 August 2016 11:52:49 UTC