Re: Schema.org property cardinality and use of plural (WAS Re: SoftwareApplication proposal for schema.org)

From: Tantek <tantek@cs.stanford.edu> · Date: Fri, 2 Mar 2012 18:51:35 -0800

On Fri, Mar 2, 2012 at 7:58 AM, Dan Brickley <danbri@danbri.org> wrote:
> Hi Tantek,
>
> On 2 March 2012 01:41, Tantek <tantek@cs.stanford.edu> wrote:
>> tl;dr: making all properties optional and plural at a
>> syntactical/schema level is more future proof.
>>
>> slightly longer:
>>
>> I saw the wiki page: http://www.w3.org/wiki/WebSchemas/Singularity but
>> I have such fundamental disagreements with its content that I'm
>> following up in email instead (wiki follow-up is my preference by default).

I've corrected my mistake and documented my experience/principles
about cardinality and lack there of on the microformats wiki for
better reference/discovery.

http://microformats.org/wiki/cardinality

> I hope some of that disagreement comes from poor description on my
> part, rather than deeper differences.
>
> I yesterday asked the W3C team to add a syntax highlighter to the main
> W3C wiki (apparently some groups were already using it). And they've
> done that already :) This means we can now post markup examples there
> much more easily; I've just made a start on this.
>
> So I've added a couple to the top of
> http://www.w3.org/wiki/WebSchemas/Singularity to give a bit more
> context and concrete illustration of our current problem.

Neat. Is that a defined/published MediaWiki extension?

We've been using a "<source>" tag Mediawiki extension on the
microformats wiki that allows per language highlighting. Might be
worth passing along to the W3C team out for the W3C wiki as well:

http://www.mediawiki.org/wiki/Extension:SyntaxHighlight_GeSHi

>> 1. bad idea to put cardinality into formats/schemas.
>> 2. better to make each property plural at a format/schema level.
>> 3. move any notion of "singular" semantics up to the per application
>> level (use first occurrence/value of a property if necessary).
>>
>> Our experience with microformats has shown all this and thus we are
>> moving forward with 1-3 above in microformats-2 (explicitly making all
>> properties both optional and plural at a syntax level, permitting
>> applications to apply any needed singular semantics to first instance
>> of a property).
>
> The schema.org schema is essentially a dictionary; it doesn't have a
> notion of mandatory properties. It does hint that using common
> properties (url etc) is useful, but there is no notion of a schema.org
> description that is invalid due to missing information. This (as
> you've found with microformats) generally works better at the level of
> applications. So while Rich Snippets or Yandex tools might complain
> that they don't have enough information to go on, missing out
> information is not 'wrong' in some broader schema.org-sense.

Good to know. This might make for a good FAQ, as it can be hard to
tell if the objections from Google/MS/Yandes validators are officially
part of schema or not (since there's often an equivalence drawn
between schema.org and Google/MS/Yahoo).

> I haven't looked through yet, property by property, to see which of
> the would-be 'singular' properties are most plausible vs questionable.
> Personally I find it a bad idea to encode such decisions in the
> spelling of the property, because it makes any change vastly more
> expensive.

I think we're agreed on this.

> I suspect I'm more comfortable than you with the idea of expressing
> such information in schemas, because the schema languages I generally
> work with are in the passive/declarative family (RDFS/OWL).

As a programmer, I used to be a big fan of encoding/expressing such
information in schemas.

However, the need to for more author-scalable design (design that
scales to many more authors) trumps any CS/programmer sense of
neatness/cleanness etc.

Thus it's not a matter of "comfort" but rather design priorities.

> Schema.org's approach to schema is much in that tradition, as it
> expresses information about a vocabulary rather than a concrete
> format.  In this style of schema, if you say that e.g. dateOfBirth is
> a functional property, you're saying something about the meaning of
> the term, rather than setting an expectation that every description
> should include 1 of those.

I'm not sure there's much difference in practice.

> When dealing with format schemas (XSD etc.) I also prefer to see
> cardinality and other rules expressed more at the application level,
> with Schematron being a fine example of an (XML) system that allows
> different rules to be policed (or skipped) in different workflow
> stages or applications.

Does anyone use Schematron with arbitrary content they parse from the web?

> So +1 on everything optional, and perhaps a difference of perspective
> around value of expressing things in (some kind of) schema.
>
> I think also (from our discussion around expected values for 'gender'
> property in microformats, vCard, FOAF) we also share a concern that
> engineers tend to push too hard to get a complex world into tidier,
> rigid data structures, at the expense sometimes of the people they're
> describing.
>
> So often the impulse to express a rigid constraint is
> worth questioning, particularly when a vocabulary seeks wide-scale use
> by very diverse parties and applications.

Right, that. Now multiply it across every property and every format
used on the web.

Pretty much every time a (spec) engineer has placed a (pseudo)tidy
requirement into a format for the web, they've been wrong.

HTML: all the crazy block vs. inline element nesting rules. HTML5
fixes most of this.
Atom: requiring published and updated dates in seconds (artificial)
precision. nevermind other fields. required fields are often just
filled with boilerplate to placate validators.

Those of us who have worked with at least a generation or two (or
more) of formats on/for the web have learned this lesson, but we're a
tiny minority.

There is a very strong "tidy/rigid data structure" culture among
programmers which has nothing to do with designing formats for the
web.

This cultural flaw is predominant enough to actually cause consensus
based web standards decision-making in to fail. Atom's set of
"required" elements etc. is a failure - and *plenty* of smart people
worked on Atom for *years* and carefully negotiated what should be
required and what should be optional. Nevermind the whole draconian
XML experiment as a giant categorical tidy/rigid failure.

I have no idea how to address this problem (other than going off and
creating alternatives outside of such broader "consensus").
Suggestions welcome.

>> See point (2.) here:
>>
>> http://microformats.org/wiki/microformats-2#Summary
>>
>> Related: making properties optional has also been a hard lesson
>> learned, with repeated examples, from hCard to hAtom - which to be
>> fair took their notions of "required" properties from vCard and Atom,
>> though a mistake propagated is still a mistake.
>
> Yup, I think we talked about this at SocialGraphFoo a while back. The
> more widely a vocabulary aspires to be used, the less it can be strict
> about the exact shape of descriptions that use it.

Yes. And the web is perhaps the widest aspiration for a vocabulary.

>> On Thu, Mar 1, 2012 at 10:01 AM, Dan Brickley <danbri@danbri.org> wrote:
>>> On 24 February 2012 21:25, Will Norris <will@willnorris.com> wrote:
>>>> I had the same question when I first started looking at this.  There is a
>>>> certain simplicity in not requiring microdata vocabularies to define
>>>> cardinality of properties, and leaves the door open to interesting use cases
>>>> that may not have been initially imagined.
>>>
>>> Yes. Well there are two things here: do we define a cardinality
>>
>> Experience with microformats has shown attempts to define cardinality
>> in formats for publishing on the web create an unnecessary point of
>> failure/fragility.
>>
>> So, no, don't bother.
>
> I should probably have avoided the word 'cardinality'; it has at least
> a couple of quite different meanings, depending on the underlying
> format or data model.

Here is a concrete existing use in our sphere of vocabulary that I
believe is consistent with this thread:

http://tools.ietf.org/html/rfc6350#section-3.3

from that URL:
"
   Property cardinalities are indicated using the following notation,
   which is based on ABNF (see [RFC5234], Section 3.6):

    +-------------+--------------------------------------------------+
    | Cardinality | Meaning                                          |
    +-------------+--------------------------------------------------+
    |      1      | Exactly one instance per vCard MUST be present.  |
    |      *1     | Exactly one instance per vCard MAY be present.   |
    |      1*     | One or more instances per vCard MUST be present. |
    |      *      | One or more instances per vCard MAY be present.  |
    +-------------+--------------------------------------------------+
"

>> Simple answer (using hRecipe as a real world designed/implemented
>> source of examples)
>>
>> http://microformats.org/wiki/hrecipe
>>
>> 1. use singular forms of English nouns, even for (expected)
>> multivalued properties. E.g. "ingredient" is the property name even
>> though pretty much all recipes have multiple ingredients (and thus
>> instances of that property)
>
> I agree.
>
>> 2. plural forms of English nouns should only be used when it implies
>> specific meaning about any instance of the property (e.g. amounts),
>> not some implication that the property is or may be multivalued. E.g.
>> "instructions" is the property because a property value itself likely
>> contains multiple human readable instructions, and in practice recipes
>> have a single instance of this property. Related: the "calories"
>> extension property (which is an amount).
>
> Yup

Ok good, I've documented these as well:

http://microformats.org/wiki/cardinality#should_English_plurals_as_property_names_imply_cardinality

>>> I'm in favour of defining cardinality when it makes sense to do so
>>
>> It never makes sense to do so at a format level.
>
> If we can distinguish format vs vocabulary levels, perhaps we can agree here?

I think in practice it's been shown that it makes sense for neither.

At least if your format target is the web.

>>> (and in the full expectation people will ignore or mess up whatever we
>>> try to impose).
>>
>> Exactly why. Instead define application processing of such cases.
>>
>>
>>> But I don't think the experiment of using plural 's'
>>> markers has worked well.
>>
>> Yes, the use of plural forms to indicate anything syntactically or
>> semantically automatically was/is a mistake. Let's stop propagating it
>> (and shame on whoever thought that experiment was a good idea :P)
>
> Re 'let's stop propagating it', that was a goal of
> http://www.w3.org/wiki/WebSchemas/Singularity

That's good to hear.

Thanks for that clarification.  The "singularity" name of the page
seemed to imply a framing that singularity was a concept to be
captured / preserved - that's probably what confused me.

>>> http://schema.org/Person has 'spouse' rather than 'spouses'. Are we
>>> really to assume the property can have at most one singular value?
>>> What about re-marriage, or societies (the Web having global reach)
>>> where multiple spouses are common?
>>
>> Great example.
>>
>> Cultural differences are easily a source of disagreements of
>> cardinality. Avoid this problem by leaving out cardinality.
>>
>>
>>> I'd much rather see cardinality expressed schematically
>>> than through spelling,
>>
>> Both are bad.
>
> I believe schemas can be managed and evolved more easily than such
> spelling rules, not least because you can change a schema and
> associated documentation more easily than changing millions of Web
> pages. But our main concern here is whether to move various property
> names, e.g. 'actor' to 'actors'. We can debate the level, formality
> and content of schema separately.

Ok.

>>> since changing the expected spelling has impact
>>> on a *lot* of instance data.
>>
>> Separate issue: unnecessary renaming in general is bad, and I'd advise
>> anyone who makes decisions on schema property names to consider
>> re-using existing property names, perhaps singularized as necessary
>> (as we've done with microformats), rather than using new names (as is
>> rampant throughout schema.org - lots of unnecessary NIH,
>> even/especially where we (previous to schema) had format convergence
>> on the web e.g. Person vs. vCard/hCard, Event vs. iCalendar/hCalendar
>> etc.).
>
> Noted in wiki.

Appreciated.

> The list at http://www.w3.org/wiki/WebSchemas/Singularity#Details
> isn't yet cross-referenced to associated classes and other
> documentation, but if you have e.g. a 'top 10' wishlist for name
> changes that could be made part of this proposal, please record it in
> the wiki page. Fixing by removing the final 's' is the simplest option

Yes, I think schema is "new" enough that fixing these things is doable
now but will get progressively more painful.

In general I think that often re-using terminology (while applying
strict design principles) can help minimize these problems up front.

I'd flip this around, rather than asking for changes, document
property name provenance and that will reveal problems.

Each schema property should note its provenance, that is, where did
the name of the property come from?

If a schema property cannot document where its name came from, that's
a sign that there is unnecessary invention happening and thus the
property should be considered suspect.

> but it would be good to see a specific proposal for any name
> improvements that could happen instead.

I'll see what I can do. The big ones are around established
microformats (which themselves were for the most part based on other
existing dominantly implemented formats), many of these were already
supported by the search engines previous to schema.org.

>>>> I think the same would apply to cardinality.  We provide guidance on
>>>> expected cardinality of properties, but always do the best we can with
>>>> whatever we get.
>>>
>>> Yes. With FOAF we declared some properties as having 'at most one
>>> proper value', or implying that
>>> there can be at most one entity with any given value. Sites got it
>>> wrong all the time, but at least the
>>> declaration helped track down some data problems.
>>
>> And before that vCard (up through v3) made the same mistake of
>> declaring many properties to have at most one value.
>>
>> Much of this was addressed in vCard4, where previously singular
>> properties were made plural.
>>
>> In short, vocabulary designers get cardinality wrong all the time, so
>> you might as well give up trying. Seriously, y'all are not that smart.
>> None of us are. ;)
>
> The FOAF usage was generally more to express 'at most one thing has
> any given value for this property' (indirect identification by
> description) than to say 'there's at most one value of this property'.

That is perhaps a different meaning of cardinality.

> We say that things have at most one gender;

Some biologists might disagree.

> that documents can have at
> most one primaryTopic;

I've found that the notion of "primaryTopic" of a document is often
dependent on who is reading the document. Thus it doesn't really make
sense in practice to have a singular "primaryTopic". Perhaps a set of
topics, which may have different applicability / "primariness" to
different audiences.

> people one birthday and age.

Certainly people have one physical biological birthday (discounting
birth reincarnation systems which have their own identity problems).

However, if multiple calendar scales were allowed (as vCard4 allows I
believe), it might make sense to document someone's known birthday
with multiple property values for different calendar systems. E.g.
historical figures.

And that's just as an example of a property we all think is purely
singular. There may be multivalued uses of even properties which have
physically singular values.

> The rest were 'inverse
> functional' constraints, eg. that a homepage is a homepage of at most
> one 'thing' (ditto primary mailbox).

Also assumptions that tend to fail on the web over time. Amazon allows
multiple accounts per email address for example.

Such assumptions are fine to bake into applications, but I think
incorrect to place in formats/schema.

>> It's easier (and more future-proof) to simply allow every property to
>> be plural, and then define any perceived singular semantics at a
>> higher application level (which is where any notion of singular vs
>> plural actually matters if at all). If it changes, changing the
>> application is much easier than the format.
>>
>>> If we have to choose between the JSON being a bit weird, or the
>>> HTML-based markups being a bit weird, I would go for the former. JSON
>>> feeds are relatively invisible, whereas the HTML source has a wider
>>> and more varied audience.
>>
>> Agreed. HTML impacts more authors, thus takes design precedence over JSON.
>
> Noted
>
>>>> This same problem occurred with PortableContacts when you compare the XML
>>>> and JSON
>>>> representations: http://portablecontacts.net/draft-schema.html#anchor5.  For
>>>> what it's worth, PoCo used plural naming where properties were expected to
>>>> be multi-valued.
>>
>> Which was also a mistake.
>>
>>> Yup, it's hard designing a schema to work nicely in two quite
>>> different syntaxes.
>>
>> Yes it is hard, but not impossible.
>>
>> We've taken a shot at doing so for HTML and JSON in microformats 2.0 [1].
>>
>> Comments appreciated (though perhaps better redirected to microformats-new[2]).
>
> I'll take a look. I liked the general direction it was taking last year.
>
>> [1] http://microformats.org/wiki/microformats-2
>> [2] http://microformats.org/mailman/listinfo/microformats-new/

Appreciated.

And in general I appreciate your attention to detail on this matter of
cardinality. This is a tough problem, especially so since it's one of
those problems where a little bit of expertise/experience will tend to
draw the wrong conclusion as compared to much more
experience/expertise.

Thanks,

Tantek