Re: Formats, schemas, vocabularies, data models and section 7.4 of the Best Practices document

Dear Joao Paolo, Carlos,

I agree with your concerns. This have been voiced many times, and the 'technology neutral' focus does make it more visible, but, I think doesn't change much. The situation is a bit messy in the Linked Data world alone.

I won't discuss data formats now, because that's not the point of 7.4 (it may be very useful to have the discussion for other sections though; I just don't have the time).

My issue about what Joao Paolo describes as 'schemas' (and what suggests to call 'data models') is that it misses a part of what is called 'vocabularies' in the Linked Data world (and in other communities). Using the same point as in an earlier email: do you think that the ISO language codes are a schema (or a data model) of their own?

In a previous group I was involved, on Library Linked Data, we faced a similar problem of naming things. We ended up with 'metadata element sets' for schemas/ontologies and 'controlled vocabularies' for thesauri, code lists etc.
http://www.w3.org/2005/Incubator/lld/XGR-lld-vocabdataset/
Note that we were facing then the need of being a bit more technology neutral: these 'controlled vocabularies' have existed way before RDF (porting them into RDF was actually why the SKOS 'schema' was created).

Now, we may decide to rule 'vocabularies that don't qualify as data models' (like ISO language codes) from the best practices. I find it a bit a pity, because these are valuable artefacts, as the past decade on Linked Data has shown. And our current best practices apply to them, too.

Back to the BP document now. From my past experience, we won't have time to fix this in two days. There are much easier and urgent issues to fix - *once* we have noted this vocabulary issue down for future resolution of course.
Also, and because I've seen these discussions before, we probably won't find a good solution, i.e. we'll always have to exemplify the term we chose, as in "this section is about 'X', which gathers ontologies, schemas, relational models, etc".

So what I suggest is to create an issue saying the the section needs terminological discussion and input, and maybe go as far as removing the 'controlled vocabularies' from the picture. Is it alright?

Best,

Antoine

On 1/22/15 5:01 AM, Joao Paulo Almeida wrote:
> Dear All,
>
> I think that we have reached a crucial point in the discussions around the Best Practices document.
>
> Many have raised the concern that the term "vocabulary" may be a problem in the document, in part because of its lack of precision and in part because it is biased towards the RDF(S)/OWL(S) technological space.
>
> I completely agree with that, and we need to do our best to ensure precision and to be agnostic with respect to the various technological spaces.
>
> The problem has also appeared in the discussion surrounding the term "format", which I also believe is problematic if not properly defined and qualified. (and also the term "schema" and the other terms used in section 7.4 of the BP document).
>
> So, this is a call for the group to settle on some concepts (and ultimately terms) that should help us to structure our discussions,  give us a basis to communicate and help our audience to understand us.
>
> I offer here a sketchy initial attempt; I'm hoping (fingers crossed) not to incite a terminological debate, but a conceptual one... As long as we agree on the concepts, we can always adjust the terms to make this more intuitive to the majority of the people in our audience.
>
> Some of it is inspired in [1] to avoid re-inventing the wheel. (I wanted to, but did not manage to touch upon the "metadata" and "ontology" terms. I also did not manage to link OWL and SKOS into this.) And remember, this is just a starting point.
>
> regards,
> Joćo Paulo
>
> ----
>
> By "data representation" we mean any convention for the arrangement of symbols in such a way as to enable information to be encoded by a data producer and later decoded by data consumers.
>
> A particular convention for data representation is often referred to as a "data format".
>
> Adapted from [1]:
>
>     In existing computer systems there is typically a long chain of relations connecting the physical phenomena by which data are represented with the data being represented. Each link in the chain connects two layers of representation: each layer organizes information available at the next lower level into structures at a higher (or at least different) layer of abstraction, and in this way provides information used in turn by the next higher level in the representation.
>
>     For example, the representation of an email message may involve the following layers:
>
>     Physical layer: holes in cards or tape, magnetic charges, color changes on optical disks or scan codes, tones on a telephone connection, or similar phenomena are interpreted as representing sequences of bits.
>
>     Bit layer: those sequences of bits may be interpreted as representations of other different sequences of bits (for example five bits may be written to the physical medium to represent four bits of data, in such a way as to guarantee a minimum and maximum amount of space between magnetic flux events in the media).
>
>     Byte / octet layer: the sequences of bits read from the storage device are grouped into octets: units of eight bits often referred to as bytes.
>
>     Character layer: an octet sequence may be interpreted as a sequence of characters as defined by the appropriate character-set standard.
>
>     Application-specific data structure layer: the email reader will read the character stream and distinguish the mail header from the message body, and may distinguish multiple alternative representations of the message and attachments within the message body. Within the mail header, mail software will distinguish important fields like date, sender, and addressee.
>
>
> We assume here that this group is concerned with data representation beyond the octet layer, not concerning itself with "data formats" for physical, bit and byte level data representation. Data representation at the lowest level in this context is thus the octet sequence.
>
> Different applications will almost always have different application-specific data structures. The variety of applications and uses of the data on the web leads to an unbounded number of data formats.
>
> The need to support the definition of suitable data formats for data interchange on the web has led to the development of languages and frameworks for families of formats, examples of which include XML, SGML, JSON and RDF.
>
> (Here it is important to note that we should avoid saying that data is represented in XML or RDF - but instead, we should say that data is represented in an XML-based format, or in an RDF-based format. So XML data is data represented in an XML-based format, RDF data is data represented in an RDF-based format.)
>
> These languages  and frameworks ultimately establish conventions to encode data into sequences of octets. These conventions are often called "serialization formats" or "serialization syntaxes" (e.g., [TURTLE], [RDF11-XML], [JSON-LD]). In addition, these languages often establish a "data model" or "abstract syntax" (e.g., [RDF11-CONCEPTS]) which define the structure of data independent of a particular serialization format.
>
> Some of these families of formats are accompanied by languages or (meta-)formats to specify a format, to enable some level of automation for processing data in the format. For example, an XML-based format can be specified with a "schema document" in the XML Schema Definition language, enabling XML documents to be checked for conformance to the format defined in the schema document [XML-SCHEMA]. Likewise, an RDF-based format can be specified using RDF Schema [RDF11-SCHEMA].
>
> These "schemas" are often used as a means to anchor natural language descriptions to guide humans in the interpretation of data produced using the format. Often, labels are used in these schemas to convey intuitive meaning and guide interpretation, in which case these labels serve the role of "terms" in communication. The collection of terms as used in the schema is then referred to as a "vocabulary".
>
>
> Some requirements (adapted from [1]):
>
> Any data representation relied on for interoperability must have clear, well written, published documentation. If the format is not documented, the likelihood that the information it represents can be recovered without loss is small.
>
> The specification documents for data formats should be controlled by public bodies, preferably consensus-based organizations in the international standardization system or by relevant industry consortia.
>
>
> [1] C. M. Sperberg-McQueen, . David Dubin. "Data Representation", DH Curation Guide: a community resource guide to data curation in the digital humanities, http://guide.dhcuration.org/representation/.
>
> [RDF11-SCHEMA]
> Dan Brickley, R. V. Guha. RDF Schema 1.1. W3C Recommendation, 25 February 2014. URL: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/. The latest published version is available at http://www.w3.org/TR/rdf-schema/.
>
> [RDF11-XML]
> Fabien Gandon, Guus Schreiber. RDF 1.1 XML Syntax. W3C Recommendation, 25 February 2014. URL: http://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/. The latest published version is available at http://www.w3.org/TR/rdf-syntax-grammar/.
>
> [TURTLE]
> Eric Prud'hommeaux, Gavin Carothers. RDF 1.1 Turtle: Terse RDF Triple Language. W3C Recommendation, 25 February 2014. URL: http://www.w3.org/TR/2014/REC-turtle-20140225/. The latest edition is available at http://www.w3.org/TR/turtle/
>
> [OWL2-OVERVIEW]
> W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview (Second Edition). 11 December 2012. W3C Recommendation. URL: http://www.w3.org/TR/owl2-overview/
>
> [JSON-LD]
> Manu Sporny, Gregg Kellogg, Markus Lanthaler, Editors. JSON-LD 1.0. 16 January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/
>
> [XML-SCHEMA]
> XML Schema: Primer
> World Wide Web Consortium. XML Schema Part 0: Primer Second Edition, ed. Priscilla Walmsley and and David C. Fallside.W3C Recommendation 28 October 2004. See http://www.w3.org/TR/xmlschema-0/

Received on Thursday, 22 January 2015 07:46:55 UTC