Formats, schemas, vocabularies, data models and section 7.4 of the Best Practices document

Dear All,

I think that we have reached a crucial point in the discussions around the
Best Practices document.

Many have raised the concern that the term "vocabulary" may be a problem in
the document, in part because of its lack of precision and in part because
it is biased towards the RDF(S)/OWL(S) technological space.

I completely agree with that, and we need to do our best to ensure
precision and to be agnostic with respect to the various technological
spaces.

The problem has also appeared in the discussion surrounding the term
"format", which I also believe is problematic if not properly defined and
qualified. (and also the term "schema" and the other terms used in section
7.4 of the BP document).

So, this is a call for the group to settle on some concepts (and ultimately
terms) that should help us to structure our discussions,  give us a basis
to communicate and help our audience to understand us.

I offer here a sketchy initial attempt; I'm hoping (fingers crossed) not to
incite a terminological debate, but a conceptual one... As long as we agree
on the concepts, we can always adjust the terms to make this more intuitive
to the majority of the people in our audience.

Some of it is inspired in [1] to avoid re-inventing the wheel. (I wanted
to, but did not manage to touch upon the "metadata" and "ontology" terms. I
also did not manage to link OWL and SKOS into this.) And remember, this is
just a starting point.

regards,
Joćo Paulo

----

By "data representation" we mean any convention for the arrangement of
symbols in such a way as to enable information to be encoded by a data
producer and later decoded by data consumers.

A particular convention for data representation is often referred to as a
"data format".

Adapted from [1]:

In existing computer systems there is typically a long chain of relations
connecting the physical phenomena by which data are represented with the
data being represented. Each link in the chain connects two layers of
representation: each layer organizes information available at the next
lower level into structures at a higher (or at least different) layer of
abstraction, and in this way provides information used in turn by the next
higher level in the representation.

For example, the representation of an email message may involve the
following layers:

Physical layer: holes in cards or tape, magnetic charges, color changes on
optical disks or scan codes, tones on a telephone connection, or similar
phenomena are interpreted as representing sequences of bits.

Bit layer: those sequences of bits may be interpreted as representations of
other different sequences of bits (for example five bits may be written to
the physical medium to represent four bits of data, in such a way as to
guarantee a minimum and maximum amount of space between magnetic flux
events in the media).

Byte / octet layer: the sequences of bits read from the storage device are
grouped into octets: units of eight bits often referred to as bytes.

Character layer: an octet sequence may be interpreted as a sequence of
characters as defined by the appropriate character-set standard.

Application-specific data structure layer: the email reader will read the
character stream and distinguish the mail header from the message body, and
may distinguish multiple alternative representations of the message and
attachments within the message body. Within the mail header, mail software
will distinguish important fields like date, sender, and addressee.


We assume here that this group is concerned with data representation beyond
the octet layer, not concerning itself with "data formats" for physical,
bit and byte level data representation. Data representation at the lowest
level in this context is thus the octet sequence.

Different applications will almost always have different
application-specific data structures. The variety of applications and uses
of the data on the web leads to an unbounded number of data formats.

The need to support the definition of suitable data formats for data
interchange on the web has led to the development of languages and
frameworks for families of formats, examples of which include XML, SGML,
JSON and RDF.

(Here it is important to note that we should avoid saying that data is
represented in XML or RDF - but instead, we should say that data is
represented in an XML-based format, or in an RDF-based format. So XML data
is data represented in an XML-based format, RDF data is data represented in
an RDF-based format.)

These languages  and frameworks ultimately establish conventions to encode
data into sequences of octets. These conventions are often called
"serialization formats" or "serialization syntaxes" (e.g., [TURTLE],
[RDF11-XML], [JSON-LD]). In addition, these languages often establish a
"data model" or "abstract syntax" (e.g., [RDF11-CONCEPTS]) which define the
structure of data independent of a particular serialization format.

Some of these families of formats are accompanied by languages or
(meta-)formats to specify a format, to enable some level of automation for
processing data in the format. For example, an XML-based format can be
specified with a "schema document" in the XML Schema Definition language,
enabling XML documents to be checked for conformance to the format defined
in the schema document [XML-SCHEMA]. Likewise, an RDF-based format can be
specified using RDF Schema [RDF11-SCHEMA].

These "schemas" are often used as a means to anchor natural language
descriptions to guide humans in the interpretation of data produced using
the format. Often, labels are used in these schemas to convey intuitive
meaning and guide interpretation, in which case these labels serve the role
of "terms" in communication. The collection of terms as used in the schema
is then referred to as a "vocabulary".


Some requirements (adapted from [1]):

Any data representation relied on for interoperability must have clear,
well written, published documentation. If the format is not documented, the
likelihood that the information it represents can be recovered without loss
is small.

The specification documents for data formats should be controlled by public
bodies, preferably consensus-based organizations in the international
standardization system or by relevant industry consortia.


[1] C. M. Sperberg-McQueen, . David Dubin. "Data Representation", DH
Curation Guide: a community resource guide to data curation in the digital
humanities, http://guide.dhcuration.org/representation/.

[RDF11-SCHEMA]
Dan Brickley, R. V. Guha. RDF Schema 1.1. W3C Recommendation, 25 February
2014. URL: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/. The latest
published version is available at http://www.w3.org/TR/rdf-schema/.

[RDF11-XML]
Fabien Gandon, Guus Schreiber. RDF 1.1 XML Syntax. W3C Recommendation, 25
February 2014. URL:
http://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/. The latest
published version is available at http://www.w3.org/TR/rdf-syntax-grammar/.

[TURTLE]
Eric Prud'hommeaux, Gavin Carothers. RDF 1.1 Turtle: Terse RDF Triple
Language. W3C Recommendation, 25 February 2014. URL:
http://www.w3.org/TR/2014/REC-turtle-20140225/. The latest edition is
available at http://www.w3.org/TR/turtle/

[OWL2-OVERVIEW]
W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview
(Second Edition). 11 December 2012. W3C Recommendation. URL:
http://www.w3.org/TR/owl2-overview/

[JSON-LD]
Manu Sporny, Gregg Kellogg, Markus Lanthaler, Editors. JSON-LD 1.0. 16
January 2014. W3C Recommendation. URL: http://www.w3.org/TR/json-ld/

[XML-SCHEMA]
XML Schema: Primer
World Wide Web Consortium. XML Schema Part 0: Primer Second Edition, ed.
Priscilla Walmsley and and David C. Fallside.W3C Recommendation 28 October
2004. See http://www.w3.org/TR/xmlschema-0/

Received on Thursday, 22 January 2015 04:01:45 UTC