RE: fyi on the Repository Schema

I look forward to seeing how Repository Schema can be applied to StratML
documents.  To facilitate explicit referencing, the StratML schema will
enable the association of globally unique identifiers with each element in
StratML documents.  


Joe built a couple prototypes referencing the StratML collection.  However, they were based upon an
early draft of the StratML schema, which does not include the <Identifier>


The only StratML instance document that has been created thus far using the
latest draft of the schema is the eGov IG's note:  However, identifiers have not been
associated with each element in that document because, as far as I am aware,
InfoPath, which I used to create it, does not have the capability to
generate GUIDs.  


Hopefully, tools will soon emerge to provide such capabilities within eForms
applications.  A number of services are already available on the Web to
generate GUIDs, e.g., & & 




From: []
On Behalf Of Jose M. Alonso
Sent: Saturday, May 16, 2009 9:41 AM
To: eGov IG
Subject: Fwd: fyi on the Repository Schema


I see this did not make it to the list somehow. Forwarding it. Sorry for
noticing so late.

-- Jose


Inicio del mensaje reenviado:

De: Daniel Bennett <>

Fecha: 16 de abril de 2009 15:57:46 GMT+02:00

Para: "Jose M. Alonso" <>, Joe Carmel <>,
"Sheridan, John" <>,, Dazza Greenwood <>,  Greg Elin
<>, Chris Wallace <>

Asunto: fyi on the Repository Schema


Jose, et al,

Repository <>  Schema
is a method to map an already Internet published of XML/XHTML documents so
it can be used automatically as an object database. Note that the author of
a Repository Schema can be different from the original publisher of the
documents and that there can be competing/complementary schemas of the same
repository (as opposed to the repository publisher creating an API with one
view/entry-point of a database).

The first part of that is to accept that the URL as the unique identifier
for the objects. Then to gather/discover all or any of the objects by URL
discovery by discerning a pattern in the URL's and also using XPath to
discover links/URLs to non-patterned URLs or by doing both. An example might
be press releases or blog postings where the URL pattern is the year and
month and day, but XPath is needed to discover the actual postings that have
long string names, e.g.
as . (this is more than URL templating to create the original URLs, this is
reverse engineering and/or discovering the URLs) and (note that using a web
sites search facility or site map with XPath discovery of URLs is also

The second part is to describe the potential parts or sub-objects in each
document using a description of the object, the XPath discovery of the
sub-object and the XML Schema of the object. For example, in a Wikipedia
page, an object within that document might be the "content" of the page
which excludes all of the templating/navigation/etc. That XPath would be
///div[@id="bodyContent"]. And a calendar event in a web page would both use
XPath and XML Schema based on whether the event was created using RDFa or a
Microformat. XPath could allow objects with the same XML Schema to be
differentiated, like separating a list of "friends" contact information from
a list of "foes."

The third part is a descriptive list of usable XSL transformations so that
the documents can be transformed in real time to another usable format, for
examples to PDF, an RDF version, to a stripped down version to convert into
JSON for an application, etc.

The fourth part is to point to indexes of the repository. This is crucial
for real time processing. The index could be created by anyone who has
previously grabbed all the documents and created an index. As an example, an
XQuery database engine could attach/grab the index when running a query of
the entire repository.

The goal of the Repository Schema is to allow for real time access and
processing of essentially static XML/XHTML documents as if it was a database
using the Repository Schema to allow tools to be built in advance that can
use the identical approach/widgets/tools for any repository. It also frees
publishers of data from needing to anticipate every use of their data,
allows for a standardization of using XML documents as a database, frees
databases from needing to store/screenscrape all the data internally before
acting on it (which many XQuery engines can do already, but with a
performance hit without having a pre-built index). It also may push
publishers of documents on the web to abide by standards such as making
XHTML wellformed and valid and encouraging the use of CoolURIs/human
readable&machine processable.

Note that much of this is made easier by the XQuery doc() function
relatively undocumented ability to act on any well formed/valid document on
the World Wide Web in real time by creating the URL from concatenating a
string (which is a real standard as opposed to using curl or other methods).

Thanks tremendously to Chris Wallace for doing much of the ground-breaking
work and some of the proof of concept pieces in the XQuery Wikibook
<> .

Joe Carmel has been building a Rosetta Stone standard that would help
"relate" repositories. 

Daniel Bennett

Received on Saturday, 16 May 2009 14:33:49 UTC