Re: CSV on the Web Working Group | XBRL from James McKinney on 2014-02-03 (public-csv-wg@w3.org from February 2014)

From: James McKinney <james@opennorth.ca>
Date: Sun, 2 Feb 2014 21:29:18 -0500
To: eric.e.cohen@us.pwc.com
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <7FC10B9F-C5EE-4EC3-9AF6-7F996520D126@opennorth.ca>
Thanks, Eric! I see many parallels between the XBRL GL Data Definition File (DDF) and the Open Knowledge Foundation's (OKF) JSON Table Schema (JTS) [1]. Unless I misunderstand, both are mechanisms for describing the fields within a CSV file - one using XML Schema (and starting with the immediate use cases of the XBRL community) and the other using JSON. I'm sure both face many of the same issues and challenges. JTS has so far focused on describing individual CSV files; according to the roadmap you describe, using DDF, we would be able, with one DDF, to describe a family of CSV files that have many common fields while allowing for extension/differences. JTS has looked at other issues, like declaring primary keys and describing data validation rules.

If we can have ways of describing CSV fields with both JSON and XML (and, ideally, have the JSON and XML formats share similarities), that'd be an outcome that serves a wide range of audiences.

For the question of comma versus semi-colon, etc. see the OKF's work on describing CSV dialects [2]. (For others going through the documents, the previously attached XML file and slide 12 of the Powerpoint are XML examples).

1. http://dataprotocols.org/json-table-schema/
2. http://dataprotocols.org/csv-dialect/

James

On 2014-01-29, at 9:53 AM, eric.e.cohen@us.pwc.com wrote:

> Please note, the opinions expressed here are my own and do not represent my employer or XBRL International. 
> 
> I am very pleased to see a new W3C effort on the area of CSV and the Web. (-1) 
> 
> I see a goal of "provid[ing] technologies whereby data dependent applications on the Web can provide higher interoperability when working with datasets using the CSV (Comma-Separated Values) or similar formats. As well as single CSV files, the group will define mechanisms for interpreting a set of CSVs as relational data. This will include the definition of a vocabulary for describing tables expressed as CSV and locatable on the web, and the relationships between them." 
> 
> As a co-founder of XBRL (Extensible Business Reporting Language) and the primary architect of a specification called XBRL Global Ledger Taxonomy Framework (XBRL GL), I have done a lot of thinking about that very issue as it relates to the domain of the Business Reporting Supply Chain, and CSV/text-to-XML/XBRL. Many ERP systems have native export to CSV capabilities; a CSV to Web interface will integrate these into the Web environment with a minimum of effort. 
> 
> We have been developing internal drafts on a descriptive language for describing the CSV to XML/XBRL connection and facilitating the conversion, and have vendors who have prototyped applications that demonstrate the process. 
> 
> It may provide some helpful background for the group, and if I can help hone it into a business case suitable for your use, I would be pleased to work with you. If what I present here is completely irrelevant, I will not be hurt if you ignore it completely. However, I would covet your consideration - for our purposes, if not yours - and glad to help with yours if I can. 
> 
> XBRL GL: detailed business master files and transaction data expressed with XML - some people prefer CSV 
> 
> As I noted,  my particular focus in the XBRL community is something called XBRL GL (Global Ledger Taxonomy Framework) (0), the use of XBRL to represent detailed business transactional data (the setup, master, transactional and historical data files of ERP systems). XBRL GL is being used for purposes such as consolidating data across disparate corporate ERP systems (1) or bringing together government agency data in Brazil (2) or being used as the electronic bookkeeping archiving format for tax purposes in Turkey (3). It is highly hierarchical; it is representing content normally found in relational databases by instead using XML/XML Schema/XBRL 
> 
> Detailed accounting data files can be very large in their native database format. Turn that into XML, and things, unzipped, may balloon As those involved in XML are likely aware, XML was not designed to be terse; it is, in fact, "verbose by design". (4) (Of course, it zips well.) That leads to large ERP extracts when expressed in XML/XBRL GL. Such extracts are used in system integration, consolidation, data migration and archival processes. 
> 
> A highly visible community of users of ERP extracts is auditors - in particular internal auditors, financial auditors (CPAs, CAs), and tax auditors. For that reason, some of the auditor community that needed to transport extracts of ERP data from corporates wanted the benefits of standardized ERP metadata such as XBRL GL (e.g., robust data representations, easier interpretability of data, validation, multiple language labels for data fields, applicability of standardized business rules with RIF, ISO Schematron or XBRL Formula) with the terseness of delimited or fixed length text for transport and memory-consumption/processing issues. 
> 
> How to get the best of CSV and the best of XBRL GL? 
> 
> To that end, we (the XBRL GL Working Group) began considering conventions for describing delimited and fixed-length text files using the semantic of XBRL GL. Does the second grouping of characters in the comma delimited file represent the main account number (gl-cor:accountMainID)? Does the text starting 31 characters in and going 40 characters represent the inventory description (gl-bus:measurableDescription)? (As we know, the "C" of CSV doesn't account for regional separators, such as the semi-colon or pipe {"|"), nor fixed length alternatives; we wanted to take all that into account.) 
> 
> The first thought was just to use fully qualified XBRL GL concepts as a header row in the text file. It had to be better than nothing - instead of variations of "account#", or "accountNo", or "Account Number", or "Identificador de la Cuenta", or "勘定科目番号" for a column representing the account number, just use the standardized gl-cor:accountMainID. That was certainly better than nothing, at least for human interpretation! 
> 
> The next step was thinking about a reusable configuration file that would describe a text output format(s) and be used without any modification to the source data. The embedded header approach, for example, doesn't let you span CSV files easily. It doesn't permit mapping between source content and enumerations, or facilitate rovisional/calculated fields. 
> 
> XBRL GL Data Definition File: Mappping from CSV to XBRL GL 
> 
>  With a standardized configuration file (we used XML Schema to define and XML to instantiate) that lets you identify each field in the text by a (standardized) XBRL GL description, you could keep your text file for transport and handling while beginning to gain the benefits of the XBRL GL, and transform the content into at least minimal valid XBRL GL (if desired) for validation and consumption. Even without transforming the source data into XBRL GL, a  savvy application could look at XBRL GL's definitions, find that a certain field should represent a date, or amount, or a valid ISO 4217 code and check the text content directly. We unimaginatively call that standardized XML file an XBRL GL Data Definition File, or XBRL GL DDF, and created an XML Schema file to define it (provided below). 
> 
> As with the goal of this group, we had to consider not just a single table/CSV but multiple tables, representing as an example headers and line items for invoices or orders. 
> 
> The need for a standardized configuration file for this purpose was renewed urgently a few months back; the American Institute of Certified Public Accountants (AICPA) published a new specification for describing accounting information to be shared between an audit client and their auditors. The series of specifications, called the Audit Data Standards (5), defines specific groupings of data important for doing a traditional financial audit, and will grow to meet the needs of a broader audit community (internal audit, tax audit, etc.). The ADS specifies the desired content and rules for its formatting, and defines two syntaxes for its representation: pipe-delimited text and XBRL GL. 
> 
> Developing a standards-based approach to be able to losslessly transform files between the two formats seemed important. For those environments where the PipeDF was the primary option, such as older report writer systems that can produce text but not XML, opening up the world of XBRL GL (for standards-based validation of content, applicability of standard business rules, greater reusability and scalability, etc.) seemed important. To that end, I redoubled my effort to update the earliest purely conceptual DDF design to something we could begin to test in a working environment. 
> 
> Similarly and more historically, the OECD published syntax-independent guidance documents for a series of Standard Audit File (Taxation, Payroll); while recommending XML and especially XBRL GL, it recognized that tax administrations may wish to use any format; being able to have a standardized definition of the relationship between a text version in one country to an XML version in another could break down some barriers. 
> 
> What we have works on simple test cases. We know more is necessary. But as you are beginning to explore the same area, I thought it was important to share our thoughts. 
> 
> Broader applicability? 
> 
> I have attached some files: 
> 
> i. A Word document (that is a woefully in need of update) backgrounder on the attached XSD. If this topic is of interest to the group, I will accelerate its being updated. For example, the Schema was updated as we recognize the need for additional constraints and calculated fields, and so the Schema shows an additional structure not documented in the Word file that began to lay this area out. 
> 
> ii. The aforementioned XSD, which provides the structure of the XBRL GL Data Definition file. The file is not an official public working draft, and has no official status other than internal draft. However, I am sharing it here for educational purposes. (Although there is no reason the XBRL GL DDF has to be limited to transforming from text to XBRL GL; it has applicability to transforming from text to other XML; I just can't promise it will work with other schema-based designs, and we have barely stress-tested it with XBRL GL itself). 
> 
> iii. A Powerpoint presentation describing the effort. 
> 
> iv. An example, not perfected, for transforming between the Pipe-delimited and XBRL GL formats of one of the tables in the AR Audit Data Standard from (5) below, 
> 
> So the primary goal is to be able to use the semantic of XBRL GL without being bound to its syntax in transport and handling. (6) The thinking behind another XBRL publication, something called Inline XBRL, also brings additional tools for regionalization and varieties of dates and numeric formatting that will be found in the CSV files and need to be transformed into what XML required that can be leveraged in these transformations. (7) 
> 
> We can then leverage the fact that virtually every accounting software product can create a standard text-export; with a series of DDFfiles, those exports can automatically be transformed into XBRL GL, and automatically be turned into XBRL GL profiled to represent the AICPA Audit Data Standard (or any other profile of data, such as a tax audit data standard file).   
> 
> Our work is purely prototype and our cases (from single table Audit Data Standard in pipe-delimited format to XBRL GL) simple. We look forward to learning from your group's effort, contributing as appropriate,  and being able to incorporate it into the XBRL GL environment as possible. 
> 
> <eccn /> 
> 
> 
> 
> 
> (-1) http://www.w3.org/2013/csvw/wiki/Main_Page 
> (0) http://www.xbrl.org/GLTaxonomy 
> (1) http://www.fujitsu.com/global/services/software/interstage/download/Fujitsus-Internal-Financial-Reporting-Platform-2009Mar.html
> (2) http://raw.rutgers.edu/28wcars and especially http://raw.rutgers.edu/docs/wcars/28wcars/28wcars%20presentations/28WCARS%20SICONFI%202013.11.09.pdf 
> (3) http://www.edefter.gov.tr/web/guest/2 
> (4) http://www.w3.org/XML/1999/XML-in-10-points.html.en 
> (5) http://www.aicpa.org/interestareas/frc/assuranceadvisoryservices/pages/auditdatastandardworkinggroup.aspx 
> (6) http://www.omg.org/news/meetings/tc/agendas/va/FDTF_pdf/Cohen_XBRL.pdf 
> (7) http://xbrl.org/Specification/inlineXBRL-specifiedTransformations/REC-2010-04-20/inlineXBRL-specifiedTransformations-REC-2010-04-20+corrected-errata-2011-08-17.html 
> 
> 
> Eric E Cohen 
>   
> PwC | XBRL Global Technical Leader
> Office: 1-585-271-4070 | Mobile: 1-585-317-4799
> Email: eric.e.cohen@us.pwc.com
> PricewaterhouseCoopers LLP
> Rochester, NY USA
> http://www.pwc.com/us
> 
> Thoughts don't need paper to take shape. 
> The information transmitted, including any attachments, is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited, and all liability arising therefrom is disclaimed. If you received this in error, please contact the sender and delete the material from any computer. PricewaterhouseCoopers LLP is a Delaware limited liability partnership. This communication may come from PricewaterhouseCoopers LLP or one of its subsidiaries.
> <GLDDF_1304041.docx><XBRLGLDDF_1304091.pptx><xbrl-gl_ddf_20-Dec-2012-r4.xsd><Customer_Master_YYYYMMDD_XBRLGLDDF_r4.xml>
Received on Monday, 3 February 2014 02:29:47 UTC