//
you're reading...
Announcement, FYI, Idea, IETF

From CSV data on the Web to CSV data in the Web

In our daily work with Government data such as statistics, geographical data, etc. we often deal with Comma-Separated Values (CSV) files. Now, they are really handy as they are easy to produce and to consume: almost any language and platform I came across so far has some support for parsing CSV files and I can virtually export CSV files from any sort of (serious) application.

There is even a – probably not widely known – standard for CSV files (RFC 4180) that specifies the grammar and registers the normative MIME media type text/csv for CSV files.

So far so well.

From a Web perspective, CSV files really are data objects, which however are rather coarse-granular. If I want to use a CSV file, I always have to use the entire file. There is no agreed-upon concept that allows me to refer to a certain cell, row or column. This was my main motivation to start working on what I called Addrable (from Addressable Table) earlier this year. I essentially hacked together a rather simple implementation of Addrables in JavaScript that understands URI fragment identifiers such as:

  • #col:temperature
  • #row:10
  • #where:city=Galway,reporter=Richard

Let’s have a closer look at what the result of the processing of such a fragment identifier against an example CSV file could be. I’m going to use the last one in the list above, that is, addressing a slice where the city column has the value ‘Galway’ and for the reporter column we ask it to be ‘Richard’.

The client-side implementation in jQuery provides a visual rendering of the selected part, see below a screen-shot (if you want to toy around with it, either clone or download it and open it locally in your browser):

There is also a server-side implementation using node.js available (deployed at addrable.no.de), outputting JSON:

{  
  "header":
    ["date","temperature"],
  "rows":
    [
      ["2011-03-01", "2011-03-02", "2011-03-03"],
      ["4","10","5"]
    ]
}

Note: the processing of the fragment identifier is meant to be performed by the User Agent after the retrieval action has been completed. However, the server-side implementation demonstrates a workaround for the fact that the fragment identifier is not sent to the Server (see also the related W3C document on Repurposing the Hash Sign for the New Web).

Fast forwarding a couple of weeks.

Now, having an implementation is fine, but why not pushing the envelope and taking it a step further, in order to help making the Web a better place?

Enter Erik Wilde, who did ‘URI Fragment Identifiers for the text/plain Media Type’ aka RFC 5147 some three years ago; and yes, I admit I was a bit biased already through my previous contributions to the Media Fragments work. We decided to join forces to work on ‘text/csv Fragment Identifiers’, based on the Addrable idea.

As a first step (well beside the actual writing of the Internet-Draft to be submitted to IETF) I had a quick look at what we can expect in terms of deployment. That is, a rather quick and naive survey based on some 60 CSV files manually harvested from the Web. The following figure gives you a rough idea what is going on:

To sum up the preliminary findings: almost half of the CSV files are (wrongly) served with text/plain (followed by some other non-conforming and partially exotic Media Types such as text/x-comma-separated-values. The bottom-line is: only 10% of the CSV files are served correctly with text/csv. Why do we care, you ask? Well, for example, because the spec says that the header row is optional, but the presence can be flagged by an optional HTTP Header parameter. Just wondering what the chances are ;)

Now, I admit that my sample here is rather small, but I think the distribution will roughly stay the same. By the way, anyone aware of a good way to find CSV files, besides filetype:csv in Google or contains:csv in Bing, as I did it?

We’d be glad to hear from you – do you think this is useful for your application? If yes, why? How would you use it? Or, maybe you want to do a proper CSV crawl to help us with the analysis?

About these ads

About mhausenblas

Chief Data Engineer EMEA @MapR #bigdata #hadoop #apachedrill

Discussion

10 thoughts on “From CSV data on the Web to CSV data in the Web

  1. Interesting. Is it possible to address an individual cell (like, column 3, row 5)? To me, the things I’d like to address in a CSV file are: the header; each header columns; each row; each cell. The JSON snippet posted above seems weird to me, it says “rows” but then the values within are columns and not rows.

    Posted by Richard Cyganiak | 2011-04-16, 19:57
    • Richard,

      Thanks for the valuable feedback! Regrading individual cell addressing – currently not in Addrable, but hey, why not allow a combination of row: and col: to address a cell. Noted for the IETF draft. I think we need to allow positional values for col: anyway (as the header row is optional, though often present).

      Let’s see how far we meet your requirements ATM: 1. the header (yes, with #col:*); 2. each header column (no, though the the iterator is available through previous operation); 3. each row (yes, with #row=n); 4. each cell (no, but see above, good addition).

      Concerning the JSON snippet: I guess eagle-eyed Richard has spotted a bug in the implementation ;)

      Cheers,
      Michael

      Posted by mhausenblas | 2011-04-16, 20:51
    • Thanks for the reply Michael. Regarding “#col:*” for addressing the header, yes this work but I’d prefer if this was less obscure. How about “#header”? +1 for positional columns.

      Posted by Richard Cyganiak | 2011-04-16, 21:00
    • Sure, #header would be more explicit. On the other hand I’d like to keep the numbers of selectors/keywords down. But I guess one more keyword doesn’t hurt …

      Posted by mhausenblas | 2011-04-16, 21:07
  2. I should add, the “where” thing is clever and interesting but I have trouble understanding what it does. Does it address the rows that match a query?

    Posted by Richard Cyganiak | 2011-04-16, 19:59
    • Yeah, sort of indirect selection. The original use case was to select a certain cell through defining the values of all but one column (as in table2.csv#where:city=Galway,date=2011-03-01,reporter=Richard) which one can extend to higher dimensions. Anyway, I guess it deserves a bit more detailed explanation, which will certainly be done in the Internet-Draft.

      Cheers,
      Michael

      Posted by mhausenblas | 2011-04-16, 20:59
  3. I’ve been going at this from a less creative direction; Grinder is a tool for turning csv into RDF (or, anything else!). https://github.com/cgutteridge/Grinder

    It works by turning the tabular data into something easy to process with XSLT.

    I also made a very hacky tool to turn csv into linked data via a web service http://graphite.ecs.soton.ac.uk/stuff2rdf/ (just a toy, but may give some jumping off points)

    Posted by Christopher Gutteridge | 2011-04-17, 18:56
  4. TopBraid supports similar ideas. We call this Semantic Tables, and it takes a spreadsheet (CSV, TSV or Excel) and creates a canonical RDF model from it. Each column becomes a property, each table becomes a class. The properties point back to the column index and the classes point back to the sheet index (in Excel). Each row obviously becomes an instance of those classes. This RDF can then be queried and further processed, e.g. with SPARQL.

    Posted by Holger Knublauch | 2011-04-19, 07:43
  5. OpenLink Software‘s (my employer) Virtuoso has been able to import or otherwise consume CSV and re-present it as SQL tables and/or fully dereferenceable and SPARQL-accessible RDF for quite some time.

    The Virtuoso Sponger includes a cartridge for Web-accessible or local files; Virtuoso has some built-in functions for similar action; and there is a bulk-load routine as well.

    Posted by Ted Thibodeau Jr | 2011-04-19, 16:28
  6. At ANZ we have been using a structured form of CSV files to store configuration data for our test environments. There are three columns, subject, predicate and object. This allows non- developers to enter data in Excel, we have Jython code which reads it in and uses it for automated deployments and deployment history recoding. We convert to RDF and load up in Joseki for web-based reporting. I have a document describing the CSV format. Often thought it should be an RFI. CSV much more useful than RDF of ntriples.

    Posted by Bill Birch | 2011-06-05, 03:59

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Archives

Follow

Get every new post delivered to your Inbox.

Join 2,151 other followers

%d bloggers like this: