Volume 18, Number 3/4
Table of Contents
The 7th International Digital Curation Conference A Personal View
University of Edinburgh
The International Digital Curation Conference (IDCC) is an established annual event with a unique place in the digital curation community, engaging individuals, organisations and institutions across all disciplines and domains involved in curating data and providing an opportunity to get together with like-minded data practitioners to discuss policy and practice. The most recent IDCC took place in Bristol, UK, 5-8 December 2011. The Digital Curation Centre (DCC) is funded by the UK organisation JISC. The event was organised in conjunction with the Coalition for Networked Information (CNI). This year the conference looked at the issues arising from the creation of an open data landscape.
We assembled at the Bristol Marriott Royal Hotel, on one corner of College Green, with the OccupyBristol encampment on the far side of the green. The main conference took place on the 6th and 7th December, but there were both pre- and post-conference workshops on the 5th and the 8th, which were well attended. The conference was organised jointly by the Digital Curation Centre (DCC), and the US Coalition for Networked Information (CNI). Amplification of the event was facilitated by JISC. The Marriott chain has many things going for it, but they do not provide free wifi for customers, even for conferences which have a focus on communications and data management. Connectivity for the attendees was therefore a little constrained at first. Kevin Ashley, Director of the DCC, closed the dinner on the first conference day with the hugely popular announcement that free wifi access had been negotiated with the hotel.
The title of the event was "Public? Private? Personal? Navigating the open data landscape", and reflects the fact that there are plenty of questions around at the moment, particularly in connection with social networking, cloud-computing, big data, data-mining, and so on, mostly revolving around what is public, what is private, ownership, rights, re-use, the practicality of rights if we want to encourage re-use of data, and so on. These questions have always been around, but they are currently occupying significant yardage in discussion. And we are aware of the things we might do, if we can operate in an open data landscape.
In this article I'll discuss the pre-conference workshops I attended on December 5th, two highlight presentations given during the main conference on December 6th and 7th, and the all-day workshop on Domain Names on December 8th. I'm writing this account from the perspective of nine or ten weeks after the event took place, and that little amount of time is sufficient to allow us to see how timely and relevant some of the IDCC discussions were. We have seen a number of interesting phenomena in those weeks which may have a long-term impact on data sharing and reuse, at least in the public sphere the closure of a number of file-sharing sites, the biggest of which (Megaupload) was taken down on the 19th of January because it contained illegal files (i.e., files for which the uploaders did not own the copyright). How many files of this kind there were wasn't specified by the US Justice Department, who instigated the take-down, nor what the proportion of the total number of uploaded files on the site these represented. The fact that the files were there at all was apparently sufficient reason to close down the site.
This is significant in that it is very difficult for sites which promote legal file sharing to keep copyright-infringing files off their sites altogether. Should we expect Myspace, Flickr, Youtube, and other filesharing sites to be taken down altogether because there are files there which have been uploaded without the permission of the copyright owners? The principal infringers are being targeted first (the essence of the case is that Megaupload represents a criminal conspiracy), but it isn't clear where this process might stop. Some of the points in the indictment have been described by legal commentators as involving 'selective interpretations' and 'legal concepts' representing 'novel theories' of US law, which could be challenged in court.
Perhaps the most disturbing aspect of the Megaupload closure was that there were no arrangements made to stop the removal (the deletion in fact) of the work of hundreds of thousands of users, who were making perfectly legitimate use of the service. User files and metadata are vulnerable to both legal action and to policy changes in commercial companies. These threats and changes are not going to go away, and we are in fact very far from being in an open data landscape. It is possible that in a few years that open data might be restricted only to academic institutions and networks. It may be, if the academic publishers succeed in arguing as some are doing, that data mining books and articles is misuse of material they own, then doing clever things with publicly accessible aggregations of data of any sort may become very difficult .
One of the pre-conference workshops which attracted my interest, "Data for Impact: Can research assessment create effective incentives for best practice in data sharing?", was focussed on research assessment exercises as a key driver for best practice in data-sharing . In the UK this is definitely something which periodically concentrates minds, and results in practical, technical, and organisational development. Identifying best practice from systems and arrangements which work, however, is difficult when the landscape and the assessment exercises are fast changing. Best practice for the last round of assessment is unlikely to survive the changes in the requirements of the assessment exercise.
In the end I chose another workshop to attend, "Delivering post-graduate research data management training", organised by the MANTRA project among others. Recent projects in the UK and US have addressed current gaps in postgraduate training for PhD and other post-graduate students, in the management and planning aspects of research data. The workshop looked at lessons which could be had from these, and what we might do next in terms of provision. Breakout sessions looked at the creation and repurposing of discipline-specific learning materials (this is a tough one, since 'discipline specific' materials are not easy to repurpose); the modes of delivery for materials (face-to-face, online, etc.); and how we might engage with existing postgraduate training programmes. The DaMSII project (Data Management Skills and Support Initiative) made some recommendations at the end of the workshop.
In the afternoon I attended the "e-Science Workshop on Data re-use How can metadata stimulate re-use". This workshop was led by Birte Christensen-Dalsgaard, of the Dutch National Library. The thrust of the workshop (organised by LIBER's e-science working group) is that Libraries, who traditionally have played an important role as mediators of research publications, are now exploring their role in assisting the research community in the data landscape. New services being developed by libraries vary from provision of data storage to creation of new ontologies and tools for digital preservation. These questions are currently under discussion in universities, usually with little direct participation of the researchers who are to benefit from these new tools and services. Providing data storage may be an analogue of the provision of shelf space, but the other things are harder, and not something that libraries can provide by themselves.
After a number of presentations, these questions were addressed in the three working groups into which we were divided. The questions were on the Infrastructure and access to research data that we need to provide, the assumption being that open access, open source and open data are a fundamental substrate to this provision; what are the technical aspects of privacy, digital identity and digital rights management for research data, which is probably the most important and complex area of all, both within academia and for anyone engaged in using third party materials across networks, whether in the cloud, on the web, or in repositories. The question of digital identity on the web was also discussed at length in one of the day-long workshops on the final day. The third working group looked at the question of the support of cross- and multi-disciplinary research (and whether for data, 'one size fits all' solutions have any mileage). And underlying it all what is the role for libraries in this area? Each group created a poster reflecting their discussions, and then each poster was explained to the other groups in turn, who then commented via post-it notes. The main points which emerged were summarised at the end of the workshop.
There were many great presentations at this conference. I have singled out two.
Changing the Story for Data
Ewan McIntosh of the company 'No Tosh' gave a TED type opening keynote, "Public data opportunities", on Tuesday . He pointed out that he did not have a clear picture of what his audience does with data, since they are not accustomed to talking about it beyond their own community boundaries, so it is hard for an outsider to find out. He asked what the 'secret sauce' is for data, and why is it the case that some data has more impact? And impact is now really important, so we need to be able to locate our data and our activity within some kind of publicly intelligible narrative. We need to tell good stories about our data in order to get impact. He suggested data has both a snobbery and a communication problem he talks about 'data', where we talk about 'Data'. He suggested five lessons derived from working with seven year olds: tell a story, create curiosity, create wonder, solve pain, and create a reason to trade data. We should also be aware that the general public is one of our main stakeholders, and that ultimately they are who we are creating data for. Ewan's argument is essentially that "nobody knows who you are or cares what you do because you don't use your data to tell stories or inspire wonder." But we can change the world one story at a time, using data. He gave a nice example of "creating a sense of wonder": "Debtris" visualization for international debt .
For me, the standout presentation of the conference was "Reproducible research", given by Victoria Stodden, Assistant Professor in the Department of Statistics at Columbia University, who suggested that we are at a watershed point in connection with reproducibility and data replication in science, and that the concept of reproducible research can help us to frame the agenda for digital curation . She may be exactly right about this. She explicitly connected her remarks about openness in science with the origin of organised science in the seventeenth century, and asked the question "Why is science open?" She answered the question by saying that the main purpose of publishing materials in the scientific method is to 'root out error'. That was the goal in the seventeenth century, and it should still be our goal in the twenty-first century, but in our digitized frenzy, we seem to have lost sight of this, and of the sceptical and critical dialogue necessary to 'root out error'. This was the main reason why research papers were first published, and the reason for the foundation of the Transactions of the Royal Society, and why data was collected by that society. It was why data was passed around the community of those interested and active in science. Crucially, she noted that many scientists are not sure about what to share or when to share it, whereas the concept of reproducibility helps to make this clear. Open data and open science do not have a concrete meaning to the everyday practice of many scientists, whereas reproducibility does. This is the real reason why we should have open data and OA. We should be arguing for reproducibility, not 'open this', and 'open that'.
The conference programme is available from the DCC website, as well as live-blogged summaries of presentations, slide presentations, posters, audience tweets, and video of the speakers.
Domain Names and Persistence Workshop
The "Domain names and persistence workshop" was led by Henry Thompson from the University of Edinburgh. Persistence of names on the web depends to a significant extent on the persistence of domain names. This is clear from the fact that the proprietors of some of the larger non-http based URI schemes (such as DOI and HDL) provide 'actionable' http versions of their URIs ID strings which resolve as URLs (via dx.doi.org and hdl.handle.net). This workshop revisited a discussion which took place over two days in Glasgow in 2005 , which failed to produce agreement on whether or not names in general, and domain names in particular, should be human readable. Again, at this meeting, there was no clear consensus on this point. There was some argument to the effect that we would be fooling ourselves if 'we think that URIs are understandable outside a small community of experts'. Another area of revisitation was the question of whether or not natural language meaning in the URI strings 'are important both for user confidence, and for branding'.
Even now it is difficult to grasp and define the real impediments to arriving at a consensus about this question. Human readable URLs came first, because the whole point of the web at the time was the enabling of easy linkage to resources on other machines. Then, we thought of URLs being specific instances of URIs, in that URLs located resources, and that the URI identified something more abstract. The fact that they might look exactly the same to users was a complication we might have to put up with. A URI does not have to dereference to a resource or a representation (i.e., function as a link), so outside the community of experts, what the string represents is not understandable, because to most users, an http type string which does not dereference is simply a broken link.
There was useful discussion of terminology, particularly in relation to key concepts, such as binding, and resolution. In this context, binding refers to a means of creating a relationship between a new name and the entity named, and 'resolution', which is a process for looking up a name in order to discover the entity named. For persistent names, binding happens once and is intended to be unique and irrevocable, whereas the process of resolution can be achieved in a number of ways.
One possible approach to domain name persistence was suggested. Gavin Brown of the UK registrars CentralNic pointed out that .arpa has special status, and it is not managed under contract by a registrar, as are all other Generic Top Level Domains (gTLDs), but rather it's managed by IANA (Internet Assigned Numbers Authority). The .arpa domain is the "Address and Routing Parameter Area" domain and is designated to be used exclusively for Internet-infrastructure purposes. It is administered by the IANA in cooperation with the Internet technical community under the guidance of the Internet Architecture Board (IAB). This approach (i.e., managing generic top level domains using IANA rather than via normal registries) would allow the creation of new persistable domain names (i.e., ones whose persistence might be enforced in some way), but would not in itself change the status of existing ones.
Henry Thompson suggested that a further step might be to use the creation of, for example, .org.w3.arpa as a way to simultaneously create a new 'robust' domain name, and to change the status of the existing w3.org domain to have the same properties of persistence. However this further step would require agreement not only from the IAB but also new regulations at the level of ICANN's contracts with some gTLD registrars, at the very least the Public Interest Registry (PIR) and VeriSign, for both the .org and .net gTLDs. There was no consensus as to the likelihood of achieving such agreement. Getting a new gTLD with its own new governance rules through ICANN would involve enormous political complexity, which would be tough to manage to a successful conclusion.
The edited IRC log of the "Domain names and peristence workshop", together with links to all the talks, is available. A summary report is also available.
There were many questions here, and surprisingly in some cases, a few possible answers. In any case, many of the questions focussed on are ones which will become more pressing, as the spheres of the private and the public continue to blur into one another. The development of the web and the Internet has tended to remove old rules and patterns of behaviour. We still need rules and working practices, and reasons for doing things one way rather than another. But not the old rules and ways of doing things kitted out as new. The open data landscape needs better.
 "Trouble at the text mine", Nature, Richard Van Noorden, 7th March 2012, http://www.nature.com/news/trouble-at-the-text-mine-1.10184.
 Higher Education Funding Council for England (hefce), Research Excellence Framework, http://www.hefce.ac.uk/research/ref/.
 Ewan Tosh: Public data opportunities, http://vimeo.com/33410539.
 YouTube. "Debtris US", http://youtu.be/K7Pahd2X-eE.
 Victoria Stodden: Reproducible research, http://vimeo.com/33627936.
 Philip Hunter, "DCC Workshop on Persistent Identifiers", Ariadne, http://www.ariadne.ac.uk/issue44/dcc-pi-rpt/.
About the Author
Philip Hunter is currently Digital Library Grants and Projects Co-ordinator for the Digital Library Section of Edinburgh University Library (December 2010 onwards). Before that he managed the Research Publications Service, which brought together a Research Assessment Exercise (RAE) related publications repository, and the open access Edinburgh Research Archive (November 2008 - November 2010). He was the project manager for IRIScotland and other open access related projects from 2005 to 2008. He worked for the DCC in 2004-5 on secondment from UKOLN, implementing the technical and organisational infrastructure for the International Journal of Digital Curation, now in its sixth year of publication.