D-Lib MagazineJanuary/February 2015 Data without Peer: Examples of Data Peer Review in the Earth Sciences
Sarah Callaghan AbstractPeer review of data is an important process if data is to take its place as a first class research output. Much has been written about the theoretical aspects of peer review, but not as much about the actual process of doing it. This paper takes an experimental view, and selects seven datasets, all from the Earth Sciences and with DOIs from DataCite, and attempts to review them, with varying levels of success. Key issues identified from these case studies include the necessity of human readable metadata, accessibility of datasets, and permanence of links to and accessibility of metadata stored in other locations. 1 IntroductionMuch has been written and said about peer review in recent years (Bohannon, 2013; Bornmann, 2011; Lee, Sugimoto, Zhang, & Cronin, 2013; Weller, 2001), in particular focussing on its problems, and the choke points in the whole academic publishing system. Peer review of journal articles by external reviewers is in fact only a relatively recent development the journal Nature instituted formal peer review only in 19671. It is unsurprising therefore that even though the community generally agrees that peer review should be applied to the other outputs of research, such as data and software, the prospect of actually implementing these processes is daunting. Data in particular are much more heterogeneous than publications, and require more tools in order to interpret them. The size of modern datasets means that it is no longer possible to publish them as part of the journal article describing them. Instead they must be permanently linked to the article, and in such a way that the reader of the article can also understand the data enough to be reassured that the data do indeed support the arguments made in the article. One of the foundations of the scientific process is reproducibility without it, conclusions are not valid, and the community can be sent down costly and wasteful side tracks. Yet, for reproducibility to be achieved, the data must be made open to scrutiny, and in such a way that a researcher with knowledge in the field of study will be able to understand and interpret the data, at the very least. Researchers who produce data want recognition for their efforts, and quite rightly. Research funders also want to know what impact their funding has had on the community and wider society. Hence the drive to make data open also includes a drive to assess the quality of the data. This can be done in a number of ways. For those communities where large datasets are the norm (e.g. the climate modelling and high energy physics communities), the data tend to be stored in custom built repositories, with standardised metadata and quality control checks performed as part of the repository ingestion process (Adelman et al., 2010; Stockhause, Höck, Toussaint, & Lautenschlager, 2012). For the majority of research groups, lacking discipline-specific repositories capable of performing quality checks, their only ways of getting their data assessed is to wait for others to use it, and hope that the new users provide feedback2, or to submit it to a data journal, where as part of the publication process it will go through peer review. Data journals are a new type of academic publication where authors write a short article about, and permanently linking to, a dataset, which is stored in a trusted data repository. The reviewers then review the article, and the linked dataset as one unit, providing assurance to the user community that this data is useful, and can be reused by researchers other than the original data producers. How this peer-review process should be done is a matter of debate, and in all likelihood, will be very discipline specific. Other articles deal with the generalities and theories of data peer review (Lawrence, Jones, Matthews, Pepler, & Callaghan, 2011; Mayernik, Callaghan, Leigh, Tedds, & Worley, 2014; M. A. Parsons & Fox, 2013; Mark A. Parsons, Duerr, & Minster, 2010), this article instead chooses to focus on worked examples of peer review of data, in particularly outlining the many pitfalls that can occur in the process! 2 MethodologyThis paper is an expansion of the blog post at (Callaghan, 2013), where I performed peer-review on two datasets held in two different repositories. This paper uses the same methods as in the blog post, but broadens the sample space to include seven other datasets. For all the datasets reviewed, the same set of questions were applied and answered. As my expertise is in the Earth Sciences, the questions are therefore somewhat biased to those that would need to be answered in that field. Many questions are common, however, and deal with fundamental issues regarding the accessibility of the data and understandability of the metadata. The first four questions in the series can be viewed as editorial questions:
In other words, these are questions that should be answered in the affirmative by an editorial assistant at the data journal before the paper and dataset even get sent out for review. If the dataset fails any of these, then it will not be possible for the reviewer to get any further in the review process, and asking them to try will be wasting their time and good will. I apologise for the repetition involved in answering the same questions for each of the seven datasets, but hope that it will make the reader's understanding of the thought processes behind the review more clear. 2.1 Finding the datasetsIn the blog post, I deliberately went to two data repositories that I had personal experience with, and searched within their catalogues for a particular search term "precipitation". For this experiment, I widened the net. Firstly I attempted to use Google to search for suitable datasets, with no success many results were obtained, but trying to filter them to find the links to actual datasets was very difficult. So instead I went to the DataCite metadata store and searched the catalogue of datasets there, using the search terms "rain" and "precipitation". The results still required a certain amount of filtering, as DataCite DOIs can be applied to publications and other research outputs, as well as datasets. The seven datasets chosen are not intended to be a representative sample of the datasets in DataCite, but are more illustrative of the type of issues I came across. As all the datasets were chosen from DataCite, they all have a DOI, hence I decided to omit the first editorial question from each of the review examples, as the answer is the same in all cases:
2.2 Caveats and biasesAs mentioned earlier, my background is in atmospheric science, particularly the space time variability of rain fields and the atmospheric impacts on radio communications systems (Callaghan, 2004). Hence the choice of my search terms, as I wished to find datasets to review that I had sufficient domain knowledge to do so. As I work for a discipline-specific repository (the British Atmospheric Data Centre), one of NERC's federation of environmental data centres, I also excluded datasets from any of the NERC data centres due to potential conflicts of interest. The results given in this paper are not intended to be exhaustive, or statistically meaningful. I made deliberate choices of datasets to review, based on what I thought would provide interesting and illuminating examples. A far less biased sample would have been to randomly choose a number of datasets from the list of returned search results, however, I hope my (biased) selection process is more illuminating. It's also worth pointing out that even though the researchers who created these datasets have made them available, they probably didn't expect them to be reviewed in this fashion. 3 DatasetsDataset 1: Hubbard Brook Rain GagesCitation: Campbell, John; (2004): Hubbard Brook Rain Gages; USDA Forest Service. http://doi.org/10.6073/AA/KNB-LTER-HBR.100.2 Figure 1: Landing page for (Campbell, 2004) This dataset nearly fell at the very first hurdle, in that the landing page for the DOI is an xml file without any style information associated with it, which makes the whole thing really difficult to read, and hence review. (If I was reviewing this for real, I would have rejected it as soon as I saw the raw xml, but for the purposes of this paper, I have tried to answer the review questions.)
And it's at this stage that I give up on this particular dataset. Verdict: Revise and Resubmit. This case study is a really good example of how important it is to have the landing pages for your DOIs presented in a way that is good for humans to read, not just machines. If the landing page was more user-friendly, then I'd probably have got further with the review. Dataset 2: Daily Rainfall Data (FIFE)Citation: Huemmrich, K. F.; Briggs, J. M.; (1994): Daily Rainfall Data (FIFE); ORNL Distributed Active Archive Center. http://doi.org/10.3334/ORNLDAAC/29 Figure 2: Landing page for (Huemmrich and Briggs, 1994)
It looks like the registration process is easy, and I couldn't find any information there about restrictions to users, but given I'm on Dataset 2, and haven't managed to properly review anything yet, I decided to move on and find another, more open, dataset. Verdict: Don't know. This case does show, however, that even minor access control restrictions can put users off accessing and reusing this data for other purposes. Dataset 3: ARM: Total Precipitation SensorCitation: Cherry, Jessica; (2006): ARM: Total Precipitation Sensor; http://dx.doi.org/10.5439/1025305. Figure 3: Error page received when attempting to resolve the DOI for Cherry, Jessica; (2006): ARM: Total Precipitation Sensor; http://doi.org/10.5439/1025305 On my first attempt to review this dataset, the DOI resolved to the dataset landing page for long enough so that I could take a first glance and identify it as a candidate for review. However, when I came back to the page after an hour or so, the page wouldn't reload, and the DOI failed to resolve repeatedly. This happened when I tried again four days later. Verdict: Reject. If the DOI doesn't resolve, don't even bother sending it for review. Dataset 4: RainCitation: Lindenmayer, David B.; Wood, Jeff; McBurney, Lachlan; Michael, Damian; Crane, Mason; MacGregor, Christopher; Montague-Drake, Rebecca; Gibbons, Philip; Banks, Sam C.; (2011): rain; Dryad Digital Repository. http://doi.org/10.5061/DRYAD.QP1F6H0S/3 Figure 4: Landing page for (David B. Lindenmayer et al., n.d.-b)
Verdict: Revise and resubmit. This is a small part of a research project that wasn't really looking at rain at all yet what data they have collected could potentially be amalgamated with other datasets to provide a wider-ranging, more useful dataset. The title definitely needs fixing, and extra metadata about the calibration, type of gauge, latitude and longitude, need to be supplied. (This information may be in the associated paper, but seeing as it's behind a paywall, it's not much use to a user at this time). It's also interesting to see that you can import the data citation into Mendeley from Dryad (by clicking the "Save to Mendeley" plugin button I have in Google Chrome). Unfortunately the dates of the data citations get a bit messed up by this. Dataset 5: Meteorological Records from the Vernagtferner Basin Gletschermitte Station, for the Year 1987Citation: Weber, Markus; Escher-Vetter, Heidi; (2014a): Meteorological records from the Vernagtferner basin Gletschermitte Station, for the year 1987; PANGAEA Data Publisher for Earth & Environmental Science. http://doi.org/10.1594/PANGAEA.832561 Figure 5: Landing Page for (Weber & Escher-Vetter, 2014a)
Verdict: Accept. Of all the datasets reviewed in the process of writing this paper, this was the best documented, and therefore the most useful. Dataset 6: National Oceanic and Atmospheric Administration. Weather Measurements: Monthly Surface Data: Total Precipitation | Country: USA | State: South Carolina [Data-file]Citation: Data-Planet by Conquest Systems, Inc. (2014). National Oceanic and Atmospheric Administration. Weather Measurements: Monthly Surface Data: Total Precipitation | Country: USA | State: South Carolina [Data-file], Retrieved from http://www.data-planet.com, Viewed: July 8, 2014. Dataset-ID: 018-002-006. http://doi.org/10.6068/DP143A169EBCB2 Or (DataCite citation) Conquest System Datasheet; (2013): Average Daily Precipitation from the Weather Measurements: Monthly Surface Data Dataset shown in Inches; Conquest Systems, Inc.. http://doi.org/10.6068/DP13F0712607393 Figure 6: Landing page for (Data-Planet by Conquest Systems, Inc. 2014).
Figure 7: Screenshot from the DataCite search pages, showing the list of identically-titled datasets in this repository.
Verdict: Reject. Don't even send out for review. On a more positive note, I like the way they've very clearly spelled out how the dataset should be cited. And, to be fair, I would expect that if the dataset authors had sent their dataset for review, then they would have arranged access. Though it's interesting to note that in the citation, Data-Planet is given as the dataset creators, when I'd expect them to be the publishers. Oh, and the citation Data-Planet give isn't the same as the DataCite citation, which will cause confusion. Dataset 7: ECHAM5-HAM Precipitation and Aerosol Optical Depth DataCitation: Benjamin S. Grandey; (2014): ECHAM5-HAM precipitation and aerosol optical depth data; Figshare. http://doi.org/10.6084/M9.FIGSHARE.1061414 Figure 8: Landing Page for for (Benjamin S. Grandey, 2014)
Verdict: Accept. Reviewing this dataset got outside my comfort zone, as I'm not a climate modeller, and if I'd been asked to review this for real, I would have declined. In terms of usability and metadata, the metadata provided on the landing page isn't really good enough to help the reader judge. Having metadata inside the data files is all well and good, but does require software to open them. NetCDF is a common format, but wouldn't be accessible to researchers outside the Earth Science and climate fields. I'd recommend that the dataset authors update their description section with the full reference for the paper (marked as "in press, 2014") that analyses the data. I also really like how figshare have an "export to Mendeley" button set up on their dataset pages. 4 ConclusionsIf data are to become recognised outputs of the research process (as they should) then they need to be available for other users to scrutinise, for the verification and reproducibility of the research. As the case studies presented here show, peer review of data can be done, though potentially has many problems. These problems aren't necessarily with the datasets, but with the way they are presented and made available (or not) to the user, making them primarily the repository's area of concern.
The true impact of a dataset, like the true impact of a research article, can really only be determined over a long period of time, by the number of researchers using and citing it. Yet, if a dataset is difficult to use and understand, it is very likely that its impact will be seriously reduced. Peer review of data, as presented in this paper, provides a way of checking the understandability and usability of the data, allowing other users to filter available datasets to those that will be of use to them, without a significant overhead in opening and interpreting the data and metadata. AcknowledgementsThe work required to write this paper has been funded by the European Commission as part of the project OpenAIREplus (FP7-2011-2, Grant Agreement no. 283595). Notes1 "History of the journal Nature". Timeline publisher Nature.com. 2013. 2 Quality control through user feedback is an interesting topic, but unfortunately out of scope for this paper. The CHARMe project is looking into this in greater detail and will be producing a software mechanism to support this. References[1] Callaghan, S.A, "Fractal analysis and synthesis of rain fields for radio communication systems" PhD thesis, University of Portsmouth, June 2004. [2] Callaghan, S., 2013: How to review a dataset: a couple of case studies [blog post]. [3] Campbell, John; (2004): Hubbard Brook Rain Gages; USDA Forest Service. http://doi.org/10.6073/AA/KNB-LTER-HBR.100.2 [4] Data-Planet by Conquest Systems, Inc. (2014). National Oceanic and Atmospheric Administration. Weather Measurements: Monthly Surface Data: Total Precipitation | Country: USA | State: South Carolina [Data-file], Retrieved from http://www.data-planet.com, Viewed: July 8, 2014. Dataset-ID: 018-002-006. http://doi.org/10.6068/DP143A169EBCB2 [5] Adelman, J., Baak, M., Boelaert, N., D'Onofrio, M., Frost, J. A., Guyot, C., ... Wilson, M. G. (2010). ATLAS offline data quality monitoring. Journal of Physics: Conference Series, 219(4), 042018. http://doi.org/10.1088/1742-6596/219/4/042018 [6] Bohannon, J. (2013). Who's afraid of peer review? Science, 342, 6065. http://doi.org/10.1126/science.342.6154.60 [7] Bornmann, L. (2011) Scientific peer review. Annual Review of Infomration Science and Technology, 45, 197245. http://doi.org/10.1002/aris.2011.1440450112 [8] Grandey, B. S. (2014). ECHAM5-HAM precipitation and aerosol optical depth data. http://doi.org/10.6084/m9.figshare.1061414 [9] Grandey, B. S., Stier, P., & Wagner, T. M. (2013). Investigating relationships between aerosol optical depth and cloud fraction using satellite, aerosol reanalysis and general circulation model data. Atmospheric Chemistry and Physics, 13(6), 31773184. http://doi.org/10.5194/acp-13-3177-2013 [10] Lawrence, B., Jones, C., Matthews, B., Pepler, S., & Callaghan, S. (2011). Citation and Peer Review of Data: Moving Towards Formal Data Publication. International Journal of Digital Curation, 6(2), 437. http://doi.org/10.2218/ijdc.v6i2.205 [11] Lee, C. J., Sugimoto, C. R., Zhang, G., & Cronin, B. (2013). Bias in peer review. Journal of the American Society for Information Science and Technology, 64(1), 217. http://doi.org/10.1002/asi.22784 [12] Lindenmayer, D. B., Wood, J., McBurney, L., Michael, D., Crane, M., MacGregor, C., ... Banks, S. C. (n.d.-a). Data from: Cross-sectional versus longitudinal research: a case study of trees with hollows and marsupials in Australian forests. http://doi.org/10.5061/dryad.qp1f6h0s [13] Lindenmayer, D. B., Wood, J., McBurney, L., Michael, D., Crane, M., MacGregor, C., ... Banks, S. C. (n.d.-b). rain. http://10.5061/dryad.qp1f6h0s/3 [14] Lindenmayer, D. B., Wood, J., McBurney, L., Michael, D., Crane, M., MacGregor, C., ... Banks, S. C. (2011). Cross-sectional vs. longitudinal research: a case study of trees with hollows and marsupials in Australian forests. [15] Mayernik, M. S., Callaghan, S., Leigh, R., Tedds, J., & Worley, S. (2014). Peer Review of Datasets: When, Why, and How. Bulletin of the American Meteorological Society, 140507132833005. http://10.1175/BAMS-D-13-00083.1 [16] Parsons, M. A., Duerr, R., & Minster, J. B. (2010). Data citation and peer review. Eos. http://doi.org/10.1029/2010EO340001 [17] Parsons, M. A., & Fox, P. A. (2013). Is Data Publication the Right Metaphor? Data Science Journal, 12, WDS32WDS46. http://doi.org/10.2481/dsj.WDS-042 [18] Stockhause, M., Höck, H., Toussaint, F., & Lautenschlager, M. (2012). Quality assessment concept of the World Data Center for Climate and its application to CMIP5 data. Geoscientific Model Development, 5, 10231032. http://doi.org/10.5194/gmd-5-1023-2012 [19] Weber, M., & Escher-Vetter, H. (2014a, May 16). Meteorological records from the Vernagtferner basin Gletschermitte Station, for the year 1987. PANGAEA. http://doi.org/10.1594/PANGAEA.832561 [20] Weber, M., & Escher-Vetter, H. (2014b, May 16). Meteorological records from the Vernagtferner basin Gletschermitte Station, for the years 1968 to 1987. PANGAEA. http://doi.org/10.1594/PANGAEA.832562 [21] Weller, A. C. (2001). Editorial Peer Review: Its Strengths and Weaknesses (p. 342). Information Today, Inc. [22] Huemmrich, K.F.; Briggs, J.M.; (1994): Daily Rainfall Data (FIFE); ORNL Distributed Active Archive Center. http://doi.org/10.3334/ORNLDAAC/29 [23] Cherry, Jessica; (2006): ARM: Total Precipitation Sensor. http://dx.doi.org/10.5439/1025305 About the Author
|
|||||||||
|