Drowning in data sets? Here’s how to cut them down to size
Within the next decade, a pair of giant radio telescopes in South Africa and Australia will be able to generate about 700 petabytes of data each year, the equivalent of about 149 million DVDs, a stack nearly 180 kilometres high.The telescopes are part of the Square Kilometre Array Observatory (SKAO), which will include more than 100,000 Christmas-tree-like wire antennas in Australia and some 200 dishes in South Africa when it is completed in 2029. These telescopes will pick up radio signals from celestial objects, and their developers hope that they will shed light on some of astronomy’s long-standing questions, such as what dark matter is and how galaxies form.Microsoft team creates ‘revolutionary’ data-storage system that lasts for millenniaBut 700 petabytes is only about 1% of the data that the array could generate. Shari Breen, head of science operations at the SKAO in Jodrell Bank, UK, estimates that it could produce some 60 exabytes — 60,000 petabytes — each year if researchers used all of its systems continuously and retained all of the data.“The amount of money that it would take to hold our rawest forms of data is insane — I don’t even know where we would fit that many computers,” says Breen. “So, we have to make some compromises.”Disciplines such as astronomy and the Earth and biological sciences have long grappled with unwieldy data sets. As the volume, processing speed and variety of data continue to grow, the storage capacity is struggling to keep pace. At the same time, the boom in machine-learning and artificial-intelligence technologies is creating an incentive to hoard information. But unconstrained data retention is not financially viable and uses a great deal of energy.“This is a problem that libraries have been dealing with for as long as libraries have existed,” says Kristin Briney, a librarian at the California Institute of Technology (Caltech) in Pasadena. “We cannot physically collect all the books that we want to collect, and in 50 years, the book may not be useful any more.”Data sets, she says, are the same. “There has to be some curation that determines what is worth keeping and what is worth throwing away.”Field-specific rulesThere is no one-size-fits-all rulebook for data curation, and best practice often depends on the discipline and on the scale of a project.The SKAO, for instance, will store the products that it makes according to what the scientists ask for in advance, says Breen. The products can range from raw data to highly processed images. So if an astronomer requests an image based on interferometry data, then the underlying data set will be discarded once the picture’s quality has been deemed sufficient, she says.Breen, who is a principal investigator on a large astronomical survey, says that in the past, she would request raw data. “Now, I’m like, ‘No, please don’t!’,” she says. “The reality of these next-generation telescopes is that then you’ll spend all your time bogged down by enormous data sets rather than delivering the awesome science that was the whole point.” Instead, she typically asks for an interactive 3D array of pixels known as an image cube, which is easier to wrangle, she says.‘Google for DNA’ brings order to biology’s big dataMeteorologists, by contrast, still prefer to work with the raw data. The World Meteorological Organization (WMO) receives data from thousands of satellites, marine platforms, aerial surveys and ground-based stations around the world, which record parameters such as atmospheric pressure, wind speed, air temperature and humidity, often hourly.“We have a principle in meteorology, which is that we have to archive all the original data in order to enable us to always produce any product we have ever produced out of the original data,” says WMO scientific officer Peer Hechler in Geneva, Switzerland. The meteorology community uses original data to create projections and models, but “it doesn’t make sense economically to store all these derivative data sets”, he says.Similarly, the Wellcome Sanger Institute, a genomics research organization in Hinxton, UK, keeps most of the raw data it generates, says sequencing informatics team leader David Jackson. Its DNA database already contains some 90 petabytes of data. As a result, Jackson says, the organization needs clear data-retention policies, and soon. “You get to the point where the data becomes more of a liability than an asset,” he says.What needs to be keptWhatever the discipline, the first step in managing massive data sets is working out what needs to be kept and what can be thrown away. Although practices vary, librarians and data specialists say that there are some overarching principles.Some data sets must be kept because they are irreplaceable or legal requisites. Others might have been used in a publication or for a government decision, and need to be stored so that future readers can see the evidence on which a decision was based.‘Do-it-yourself’ data storage on DNA paves way to simple archiving systemMany funders, including the US National Institutes of Health, require that data remain available to other researchers. To do so, researchers can use shared repositories such as the generalist Zenodo and Dryad databases, or more specialist systems, including the Open Data Commons for Spinal Cord Injury. The Registry of Research Data Repositories provides an index of nearly 3,500 such resources.The US National Science Foundation requires grant recipients to submit a data-management plan, including information about the size and storage plans for data sets, as well as how much of the grant will be allocated to this. It offers guidance that is tailored to different disciplines. For instance, the guidelines for biological sciences contain information about how to handle sensitive data relating to human participants, whereas those for mathematics have provisions for making code and software open source and contain suggestions about data formats.The UK National Environment Research Council has developed a checklist that covers the data’s legal status, potential reuse and historical and scientific value, says Sam Pepler, curation manager at the Centre for Environmental Data Analysis in Leicester, UK. The list could be useful for other fields of research, too, Pepler says, but he cautions that it is subjective and that disciplines often have their own requirements.One thing that is not subjective, however, is the importance of the metadata that describe the data set. Helen Glaves, a senior data scientist at the British Geological Survey in Nottingham, UK, says that metadata are “absolutely fundamental”. If data sets have poor metadata, she explains, their value for reuse might be limited.