Back to Headlines
Science AI Analysis

Drowning in data sets? Here’s how to cut them down to size

AI
AI Legal Analyst
March 24, 2026, 12:05 AM 9 min read 10 views

Summary

Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia But 700 petabytes is only about 1% of the data that the array could generate. Log in or create an account to continue Access the most recent journalism from Nature's award-winning team Explore the latest features & opinion covering groundbreaking research Access through your institution or Sign in or create an account Continue with Google Continue with ORCiD Nature 651 , 1121-1122 (2026) doi: https://doi.org/10.1038/d41586-026-00880-7 Related Articles Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia ‘Google for DNA’ brings order to biology’s big data ‘Do-it-yourself’ data storage on DNA paves way to simple archiving system 11 ways to avert a data-storage disaster NatureTech hub Subjects Technology Research data Databases Astronomy and astrophysics Latest on: Technology Research data Databases Static electricity is a big mystery — a jolt of fresh research could help to solve it News Feature 18 MAR 26 AI and the PhD student: friend or foe? Technology Feature 16 MAR 26 Data from smart watches reveal early signs of insulin resistance News & Views 16 MAR 26 Rethinking AI’s role in survey research: from threat to collaboration Correspondence 17 MAR 26 AlphaFold database hits ‘next level’: the AI system now includes protein pairing News 17 MAR 26 Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia News 18 FEB 26 AlphaFold database hits ‘next level’: the AI system now includes protein pairing News 17 MAR 26 Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia News 18 FEB 26 An expanded registry of candidate cis-regulatory elements Article 07 JAN 26 Jobs Open Rank Faculty Position in Engineering - School of Science and Engineering The School of Science and Engineering (SSE) at theCUHK-Shenzhen sincerely invites applications for Tenure/Teaching/Research Stream faculty positions Shenzhen, China The Chinese University of Hong Kong, Shenzhen - School of Science and Engineering Open Rank Faculty Position in Science - School of Science and Engineering The School of Science and Engineering (SSE) at theCUHK-Shenzhen sincerely invites applications for Tenure/Teaching/Research Stream faculty positions Shenzhen, China The Chinese University of Hong Kong, Shenzhen - School of Science and Engineering Global Recruitment for Faculty, Postdocs, and Specialists at Hangzhou Institute of Medicine, CAS Seeking exceptional Senior/Junior PIs, Postdocs, and Core Specialists globally year-round Hangzhou, China Hangzhou Institute of Medicine Chinese Academy of Sciences (HIMCAS) Faculty Positions in School of Engineering, Westlake University The School of Engineering at Westlake University is seeking to fill multiple tenured or tenure-track faculty positions in all ranks. Hangzhou, Zhejiang (CN) Westlake University Global Talent Recruitment Announcement of the College of Engineering, HZAU Join HZAU's global faculty team to advance research with competitive benefits.

## Summary
Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia But 700 petabytes is only about 1% of the data that the array could generate. Log in or create an account to continue Access the most recent journalism from Nature's award-winning team Explore the latest features & opinion covering groundbreaking research Access through your institution or Sign in or create an account Continue with Google Continue with ORCiD Nature 651 , 1121-1122 (2026) doi: https://doi.org/10.1038/d41586-026-00880-7 Related Articles Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia ‘Google for DNA’ brings order to biology’s big data ‘Do-it-yourself’ data storage on DNA paves way to simple archiving system 11 ways to avert a data-storage disaster NatureTech hub Subjects Technology Research data Databases Astronomy and astrophysics Latest on: Technology Research data Databases Static electricity is a big mystery — a jolt of fresh research could help to solve it News Feature 18 MAR 26 AI and the PhD student: friend or foe? Technology Feature 16 MAR 26 Data from smart watches reveal early signs of insulin resistance News & Views 16 MAR 26 Rethinking AI’s role in survey research: from threat to collaboration Correspondence 17 MAR 26 AlphaFold database hits ‘next level’: the AI system now includes protein pairing News 17 MAR 26 Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia News 18 FEB 26 AlphaFold database hits ‘next level’: the AI system now includes protein pairing News 17 MAR 26 Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia News 18 FEB 26 An expanded registry of candidate cis-regulatory elements Article 07 JAN 26 Jobs Open Rank Faculty Position in Engineering - School of Science and Engineering The School of Science and Engineering (SSE) at theCUHK-Shenzhen sincerely invites applications for Tenure/Teaching/Research Stream faculty positions Shenzhen, China The Chinese University of Hong Kong, Shenzhen - School of Science and Engineering Open Rank Faculty Position in Science - School of Science and Engineering The School of Science and Engineering (SSE) at theCUHK-Shenzhen sincerely invites applications for Tenure/Teaching/Research Stream faculty positions Shenzhen, China The Chinese University of Hong Kong, Shenzhen - School of Science and Engineering Global Recruitment for Faculty, Postdocs, and Specialists at Hangzhou Institute of Medicine, CAS Seeking exceptional Senior/Junior PIs, Postdocs, and Core Specialists globally year-round Hangzhou, China Hangzhou Institute of Medicine Chinese Academy of Sciences (HIMCAS) Faculty Positions in School of Engineering, Westlake University The School of Engineering at Westlake University is seeking to fill multiple tenured or tenure-track faculty positions in all ranks. Hangzhou, Zhejiang (CN) Westlake University Global Talent Recruitment Announcement of the College of Engineering, HZAU Join HZAU's global faculty team to advance research with competitive benefits.

## Article Content
Email
Bluesky
Facebook
LinkedIn
Reddit
Whatsapp
X
Illustration: The Project Twins
Within the next decade, a pair of giant radio telescopes in South Africa and Australia will be able to generate about 700 petabytes of data each year, the equivalent of about 149 million DVDs, a stack nearly 180 kilometres high.
The telescopes are part of the Square Kilometre Array Observatory (SKAO), which will include more than 100,000 Christmas-tree-like wire antennas in Australia and some 200 dishes in South Africa when it is completed in 2029. These telescopes will pick up radio signals from celestial objects, and their developers hope that they will shed light on some of astronomy’s long-standing questions, such as what dark matter is and how galaxies form.
Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia
But 700 petabytes is only about 1% of the data that the array could generate. Shari Breen, head of science operations at the SKAO in Jodrell Bank, UK, estimates that it could produce some 60 exabytes — 60,000 petabytes — each year if researchers used all of its systems continuously and retained all of the data.
“The amount of money that it would take to hold our rawest forms of data is insane — I don’t even know where we would fit that many computers,” says Breen. “So, we have to make some compromises.”
Disciplines such as astronomy and the Earth and biological sciences have long grappled with unwieldy data sets. As the volume, processing speed and variety of data continue to grow, the storage capacity is struggling to keep pace. At the same time, the boom in machine-learning and artificial-intelligence technologies is creating an incentive to hoard information. But unconstrained data retention is not financially viable and uses a great deal of energy.
“This is a problem that libraries have been dealing with for as long as libraries have existed,” says Kristin Briney, a librarian at the California Institute of Technology (Caltech) in Pasadena. “We cannot physically collect all the books that we want to collect, and in 50 years, the book may not be useful any more.”
Data sets, she says, are the same. “There has to be some curation that determines what is worth keeping and what is worth throwing away.”
Field-specific rules
There is no one-size-fits-all rulebook for data curation, and best practice often depends on the discipline and on the scale of a project.
The SKAO, for instance, will store the products that it makes according to what the scientists ask for in advance, says Breen. The products can range from raw data to highly processed images. So if an astronomer requests an image based on interferometry data, then the underlying data set will be discarded once the picture’s quality has been deemed sufficient, she says.
Breen, who is a principal investigator on a large astronomical survey, says that in the past, she would request raw data. “Now, I’m like, ‘No, please don’t!’,” she says. “The reality of these next-generation telescopes is that then you’ll spend all your time bogged down by enormous data sets rather than delivering the awesome science that was the whole point.” Instead, she typically asks for an interactive 3D array of pixels known as an image cube, which is easier to wrangle, she says.
‘Google for DNA’ brings order to biology’s big data
Meteorologists, by contrast, still prefer to work with the raw data. The World Meteorological Organization (WMO) receives data from thousands of satellites, marine platforms, aerial surveys and ground-based stations around the world, which record parameters such as atmospheric pressure, wind speed, air temperature and humidity, often hourly.
“We have a principle in meteorology, which is that we have to archive all the original data in order to enable us to always produce any product we have ever produced out of the original data,” says WMO scientific officer Peer Hechler in Geneva, Switzerland. The meteorology community uses original data to create projections and models, but “it doesn’t make sense economically to store all these derivative data sets”, he says.
Similarly, the Wellcome Sanger Institute, a genomics research organization in Hinxton, UK, keeps most of the raw data it generates, says sequencing informatics team leader David Jackson. Its DNA database already contains some 90 petabytes of data. As a result, Jackson says, the organization needs clear data-retention policies, and soon. “You get to the point where the data becomes more of a liability than an asset,” he says.
What needs to be kept
Whatever the discipline, the first step in managing massive data sets is working out what needs to be kept and what can be thrown away. Although practices vary, librarians and data specialists say that there are some overarching principles.
Some data sets must be kept because they are irreplaceable or legal requisites. Others might have been used in a publication or for a government decision, and need to be stored so that future readers

---

## Expert Analysis

### Merits
N/A

### Areas for Consideration
- But unconstrained data retention is not financially viable and uses a great deal of energy. “This is a problem that libraries have been dealing with for as long as libraries have existed,” says Kristin Briney, a librarian at the California Institute of Technology (Caltech) in Pasadena. “We cannot physically collect all the books that we want to collect, and in 50 years, the book may not be useful any more.” Data sets, she says, are the same. “There has to be some curation that determines what is worth keeping and what is worth throwing away.” Field-specific rules There is no one-size-fits-all rulebook for data curation, and best practice often depends on the discipline and on the scale of a project.
- Technology Feature 16 MAR 26 Data from smart watches reveal early signs of insulin resistance News & Views 16 MAR 26 Rethinking AI’s role in survey research: from threat to collaboration Correspondence 17 MAR 26 AlphaFold database hits ‘next level’: the AI system now includes protein pairing News 17 MAR 26 Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia News 18 FEB 26 AlphaFold database hits ‘next level’: the AI system now includes protein pairing News 17 MAR 26 Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia News 18 FEB 26 An expanded registry of candidate cis-regulatory elements Article 07 JAN 26 Jobs Open Rank Faculty Position in Engineering - School of Science and Engineering The School of Science and Engineering (SSE) at theCUHK-Shenzhen sincerely invites applications for Tenure/Teaching/Research Stream faculty positions Shenzhen, China The Chinese University of Hong Kong, Shenzhen - School of Science and Engineering Open Rank Faculty Position in Science - School of Science and Engineering The School of Science and Engineering (SSE) at theCUHK-Shenzhen sincerely invites applications for Tenure/Teaching/Research Stream faculty positions Shenzhen, China The Chinese University of Hong Kong, Shenzhen - School of Science and Engineering Global Recruitment for Faculty, Postdocs, and Specialists at Hangzhou Institute of Medicine, CAS Seeking exceptional Senior/Junior PIs, Postdocs, and Core Specialists globally year-round Hangzhou, China Hangzhou Institute of Medicine Chinese Academy of Sciences (HIMCAS) Faculty Positions in School of Engineering, Westlake University The School of Engineering at Westlake University is seeking to fill multiple tenured or tenure-track faculty positions in all ranks.

### Implications
- Email Bluesky Facebook LinkedIn Reddit Whatsapp X Illustration: The Project Twins Within the next decade, a pair of giant radio telescopes in South Africa and Australia will be able to generate about 700 petabytes of data each year, the equivalent of about 149 million DVDs, a stack nearly 180 kilometres high.
- The telescopes are part of the Square Kilometre Array Observatory (SKAO), which will include more than 100,000 Christmas-tree-like wire antennas in Australia and some 200 dishes in South Africa when it is completed in 2029.
- These telescopes will pick up radio signals from celestial objects, and their developers hope that they will shed light on some of astronomy’s long-standing questions, such as what dark matter is and how galaxies form.
- Microsoft team creates ‘revolutionary’ data-storage system that lasts for millennia But 700 petabytes is only about 1% of the data that the array could generate.

### Expert Commentary
This article covers data, research, science topics. Areas of concern are also raised. Readability: Flesch-Kincaid grade 0.0. Word count: 1533.
data research science engineering storage system sets faculty

Related Articles