Big Everything: the Future of Astronomical Data


Big data is a hot topic these days - from science to security to advertising to the stock market, the techniques and technologies used to deal with big data now touch our everyday lives. Even the White House is thinking about Big Data. In astrophysics and cosmology, we deal with big everything: big datasets, big simulations and big collaborations! We already have information on billions of astronomical objects, and we expect to make measurements of many billions more in the next ten years. The problem with such a large dataset is not in it’s size alone, but in its complexity. We want to use this data in many different ways, from locating a single rare star in a haystack of billions of near-identical stars, to understanding the spatial correlation of every single galaxy in the universe. In addition, we often want to combine information from very different sources, for example, X-ray data from a satellite telescope and optical data from a ground-based telescope. On Thursday morning at KIPAC@10 we discussed some of the big data issues unique to astronomy, and Debbie Bard talked to David Hogg (NYU) afterwards.



The field of particle physics has been dealing with Big Data for years, with the LHC producing tens of terabytes of data a night. The LHC uses a tiered network of computing sites for processing all this data, with a million computing jobs completed every day! This kind of very distributed computing model requires a very robust system, with automated job monitoring, bookkeeping and tracking to keep the job success rate compatible with the rate of science. Big Data astronomy projects like LSST can learn from this experience, for example, how to set up reliable networks and a sophisticated file transfer system to deal with the distributed data load. We can also learn that the biggest cost in all this is the investment in people with the expertise to keep this kind of system running.

In cosmology, we try to make measurements of the structure and evolution of the Universe. The Universe is a big place, and to make theoretical predictions of what the Universe looks like we need big simulations! The science required for these simulations is less of an issue than the technology required to run the simulations. We can’t simulate every possible cosmological model, so computer programs are being developed to take what we can simulate and expand the information to other, similar cosmological models. If we want to understand the structure and evolution of the Universe we also need to understand how the galaxies we see can reveal the presence the dark matter. Currently, the science of galaxy formation is a big bottleneck.

In the past, more data meant employing more astronomers. Today, it means we have to turn to machines. A particular problem exists for transient astronomical events like supernova explosions which may only last a couple of weeks, so we want to be able to follow up any sighting on the same timescale, but it can take months for an astronomer to process the data. The dream is to have a fully automated analysis pipeline. Machine learning algorithms are already identifying and classifying objects with great success, and will be put to the test with datasets the size of the ones LSST will provide.

Of course, there are some things that machines are not so good at. The human visual cortex is a very powerful scientific tool! Some tasks, for example classifying galaxies or identifying strong lens systems, can be done better by human brains than by computer algorithms. This is useful for scientific progress, as well a reason to reach out to the non-scientific community. We just ask them for help! For example, we can use crowd-sourced classification as training data for machine-learning algorithms. Perhaps more importantly, professional astronomers can leverage the intelligence of the interested public to create a community of lay-astronomers. If you give them the tools, they can make the discoveries themselves!


You can watch all the talks in this session on the KIPAC youtube channel.

You can also read more about KIPAC@10 on the conference blog home page.