Similar to how paleontologists use specific fossils known as index fossils to date rock layers and assess ancient environments, astronomers look for specific patterns of light emissions from space to mark periods of the universe’s history. For example, early galaxies give off a type of ultraviolet light that comes from electrons in hydrogen atoms falling from their second lowest to their lowest energy state, called Lyman-alpha or Ly⍺ emissions.
For decades, astronomers have associated Ly⍺ emission with a time period within the first billion years after the Big Bang, called the epoch of reionization, when the average rate of star formation in galaxies was much higher than today. When they find a galaxy that strongly emits Ly⍺ light, they classify it as a Ly⍺ emitter or LAE and can be confident it dates back to the epoch of reionization. Observing LAEs tells astronomers more about the history of the Milky Way and other galaxies like ours, including whether they were likely LAEs at some point in their early history.
However, researchers face confounding factors when looking for LAEs. The expansion of the universe distorts light in a process called cosmological redshift. But more prominently, dust, both in and between galaxies, obscures Ly⍺ light. Astronomers can analyze the full spectrum of light from a galaxy to find evidence of Ly⍺ emissions, but it would be much faster to develop a tool to predict whether a galaxy is likely an LAE based on more readily available measurements.
One team of astronomers developed a model for just this problem using a machine learning technique known as a neural network. This technology replicates how neurons in the brain work, with several interconnected layers receiving and sending signals based on an initial input and yielding a final output. The trick is that the programmers only know what kind of input they will put in and what kind of output they expect to get in the end. The algorithm itself must figure out how best to set up the connections in the middle, what to look for, and how to rank the importance of each input.
The team started with data from 2 surveys of light sources in space: 926 galaxies from VANDELS, only 520 of which were LAEs, and 507 from MUSE, all of which were LAEs. They used 80% of this data to train the algorithm by explicitly telling it which sources were actual LAEs and which ones weren’t. They saved the remaining 20% of the data for testing.
Through this initial test, the team identified 6 parameters for their neural network to focus on when evaluating galaxies for LAE potential. These parameters were the rate of star formation, the total mass of the stars, the brightness of their ultraviolet light, patterns in their ultraviolet emissions, their age, and their dustiness. They programmed the network to output an estimate of the probability that a given galaxy was an LAE, and they considered anything above 70% to mean that the algorithm classified it as an LAE.
Once they had used the training data to create the neural network, the team went through several additional rounds of testing. With the initial test data, they found that their network correctly identified LAEs 77% of the time with only a 14% chance of false positives. When they looked into what their network prioritized to make these predictions, they found that the most important factors were the pattern of a galaxy’s UV emissions, the brightness of its ultraviolet light, and the mass of its stars.
Following this initial success, the team applied their network to another survey, COSMOS2020, and its subset of LAEs, SC4K, which had less detail than the training data surveys. From these datasets, the team’s neural network identified true LAEs 72% of the time.
The team’s final result came when they applied their neural network to data from NASA’s new telescope, the JWST. Since the ultimate goal with their model is to study the distant past of the universe, and JWST aims to look farther and at fainter sources than ever before, a successful test on already confirmed LAE results from JWST would be a good sign for its future success. They found a 91% true positive rate on JWST data, demonstrating the validity of their approach and illuminating a path toward finding out more about the history of the universe.