If you’ve had to prove to a website that you weren’t a robot by looking at a jumbled set of distorted letters and trying to guess what word it’s spelling, then you’ve used a Completely Automated Public Turing test to tell Computers and Humans Apart or CAPTCHA. It’s a security feature, and software companies use data on how humans solve them to train AIs. They want AI algorithms to read words in conditions other than flat text on a computer screen. But can a similar approach benefit the natural sciences?
One team of astronomers wanted to see if they could get volunteers and AI to identify galaxies. They obtained data from an ongoing study of how fast the universe is expanding, called the Hobby-Eberly Telescope Dark Energy Experiment or HETDEX. The scientists working on HETDEX are attempting to create a 3D map of distant galaxies, 9 to 11 billion light-years away, and measure their speeds to understand the universe’s early history, 6 to 10 billion years ago. The problem is that they need to sort billions of astronomical measurements, called spectra, and trillions of individual pixels, each potentially containing useful information.
To work through this massive pool of data, the astronomers developed a program to train interested volunteers from the public to classify HETDEX data without relying on unfamiliar jargon. The program gave the volunteers a cleaned-up image that looked similar to TV static, and the volunteers had 2 options: keep it if they thought the image contained a galaxy, or throw it back into the original image collection if it was unclear. They named this program Dark Energy Explorers and have classified almost 200,000 galaxy candidates with the help of 17,000 volunteers worldwide.
Next, the scientists wanted to use AI algorithms to find trends, predict results, and identify new uses for the resulting data. This process is known as machine learning. To prepare the data for machine learning, the scientists assigned numbers to the volunteers’ qualitative classifications. The team would have at least 10 volunteers assess each image individually. If the volunteer chose to keep the galaxy, the scientists assigned their answer a 1. If they decided to throw back the galaxy, the scientists assigned their answer a 0. An image’s final score would be the average between 0 and 1. Then, the researchers could feed the images and their scores into the algorithm t-distributed stochastic neighbor embedding or t-SNE.
Once they trained the AI on data sorted by the volunteers, the astronomers set it to work on new images and assign them a score. The scientists expected to receive a lot of random static in their images because of the collection method. Hence, their priority was to train the AI to find and throw out potential galaxies with low scores. They found that their AI, the t-SNE algorithm, was 92% accurate in characterizing images with scores of 0.3 or lower. The AI’s decisions matched human averages with a 98% agreement for images with scores of 0.1 or lower. They also found that the AI could effectively identify consistent errors in the data from flaws in the telescope or cleaning software, known as artifacts.
While the scientists only applied this experiment with the t-SNE algorithm to 1.2 million images, their eventual goal is to use their AI on the entire HETDEX data set, which is 1,000 times larger. They suggested their AI proved helpful in throwing out bad data, since it flagged 5% of these images to be removed from any further analysis. However, they acknowledged that a possible systematic error in the AI’s sorting is that it throws out more sources the closer they are to Earth. This bias risks having higher data contamination as they probe farther into the universe, but they maintained that the contamination is low overall.
The researchers also acknowledged that experts can misclassify sources, so the final check on their classifications isn’t flawless. However, collaborative research approaches between teams of scientists, trained volunteers, and AIs may provide, in their words, “a more robust result.” They concluded that collaborative projects like these would increase the amount of information astronomers could process and build a bigger science community.