In recent years, machine learning has risen in popularity as an exciting frontier in data science. This type of statistical technique can be leveraged to gain insight in all sorts of applications, from suggesting the perfect song on Spotify to predicting the weather.

The newest application? Classifying plankton fossils. Recently, a Yale-led team created a first-of-its-kind machine learning model that can identify the species of almost 7,000 plankton fossil images.

The model works extremely well — even better than human experts.

“Our best-performing model gets the answer right 87.4 percent of the time, which is better than our average human expert accuracy,” said Allison Hsiang, who is a postdoctoral researcher at the GeoBio-Center at Ludwig Maximilian University of Munich and the lead author of the study.

Besides improving accuracy, the existence of the model saves experts’ time.

When working with thousands of microscopic fossils, “you just can’t have a human sit there and [classify images]. You have to find a way to automate that, which is where machine learning comes in,” said Nelson Rios, head of biodiversity informatics and data science at Yale’s Peabody Museum of Natural History.

Rios assisted the research team in creating an online database called Endless Forams, which makes their results publicly available.

These findings are useful to the biology community at large.

“The fossils at the focus of this study are used to do things like investigate the temperature of the ocean in the past, or the atmospheric CO2 level, so getting the identities right is actually important,” said Pincelli Hull, assistant professor of geology and geophysics and a leader of the international group that authored the study.

Hull added that properly identifying the plankton species is actually a difficult task, but that the research design called upon several expert taxonomists so that a consensus could provide the correct human validation to the computer model’s guesses.

Hsiang said that the model can be used to quickly build big datasets containing species information, which will allow the more efficient study of “large-scale patterns and changes in foram biodiversity and communities through time and space.”

Rios also said that understanding the plankton is helpful for modeling the state of the ocean in the past.

Researchers also emphasized their desire for their results to be used to train future experts in the field. Hsiang added that the Endless Forams database is an important resource that can be used to train students to identify species.

“We’ve been losing expertise, not gaining it. It’s important that this data is out there for people to get excited about it,” said Rios.

 

Jessica Pevner | jessica.pevner@yale.edu