YData spring-semester elective allows student sabermetricians to analyze baseball
Professor Ethan Meyers is teaching his S&DS 173 course, “YData: Analysis of Baseball Data,” for a second straight spring.
Zully Arias, Production and Design Editor
As Major League Baseball teams begin their spring training games this Sunday, Yale students are able to delve into the sport and its data through Professor Ethan Meyers’ S&DS 173 course: “YData: Analysis of Baseball Data.”
The connector class is an elective which can be taken concurrently with or after the main course, “YData: An Introduction to Data Science.” S&DS 173 allows students to use Python programming skills and knowledge of statistical concepts to analyze baseball-related data and sabermetrics — the empirical analysis of baseball that has become crucial to the sport’s 21st century front office.
“At a fundamental level, I view this class as a data science course … which allows students to start with large datasets and narrow them down to get particular insights,” Meyers said. “Baseball is hopefully an exciting example to learn the process. I’ve always grown up enjoying [the sport], which is full of suspense, and is so rich in terms of data and strategy.”
Meyers’ passion for baseball stems from early childhood. In an interview with the News, he recalled what he called a life-changing experience: his first live baseball game when he was seven years old, a July 16, 1988, matchup between the Kansas City Royals and the Boston Red Sox. At the game, after five scoreless innings, the Red Sox overcame a six-run deficit with a daring four-run sixth inning, two runs in the eighth and a walk-off home run in the ninth.
“I became a Red Sox fan [right after that game], and there was obviously a lot of rivalry since all my friends and classmates were Yankees fans,” Meyers, who is from upstate New York, said. “That only made me more adamant about liking my team.”
Meyers has been a visiting assistant professor at Yale for two years now and also serves as an assistant professor of statistics at Hampshire College. He is also a research affiliate at the Massachusetts Institute of Technology’s Center for Brains, Minds and Machines. Meyers said some of his main interests in data science are centered around the use of computational tools to analyze high-dimensional neural recordings and how information is coded in neural activity.
While at Hampshire, Meyers taught a class about introductory statistics through baseball. When he came to Yale last year and was asked to teach a YData course connector, he was inspired to continue combining his passions for data science and baseball.
Meyers’ personal experience with the sport has influenced his approach to teaching about its intersection with data science. Students in the class will even analyze data from Meyers’ first live game — the dramatic Red Sox win in 1988 — as part of this week’s course homework by examining data that plots the probability Boston would win the matchup.
Students use two main datasets in the course. The first is the Lahman dataset, which has season-level statistics that go all the way back to 1871. The second is the Retrosheet dataset, which has specific pitch-by-pitch and play-by-play information beginning with games from the early 20th century. Datasets are also supplemented by texts such as “Astroball: The New Way to Win It All” by Ben Reiter ’02, who visited Yale in 2018 to discuss the book.
With this information, the course delves into a variety of sabermetric forms of analysis. Meyers said students will use regression to find the optimal statistic for expected runs. They will also look at run expectancy based on the 24 base-out states, or E24, which is a measure of run probability based on specific game situations. Using that data, students can compare players’ performances across different decades. Examples include case studies on Hall of Famers Ted Williams and Wade Boggs, who were also both Red Sox players.
The class also explores data wrangling, probability and simulations with games, parametric hypothesis tests and confidence intervals, linear and multiple regression and transformations in regression and cross-validation.
According to Meyers, the two main programming languages for data science are Python and R. The main “YData: An Introduction to Data Science” class also uses Python, and Meyers said the language is the industry standard when exploring sabermetrics.
Max Krupnick ’23, who grew up in St. Louis, Missouri and has been a long-time Cardinals fan. He told the News that baseball was a way to connect with his friends and loved ones back home, and that getting to learn about both baseball and statistics at the same time “is a huge win.”
“I took the YData main course last spring, but going online disrupted some of the coding progress I had been making,” said Krupnick, who is a history major. “This semester I wanted to make sure that I would actually be able to learn Python to the extent that I could use it in real life.”
This spring, as the class transitioned to a virtual platform for the whole semester, Meyers, who also taught the course last spring, stated that he had to slightly restructure S&DS 173 to maintain its usual dynamism and high levels of student engagement. For example, he is making greater use of Zoom breakout rooms.
Meyers mentioned that his past students have gone on to apply what they have learned in a variety of settings, including by improving their own performances as members of the Yale baseball team or by interning for official MLB teams like the Houston Astros.
“Though the class is small, it is already leading to opportunities for students,” Meyers said. “It’s so nice to see everyone so enthusiastic. Some [students] have even expressed their interest in going into baseball later. It’s always fun to see where students will end up.”
The course has asynchronous components, but also meets synchronously every Wednesday from 1:30 to 3:20 p.m.
Wei-Ting Shih | email@example.com