YaleStudents website generates student debate, reveals potential data governance gap in University policy
Vaibhav Sharma, Photo Editor
Editor’s note, Nov. 17, 8:28 a.m.: On Tuesday morning, after the story’s publication, the site’s creators introduced new data protection policies in a major walkback from their initial site.
“When YaleStudents was first released, all users were automatically opted in. We have since realized that this was an improper and inappropriate use of data — data should only be displayed with explicit consent,” a new banner reads. It appears as a pop up window when visitors enter the site.
All Yale student users are now opted out by default on the website. To be visible on the “Lookalikes,” “Neighbors” and “Maps” pages, students must specifically choose to opt in.
As of early Tuesday morning, just two dots appear on the site’s map — belonging to Gunderson and Yao — a far cry from the thousands that represented students scattered across the world just hours prior.
According to the founders, the security issue that left students’ GPS coordinates exposed in the plain text of the code has since been addressed. Sections from the site’s visualized statistics, which included average house price values by popular first name, state and major, have also vanished.
“Ultimately, we decided these changes were more responsible uses of Yale student data and better reflected the goals of our website,” wrote Gunderson and Yao in an email to the News.
Here’s the original story:
Facial recognition technology is used widely in smart phones, controversially in law enforcement surveillance — and now by Yale students on other Yale students in a newly developed website.
Earlier this month, John Gunderson ’24 and Chris Yao ’24 trained computer programs on the profiles of around 6,000 Yale undergraduates. Their final product, a website called YaleStudents that scrapes data from the official Yale Facebook, the University directory and other publicly available sites, allows students to search for classmates that live near them and “look like” them, according to algorithms. The site also displays data visualizations summarizing information that relates student names, birthdays, median house prices, majors and residential colleges.
The site’s release last Thursday has prompted conversations about how students should and should not use their classmates’ data — and what data ethics questions the University can and cannot regulate.
“People say [these uses] are not prohibited,” said Nathaniel Raymond, a lecturer at the Jackson Institute for Global Affairs who is teaching a course this semester on data ethics and governance. “That’s not the question. The question is whether they should be.”
After debuting via the anonymous student forum Librex, YaleStudents has quickly made the rounds among the student body. According to Gunderson and Yao, more than 1,700 unique users have visited the site, generating a total of 80,000 hits as of Sunday night. That’s an average of almost 50 clicks per user.
“We decided to put [the site] out there without commenting, so people can draw their own conclusions,” Gunderson said.
According to Gunderson, he and Yao created the site to invite community conversations about data privacy, campus diversity and facial recognition technology.
“The sole benefit is not the application but in the conversations we generate,” Gunderson said. “It’s more effective to start conversations about things if you actually show the thing instead of talking about it.”
Access and use: How the site works
“YaleStudents displays data that Yale makes public,” the website’s disclaimer reads — a rejoinder that has appeared in similar language on other student-created sites like Yalies.io.
The initial disclaimer read, “we are against Yale’s policy of displaying students’ data on the Yale Face Book without explicit consent.” One day after the website went live, generating immediate controversy, Gunderson and Yao removed that line.
“Directory information,” as defined by the Family Educational Rights and Privacy Act, includes information contained in a student’s records that “would not generally be considered harmful or an invasion of privacy if disclosed.” As a result, the University does not have to ask individual students for consent to share their directory information, although they can elect to opt out. Students can also remove themselves from the YaleStudents website directly.
Students who log in to the student directory with their net IDs already have access to their classmates’ basic information, which includes, among other details, students’ profile pictures, home addresses, residential colleges and majors. This explains how Gunderson and Yao were able to write the programs they did.
The partners said that the project has been in the works since early October, after Erik Boesen ’24 launched a student database combining data from the Yale Facebook and Yale Directory and shared his source code on GitHub.
Technically, they would have been able to build the site’s “Lookalikes” and “Neighbors” features without Boesen’s packaged application programming interface, or API, but directly pulling information from the Facebook and the directory would have been much more time-intensive, Yao said. Therefore, Gunderson and Yao incorporated code from Boesen’s project when making their own site.
The site is hidden behind a Yale Central Authentication Service pop-up, which means that only students who log in with their network credentials should have access to the site. Other student-created sites, including Yalies.io and the course comparison site CourseTable, also make their platforms accessible only to Yale users.
On Monday, students.yale.edu, the usual address for the official Yale Facebook website — from which Henderson and Yao pulled some of their data — was unavailable, with no indication that the site was undergoing a planned outage or maintenance.
Neither of the two Yale IT support staff members contacted by the News could identify why the site was down.
Recognize that? Algorithmically-generated ‘lookalikes’
If Gunderson and Yao’s goal in making YaleStudents was to create conversation, they succeeded.
For one, students criticized a prominent feature of YaleStudents that relies on facial recognition algorithms, which work by constellating vectors on a human face and returning similar images from a predetermined database, according to Himnish Hunma ’22, who studies computer science and mathematics. For Gunderson and Yao’s project, this database was readily available to holders of a valid Yale ID — they used the University’s Facebook, a browsable database of almost all Yale students.
Charlotte Wakefield’s ’23 immediate reaction after a friend shared the link was “horror” and “disgust.” Wakefield, who is a trans woman, said that the ID photo the website uses to compare her to ostensible lookalikes is from early in her transition.
“I don’t want my face plastered on a random website,” Wakefield said. “It feels very dysphoric and gives me anxiety.”
After reaching out to the creators herself, she said her profile has since been taken down from the site. But even so, Wakefield said that students who aren’t aware their information is viewable on the site are “vulnerable to doxxing without a choice.”
Data privacy risks are always more “impactful for people in marginalized groups,” she added.
Gunderson said that “ultimately, the information that is displayed on Lookalikes is from an algorithm that most people can see is pretty bad. Any result from the algorithm should not be taken seriously.”
Numerous studies — including a federal study published last year — have found that facial recognition algorithms misidentified non-white, non-male faces at far higher rates than their white male counterparts.
“It’s well known that neural networks produce especially unsatisfactory results when the image being tested is darker,” Abraham Mensah ’22 wrote in an email to the News. “I consistently ended up with poor results when I tested the [YaleStudents] neural network with darker skinned people, especially African Americans.”
He added that Yale’s relatively small Black student body — 7.6 percent of Yale’s undergraduate body is Black, according to the University’s 2019-20 Office of Institutional Research dataset — gives the site’s neural network significantly fewer options to compare when searching for accurate lookalikes.
On Monday afternoon, the site’s creators added a disclaimer that facial recognition technology often uses biased datasets and expressed hopes that the page will prompt conversations about those issues.
Hunma said that he was interested in the site’s FaceNet recognition system — which was developed by Google researchers in 2015 — and its Generative Adversarial Network, the sophisticated deep learning framework used on YaleStudents to generate fake images of “Yale students.”
But Hunma also expressed reservations about the site’s purposes and potential uses.
“The technical and nerdy side of me was like, ‘This is cool,’ but the realistic side of me was like, ‘This is slightly worrisome,’” Hunma said.
Twelve of the most realistic GAN-generated fake images are currently presented on the website, along with a link to download other photos produced by the neural network.
Gunderson said that he wanted to try using GAN for “personal reasons” and thought that the resulting images were “cool enough to publish.” He did not indicate what those reasons were, nor did he suggest that the computer-generated profile pictures were intended to serve any particular function.
Elizabeth Cordova ’23 said that she did not find the site to be inherently problematic — citing as her reasons that the site restricts its audience to Yale students, uses available API and presents an opt-out option. She added that while the site’s “Stats” and “YaleGAN” sections seem to present harmless and interesting trends, the “Lookalikes” function could “cause more harm than good,” while the location-gathering tools could have mixed beneficial and malicious uses.
“The site seems harmless for now, but I would warn both the users and creators to use it responsibly and pay close attention to reviews, complaints and any misuse, as the ethical line is easily blurred,” Cordova said.
Valuing houses, evaluating harm
Lettered in Times New Roman, with a somewhat-hidden hyperlink leading to Rick Astley’s “Never Gonna Give You Up,” the YaleStudents site may appear to some as a flippant campus joke.
The humor doesn’t escape the creators themselves — Yao said he thought it was “funny” and “interesting,” for example, to visualize house price distributions by variables such as students’ first names.
The house price distribution program, one of several visualized statistics presented on the website, uses Zillow to scrape real estate information from students’ anonymized home addresses.
According to the site page, the house price averages displayed exclude international students and Yale students living in apartments, and the first-name-specific real estate visualization excludes students with uncommon first names.
Yao added that it serves as an opportunity to “quantify and provide real, concrete evidence” about the socioeconomic composition of Yale’s student body.
“I thought the website was an impressive demonstration of machine learning techniques, and I believe Chris and John were well-intentioned,” Boesen wrote to the News. Still, he echoed concerns about “the inclusion of housing information and the sensitivity of face recognition algorithms.”
Boesen pointed out that students have long been able to look up house valuations on Zillow by searching up listed directory addresses themselves. But he also encouraged students to remove their data from the official Yale Facebook site that lists these addresses and other information.
Wakefield argued that, even if individual students are not identified on the website, sections of the visualization, such as “Top 20 Median House Prices by First Name,” are still potentially harmful because they segment students into identifiable smaller groups.
Another feature, the site’s “Neighbors” map, identifies Yale students living close to each other, according to addresses compiled by the University’s directory, and charts their approximate distance from one another.
According to Yao — who said he didn’t know any classmates who live near his hometown in Iowa — the tool “enables Yalies to reach out to other Yalies in their area.”
Gunderson and Yao explained that a small degree of random variation is added to the GPS coordinate marker associated with each student as a privacy measure. The map itself restricts the user’s ability to zoom in and read specific street names, and it does not list full addresses.
Although addresses are obscured on the site’s user interface, inspecting the plain text version of the map’s source code reveals that the GPS coordinates of students’ addresses can be identified and connected to a name. This means that if a user wants to, they can find the specific coordinates of another student’s address.
The site founders said that they are not planning to introduce any new features to the site besides responding to individual concerns. They expressed a willingness to take the site down if administrators intervene.
Governing ‘an extremely valuable asset’
For Raymond, the YaleStudents website bears an uncanny resemblance to FaceMash, Mark Zuckerberg’s short-lived predecessor to Facebook, which had permitted Harvard students to rate female students by attractiveness.
Zuckerberg created the site — which was, at that time, technically not prohibited by standing Harvard regulations — with APIs he had access to as a student.
“He saw the absence of governance as permission,” Raymond said.
Instead, Raymond proposed that students should think about data privacy not in terms of permissibility and legality, but rather in terms of harms and values.
“The corpus of Yale undergraduate student data is … an extremely valuable asset,” Raymond said. “If these gentlemen were just messing around, so be it. But the fact of the matter is that there are many actors and entities that want to use this data in many of the ways that the API did, including facial recognition, which may not be around messing around.”
Even if we are to forego consideration of site-specific features, Raymond added that the site should generate campus-wide conversations on how the data assets of Yale students are brokered, manipulated and commodified.
“Are you allowed to sell food from the dining hall? Are you allowed to rent out [University] gym equipment?” Raymond asked. “These are assets of the community.”
A ‘broader reality’
Unless specifically designated as private, Yale directory information is currently classified by the University as “low-risk” under the 1604 Data Classification Policy. YaleStudents does not modify or remove original University data, nor does it seek to obtain unauthorized access to the Yale Directory, given that users must first enter their credentials into the CAS login page.
However, Raymond added that the “mosaic effect” created when Yale low-risk student directory information is mixed with external data — like house valuations — makes it difficult to tell how the site would be classified and thus regulated by the University’s information access and technology policies.
What is more concerning to Raymond is not simply that students created a potentially sensitive site with data made available to them by the University, but that the University has no playbook to adjudicate similarly murky uses in the future.
Several University administrators — including Chief Information Officer John Barden and Chief Information Security Officer Paul Rivers — did not respond to a request for comment.
Neither did Gregory Bok, Pamela Chambers, Harold Rose and Arabella Yip, four attorneys in the University’s Office of the General Counsel who specialize in digital media and information technology law.
University spokesperson Karen Peart said she is not familiar with the site. The University’s chief privacy officer, Susan Bouregy, said that she was also not aware of this site before the News requested a comment and would have to do further research.
“This is an example of the broader reality that we live in now,” said Kate Pundyk ’22, a former Science and Technology editor for the News. “The question is not about whether sophomores are fully equipped to answer all of the data ethics questions of the day, but rather if our administrators are bearing responsibility for the duty of care they hold to students.”
She added that YaleStudents is an example of the broader information-driven reality we currently live in — one where imagination is needed to both encourage innovation and field risks.
“We’re good at encouraging people to have that mindset to innovate,” she said. “What happens when we don’t know how to deal with the consequences or potential consequences?”
Mensah also told the News that he believes the University should give students more agency over what information the directory discloses.
“To put it simply, data is people, literally,” Raymond said. “Data is a way of exerting power over people. And so the question here is — how are Yale students going to take this incident from a moment of outrage du jour and make this the beginning of a structured conversation about how we want to govern the data assets produced by the Yale community in a way that is consigned with some sense of values?”
An official request to withhold student directory information must be filed with the University Registrar.
Emily Tian | firstname.lastname@example.org