Yale started the new month with another server crash.
Beginning at roughly 8 a.m. Saturday morning, Yale technology platforms — including EliApps, Yale.edu websites, Classes*v2 and Oracle E-Business applications — went down before being fully restored six hours later. This technical difficulty comes after a similar incident occurred last month when students, faculty and staff could not access Yale websites and applications for over 24 hours. Although Yale Information Technology Services said the two power outages were independent incidents, students affected by the consecutive server failures remained skeptical.
“The two Data Center power outages that we have experienced recently are not related and happened at two different data centers that are 10 miles apart,” Yale Chief Information Officer Len Peters said. “They were both caused due to unexpected issues with mechanical [and] electrical infrastructure.”
Peters said the most recent crash was caused by a component of the electrical infrastructure. He added that following an investigation, it was discovered that one of the Uninterruptible Power Supply backups failed and caused 75 percent of the systems at a West Campus data center to lose power.
Peters said power at the center was restored by 1:15 p.m. by bypassing the failed UPS. By 1:35 p.m. on Saturday, Yale.edu was back up. At 2:07 p.m., Classes*v2 was also restored.
“We needed to make sure that all of the underlying storage was up and available in order to bring up our services quickly,” Peters said. “We regret the inconvenience that this may have caused and are investigating the root cause for this failure.”
The two most-recent incidents do not necessarily indicate a larger issue with Yale Technology, said Robert Juchnicki ’15, an ITS student technician.
He said any technological feature could go down for an hour or two, which may be due to something like a new code or feature being implemented.
“If it is very recurring and starts to get in people’s way, it is a problem,” Juchnicki said. “A third or fourth time would be a serious issue, but these have been pretty isolated incidents.”
But Marcus Russi ’17 felt the recent server crashes already indicate a larger issue for Yale ITS. He cited the lack of communication from ITS as a major problem for students. During the failure, ITS posted updates via an “obscure” Twitter account instead of a “status page” hosted on a separate server to update students, he said.
There are entire fields of computer science called reliability engineering, which are created to keep data centers operational in the face of unpredictable equipment failures, Russi said. He added that since a modern data center should have multiple sources of backup power and detailed contingency plans, this semester’s outages have been unacceptable.
“A second failure of this sort … is not a stroke of bad luck, it’s just what happens if you never bother to test your backup plans,” Russi said. “The question then becomes one of blame: Do we hold ITS staff accountable for not following standard industry practices in reliability engineering, or do we blame whoever is tasked with supervising ITS for not recognizing that they are failing in their duties?”
Peters said his team has taken steps to improve the overall resiliency of its data centers, specifically a multi-year program started in 2012 to improve the revival time of servers, applications and data in the event of the loss of a data center. He cited other steps that ITS has taken to improve its servers, including contracting with an out-of-state facility to store and backup data, creating an “aggressive recovery timeline” for all core services, scheduling a full system“failover” test and reviewing the current resiliency setup.
He added that ITS will be specifically reviewing the recovery time objectives for Classes*v2.
“During the recent outages, due to the work accomplished in the last couple of years, we were able to achieve our recovery time for most of our core infrastructure services and applications ahead of our program schedule, but not all,” Peters said. “In its full operational state, we will be focusing on conducting regular disaster recovery planning exercises that will continue to test the resiliency of our infrastructure and keep us prepared in the event of real disaster such as the events experienced on Oct. 9 and Nov. 1,” he added.
Although the outage occurred on a Saturday morning, students interviewed said it still had a large impact on their studies.
“I actually can’t do the majority of my homework at the moment,” Eve Romm ’18 said Saturday afternoon. “My readings are on Classes*v2 and I can’t get a book because the library catalogues are down.”
Still, she said ITS does a fairly good job providing reliable internet access throughout the year.
Katherine Garvey ’16, who is a chemistry tutor, said she tried to send emails to her class to let her know she would be available to meet with students. However, she had no way to reach them since EliApps was not functional.
“It is sort of frustrating,” she said. “I am confused why it happened twice so recently — I feel we didn’t get a full explanation the first time.”
She added that since nearly everything for students is on Classes*v2, there was little work that she could do until access was restored.
Russi said he could not access the readings for his English class or watch any of the online materials for his French class, both of which had assignments due the following morning.
He added that the Zoo — the computer science department’s server — was down for the duration of the outage and has been broken since because the sudden loss of power corrupted data on its hard drives. He said it is the second time the computer science department has been inconvenienced by an ITS failure.
Another unplanned network outage lasted from 10:40 p.m. on Nov. 1 until 5:07 a.m.
“It is weird that it keeps happening,” Hanna Karimipour ’17 said. “I hope they fix it.”