Yale digitizes documents

Two recent federal grants will allow Yale to digitize rare primary sources of Middle Eastern history, making them accessible to researchers worldwide.

The Yale University Library has received a $650,000 four-year grant from the U.S. Department of Education to digitize Syrian and Palestinian government records, and a $240,000 joint grant from the National Endowment for the Humanities and the Joint Information Systems Committee to digitize Middle Eastern scholarly materials, according to a press release Thursday. The library will use advanced technology to translate the digitized text into searchable text, which will be available online.

“Both grants are ‘building block’ projects,” Associate University Librarian Ann Okerson said in an e-mail. “By doing them we will provide standards and infrastructure, linking and collaborative opportunities for other libraries.”

The government records will provide researchers worldwide with the first accounts of what happened from 1919 to 1948, a tumultuous time of political changes in the Middle East, said Simon Samoeil, Curator of the Yale Library’s Near East Collection. The records from British Mandate Palestine, irregularly published due to political instability, are on 40 microfilm reels and five supplementary print volumes. The Syrian collection, in printed format, is the only complete original copy in the U.S. and one of five known in the world. The digital copies will be available within four to five years in the Arabic and Middle Eastern Electronic Library (AMEEL) repository, a collaborative project between Yale and other libraries.

The Library will share the second grant with the University of London’s School of Oriental and African Studies (SOAS). The two will collaborate to digitize manuscripts, manuscript catalogs and dictionaries, which researchers will be able access for free online. The manuscripts, selected by and from Yale and SOAS holdings, will center on Arabic medical, scientific and philosophical works.

Yale will digitize the government records using optical character recognition (OCR) for Arabic text, a technology that translates scanned images of text into machine-searchable text.

Using OCR with Arabic may result in machine errors because the script expresses vowels through dots added to each of the 22 consonants, said Beatrice Gruendler, a professor of Arabic.

“Arabic writing is very homogenous, which makes it very sleek and beautiful, but it needs additional markers to remove ambiguity,” Gruendler said. She suggested that researchers compare the text with a published scholarly, or critical, edition of the work.

The rarer manuscripts are housed mainly in the Medical Historical Library, and the Beinecke Rare Book and Manuscript Library. The manuscripts must be digitized on site, but items that are largely routine will be outsourced to a company under a contract.

“This and all of the earlier projects we’ve done at Yale was to create an electronic resource for Arabic and Middle Eastern studies,” said Samoeil, who was part of the team that spearheaded current and past initiatives. “Which will make all the information and the texts available for researchers all over the world for free.”

The Yale Library collaborated with several other libraries across the world in its AMEEL project to create an electronic library about the Middle East. The Department of Education first funded AMEEL with a grant that ran from 2001 to 2004, followed by two more grants, one beginning in 2005 and ending last month, and another that took effect on October 1, and will run until September 30, 2013.

The Kirtas Company scanning machines that Kirtas donated from a book digitization deal with Microsoft, may be used on the manuscripts if they are gentle enough, Okerson said.

In 2007, Yale contracted Kirtas to scan about 100,000 volumes unique to Yale in the multi-million dollar Microsoft project, but when the software giant unexpectedly ceased funding last spring, the contract with Kirtas was terminated.

Correction: October 12, 2009

A previous version of this article contained several errors. The software translates scanned images into fully machine-searchable, not -editable, text. The rarer manuscripts are not in the Sterling Memorial Library; they are housed mainly in the Medical Historical Library and the Beinecke Rare Book and Manuscript Library. The grant to the Arabic and Middle Eastern Electronic Library beginning in 2005 and ending last month was the second grant from the Department of Education. A third grant took effect on Oct. 1, and will extend the project until Sept. 30, 2013. The scanning company, Kirtas, donated its machines to the library; these machines were only used to digitize manuscripts, not government records.

Comments