Suganthi Balasubramanian, associate research scientist in Mark Gerstein’s lab in Yale’s Molecular Biophysics and Biochemistry Department, was the lead Yale author in a paper published in the Feb. 17 issue of Science as part of a collaborative effort called the 1000 Genomes Project. This project aims to sequence the genomes of a large number of people in order to serve as a comprehensive resource on genetic variation. The research deals with genetic changes called loss-of-function variants, which are predicted to seriously disrupt protein coding within the body. As part of an initiative led by Yale and the Wellcome Trust Sanger Institute, Balasubramanian and fellow researchers worked to experimentally filter a range of these variants to develop a high-quality catalogue of the variants that cause true loss of function.

Q What kind of data analysis did you do, and how did it fit into the larger project?

A Our main role at Yale was to annotate these genetic variations. We know [the variations] are in the genome but we need some kind of identification, some signpost to see where is it and what it means. So we looked to see if [the variations] are in a protein coding region, and, if so, how do they change the amino acid sequence of the protein? This was essentially our role in the 1000 Genomes Project: to map and provide functional annotation of all the coding variants.

Q What light does your research shed on other genetics research today?

A Essentially this project contributes several things. First, people generally assume that loss-of-function variants are rare and, when observed, very harmful because they lead to disease and aren’t very common. People haven’t questioned why we see so many loss-of-function variants. Our careful analyses show that it’s very important to validate these variants. There are many ways to make erroneous variation calls, and you have to make sure you are really seeing what you see. The 1000 Genomes provided us with 3,000 loss-of-function variants, and we went through a long process of analysis with computational and experimental filters and came to only 1,285 true loss-of-function variants. There are lots of sequencing studies being done right now that look for genetic variation but in order for them to be clinically relevant, they have to be carefully validated. Our study provides a high quality catalogue of loss-of-function variants.

Q Could you describe the process of collecting the data and building the catalogue?

A After the work of many different groups, we receive a huge file that tells us where the different variants are in the human genome. We looked at this these files for 185 different people of different ancestries, and we’d map the variant to protein-coding regions and annotate them. I’d look at the genome in a linear representation, and I wanted to know what site the variant was in — was it in a coding gene? Is it in an entron? Is it in a non-coding region? This is what annotation is. I wanted to know where the variants landed, and I’m particularly interested in the ones that land in a protein-coding gene, which only constitute less than 2 percent of the genome. So to functionally annotate these variants, we first map them to coding gene, then we figure out what it does to the protein and what changes in function it causes. This has been done before, but we’re doing it on a large scale and fast.

Q What are the clinical implications of this catalogue?

A Essentially, we now have a candidate set of loss-of-function-containing genes, and this can now be used for target gene prioritization for diagnosing and treating diseases.

Q What are the project’s next steps?

A This work was based on only on 185 samples. The next step is called Phase I of the 1000 Genomes Project, and Phase I has data on over 1000 samples now. This work here was only the pilot phase. So we are going to get a much bigger catalogue. Our goal is to use the empirical rules we learned to build a more comprehensive loss-of-function catalogue from the 1,000 Genomes data … We’ll also add on other filters. So far we’ve look at variations one at a time, but now we’ll look at other variations in the same gene to see the overall effect on gene. Of course we’re also very interested in building an experimental chip where you can basically go and look at these variations in thousands of people. We want to target some specific variations, and we want to develop a system where you can probe only these specific variations. This is our big hope, and we hope it’ll lead to some good clinical discoveries.

Q What would be the clinical benefit of only probing the specific variations?

A When you sequence someone, you typically have 3 to 4 million variations, and it’s impossible to know which ones might be interesting in terms of biological functions or need closer looking at. It’s like finding a needle in a haystack; we need some way to figure out which variants to look at. So with a small subset to probe, you have a more defined targeted data set to experimentally research.

Clarification: Feb. 28 2012

This article was updated to reflect the exact date of publication of an article in Science in which Balasubramanian was the lead Yale author.