Big Data and the Social Character of Genes

תמונה תומכת תוכןA new study at the University  has used “big data” analytical methods to reveal the “social character” of genes – a phenomenon in certain diseases whereby genes operate jointly rather than independently. “The problem is that the possible number of combinations of different genes is enormous, and it is almost impossible to examine them all effectively and reliably,” the researchers explain. “Our study offers a solution to this problem.” The study, which was undertaken as part of a master’s thesis by Pavel Goldstein from the Department of Statistics, and was headed by Dr. Anat Reiner-Benaim from the department in cooperation with Professor Abraham B. Korol from the Department of Evolution and Environmental Biology, proposes a new method for discovering complex and rare genetic effects that form part of the mechanism of creation of complex diseases, such as autoimmune diseases.

One of the most active fields of genetic research at present focuses on the connection between genetic markers – DNA segments situated along the genome that effectively represent genes – and the expression of different genes – the creation of the proteins they encode. Various studies over recent years have shown that in complex biological mechanisms, such as those in most diseases, the genetic expression is not the product of the action of a single marker, but rather of a combination of several markers, some close to the location of the gene on the DNA chain and others more distant. In the Human Genome Project, for example, the researchers initially found that some 98 percent of the human genome contains genes that do not “do” anything. However, it later emerged that some of these genes are in fact active – not independently, but as part of a network of genes. Thus the influence of a given genetic marker may be dependent on the influence of other markers – a phenomenon known as epistasis.

The problem is that the theoretical number of combinations in which different genes could cooperate is almost infinite – equivalent to the product of the enormous number of potential connections between markers and the potential list of genetic expression. Accordingly, it is difficult even to decide where to look for these connections.

In the new study, published in the journal PLOS ONE, our researchers propose a new method of calculation that significantly reduces the number of possibilities, thereby making the identification of the interactions between genes a feasible task. Their method is based on innovative statistical tools from the “big data” field of analysis, and the preliminary goal is to reduce significantly the dimension of the number of genetic markers and the number of genetic expressions. The method shrinks the number of testing markers by applying a hierarchical filter to DNA areas containing at least one epistatic phenomenon, thereby enabling research to focus solely on genetic markers within these areas. The method reduces the number of genetic expressions by clustering together similar expressions.

The researchers used a study into simulative data to illustrate the advantages of their proposed method for the discovery of epistasis over two other methods. The use of genetic expression clusters and the hierarchical search of DNA areas with the potential presence of epistasis significantly increased the changes of discovering the phenomenon of epistasis, while reducing the rate of false discoveries to a very low level. The proposed method was applied for the purpose of analyzing the genome of the thale cress plant (Arabidopsis thaliana). The genetic mapping of this plant and data for its genetic expression are stored in The Arabidopsis Information Resource (TAIR), and as is customary in the field are accessible to the entire research community. The analysis addressed some 7,200 non-zero genetic expressions and 500 molecular markers situated along the five chromosomes of the plant genome. A search for epistasis based on marker pairs yields a total of nine million connections to be examined.

In the present study, the genetic expressions were grouped into some 300 clusters based on their mutual correlation. The 500 genetic markers were represented by 47 “regional” markers. As a result, the nine million possibilities for epistasis were reduced to just 340,000.

The researchers explain that the proposed method was successful in the challenging task of discovering weak effects, that is, effects relating to a group or network of genetic expressions in which each gene makes only a small contribution to the overall effect. An analysis of the expressions of individual genes that were not included in the clusters enabled only the discovery of strong effects. This suggests that strong epistatic effects exist in the expression of single genes, while weak effects exist across groups of genes. “The fact that we also observed that genetic traits involved in a strong effect showed low connectivity with other traits, and accordingly were not identified as part of the clusters, raises a hypothesis regarding the ‘social’ character of gene behavior, namely that a strong gene does not require cooperation with additional genes in order for an effect to be present, whereas a weak gene must create some type of associative mechanism, such as genetic networks, in order for an effect to be present,” the researchers concluded. It is interesting to note that similar phenomena have also been found in the social sciences; for further discussion of this aspect, see Briñol et al., 2007 and Galinsky et al., 2008

כתיבת תגובה