Yoojin Chung and Gyeong-Min Kim
Dept. of Computer Engineering, Hankuk Univ. of Foreign Studies
Yongin Kyonggi-do, Korea
email: chungyj@hufs.ac.kr
Motivation: A major post-genomic scientific pursuit is to describe the functions performed by the proteins encoded by the genome. One strategy is to first identify the protein-protein interactions in a proteome, then determine pathways and overall structure relating these interactions, and finally to statistically infer functional roles of individual proteins. Although huge amount of genomic data are at hand, current experimental protein interaction assays must overcome technical problems to scale-up for high-throughput analysis. In the meantime, bioinformatics approaches may help bridge the information gap required for inference of protein function and moreover parallel processing will speed up its throughput. In th! is paper, we try to predict protein-protein interactions directly from its amino acid sequence in parallel on a 16-node PC-cluster using a Support Vector Machine (SVM).
Results: SVM learning is one of statistical learning theory, it is used many recent bioinformatic research, and it has many advantages to process biological data. We train a SVM learning system to recognize and predict protein-protein interactions using the Database of Interacting Proteins (DIP; http://www.dip.doe-mbi.ucla.edn/).
We use 15117 entries protein interactions of yeast in the DIP database, sampled 4000 entries at random, and partition the sampled data into training and testing sets, at approximately a 1:1 ratio. We make 4 models of data set according to data partition and make 4 data sets from each model and thus total 16 data sets.
We represent an interaction pair by concatenating two amino acid sequences of interacting proteins and by replacing each value of the sequence with a property value of corresponding amino acid. Each amino acid has diverse properties such as hydrophobicity, polarity, charge, surface tension, etc. We select only one among these properties and combine it to amino acid sequence of interacting proteins. According to the experiments using 16 different data sets, we get approximately 78.9to 99.6using hydrophobicity property. When using other property except hydrophobicity, experiment result is not so good to predict interactions.
However, SVM requires resources that are at least quadratic on the number of training examples, and it is important to use a parallel mixture of SVMs and to reduce the training time of huge biological data. The experiments of using a parallel mixture of SVM on a 16-node PC-cluster are in progress.