An Efficient Algorithm for Oligonucleotides Selection in a Large EST Databases

Adebiyi, E. F. (2007) An Efficient Algorithm for Oligonucleotides Selection in a Large EST Databases. In: Proceedings of the First Southern African Bioinformatics Workshop, 28–30 January 2007, University of the Witwatersrand, Johannesburg.

PDF
Download (321kB)

Abstract

Identifying unique oligonucleotide (oligo) probe sequences is an important step in PCR and microarray experiments. While there are a growing number of complete and annotated genomes, the largest collection of publicly available genetic sequences are expressed sequence tag (EST) sequences. Furthermore, for many organisms that are important to the society, such as barley, the EST is the major data on the expressed genes in a number of these organisms. For the EST sequences, the unique oligo problem is the selection of oligos each of which appears (exactly) in one EST sequence but does not appear (exactly or approximately, for a given hamming difference d) in any other EST sequence. OligoSpawn, in two phase, has been implemented to efficiently select oligos from ESTs. The notion of a “seed” was used in the construction of OligoSpawn, and its run time is exponential dependent on q (the length of the “seed”). For q = 11, it ran on a previous barley dataset of 28MB for 2 hours and 26 minutes using a 1.2GHz AMD machine, but it is very inefficient for large datasets, like the new 43MB barley dataset. We observed this as OligoSpawn, for q = 11, runs for about 6 days using a 3.0GHz Pentium IV machine. Furthermore, selection of some important unique oligos (e.g., for which q = 13) is unwieldy for OligoSpawn. In this work, using the suffix tree, we give a careful theoretical characterization of the set of seeds required, and prove a subqradratic time algorithm for extracting these seeds. Using this result, we present an efficient algorithm that takes advantage of the new results, that simplify the solution of the least common ancestor (LCA) problem via the range minimum query (RMQ) problem. The run time of our resulting algorithm is O(n3qd/42q). For q = 11 and q = 13, our algorithm runs on the new 43MB barley dataset for 4 days using also a 3.0 GHz Pentium IV. As far as we know, our algorithm is the fastest oligonucleotides selector algorithm for large databases of tens of thousands of EST sequences, such as the barley ESTs.

Item Type:	Conference or Workshop Item (Paper)
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	Faculty of Engineering, Science and Mathematics > School of Electronics and Computer Science
Depositing User:	Mrs Patricia Nwokealisi
Date Deposited:	01 May 2017 13:16
Last Modified:	01 May 2017 13:16
URI:	http://eprints.covenantuniversity.edu.ng/id/eprint/8083

Actions (login required)

View Item