SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size

Giladi, E. and Walker, M. and Wang, J. and Volkmuth, W. (2000) SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size. Technical Report. Stanford InfoLab. (Publication Note: Fourth Annual International Conference on Computational Molecular Biology (RECOMB 2000)April 8-11, 2000, Tokyo, Japan)

Preview

PDF
209Kb

Abstract

W e have developed an algorithm, called SST (Sequence Search T that searches database of DNA sequences for near exact matches, in time proportional to the logarithm of the database size n. In SST, we partition each sequence into fragments of fixed length called "windows" using multiple offsets. Each window is mapped into a vector of dimension 4 k which contains the frequency of occurrence of its component k-tuples, with k a parameter typically in the range 4 6. Then we create a tree-structured index of the windows in vector space, using tree structured vector quantization (TSVW e identify the nearest-neighbors of a query sequence by partitioning the query into windows and searching the tree-structured index for nearest neighbor windows in the database. This yields an O(log n) complexity for the search. SST is most effective for applications in which the target sequences show a high degree of similarity to the query sequence, such as assembling shotgun sequences or matching ESTs to genomic sequence. The algorithm is also an effective filtration method. Specifically , it can be used as a preprocessing step for other search methods to reduce the complexity of searching one large database against another. F or the problem of identifying overlapping fragments in the assembly of 120,000 fragments from a 1.5 megabase genomic sequence, SST is 17 to 35 times faster than BLAST when we consider both building and searching the tree. F or searching alone (i.e., after building the tree SST is 50 to 100 times faster than BLAST.

Item Type:	Techreport (Technical Report)
Uncontrolled Keywords:	DNA sequence database, tree structure, vector quantization
Subjects:	Computer Science
Projects:	Miscellaneous
Related URLs:	Project Homepage	http://infolab.stanford.edu/
ID Code:	460
Deposited By:	Import Account
Deposited On:	25 Feb 2000 16:00
Last Modified:	27 Dec 2008 14:19

Download statistics

Repository Staff Only: item control page