Suffix trees for inputs larger than main memory
A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than the input sequences and quickly outgrow the main memory, the first attempts at building large suffix trees focused on algorithms which avoid massive random access to the trees being built. However, all the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The constantly growing pool of string data, especially biological sequences, requires us to build suffix trees for much larger strings. We are the first to present an algorithm which is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them.1 As a proof of concept, we show that our method allows to build the suffix tree for 12 GB of real DNA sequences in 26 h on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome, and the construction of suffix trees for inputs of such magnitude was never reported before.
2005 yılında The VLDB Journal dergisinde yayınlandı.
1996 yılında Information Processing Letters dergisinde yayınlandı.
2010 yılında Software: Practice and Experience dergisinde yayınlandı.
2005 yılında IEEE Transactions on Knowledge and Data Engineering dergisinde yayınlandı.
2012 yılında Knowledge and Information Systems dergisinde yayınlandı.
2005 yılında Journal of Discrete Algorithms dergisinde yayınlandı.
2007 yılında Parallel Computing dergisinde yayınlandı.
2015 yılında yayınlandı.
2009 yılında Theoretical Computer Science dergisinde yayınlandı.