当拿到转录组数据或者基因组数据后,在做拼接以及其他一系列的分析之前,我们一般会做一个K-mer频率分布分析,下面是目前常用的几种计算K-mer的算法和软件。
A. Bloom Filter-based Approach
This method uses the fact that, in real data, large number of k-mers are singletons appearing due to sequencing errors. Bloom filter based approach takes the least amount of memory, but is slightly slower than JELLYFISH hashing approach.
B. Hashing-based Approach as in JELLYFISH
It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.
Their manual is available here.
C. Meryl
We are not sure of how efficient the algorithm is. Their website says -
An out-of-core k-mer counter. The amount of sequence that can be processed for any size k depends only on the amount of free disk space.
More here.
D. Tallymer – Suffix array based approach