454 Replicate Filter

This tool clusters and filters out artificially replicated sequences in 454 data. It returns a fasta file of unique sequences and a list of the sequences in each cluster. This tool is described in Gomez-Alvarez V, Teal TK, Schmidt TM, Systematic artifacts in metagenomes from complex microbial communities, ISME J. 2009 Jul 9

Upload file

Upload a Fasta file of 454 metagenomic reads


--------------------------------

Cluster replicates

Using the default settings identifies artifically replicated sequences. To identify only exact duplicates (as in MG-RAST), use 1.0 for the sequence identity cutoff and 1.0 for the length difference requirement.

Sequence identity cutoff:
This value should be between 0.85 and 1.0

Length difference requirement:
This value should be between 0 and 1.0


It can take up to 10 minutes to upload these large files, so please be patient!

Advanced options

Number of beginning base pairs to check:

By default this program checks to make sure the first 3 base pairs of every sequence in a cluster are identical, ensuring that any sequences removed are both similiar to each other and start at the same position, our criteria for replicates caused by 454 sequencing artifacts. If you choose to, you can change this setting here. In datasets for complex communities, this setting typically excludes sequences that should be clustered together, because sequencing errors at the beginning of the sequence cause a mismatch. In communities that are simpler and have more coverage, this setting ensures that true biological replicates are not being clustered and excluded. Be familiar with your data and check the output before changing this setting.

--------------------------------

Note:
Sequences that cluster together by CD-HIT and start with the same beginning base pairs are identified as replicates and clustered. If many sequences are expected to look similiar and start at the same position, this is not the right tool for your data, e.g. 454 tag data.

Availability:
These scripts are all open source and distributed under the Gnu GPL. They can also be run at the command line without the web interface. The scripts are available for download here.
If you would like to be added to the mailing list for updates to the scripts, please just contact the authors.

Comments/Questions:
If you have any comments or questions about these programs, please contact the authors.