Navigation

 ·   Wiki Home
 ·   Data Processing
 ·   Hemileia vastatrix
 ·   Hypothenemus hampei
 ·   Coffea
 ·   Beauveria bassiana
 ·  
 ·   Title List
 ·   Uncategorized Pages
 ·   Random Page
 ·   Recent Changes
 ·   Wiki Help
 ·   What Links Here

Active Members:

Search:

 

Create or Find Page:

 

View SNPs Single Nucleotide Polimorfism Prediction on Coffee Rust Genome

OVERVIEW

The principal objective is search SNPs on 454 sequencing of H. vastatrix. For this, polimorphism rich regions will be identified. Then flanking primers of this regions will be designed to be used like identification tool for isolation o races coffee rust.

MARTHLAB SNPs SEARCH ON Hemileia vastatrix

In this section you find some results about preliminar test of SNPs search by means of Boston College Marth Lab package.

GUIDED ASSEMBLY OF RUST

MOSAIK aligner tools were used to assembly 516.834 reads of coffee rust. 45.244 Contigs of Newbler Assembler (Roche) were used like reference sequences for guided assembly.

PyroBayes

To extract, recall and generate new quality values for reads, PyroBayes was run over Roche 454 .sff file. On total 516834 sequences were called. Bellow, a screenshot of the command ran are showed.

pyrobayes.png

MosaikBuilder

Binary files (.dat) of reads (fasta and quality files) and contigs (fasta) were created using MosaikBuilder.

For the reference sequences 454Allcontigs.fna:

MBP:Roya david$ MOSAIK/bin/MosaikBuild -fr 454AllContigs.fna -oa ./AssembleRoyaBostonLab/RoyaAllContigsMac/454AllContigs.dat
———————————————————————————————————————
MosaikBuild 1.0.1384 2010-01-24
Michael Stromberg Marth Lab, Boston College Biology Department
——————————————————————————————————————— – converting 454AllContigs.fna to a reference sequence archive.

- parsing reference sequences:
ref seqs: 45,244 (45,198.8 ref seqs/s)

- writing reference sequences:
100%[=======================================] 45,244.0 ref seqs/s in 1 s

- calculating MD5 checksums:
100%[=======================================] 45,244.0 ref seqs/s in 1 s

- writing reference sequence index:
100%[=======================================] 45,244.0 ref seqs/s in 1 s

- creating concatenated reference sequence:
100%[=======================================] 45,244.0 ref seqs/s in 1 s

- writing concatenated reference sequence… finished. – creating concatenated 2-bit reference sequence… finished. – writing concatenated 2-bit reference sequence… finished. – writing masking vector… finished.

MosaikBuild CPU time: 2.339 s, wall time: 4.201 s

For the reads on fasta (RoyaPyroBayes.fasta) and their quality values (RoyaPyroBayes.fasta.qual):

MBP:Roya david$ ./MOSAIK/bin/MosaikBuild -fr ./AssembleRoyaBostonLab/RoyaLenght850/RoyaPyroBayes.fasta -fq
./AssembleRoyaBostonLab/RoyaLenght850/RoyaPyroBayes.fasta.qual -out ./AssembleRoyaBostonLab/RoyaAllContigsMac/RoyaReads.dat -st 454
———————————————————————————————————————
MosaikBuild 1.0.1384 2010-01-24
Michael Stromberg Marth Lab, Boston College Biology Department
——————————————————————————————————————— – setting read group ID to: ZBYGX09JC8I – setting sample name to: unknown – setting sequencing technology to: 454 – trimming leading and lagging N’s. Mates with >4 interior N’s will be deleted.

- parsing FASTA files:
reads: 516,834 (5,738.1 reads/s)

Filtering statistics:
========================================

  1. reads written: 516834
  2. bases written: 187871784

MosaikBuild CPU time: 87.559 s, wall time: 90.333 s

MosaikJumpDatabase

Then Jump Database of contigs for assembly optimization was created. Parameters used was hash size 15 and 3 GB of RAM for database storage:

MBP:Roya david$ ./MOSAIK/bin/MosaikJump -ia ./AssembleRoyaBostonLab/RoyaAllContigsMac/454AllContigs.dat -out
./AssembleRoyaBostonLab/RoyaAllContigsMac/454RoyaJump_15 -hs 15 -mem 3

———————————————————————————————————————
MosaikJump 1.0.1384 2010-01-24
Michael Stromberg Marth Lab, Boston College Biology Department
———————————————————————————————————————

- retrieving reference sequence… finished.

- hashing reference sequence:
100%[=========================================] 5,003,756 hashes/s in 7 s

- serializing final sorting vector… finished.

- writing jump positions database:
100%[=================================] 24,281.4 hash positions/s in 09:41

- serializing jump keys database (17 blocks):
blocks: 17 (0.0345 blocks/s)

MosaikJump CPU time: 225.963 s, wall time: 1214.830 s

MosaikAligner

Then, MosaikAligner was executed. The parameters were: hash size 15, specifies the maximum percentage of the read length that are allowed to be errors 5%, maximum number of hash positions 100, alignment candidate threshold 26, number of processors 4 and band with 51.

MBP:Roya david$ ./MOSAIK/bin/MosaikAligner -in ./AssembleRoyaBostonLab/RoyaAllContigsMac/RoyaReads.dat -out
./AssembleRoyaBostonLab/RoyaAllContigsMac/RoyaAlignedMac.dat -ia ./AssembleRoyaBostonLab/RoyaAllContigsMac/454AllContigs.dat
-hs 15 -mmp .05 -mhp 100 -act 26 -j ./AssembleRoyaBostonLab/RoyaAllContigsMac/454RoyaJump_15 -p 4 -bw 51

———————————————————————————————————————
MosaikAligner 1.0.1384 2010-01-24
Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department
——————————————————————————————————————— – Using the following alignment algorithm: all positions – Using the following alignment mode: aligning reads to all possible locations – Using a maximum mismatch percent threshold of 0.05 – Using a hash size of 15 – Using 4 processors – Using a Smith-Waterman bandwidth of 51 – Using an alignment candidate threshold of 26bp. – Setting hash position threshold to 100 – Using a jump database for hashing. Storing keys & positions in memory. – Using a homo-polymer gap open penalty of 4 – loading jump keys database into memory… finished. – loading jump positions database into memory… finished. – loading reference sequence… finished.

Aligning read library (516834): 0% [ ]
/ERROR: A position (1099511627520) was specified that is larger than the jump positions database (101041800).

The Jump Database could not be used: “/ERROR: A position (1099511627520) was specified that is larger than the jump positions database (101041800)”. So, the alignment was performed without this, with the same parameters:

MBP:Roya david$ ./MOSAIK/bin/MosaikAligner -in ./AssembleRoyaBostonLab/RoyaAllContigsMac/RoyaReads.dat -out
./AssembleRoyaBostonLab/RoyaAllContigsMac/RoyaAlignedMac.dat -ia ./AssembleRoyaBostonLab/RoyaAllContigsMac/454AllContigs.dat
-hs 15 -mmp .05 -mhp 100 -act 26 -p 4 -bw 51

———————————————————————————————————————
MosaikAligner 1.0.1384 2010-01-24
Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department
——————————————————————————————————————— – Using the following alignment algorithm: all positions – Using the following alignment mode: aligning reads to all possible locations – Using a maximum mismatch percent threshold of 0.05 – Using a hash size of 15 – Using 4 processors – Using a Smith-Waterman bandwidth of 51 – Using an alignment candidate threshold of 26bp. – Setting hash position threshold to 100 – Using a homo-polymer gap open penalty of 4

Hashing reference sequence:
100%[======================================] 475,611.8 ref bases/s in 01:18

- loading reference sequence… finished.

Aligning read library (516834):
100%[==========================================] 97.7 reads/s in 1:28:08

Alignment statistics (mates):
===============================

  1. failed hash: 649 ( 0.1 %)
  2. filtered out: 481578 ( 93.2 %)
  3. unique: 33447 ( 6.5 %)
  4. non-unique: 1160 ( 0.2 %)
    —————————————————-
    total: 516834
    total aligned: 34607 ( 6.7 %)

Miscellaneous statistics:
==============================
aligned mate bp: 11583039
alignment candidates/s: 4130.6

MosaikAligner CPU time: 10439.013 s, wall time: 5374.956 s

How it can be seen, only 6.7% (34607) of the total reads (516834) were aligned. This is because approximately half of the reads beware to a half-plate that was not paid and the contigs send to us was assembled by Newbler only with the remained reads. Furthermore, other reads are discarded by low quality and other filter parameters of the MosaikAssembler. Remember that reads was extracted of the .sff file of Roche that contain all reads of complete plate.

MosaikAssembler

Finally, MosaikAssembler was run.

MBP:BostonLab david$ ../MOSAIK/bin/MosaikAssembler -in ./RoyaAllContigsMac/RoyaSortMac.dat -out
./RoyaAllContigsMac/Assemble/RoyaAssemblyMac -ia ./RoyaAllContigsMac/454AllContigs.dat -f gig

MosaikAssembler CPU time: 44.382 s, wall time: 720.523 s

On output showed above, read counting and statistical alignments for every contig were omitted, because the list are very large. Nevertheless, on MosaikAssembler.txt file you will find the complete output.
GIGABAYES

Using the binary output of MosaikAssembler, .gig files, GigaBayes was run to two differents configurations.

Configuration 1:

./bin/gigaBayes —gig ./Assemble/gig/RoyaAssemblyMac_contig04554.gig —gff ./Assemble/gff/RoyaAssemblyMac_contig04554.gig.gff
—sample unknown —CRL 50 —PSL 0.5 —TB 0 —O 2 —debug

Configuration 2:

./bin/gigaBayes —gig ./Assemble/gig/RoyaAssemblyMac_contig37379.gig —gff ./Assemble/gff/RoyaAssemblyMac_contig37379.gig.gff
—sample unknown —CRL 100 —PSL 0.5 —TB 0 —O 2 —debug

Results for the two parameter configuration are shown on Tables 1 and 2 below. P50-R50: 50 reads with a threshold bayesian probability of 50%; P50-R100: 100 reads and probability of 50%. Besides runs the “unknown” sample configuration was established: it is to said that sample came from different entities (different clonal populations, colony). On Table 1 SNPs with bayesian probability less than 0.7 (70%) are highlighted red.

gigabayes_table_1.png

gigabayes_table2.png

REFERENCES

1. PYROBAYES: An improved base-caller for SNP discovery in pyrosequences. Quinlan AR, Stewart DA, Strömberg MP, Marth GT. Nature Methods. 2008;5:179-81.

2. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Sachidanandam, R, Weissman, D, Schmidt, SC, Kakol, JM, Stein, LD, Marth, G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL, Hunt SE, Cole CG, Coggill PC, Rice CM, Ning Z, Rogers J, Bentley DR, Kwok PY, Mardis ER, Yeh RT, Schultz B, Cook L, Davenport R, Dante M, Fulton L, Hillier L, Waterston RH, McPherson JD, Gilman B, Schaffner S, Van Etten WJ, Reich D, Higgins J, Daly MJ, Blumenstiel B, Baldwin J, Stange-Thomann N, Zody MC, Linton L, Lander ES, Altshuler D; The International SNP Map Working Group. Nature 409, 928-33 (2001).

3. Whole-genome sequencing and variant discovery in C. elegans. Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER. Nature Methods. 2008;5:183-8.

4. A general approach to single-nucleotide polymorphism discovery. Gabor T. Marth, Mark D. Yandell, Ian Korf, Zhijie Gu, Raymond T. Yeh, Hamideh Zakeri, Nathan O. Stitziel, LaDeana Hillier, Pui-Yan Kwok and Warren Gish. Nature Genetics 23, 452-456 (1999).