View Hybrid Assembly of 454 and Illumina data with CLC
Hybrid Assembly:
Command:
biosge002:/home/dm-riano122/Illumina# time clc_novo_assemble -o ./Hybrid/HvHybridAssemblyCLC.fasta -q 454RoyaTodas_trim_seqclean.fasta -q -p fb ss 200 400 Hv387_trim.fasta Hv494_trim.fasta HvCat_4_trim.fasta HvDQ952_trim.fasta HvH_179_trim.fasta HvH_569_trim.fasta HvH_701_trim.fasta HvMar_1_trim.fasta -v —cpus 20&
Screen Output:
Read count: 376582310
Nucleotide count: 40657409144
Word size: 25Progress: 100.0 %
real 181m38.836s
user 2922m1.737s
sys 7m31.200s
Reference Assembly:
Command:
biosge002:/home/dm-riano122/Illumina# clc_ref_assemble_long -o HybridReferenceAssemblyCLC.cas -d HvHybridAssemblyCLC.fasta -q 454RoyaTodas_trim_seqclean.fasta -q -p fb ss 200 400 Hv387_trim.fasta Hv494_trim.fasta HvCat_4_trim.fasta HvDQ952_trim.fasta HvH_179_trim.fasta HvH_569_trim.fasta HvH_701_trim.fasta HvMar_1_trim.fasta —cpus 20&
Reports:
Commands:
biosge002:/home/dm-riano122/Illumina# sequence_info -n -r HvHybridAssemblyCLC.fasta > HybridAssembly.txt
biosge002:/home/dm-riano122/Illumina# sequence_info -k HvHybridAssemblyCLC.fasta > HybridLenght.txt
biosge002:/home/dm-riano122/Illumina# sequence_info -l HvHybridAssemblyCLC.fasta > HybridContigLenghts.txt
biosge002:/home/dm-riano122/Illumina# assembly_info -p fb ss 200 400 HybridReferenceAssemblyCLC.cas >HybridAssemblyInfo.txt
This is a Summary:
Number of sequences 539797 Residue counts: Number of A's 110198494 32.40 % Number of C's 56978612 16.75 % Number of G's 57229488 16.82 % Number of T's 110193824 32.39 % Number of N's 5557029 1.63 % Total 340157447 Sequence lengths: Minimum 200 Maximum 73632 Average 630.16 N50 1199 Read info: Contigs 539797 Reads 376582310 Unassembled reads 9380494 Assembled reads 367201816 Multi hit reads 44468504 Potential pairs 187163075 Paired 42942190 Not paired 144220885 Paired end info: Pairs 43770853 Average distance 323.21 99.9 % of pairs between 200 - 400 99.0 % of pairs between 205 - 400 95.0 % of pairs between 227 - 400 Not pairs 144520302 Both seqs not matching 3195188 One seq not mathing 2990118 Both seqs matching 138334996 Different contigs 130565115 Wrong directions 428038 Too close 4319462 Too far 3022381 Coverage info: Total sites 340157447 Average coverage 113.20
Here you can download reports:
File:HybridAssembly.txt.zip
File:HybridAssemblyInfo.txt.zip
File:HybridContigLenghts.txt.zip
File:HybridLenght.txt.zip
Summary
Checking the HybridAssemblyInfo.txt with R
Loading the file in R:
a<-read.table(”../HybridAssemblyInfo.txt”,header=T)
Summary of the data:
Contig Sites Reads Coverage Min. : 1 Min. : 200.0 Min. : 0.0 Min. : 0.00 1st Qu.:134950 1st Qu.: 227.0 1st Qu.: 7.0 1st Qu.: 2.57 Median :269899 Median : 281.0 Median : 88.0 Median : 29.77 Mean :269899 Mean : 630.2 Mean : 680.3 Mean : 138.39 3rd Qu.:404848 3rd Qu.: 498.0 3rd Qu.: 440.0 3rd Qu.: 64.87 Max. :539797 Max. :73632.0 Max. :1860452.0 Max. :123921.64
Thus, we see that there is a contig with more than 1.8 million reads and one (a differente one) with coverage 123.000x. Those are repeats. We should mask low complexity regions in the illumina reads before the assembly, as we did for the 454 reads.
This results are interesting to study there repeats, though.
Let’s plot the distribution of the number of reads, coverage ans size per contig: As we have more than 500K contigs, I will take a random sample and plot only that, otherwise plotting will take ages. Plots are sone in R using ggplot2, which is amazing!
First, take a sample of the data (10% of the full data set):
> a.sample=a[sample(1: 539797,size= 53980,replace=F),]
> summary(a.sample)
Contig Sites Reads Coverage Min. : 1 Min. : 200.0 Min. : 0.0 Min. : 0.00 1st Qu.:133934 1st Qu.: 227.0 1st Qu.: 7.0 1st Qu.: 2.57 Median :269478 Median : 280.0 Median : 86.0 Median : 29.66 Mean :269752 Mean : 629.2 Mean : 694.1 Mean : 144.28 3rd Qu.:405492 3rd Qu.: 497.0 3rd Qu.: 430.0 3rd Qu.: 64.82 Max. :539788 Max. :23972.0 Max. :982953.0 Max. :87942.17
Now plot the distribution of number of reads per contig:
library(ggplot2)
m=ggplot(a.sample,aes(x=Reads))
m + geom_histogram(aes(x=Reads),binwidth=1) + xlab(“Number of Reads in Contig”) + xlim(0,430)
Now plot the distribution of contig size:
m=ggplot(a.sample,aes(x=Sites))
m + geom_histogram(aes(x=Sites),binwidth=1) + xlab(“Contig Size”) + xlim(0,5000)
m + geom_histogram(aes(x=Sites),binwidth=1) + xlab(“Contig Size”) + xlim(0,1000)
And finally, the distribution of average coverage per contig:
m=ggplot(a.sample,aes(x=Coverage))
m + geom_histogram(aes(x=Coverage),binwidth=0.3) + xlab(“Coverage per contig”) + xlim(0,200)
m + geom_histogram(aes(x=Coverage),binwidth=0.3) + xlab(“Coverage per contig”) + xlim(0,64)
Here is a function in R to label each contig according to its coverage, it is useful for some plots:
add_labelsFromCoverage=function(data){ labels<-matrix(nrow=nrow(data),ncol=2) colnames(labels)<-c('Contig','CoverageCategory') for (i in 1:nrow(data)){ if(data[i,'Coverage']<5){ labels[i,1]<-data[i,'Contig'] labels[i,2]<-'Low' } else if(data[i,'Coverage']>=5 && data[i,'Coverage']<45){ labels[i,1]<-data[i,'Contig'] labels[i,2]<-'Good' } else if(data[i,'Coverage']>=45 && data[i,'Coverage']<100){ labels[i,1]<-data[i,'Contig'] labels[i,2]<-'High' } else{ labels[i,1]<-data[i,'Contig'] labels[i,2]<-'Ups' } } labels }
Can be used like:
categoriesCoverage=add_labelsFromCoverage(originalDataFrame)
newDataFrame=merge(as.data.frame(categoriesCoverage,originalDataFrame))
Probably the last command do not function so prove that:
> newDataFrame=merge(as.data.frame(categoriesCoverage), as.data.frame(originalDataFrame))
I used the above function to make the following graphic:
hist_cut = ggplot(newDataFrame, aes(x=Sites, fill=CoverageCategory))
hist_cut + geom_bar(position=“fill”,binwidth=150) + xlab(“Contig Size”) + ylab(“Proportion of Coverage Category”)
Where Low means coverage smaller than 5x, Good is coverage between 5x and 45x, High is coverage between 45x and 100x and Ups is coverage higher than 100x, these categories are quite arbitrary, if the are better ones, I am (DMRP) all ears. The bins for the contig size are 150bp wide. The average coverage of a good proportion of the contigs of this assembly is OK (those with High and Good in the Figure). However we have many contigs with very high coverage, i.e., 11438 contigs with a Coverage higher or equal than 1000x. David Will run RepeatMasker on these contigs.
Here is the same figure for the whole dataset, in contrast with the 10% sample of the previous figure.
From the summaries is crystal clear that we have a genome with an important contribution of repetitive sequences. We have to deal with that in order to improve the assemblies and scaffolding.