Navigation

 ·   Wiki Home
 ·   Data Processing
 ·   Hemileia vastatrix
 ·   Hypothenemus hampei
 ·   Coffea
 ·   Beauveria bassiana
 ·  
 ·   Title List
 ·   Uncategorized Pages
 ·   Random Page
 ·   Recent Changes
 ·   Wiki Help
 ·   What Links Here

Active Members:

Search:

 

Create or Find Page:

 

View Hybrid Assembly of 454 and Illumina data with CLC

Hybrid Assembly:

Command:

biosge002:/home/dm-riano122/Illumina# time clc_novo_assemble -o ./Hybrid/HvHybridAssemblyCLC.fasta -q 454RoyaTodas_trim_seqclean.fasta -q -p fb ss 200 400 Hv387_trim.fasta Hv494_trim.fasta HvCat_4_trim.fasta HvDQ952_trim.fasta HvH_179_trim.fasta HvH_569_trim.fasta HvH_701_trim.fasta HvMar_1_trim.fasta -v —cpus 20&

Screen Output:

Read count: 376582310
Nucleotide count: 40657409144
Word size: 25

Progress: 100.0 %

real 181m38.836s
user 2922m1.737s
sys 7m31.200s

Reference Assembly:

Command:

biosge002:/home/dm-riano122/Illumina# clc_ref_assemble_long -o HybridReferenceAssemblyCLC.cas -d HvHybridAssemblyCLC.fasta -q 454RoyaTodas_trim_seqclean.fasta -q -p fb ss 200 400 Hv387_trim.fasta Hv494_trim.fasta HvCat_4_trim.fasta HvDQ952_trim.fasta HvH_179_trim.fasta HvH_569_trim.fasta HvH_701_trim.fasta HvMar_1_trim.fasta —cpus 20&

Reports:

Commands:

biosge002:/home/dm-riano122/Illumina# sequence_info -n -r HvHybridAssemblyCLC.fasta > HybridAssembly.txt

biosge002:/home/dm-riano122/Illumina# sequence_info -k HvHybridAssemblyCLC.fasta > HybridLenght.txt

biosge002:/home/dm-riano122/Illumina# sequence_info -l HvHybridAssemblyCLC.fasta > HybridContigLenghts.txt

biosge002:/home/dm-riano122/Illumina# assembly_info -p fb ss 200 400 HybridReferenceAssemblyCLC.cas >HybridAssemblyInfo.txt

This is a Summary:

Number of sequences                539797

Residue counts:
  Number of A's                 110198494   32.40 %
  Number of C's                  56978612   16.75 %
  Number of G's                  57229488   16.82 %
  Number of T's                 110193824   32.39 %
  Number of N's                   5557029    1.63 %
  Total                         340157447

Sequence lengths:
  Minimum                             200
  Maximum                           73632
  Average                             630.16
  N50                                1199

Read info:

  Contigs                      539797
  Reads                     376582310
    Unassembled reads         9380494
    Assembled reads         367201816
      Multi hit reads        44468504
  Potential pairs           187163075
    Paired                   42942190
    Not paired              144220885

Paired end info:

  Pairs                      43770853
    Average distance              323.21
    99.9 % of pairs between       200 - 400
    99.0 % of pairs between       205 - 400
    95.0 % of pairs between       227 - 400

  Not pairs                 144520302
    Both seqs not matching    3195188
    One seq not mathing       2990118
    Both seqs matching      138334996
      Different contigs     130565115
      Wrong directions         428038
      Too close               4319462
      Too far                 3022381

Coverage info:

  Total sites               340157447

  Average coverage                113.20

Here you can download reports:

File:HybridAssembly.txt.zip
File:HybridAssemblyInfo.txt.zip
File:HybridContigLenghts.txt.zip
File:HybridLenght.txt.zip

Summary

Checking the HybridAssemblyInfo.txt with R

Loading the file in R:

a<-read.table(”../HybridAssemblyInfo.txt”,header=T)

Summary of the data:

   Contig           Sites             Reads              Coverage        
 Min.   :     1   Min.   :  200.0   Min.   :      0.0   Min.   :     0.00  
 1st Qu.:134950   1st Qu.:  227.0   1st Qu.:      7.0   1st Qu.:     2.57  
 Median :269899   Median :  281.0   Median :     88.0   Median :    29.77  
 Mean   :269899   Mean   :  630.2   Mean   :    680.3   Mean   :   138.39  
 3rd Qu.:404848   3rd Qu.:  498.0   3rd Qu.:    440.0   3rd Qu.:    64.87  
 Max.   :539797   Max.   :73632.0   Max.   :1860452.0   Max.   :123921.64 

Thus, we see that there is a contig with more than 1.8 million reads and one (a differente one) with coverage 123.000x. Those are repeats. We should mask low complexity regions in the illumina reads before the assembly, as we did for the 454 reads.

This results are interesting to study there repeats, though.

Let’s plot the distribution of the number of reads, coverage ans size per contig: As we have more than 500K contigs, I will take a random sample and plot only that, otherwise plotting will take ages. Plots are sone in R using ggplot2, which is amazing!

First, take a sample of the data (10% of the full data set):

> a.sample=a[sample(1: 539797,size= 53980,replace=F),]
> summary(a.sample)

    Contig           Sites             Reads             Coverage       
 Min.   :     1   Min.   :  200.0   Min.   :     0.0   Min.   :    0.00  
 1st Qu.:133934   1st Qu.:  227.0   1st Qu.:     7.0   1st Qu.:    2.57  
 Median :269478   Median :  280.0   Median :    86.0   Median :   29.66  
 Mean   :269752   Mean   :  629.2   Mean   :   694.1   Mean   :  144.28  
 3rd Qu.:405492   3rd Qu.:  497.0   3rd Qu.:   430.0   3rd Qu.:   64.82  
 Max.   :539788   Max.   :23972.0   Max.   :982953.0   Max.   :87942.17

Now plot the distribution of number of reads per contig:

library(ggplot2)
m=ggplot(a.sample,aes(x=Reads))
m + geom_histogram(aes(x=Reads),binwidth=1) + xlab(“Number of Reads in Contig”) + xlim(0,430)

ReadsPerContigUpto430Reads.png

Now plot the distribution of contig size:

m=ggplot(a.sample,aes(x=Sites))
m + geom_histogram(aes(x=Sites),binwidth=1) + xlab(“Contig Size”) + xlim(0,5000)

ContigSizeUpto5000bp.png

m + geom_histogram(aes(x=Sites),binwidth=1) + xlab(“Contig Size”) + xlim(0,1000)

ContigSizeUpto1000bp.png

And finally, the distribution of average coverage per contig:

m=ggplot(a.sample,aes(x=Coverage))
m + geom_histogram(aes(x=Coverage),binwidth=0.3) + xlab(“Coverage per contig”) + xlim(0,200)

CoveragePercontigUpto200bp.png

m + geom_histogram(aes(x=Coverage),binwidth=0.3) + xlab(“Coverage per contig”) + xlim(0,64)

CoveragePercontigUpto64bp.png

Here is a function in R to label each contig according to its coverage, it is useful for some plots:

add_labelsFromCoverage=function(data){
 labels<-matrix(nrow=nrow(data),ncol=2)
 colnames(labels)<-c('Contig','CoverageCategory')
 for (i in 1:nrow(data)){
  if(data[i,'Coverage']<5){
   labels[i,1]<-data[i,'Contig']
   labels[i,2]<-'Low'
  }
  else if(data[i,'Coverage']>=5 && data[i,'Coverage']<45){
   labels[i,1]<-data[i,'Contig']
   labels[i,2]<-'Good'
  }
  else if(data[i,'Coverage']>=45 && data[i,'Coverage']<100){
   labels[i,1]<-data[i,'Contig']
   labels[i,2]<-'High'
  }
  else{
   labels[i,1]<-data[i,'Contig']
   labels[i,2]<-'Ups'
  }
 }
 labels
}

Can be used like:

categoriesCoverage=add_labelsFromCoverage(originalDataFrame)
newDataFrame=merge(as.data.frame(categoriesCoverage,originalDataFrame))

Probably the last command do not function so prove that:

> newDataFrame=merge(as.data.frame(categoriesCoverage), as.data.frame(originalDataFrame))

I used the above function to make the following graphic:

hist_cut = ggplot(newDataFrame, aes(x=Sites, fill=CoverageCategory))
hist_cut + geom_bar(position=“fill”,binwidth=150) + xlab(“Contig Size”) + ylab(“Proportion of Coverage Category”)

ContigSizeAndCoverage.png

Where Low means coverage smaller than 5x, Good is coverage between 5x and 45x, High is coverage between 45x and 100x and Ups is coverage higher than 100x, these categories are quite arbitrary, if the are better ones, I am (DMRP) all ears. The bins for the contig size are 150bp wide. The average coverage of a good proportion of the contigs of this assembly is OK (those with High and Good in the Figure). However we have many contigs with very high coverage, i.e., 11438 contigs with a Coverage higher or equal than 1000x. David Will run RepeatMasker on these contigs.

Here is the same figure for the whole dataset, in contrast with the 10% sample of the previous figure.

ContigSizeAndCoverageFullDataset.png

From the summaries is crystal clear that we have a genome with an important contribution of repetitive sequences. We have to deal with that in order to improve the assemblies and scaffolding.