Navigation

 ·   Wiki Home
 ·   Data Processing
 ·   Hemileia vastatrix
 ·   Hypothenemus hampei
 ·   Coffea
 ·   Beauveria bassiana
 ·  
 ·   Title List
 ·   Uncategorized Pages
 ·   Random Page
 ·   Recent Changes
 ·   Wiki Help
 ·   What Links Here

Active Members:

Search:

 

Create or Find Page:

 

View Maker RNASeq Cufflinks ThirdHybridAssembly CLC

First we selected the contigs from thir hybrid assembly with CLC represented by RNASeq data and then use them to run maker.

Open R and install the package cummeRbund: > biocLite(“cummeRbund”)

setwd(”~/tmp/RNASeqRoya/cuffdiff_Hv701_Hv955_HvCatNor”)
library(cummeRbund)
cuff<-readCufflinks()
cuff

CuffSet instance with:
  3 samples
  21345 genes
  25297 isoforms
  22469 TSS
  0 CDS
  64035 promoters
  67407 splicing
  0 relCDS

gene.features <- features(genes(cuff))
> head(gene.features)

     gene_id class_code nearest_ref_id gene_short_name                    locus length coverage gene_id
1 XLOC_000001       <NA>           <NA>            <NA>       contig_1:1677-2020     NA       NA    <NA>
2 XLOC_000002       <NA>           <NA>            <NA>      contig_100003:0-214     NA       NA    <NA>
3 XLOC_000003       <NA>           <NA>            <NA>  contig_100006:2319-5847     NA       NA    <NA>
4 XLOC_000004       <NA>           <NA>            <NA>  contig_100006:5992-7648     NA       NA    <NA>
5 XLOC_000005       <NA>           <NA>            <NA> contig_100006:7844-14750     NA       NA    <NA>
6 XLOC_000006       <NA>           <NA>            <NA>  contig_100006:1502-2070     NA       NA    <NA>

write.table(gene.features, “mydata.txt”, sep=”\\t”)

Extract 5th column and exxxtract these contigs from ThirdHybridAssembly with cdbfasta and cdbyank to run maker.

maker -CTL

vim maker_opts.ctl

#-----Genome (Required for De-Novo Annotation)
genome=RNASeqCufflinksThirdHybridAssemblyCLC.fasta #genome sequence file in fasta format
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
genome_gff= #re-annotate genome based on this gff3 file
est_pass=0 #use ests in genome_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ests in genome_gff: 1 = yes, 0 = no
protein_pass=0 #use proteins in genome_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in genome_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in genome_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
other_pass=0 #passthrough everything else in genome_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=/data/process/Roya/blastTranscriptomVsNr/HvCatNor7_Trans.fasta #non-redundant set of assembled ESTs in fasta format (classic EST analysi
s)
est_reads= #unassembled nextgen mRNASeq in fasta format (not fully implemented)
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #EST evidence from an external gff3 file
altest_gff= #Alternate organism EST evidence from a separate gff3 file

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/data/process/DBs/fungi/Protein_DBs/Pucciniales_proteins.fasta #protein sequence file in fasta format
protein_gff=  #protein homology evidence from an external gff3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/opt/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to run repeat masking on prokaryotes (don't change this), 1 = yes, 0 = no

#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=ustilago_maydis #Augustus gene prediction species model
fgenesh_par_file= #Fgenesh parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file &#40;annotation pass-through&#41;
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #gene prediction from protein homology (prokaryotes only), 1 = yes, 0 = no
unmask=0 #Also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #features to pass-through to final output from an extenal GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases  memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #force start and stop codon into every gene, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Add unsupported gene prediction to final annotation set, 1 = yes, 0 = no

makedir gff

find RNASeqCufflinksThirdHybridAssemblyCLC.maker.output/ -name *.gff -exec cp {} gff \;
cat *.gff | grep ‘^contig’ > RNASeqCufflinksThirdHybridAssemblyCLC.gff.tab

vim RNASeqCufflinksThirdHybridAssemblyCLC.gff.tab

add this line at begin of file

seqid   source  type    start   end     score   strand  phase   attributes

/opt/scripts/tabToSql.pl RNASeqCufflinksThirdHybridAssemblyCLC.gff.tab > RNASeqCufflinksThirdHybridAssemblyCLC.gff.tab.sql

Upload .sql to postgresql

/opt/local/lib/postgresql90/bin/psql -h 192.168.194.70 -d limsHLM -U root -f RNASeqCufflinksThirdHybridAssemblyCLC.gff.tab.sql

In postgres execute the query:

select seqid,source,count(source) as cantSource
from tmp.rnaseqcufflinksthirdhybridassemblyclc_gff
where source!=’.’
group by seqid,source
order by count(source) desc;

Result: File:RNASeqCufflinksThirdHybridAssemblyCLCMakerCount.zip

Extract protein from augustus annotation

find RNASeqCufflinksThirdHybridAssemblyCLC.maker.output/ -name *.augustus -exec cp {} annotAugustus \;