non-Reference based mRNA SequencingNext-generation Sequencing
Service OverviewmRNA sequencing enables profiling of all mRNAs transcripted from cells under specific conditions. In terms of species without reference genome, small fragments are assembled into unigenes as a reference sequence for downstream analysis. It is a powerful strategy for revealing molecular mechanism and regulatory networks of species without reference genome. To date, mRNA sequencing has been widely employed in fundamental research, clinical diagnostics, drug development, molecular breeding, etc.
Data quality control
Rate of error basing calling is influenced by the instrument, reagents, samples, etc. It is commonly found in Illumina platform that the rate slowly climbs along the reading of sequence due to the consumption of reagents.
Transcriptome data assembly
The clean data with high quality obtained from quality control steps are processed for assembly by Trinity. The sequencing depth is largely influenced by the size of data as well as the abundancy of specific transcripts. As a direct influencing factor of assembly quality, an adequate depth of sequencing is crucial to guarantee a more integrated assemble of low abundant transcripts. Therefore,data of same species are combined for assembly to increase the depth of sequencing. In terms of samples of different species, since they differ from genome, the assembly of data is processed separately.
Gene expression analysis
RNA-Seq is able to achieve a highly sensitive estimation of gene expression. Normally, the detectable range of transcripts expression FPKM is from 10^-2 to 10^6. The box plots and FPKM density distribution enable visualization of the gene expression dispersion within a single sample and the comparison of the overall expression level among all samples.
Correlation assessment on reproducibility
Correlation assessment between biological replicates examines the reproducibility of the experiment. A good correlation can strongly supports the reliability of differentially expression analysis outcome. Moreover, it is a method of screening abnormal samples.
Differential expression analysis
Differential expression analysis is presented in various forms, including Volcano plot, MA plot, Venn diagram, Hierarchical clustering heatmap, protein-protein interaction network, etc. Volcano plot is a plot of log2(Fold change) against log10 FDR, which clearly shows the differences in gene expression between two samples coupled with the corresponding significancy.
DEGs functional enrichment analysis
GO (Gene Ontology) database is a structured biological annotation system containing a standard vocabulary of gene and gene products functions. It contains multiple levels, where the lower the level is, the more specific the functions are.
DEG protein-protein interaction networks
STRING is a database containing information of predicted and proved protein-protein interactions (PPI) of a collection of species. The interactions refer to both direct physical interactions and indirect functional interactions. The PPI network was built based on the DEGs generated in the differential expression analysis and existing information on interactions in database.
1How to process data interpretation?
Generally, the analysis of data can be divided into three aspects: gene identification, differentially expressed genes and SNPs. Genes can be studied for their ID, gene name, sequences, functional annotation, gene expression Venn between samples, WGCNA, etc. In differential expression analysis, common DEGs between different grouping can be identified by Venn. Genes with similar expression pattern among different treatments are likely to have similar functions. Therefore, hierarchy clustering is a useful method to extract genes with similar expression pattern for further functional analysis. SNP analysis contains PCA analysis, identification of differential SNPs between samples and SNP searching on targeted regions, etc. All the analysis mentioned above can be achieved on BMKCloud platform.
2In KEGG pathway annotation file, what does "K Number Count" stands for?
K number Count represents the number of enzymes involved. For example, 8(6) means there are 8 genes annotated to the pathway, where 6 enzymes in this pathway are involved. (Two or more genes are related with same enzyme.)
3In the results, the significancy index of GO and KEGG enrichment analysis are KS and Q-Value respectively. However, the common threshold used in literature is P-value<0.05. How do we set the corresponding threshold for KS and Q-value?
GO enrichment analysis is processed by TopGO R package and KEGG enrichment analysis is processed by a self-written program based on Fisher's exact tests, which give us KS and Q-value. KS<0.05 is equivalent with P-Value<0.05. Q-Value is the adjusted P-value. Q-value can be interpreted as FDR (the proportion of false positive objects among all positives). These two index are not commonly used in papers.
4For a specific gene A, the reads obtained in sequencing could mapped to 3' and 5', however, the part between then is missing. Is it possible to extend the known parts to obtain the sequence in the middle?
(1) Experimental method: Design primers based on known sequences at 3' and 5', extend and amplify the sequence by PCR. (2) Bioinformatic method: Try to find homologous gene in related species map the reads in this area on this gene.