RNA-Seq/Microarray DEG Analysis...analysis (K-means clustering, Hierarchical analysis, t-test,...
Transcript of RNA-Seq/Microarray DEG Analysis...analysis (K-means clustering, Hierarchical analysis, t-test,...
- 1 -
Version 4.0
RNA-Seq & Microarray
DEG Analysis Manual v4.1
- 2 -
<Contents>
1. Differentially Expressed Gene (DEG) Analysis (ExDEGA v.1.6.6)
2. Clustering Heatmap (MeV Software)
3. Pathway Analysis (KEGG Mapper)
4. Functional Annotation Analysis (DAVID)
5. Gene Set Enrichment Analysis (GSEA) Analysis (MSigDB)
6. Protein-Protein Interaction (PPI) Analysis (STRING)
- 3 -
1. Differentially Expressed Gene (DEG) Analysis (ExDEGA v.1.6.6)
EBIOGEN provides our customers with special data report of ExDEGA (Excel-based Differentially
Expressed Gene Analysis) for NGS, microarray, and antibody array experimental services. ExDEGA is
Excel-based data analysis tool that includes various convenient functions such as data mining and
graphic visualization. It is user-friendly and will be continuously updated for researchers who are
unfamiliar with data analysis and the use of Excel software.
ExDEGA setup file and data will be provided after completing NGS, microarray or antibody array. As
follows (Figure 1-1), it is needed to unzip ExDEGA.zip file and to execute ExDEGASetup.exe. Then,
the ExDEGA data report will be automatically opened. If other excel files already opened, please
close all opened files and open data report again.
Figure 1-1. ExDEGA Set Up
- 4 -
Gene Ontology (GO) analysis tool is on the left, mRNA expression data is on the middle, and the
Differentially Expressed Gene (DEG) analysis tool is on the right in ExDEGA Report.xls file (Figure 1-
2). General GOs are already set up on the Gene Category and GOs are editable by manually adding
or modifying a gene list in Gene Category Settings. Meaningful data can be quickly acquired when
Gene Category and DEG analysis functions work together. DEG analysis allows the user to select
significantly differentially expressed genes and to visualize gene expression data more effectively.
By using these functions, the researcher can analyze easily NGS, microarray or antibody array data
with ExDEGA.
Figure 1-2. mRNA Expression Data Format Made by EBIOGEN
- 5 -
1-1. Gene category
Functional grouping is efficient to analyze mRNA expression data from tens of thousands of genes.
Most biologists normally use gene ontology (GO) database and pathway database for biological
function analysis. Pre-established 15 GOs in Gene Category have been commonly studied in the
field of biology. If you want to analyze genes related to aging, it could be filtered by selecting
‘Aging’ in the Gene Category (Figure 1-3). Multi-selection is available. The functions, ‘AND’ and ‘OR’,
are helpful to filter out genes that are related to more than one GO or at least one at the same
time.
Figure 1-3. Gene Ontology (Aging) Selection
If you cannot find interesting GO in Gene Category, other GOs can be added through Quick GO
site. To modify the Gene Category setting, click the ‘View All Data’ button first, then click the ‘Gene
Category Settings’ button (Figure 1-4). An instruction that describes the way of adding GO will be
popped up when clicking ‘?’ button.
Figure 1-4. Gene Category Settings
- 6 -
If you have a gene list of another functional group, you can manually create a new gene category
as follows: 1) click on the ‘Gene Category Settings’ button, 2) select ‘New’, 3) enter a name for the
new gene category 4) enter the desired gene list (or copy-paste), and 5) click ‘OK’ to save it (Figure
1-5-a, b).
Figure 1-5-a. Adding Genes to Make a New Gene Category
Figure 1-5-b. Adding Genes to Make a New Gene Category
- 7 -
1-2. Significant Gene selection
In the DEG Analysis section on the right, the ‘Significant Gene Selection’ window is designed to
filter genes that were significantly different between control and test samples from the total results.
Figure 1-6 shows that fold change is 2.00, normalized data (log2) is 4.00, and t-test p-value is 0.05,
resulting in 59 genes are filtered from the total 24,496 genes. Fold change, p-value, and normalized
data (log2) are adjustable according to results. p-value will be calculated in only replicated data.
As to ‘AND’ and ‘OR’ functions in Significant Gene Selection, it has a similar concept as the
aforementioned in the part 1-1. The significant differences of genes involved in one sample or more
than one sample would be filtered through these functions.
Figure 1-6. Selection of Significantly Expressed Genes
- 8 -
Gene Category and Significant Gene Selection can be performed together. If you select Cell
differentiation in the Gene Category as Figure 1-7, only 5 genes from A and B samples are filtered.
It means that the 5 genes are significantly differentially expressed genes related to cell differentiation.
Figure 1-7. Significantly Differentially Expressed Genes in Cell Differentiation
To visualize its results after setting the Significant Gene Selection up, click the 'Filter Gene Category
Chart' button. A pie graph and a bar graph will pop up in a moment. The ratio and number of
genes expressed differently in each GO can be verified graphically. When you click a specific part
of the graph where you are interested in, its genes will be automatically displayed in the spreadsheet
of ExDEGA. For example, click a part of GOs in the pie graph or the bar graph, resulting in its up or
down regulated genes will be automatically filtered in the spreadsheet (Figure 1-8). The digits that
are written above the bar in the bar graph are the number of genes.
Figure 1-8. Gene Category Chart
- 9 -
1-3. Analysis Graph
A scatter plot, volcano plot, and Venn diagram can be easily drawn through Analysis Graph (Figure
1-9).
Figure 1-9. Analysis Graph Tool
1-3-1. Scatter plot
For scatter plot, choose two variables and set Fold Threshold Line value first. Then click the ‘Graph
View’. Scatter Plot is automatically created for the selected condition. Each part describes as 1) x-
and y-axis are relative expression levels 2) red dots are a higher expression level of y-values than x-
value 3) green dots are a higher expression level of x-values than y-value (Figure 1-10). When you
click on a spot in the plot, the gene symbol is displayed and it can be removed by clicking the right
mouse button. If you want to display multiple genes at the same time, copy and enter the
corresponding gene ID list into the ‘Gene Select (ID Input)’ window and click ‘Add’.
Figure 1-10. Analysis Graph Tool – Scatter Plot
- 10 -
1-3-2. Volcano plot
Volcano Plot’s function is almost the same as the Scatter Plot. Select the variables and set Fold
Threshold Line value and p-value. Then click the ‘Graph View’. Volcano Plot is automatically created
for the selected comparison condition on the left. Each part describes as 1) x-axis is a fold change
in the log2 scale, 2) y-axis is the p-value in -log10 scale, 3) red or green dots are genes that were
significantly changed in accordance with the condition already set up (Figure 1-11). When you click
on a spot in the plot, the gene symbol is displayed and it can be removed by clicking the right
mouse button. If you want to display multiple genes at the same time, copy and enter the
corresponding gene ID list into the ‘Gene Select (ID Input)’ window and click ‘Add’.
Figure 1-11. Analysis Graph Tool – Volcano Plot
- 11 -
1-3-4. Venn diagram
Venn diagrams for all possible logical relations between 2, 3 or 4 samples can be created. To draw
a Venn Diagram, select Sample Comparison first. Then set the Fold Change value and p-value and
click the ‘Diagram View’ (Figure 1-12). Up to 4 sample comparisons can be selected.
Figure 1-12. Analysis Graph Tool – Venn Diagram
The numbers shown in the Venn Diagram results (Figure 1-13) indicates that 1) tilted number is the
number of up-regulated genes 2) red number is the number of genes that showed the opposite
aspects among sample comparisons 3) underlined number is the number of down-regulated genes
based on the pre-set conditions.
Figure 1-13. An Example of Up, Down, and Contra-regulated in Venn Diagram
- 12 -
To confirm the corresponding genes which were appeared in the Venn Diagram, place the mouse
cursor onto a region of the Venn Diagram and click the right mouse button. For example, if you
want to see up-regulated genes in only B/A, right-click the area of the B/A in the Venn Diagram
and select ‘Up-regulated’. Three genes would be filtered in the Excel spreadsheet (Figure 1-14).
Figure 1-14. Filtering 2fold Up-regulated Gene List in Venn Diagram
All images provided by ExDEGA can be saved by right-clicking in the plots and Venn Diagram and
selecting a 'Save image' (Figure 1-15).
Figure 1-15. Saving Image
- 13 -
1-4. Clustering Heatmap Support
DEG Analysis of ExDEGA supports data mining through Significant Gene Selection or Venn Diagram
and easily creates a Clustering Heatmap for the sorted gene list.
A recommended Clustering Heatmap program is MeV. ExDEGA can automatically generate an input
file that can be imported in MeV and details on how to create clustering using MeV software are
described in 2. Clustering heatmap using MeV Software on page 15.
In order to create the input file from ExDEGA for Clustering Heatmap regarding the filtered gene
list, two types of data can be used (Figure 1-16). First, when using the Fold change value, check the
‘Fold change’ in the Type part and sample comparison in the Export Data Select. Click the ‘Data
Export’ and save it as a tab-delimited text file. Second, when using the expression value (Normalized
data), check the ‘Z-score’ and follow the same steps as above. The z-score, which generally indicates
how far away a value is from the mean, is only available when the variable is three or more samples.
The formula for calculating the standard score (z-score) is given below:
Z-score = {Normalized data (log10) – average of Normalized data (log10)}/ standard deviation of
Normalized data(log10)
Figure 1-16. Clustering Heatmap Support
- 14 -
1-5. Selected Gene Plot & Gene Search
A tool of ‘Selected Gene Plot’ is used to draws a graph of the expression patterns of selected genes.
Both genes based on the setting of Significant Gene Selection or selected by researchers can be
used. To create it, copy the ID list of the selected gene, paste them into the Selected Gene Plot
window, and click the ‘Expression Plot View’. Two types of selected gene plots displayed with the
normalized data (log2) and the fold change (log2) values will be popped up (Figure 1-17).
‘Gene Search’ is helpful to search for specific keywords. For example, if you enter ‘insulin’ in the
gene search box, all genes that contain the word ‘insulin’ will be automatically searched and filtered
in the Excel data sheet (Figure 1-18).
Figure 1-17. Gene Graph
Figure 1-18. Genes Related to Insulin
- 15 -
2. Clustering Heatmap (MeV Software)
MeV software, developed by the Dana-Farber Cancer Institute in the United States, is a free
analysis program of Microarray and mRNA-seq data. It serves clustering analysis and statistical
analysis (K-means clustering, Hierarchical analysis, t-test, Significance Analysis of mRNA-Seq data,
Gene Set Enrichment Analysis, and EASE). Visit the web site to download the latest updated
programs and manuals (MeV software download web site: https://sourceforge.net/projects/mev-
tm4/).
For using MeV, three steps are required first: 1) download MeV, 2) unzip the file, and 3) run the
installer, ‘MeV’ or ‘TMEV’ (Figure 2-1). After that, it is needed to confirm that three windows will
appear when the MeV program is opened as described in Figure 2-2. Data analysis will be
performed in the ‘Multiple Array Viewer’. To create this, click ‘File’ and ‘New’ on the
‘MultiExperiment Viewer’ bar. Creating several Multiple Array Viewers is available.
Figure 2-1. Installation File for MeV Program
- 16 -
Figure 2-2. MeV Program Windows
A clustering analysis can be performed by using MeV. The automatically saved input file from
‘Clustering Heatmap Support’ can be used for MeV as described on page 13. Another, genes that
researcher wants to use for clustering analysis can be also listed up. Open a new Excel file, then
copy and paste the list of genes’ name and the fold change value or the normalized value. It must
be saved as a ‘tab-delimited text file’ (Figure 2-3) and be limited to 20,000 genes. Depending on
the number of samples, about 15,000 genes may not be analyzed.
Figure 2-3. An Example of Data Format
- 17 -
After the input data is saved, click ‘File’ and ‘Load Data’ on the ‘Multiple Array Viewer’ of the MeV
program (Figure 2-4). Click ‘Browse’ and select the input file.
Figure 2-4. Data Uploading Method
Click ‘Analysis’, ‘Clustering’, and ‘HCL’ (Figure 2-5).
Figure 2-5. Hierarchical Clustering Selection
- 18 -
Various options for clustering analysis can be selected (Figure 2-6). ‘Gene Tree’ creates a cluster of
genes that have similar fold change or normalized values. ‘Sample Tree’ creates a cluster of samples
that show similar aspects of the gene expression. Among many options, ‘Euclidean Distance’ and
‘Average linkage clustering’ have been widely used for the clustering analysis in research. After the
setup is complete, click ‘OK’.
Figure 2-6. Hierarchical Clustering Options
- 19 -
A result of HCL clustering shows up on the left side and an HCL tree shows up on the right side
when clicking ‘HCL Tree’ (Figure 2-7). Figure 2-7 is an example of HCL clustering and indicates that
a top tree diagram is a result of sample clustering and a left tree diagram is a result of gene
clustering. Each tree diagram has its distance scale bar to measure the length of the tree. The
shorter the distance of the tree indicates that the pattern of expression between genes or samples
is more similar, whereas the longer the distance means that the pattern of expression is more
different.
Figure 2-7. A Result of Hierarchical Clustering
Clustering’s size and color are modifiable (Figure 2-8).
Figure 2-8. Clustering Size Option
- 20 -
A range of color scale bar (lower limit, midpoint value, and upper limit) can be set by click ‘Display’
and ‘Set Color Scale Limits’ (Figure 2-9). Generally, the lower and upper values are set with the same
value and the midpoint value is set to 0 as illustrated in Figure 2-9. The up-regulated gene
expressions will be showing up with the red color and the down-regulated gene expression will be
showed up with the blue color.
Figure 2-9. Color Scale Option
To save the image, click ‘File’ and ‘Save Image’. A file name must include the file extension such as
JPG files (Figure 2-10).
Figure 2-10. Saving Clustering Image
- 21 -
3. Pathway Analysis (KEGG Mapper)
Pathway analysis using KEGG Mapper helps to search specific pathways that are related to genes
that have come from the results of NGS, microarray, and antibody array. A procedure of how to use
the KEGG mapping tool is described in Figure 3-1.
Figure 3-1. Process of Pathway Analysis by Using KEGG Mapper
Pathway analysis is simple and easy if using ExDEGA. Figure 3-2 shows a way of importing selected
genes data based on 2-fold change and normalized data (log2) > 4 into KEGG Mapper. KEGG input
values are located between ‘Raw Data(RC)’ and ‘Annotation’. First of all, it is needed to specify genes
by using ‘Fold change’, ‘Normalized Data’, and ‘p-value’ (p-value is only available when replicates
were carried out) from ‘Significant Gene Selection’ in the right filter and then click sample
comparison to apply its setting. Afterward, copy KEGG input data both Entrez ID and FC Color (black
colored #Number) that will be used in KEGG Mapper.
Figure 3-2. Process of Making KEGG Mapper Input Data in ExDEGA
Copy the section of Entrez ID & a Fold change color column
Enter KEGG mapper Website –Search& Color pathway
http://www.genome.jp/kegg/tool/map_pathway2.html
Paste the copied items and Pathway proceeding
Check the Result of pathway and interesting pathway search
Enter the
Website
- 22 -
Steps for pathway analysis in KEGG Mapper are as follows: 1) visit the KEGG Mapper website
(http://www.genome.jp/kegg/tool/map_pathway2.html), 2) enter a species code (against). If you do
not know the organism code, click ‘org’ and search it as described in Figure 3-3, 3) select ‘KEGG
identifiers’ for primary ID, 4) copy and paste the Entrez ID and Color data that were copied from
‘KEEG input’ in ExDEGA into 'Enter objects one per line followed bgcolor, fgcolor' box, 5) check
‘Include Aliases’ and ‘Use uncolored diagrams’, and 6) click ‘Exec’.
Figure 3-3. Process of Setting Up KEGG Mapper
‘Pathway Search Result’ by KEGG Mapper will be represented as illustrated in Figure 3-4. Pathway
lists are related to genes which you input and the digits beside the name of the pathway is the
number of all genes. Genes can be checked by clicking the number. Click the pathway name in
which you are interested to make the pathway map. The red color indicates up-regulated genes
and the green color indicates down-regulated genes. For saving the image of the pathway map,
click the right mouse button and ‘Save As’. If you click and an item linked to the image can be
saved by saving as ‘HTML’.
- 23 -
Figure 3-4. Pathway Search Result in KEGG Mapper
- 24 -
4. Functional Annotation Analysis (DAVID)
DAVID provides a comprehensive set of functional annotation tools based on numerous databases
to understand the biological meaning of genes derived from the result of NGS, microarray, and
antibody array. Its process is described in Figure 4-1.
Figure 4-1. Process of Functional Annotation Analysis by Using DAVID
Since more than 3,000 genes cannot be analyzed in DAVID, less than 30,000 genes have to be
selected first. Significantly differentially expressed genes that extracted from mRNA-Seq data in
ExDEGA can be also used as above mentioned in Chapter 2 and 3. Visit the DAVID homepage
(http://david.abcc.ncifcrf.gov/) and click ‘Functional Annotation’ (Figure 4-2).
Figure 4-2. DAVID Homepage
Websit
access
•http://david.abcc.ncifcrf.gov/
•‘Functional Annotation ‘ Click!
Step 1 ~ 4
•Gene list (Gene symbol, Gene Bank No, others) copy & paste
• Select Identifier ---> ‘Gene List’ Check ---> ‘Submit List’ Click!
Data Base
Check
•Gene Ontology, Pathway, others DB의 ‘Chart’ Click!
•Indentify the gene of interest in the ‘Chart’ and the corresponding gene.
- 25 -
Step 1: Enter Gene List, copy the list of ‘Gene Symbol’ from ExDEGA (or Gene Bank No. if you have)
and paste it into ‘A: Paste a list’ box. Step 2: Select Identifier, select ‘OFFICIAL_GENE_SYMBOL’ (or
‘GENEBANK_ACCESSION’ if Gene Bank No. is used). Step 3: List Type, check ‘Gene List’. Step 4:
Submit List, click ‘Submit List’. Finally, read a popup message and click ‘확인’ to confirm it.
Figure 4-3. Process of Functional Annotation Analysis in DAVID
If a specific species was not found in the ‘Current Background’ as shown in Figure 4-4, select the
correct species with the ‘Population Manager’ on the ‘Background’ page and click ‘Use’.
Genes’ number which is marked in ‘Species(number)’ on the ‘List’ page will only be applied to the
functional annotation analysis since only that number of genes is identified in the database even if
more genes were input.
- 26 -
Figure 4-4. Specifying Species Information
To review the results, select one of the lists, click ‘+’, click ‘Chart’, then select and click one of ‘Terms’
on the popup window. Figure 4-5 shows an example of the result of ‘Gene_Ontology’. Click ‘+’
beside the ‘Gene_Ontology’ and click ‘Chart’ on the ‘GOTERM_BP_FAT’. Relevant biological processes,
55 chart record, in this case, popped up. In the new window, if you select and click one of the terms,
QuickGO will be linked to display its information. Genes related to GO can be identified by clicking
the bar on the ‘Genes’.
- 27 -
Figure 4-5. Results of Gene Ontology Analysis
A procedure of checking the result of ‘Pathways’ is the same as well (Figure 4-6). Click ‘+’ beside
the ‘Pathways’ and click ‘Chart’ on the ‘KEGG_PATHWAY’. A list of relevant pathways, 2 pathways, in
this case, popped up. In the new window, select and click one of the pathways to see its image. A
red star in the image indicates the gene that you manually input in the previous step. Details about
genes can be confirmed by clicking it.
Figure 4-6. A Results of Pathway Analysis
- 28 -
DAVID tool is useful to analyze GO and pathway. However, the DAVID tool uses only input data so
that the small number of genes, either input number or relevant number, cannot produce results of
GO and pathways. DAVID tool defaults more than two genes and a lower than 0.1 EASE score to
make results. Its criteria are adjustable in ‘Option’. ‘Help and Tool Manual’ for DAVID tool is located
on the top of the window as described in Figure 4-7.
Figure 4-7. Help and Tool Manual for DAVID Tool
- 29 -
5. Gene Set Enrichment Analysis (GSEA) Analysis (MSigDB)
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a
priori defined set of genes shows statistically significant, concordant differences between two
biological states. The analysis process is shown in Figure 5-1.
Figure 5-1. Process of GSEA Analysis
Visit the MSigDB, click ‘Investigate Gene Sets’, and enter a registered email address to log in.
If necessary, a registration has to preceed to view the MSigDB gene sets and/or download the
GSEA software (Figure 5-2 and Figure 5-3).
Figure 5-2. GSEA Main Page
Websit
access
•http://software.broadinstitute.org/gsea/msigdb/index.jsp
•Left menu ‘Investigate gene sets’ Click! ---> Enter email, ‘login’ Click!
Enter gen
e list
•gene identifier > Gene list (Gene symbol or Entrez GeneID) Copy and Paste
•Select the DB you want from Compute Overlaps --->After option selection ‘compute
overlaps’ Click!
Analysis
Results
•Check results of Enrichment Function & Pathway, Save as Excel
•Gene/geneset overlap matrix
- 30 -
Figure 5-3. GSEA Login Page
Enter the list of genes (Gene Symbol, EntrezGeneID, or public ID) in ‘Gene Identifiers’, click an
interested DB on the ‘Compute Overlaps’, and click ‘compute overlaps’ on the bottom (Figure 5-4).
For more gene set information derived from selected DB, click the blue letter of DB.
Figure 5-4. Investigating Gene Sets
After the analysis is complete, the results of GSEA analysis (Gene Set and Gene/geneset overlap
matrix) are available as shown in Figure 5-5 and Figure 5-6.
- 31 -
Figure 5-5. A Result of GSEA Analysis (Gene Set)
Figure 5-6. A Result of GSEA Analysis (Gene/Gene-set Overlap Matrix)
- 32 -
6. Protein-Protein Interaction (PPI) Analysis (STRING)
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a database which statistically
analyzes known and predicted protein-protein interactions. The interactions include physical and
functional associations to build and analyze interactome networks. The analysis process is shown
in Figure 6-1.
Figure 6-1. Process of Analysis of Protein-Protein Interaction by STRING
Before using the web-based STRING, note that it allows only fewer than 100 genes to analyze.
ExDEGA is designed to easily use meaningful data for the analysis of protein-protein interaction. In
ExDEGA, sort and select genes that you want to use and analyze. Next, copy those gene symbols
or EnterzGeneIDs. Visit STRING homepage (http://string-db.org/). Click ‘Multiple proteins’ and paste
them into a box of ’List Of Names’. Select a scientific name of species from ‘Organism. Then, click
‘Search’ (Figure 6-2).
Websit
access
•http://string-db.org/
•‘Multiple proteins’ Click!
Input gen
e list
•Gene list (Gene symbol or Entrez GeneID) copy & paste (Less than 100)
•Enter Organism (Ex.) Homo sapiens, Mus musculus,...) ---> ‘Search’ Click!
Network
& Analysis
•‘Continue’ Click! ---> Network contsruction ---> Check Results
•‘Analysis’ Click! ---> Check Enriched Function & Interaction etc.
- 33 -
Figure 6-2. Multiple Proteins Search
As shown in Figure 6-3, The confirmation of whether the following proteins match the input genes
is required. If there are no problems, click ‘continue’ to proceed.
Figure 6-3. Gene Confirmation Steps
When the analysis is complete, you can see a figure like Figure 6-4. That is a result of the network
based on STRING DB. To check ‘Functional applications in your network’ like Figure 6-5, click
‘Analysis’. To view all items that represent less than 0.5 0f FDRs, Click ‘More’.
- 34 -
Figure 6-4. A Result of STRING Network
Figure 6-5. A Result of Functional Enrichments
- 35 -
If you click any of the interesting functions in the result of ‘Functional enrichments in your network’,
genes will be displayed with red color on the network figure (Figure 6-6). To get more details about
the gene that you are interested in, click one of the genes on the network figure (Figure 6-7).
Figure 6-6. Selection One of Functions
Figure 6-7. Gene Details
- 36 -
The ‘Legend’ tab provides a detailed description of the nodes, edges, and input genes (Figure 6-8).
To save network image and genetic information, click the ‘Tables Exports’ (Figure 6-9).
Figure 6-8. Legend of Network
Figure 6-9. Exporting of Network