EXPRESSION ANALYSIS - IASRI

11
EXPRESSION ANALYSIS Sunil Archak NBPGR, New Delhi-110 012 [email protected] JMP Genomics offers a comprehensive solution for expression analysis, which includes a number of APs to carry out Quality Control, Normalization, Pattern Discovery, and Row-by-Row Modeling that form a standard pipeline in the analysis of expression data. Thirty-eight individual analytical processes are grouped together within a series of submenus under the main Genomics menu. The Quality Control submenu provides ten processes (Distribution Analysis, Correlation and Principal Components, Correlation and Grouped Scatter plots, Filter Intensities, Feature Flagger, Effect Removal via PLS Normalization, Missing Value Imputation, Pseudo Image, Ratio Analysis, Surface Summary) that can assess the quality of raw data. The Normalization submenu provides nine processes (Data standardization, ANOVA Normalization, Mixed Model Normalization, Batch Normalization, Batch Scoring, Loess Normalization, Factor Analysis Normalization, Partial Least Squares Normalization, Quantile Normalization) that help normalize data. The Pattern Discovery submenu provides eight processes (Hierarchical Clustering, K-Means Clustering, Principal Components Analysis, Plot Intensities, Cross Correlation, Distance Matrix, Multidimensional Scaling, Partial Correlation Diagram) that investigate the nature, magnitude and underlying reasons of the relationships among the observations in data. The Row- by-Row Modeling submenu provides eleven processes (One-Way ANOVA, ANOVA, Mixed Model Analysis, Survival Analysis, Estimate Builder/Compare Means, Difference Chooser, Two- Way Plotter, P-Value Adjustment, P-Value Quantile Plotter, Chromosome Color Theme, P-Value Browser) that help fit statistical models. Basic Expression Workflow is organized as a series of links without compromising on the options and flexibility. Although independent APs like hierarchical clustering, ANOVA and K-means clustering are often needed by researchers analyzing expression data, first-timers may appreciate the power of JMP Genomics by following Basic Expression Workflow. After getting familiar with the Basic Expression workflow, advanced users may investigate the Expression QC workflow and the Expression Statistics workflow, which offer greater flexibility and more options. The following section attempts to familiarize the reader with all the features of Basic Expression Workflow, employing the option V given below for hands on experience. The programme offers multiple options from choosing input files to preparing to inspecting to analyzing expression data. Input data files can be selected by different ways. (i) By using the Choose button for various input data files that are formatted as per the JMP Genomics requirement; (ii) By Importing the data from a source in the lab that has generated Affymetrix or Illumina raw data file-sets; (iii) By directly downloading from any of the vendors web sites; (iv) Using the Affymetrix project wizard or (v)

Transcript of EXPRESSION ANALYSIS - IASRI

Page 1: EXPRESSION ANALYSIS - IASRI

EXPRESSION ANALYSIS

Sunil Archak NBPGR, New Delhi-110 012

[email protected] JMP Genomics offers a comprehensive solution for expression analysis, which includes a number of APs to carry out Quality Control, Normalization, Pattern Discovery, and Row-by-Row Modeling that form a standard pipeline in the analysis of expression data. Thirty-eight individual analytical processes are grouped together within a series of submenus under the main Genomics menu. The Quality Control submenu provides ten processes (Distribution Analysis, Correlation and Principal Components, Correlation and Grouped Scatter plots, Filter Intensities, Feature Flagger, Effect Removal via PLS Normalization, Missing Value Imputation, Pseudo Image, Ratio Analysis, Surface Summary) that can assess the quality of raw data. The Normalization submenu provides nine processes (Data standardization, ANOVA Normalization, Mixed Model Normalization, Batch Normalization, Batch Scoring, Loess Normalization, Factor Analysis Normalization, Partial Least Squares Normalization, Quantile Normalization) that help normalize data. The Pattern Discovery submenu provides eight processes (Hierarchical Clustering, K-Means Clustering, Principal Components Analysis, Plot Intensities, Cross Correlation, Distance Matrix, Multidimensional Scaling, Partial Correlation Diagram) that investigate the nature, magnitude and underlying reasons of the relationships among the observations in data. The Row-

by-Row Modeling submenu provides eleven processes (One-Way ANOVA, ANOVA, Mixed Model Analysis, Survival Analysis, Estimate Builder/Compare Means, Difference Chooser, Two-Way Plotter, P-Value Adjustment, P-Value Quantile Plotter, Chromosome Color Theme, P-Value Browser) that help fit statistical models. Basic Expression Workflow is organized as a series of links without compromising on the options and flexibility. Although independent APs like hierarchical clustering, ANOVA and K-means clustering are often needed by researchers analyzing expression data, first-timers may appreciate the power of JMP Genomics by following Basic Expression Workflow. After getting familiar with the Basic Expression workflow, advanced users may investigate the Expression QC workflow and the Expression Statistics workflow, which offer greater flexibility and more options.

The following section attempts to familiarize the reader with all the features of Basic Expression Workflow, employing the option V given below for hands on experience. The programme offers multiple options from choosing input files to preparing to inspecting to analyzing expression data. Input data files can be selected by different ways. (i) By using the Choose button for various input data files that are formatted as per the JMP Genomics requirement; (ii) By Importing the data from a source in the lab that has generated Affymetrix or Illumina raw data file-sets; (iii) By directly downloading from any of the vendors web sites; (iv) Using the Affymetrix project wizard or (v)

Page 2: EXPRESSION ANALYSIS - IASRI

Expression Analysis

simply loading the sample data files or one’s own files stored in the library. All the files should be available in the SAS dataset formats. The File menu contains options for saving and opening data sets. When saving data sets in JMP Genomics, the standard File > Save As... option can be used. In the “Save as type” box, choose SAS Data Set. Keep the default options for the three checkboxes. The Input files For expression analysis, JMP Genomics requires two files: a design file and a data file. The design file contains information regarding the sample attributes. All the information about the experiment, including technical variables, experimental and field/lab variables. Including this type of information will make it easier to understand the sources of variance in the experiment when quality control processes are run. The design file has two required columns or variables. The Array column is numeric and has a unique number for each array. The ColumnName column contains a unique identifier for each array. JMP Genomics software contains tools to create these variables, found in the Experimental Design submenu. The Procedure In case of importing Illumina or Affymetrix data as described above, a pop-up window appears that you can use to automatically load the Basic Expression Workflow. By clicking “Basic

Expression Workflow” the dialog is automatically populated with the experimental design file, the data file and the output directory where input data set is stored. One may elect to change the output directory. There are eight tabs in the Basic Expression Workflow.

The General tab Fill in the “Study Name” Click Open to inspect the input file Out of the available variables, select Probe_set_ID for filling in By Variables, Label Variables and in Variables to Keep in Output ….

The Experimental Design tab Click Open to inspect the design file From the Available Variables, populate the Color Variables (variables by which to color QC plots and grouped sets of scatterplots), Label Variable and Variable Defining Potting Groups (for categorization of the results and calculation of variance components. In the Variance Component Effects field enter Color Variables<space>Variable Defining Potting <space>Color Variables*Groups Variable Defining Potting Groups. Any variance components effects specified on this tab will be analyzed for their contributions to the overall

variation in the data set.

Page 3: EXPRESSION ANALYSIS - IASRI

Expression Analysis

Leave rest of the fields blank. The QC and Normalization tab Select all the three (Distribution Analysis, Correlation and Grouped Scatterplots) as a means to gather the quality measure of the input data. For the Principal Components Analysis portion of the Correlation and Grouped Scatter plots QC procedure, you can either choose to include as many principal components as is necessary to explain a specified proportion of variance, or directly specify the number of principal components. The Cumulative Proportion of Variation to Explain with Principal Components should ideally be set at 0.9 (by default). Because this limit is usually more than enough to get a high level of understanding about the sources of variation. Select the Normalization Method to be used (if necessary). Note that one can opt to Run Analysis Above for post-import normalization, if desired, and choose when to run QC processes: before or after normalization, or both.

The ANOVA tab Analysis of variance is the stage where one sets up the model and desired comparisons. Although it is run along with quality control in the workflow, ideally one would like to confirm the data quality prior to the ANOVA.

The Class Variables box should contain variables whose different levels are to be compared or any other variable which may have likely impact on the expression levels. Note that Class Variables are always categorical. For example, a treatment variable may include values of

Control, Treatment 1, Treatment 2, etc., but not continuous values of 0, 0.15 or 0.30. For instance,

Page 4: EXPRESSION ANALYSIS - IASRI

Expression Analysis

the ones selected to calculate Variance Component Effects in the Experimental Design tab may be used. In the Model these Fixed Effects box, one should specify variables that may affect expression values. This must include at least one Class Variable specified as a Fixed Effect. Other Fixed Effects can be included, as can continuous variables. One can even specify interaction or nested variables, the syntax is A*B or A(B), respectively. JMP Genomics will automatically call the proper process depending on the complexity of the model. The Adjust Variability for these Random Effects field specifies a variable that affects the expression analysis and contributes unwanted variance. For example, arrays that are run in batches over a period of time may have some differences due to the batch effect. Batch can be listed as a Class Variable, and the variance due to the batch effect will be removed when it is listed as a Random Effect in the model. The Separate and Journal Results by Chromosome option is useful for large arrays, allowing the results to be interrogated separately for each chromosome. However, to make the most of this option, chromosome variable must be specified either in the input data set on the General tab or in the annotation file on the Annotation tab. The LSMeans tab This tab contains options for forming pairwise comparisons and volcano plots. In the Estimate LSMeans for

these Fixed Effects box paste the text copying from Model these Fixed Effects box of the previous tab. Also select the radio button under LSMeans Difference Set for Volcano plots for deciding on what basis volcano plots are constructed.

Page 5: EXPRESSION ANALYSIS - IASRI

Expression Analysis

If you are forming differences with a control, specify the control value (under LSMeans Control Levels), as JMP Genomics will choose one by default if one is not entered. Note that the value must be in double quotes and must match exactly the design table value.

Optionally, you may click

the Difference Chooser button (under Data Set containing LSMeans Differences to include) to select and/or reverse differences to be calculated between your experimental groups. This can be useful in trimming the number of comparisons displayed in the output table, and in reversing comparisons that may not be in the desired order. By default, directionality of differences will be calculated based on alphabetical order. The Multiple Testing tab This contains options for controlling the multiple testing correction that is applied to the results. If volcano plots are requested on the LSMeans tab, then the correction is applied to all pairwise comparisons. Otherwise, the correction is applied to the fixed effects and covariates specified in the ANOVA tab. In the drop down menu of Multiple Testing Method, select FDR, which is a correction for false discovery rate. Rest all may be left to default values.

The Annotation tab Choose the Annotation Data Set comprising of supplemental information such as gene symbols, accession numbers and other identifying information. This step facilitates interpretation of results by appending supplemental information to output tables. This annotation data set can come from either array manufacturers, GEO or other sources. If GenBank accession numbers, gene symbols or LocusLink IDs are present in the annotation file, then specifying them properly in the dialog on the Annotation tab will allow the creation

of buttons in the output that link from specific results in plots to the appropriate Web databases. Alternatively, annotation information can be merged after analysis.

Page 6: EXPRESSION ANALYSIS - IASRI

Expression Analysis

The Tracks tab enables the creation of graphical displays of transcript information to enrich positional plots. However, this option is more applicable in case of exon data. Results Basic Expression Workflow results are returned in Journal format. Click Open Workflow Dialog to launch the Workflow Builder. The Workflow to Run box contains a list of settings for processes used in the

Basic Expression Workflow. A setting can be opened in editing mode, settings for any process can be modified, saved and the analysis rerun preferably with a new Workflow Folder location.

Quality Control

Page 7: EXPRESSION ANALYSIS - IASRI

Expression Analysis

Distribution: Click DataDistribution in the journal to view the results of this process. The Distribution Details window shows histograms, box plots, quantiles, and statistical moments for each row of the experimental design. Open the distribution plots; two graphs are generated: a side-by-side box plot and an overlay of the arrays’ kernel density estimates. The Box Plots summary window shows the distributions and the outliers for all the variables in the input data set. Drill down on the box plots by clicking the red triangle to get more information, such as quantiles. The resulting values can be converted into a JMP table by right-clicking and selecting Make into Data

Table. The other graph, Overlaid Kernel Density Estimates is a view of the intensity profiles of the arrays in the data set with intensity on the x-axis and relative bin size on the y-axis. The window shows the raw univariate distributions for all arrays. In case one detects low or high values, they can be modified or removed thereby delete genes which do not meet certain criteria. This can be accomplished through the Filter Intensities process. Additionally, any of the arrays that are deemed to be outliers can be removed from future analyses by selecting them and then click Create Subset Experimental Design Data Set, Excluding Selected Rows. Data Correlation: Click DataCorrelation in the journal to view the results of this process, mainly 3D PCA analysis and hierarchical clustering of correlation coefficients. The distributions of these p-values for the test that the pairwise correlation equals to zero are shown the chart. The two plots shown in the Variance Components Charts window illustrate the relative contribution of each component in the analysis to the variance. The top plot shows the overall distribution of the weighted average proportions of variation for all specified variance component

effects across the principal components. The bottom plot shows the proportion of variation of all specified variance component effects within each principal component, (each component effect is represented by a line in the plot) and is used to discover which variance component(s) dominate each of the principal components with larger within-principal-component proportion(s) of variation.

Page 8: EXPRESSION ANALYSIS - IASRI

Expression Analysis

The variance components charts explain the contribution of technical and experimental factors to the overall variation across the dataset, as well as the distribution of that variation across the different principal components. The number of principal components appearing in this analysis is determined by the choices made in the Basic Workflow dialog. In this case, nine principal components are required to explain 80% of the variation in the data set. See the scree plot for information about the variation explained by individual components. The 3D PCA plot (scatterplot 3D) allows us to compare the relative effects of three principal components, at the same time. Examination of the 3D PCA plot shows that the points cluster into certain groups. Note that these graphs are highly interactive; selecting a group in the hierarchical cluster highlights the same points in the other plots. Additionally, clicking the red triangle on the Scatterplot 3D window will allow you to connect points or create ellipsoids around points grouped by any design variable. Heat Map The question “which of the principal components is most responsible for this clustering” is answered by examining the correlations heat map and by studying the labels on the left hand side. Heat Map is to visualize the strength of correlation between all pairs of samples. The clustering algorithm is unsupervised with positively correlated samples in the region of red color and negatively correlated ones in the blue color regions. Samples with no correlation are in grey color.

Page 9: EXPRESSION ANALYSIS - IASRI

Expression Analysis

Array Group Correlation: Click ArrayGroupCorrelation in the journal to view the results. This presents scatterplots of the arrays within the groups specified earlier in the Basic Expression Workflow dialog. To make all the windows larger, hold CTRL, grab a corner of a graph and expand it to the desired size. Note that clicking points or drawing boxes around them will select them in the output tables. Each grid displays a set of scatter plots in which data from pairs of samples are plotted against each other. The chart provides a summary view of the of the correlation between replicate arrays in each treatment (experiment) group. ANOVA: Click OneWayANOVA in the journal to view the results. A number of graphs are created following ANOVA analysis. First, volcano plots for all comparisons are shown. A volcano plot represents a summary of all probe sets for a single comparison. Each point represents a single probe set. The x-axis displays differences (log2 ratios); that is, a difference of 1 is approximately the same as a twofold change. The y-axis shows the –log10 (p-value) for the comparison between the two groups. The dashed red line is the value for significance with the correction for false discovery rate. Note that the p-values are not adjusted for false discovery rate, only the threshold for significance. To adjust the p-values themselves, use the P-Value Adjustment, found under Genomics > Row-by- Row Modeling. Highlight points with your mouse to highlight corresponding rows in the table. You

can create a new data table containing only these points using the Tables > Subset function. The hierarchical clustering of Standardized LSMeans diagram contains the results of any probesets with at least one significant difference between analysis groups. Highlighting a branch of this graph also highlights the corresponding probesets in the output table. A subset of highlighted probesets can be created as described above.

Page 10: EXPRESSION ANALYSIS - IASRI

Expression Analysis

The parallel plots of standardized and unstandardized Least Squares Means also contain only the significant probesets. With a large number of significant probesets, these plots can be difficult to interpret. Small numbers of probesets can be graphed following parsing of tables, or the Data Filter can be used to pare down the plots by selecting samples with differences or p-values in certain ranges for the comparisons of interest. The distributions of r-square and residuals can be helpful in understanding the quality of the results. In general, distributions skewed to a high r-square and low residual indicate that most of the variance was explained by your model. The opposite finding indicates that there may be sources of variances in your data that are not accounted for in your model.

Parsing Tables: In the ANOVA results table, each probe set is given a significance index of 0 or 1 for each comparison, with 1 indicating significance. We can use these significance indicators and other variables such as differences to obtain subset tables. First, click in a results table with the suffix _owa or _ars to make sure it is the active table. Go to Genomics > Annotation Analysis >

Venn Diagram. Select up to five significance indices from the list of columns. For two- and three-way diagrams, selecting Proportional Areas causes the regions of the diagram to be drawn with areas proportional to the number of results they contain. Controls to change colors, labels and label placement are found on the right of the Venn Diagram display. Other Action Button Functions: The Annotation Summary button in the Action Buttons menu will create a dynamic Web page with links to selected Web sites. The other buttons will open one web page for each selected row, so use these only when selecting a limited number of probe sets.

Page 11: EXPRESSION ANALYSIS - IASRI

Expression Analysis

Pattern Discovery

Hierarchical Clustering can be performed on tall or wide data sets, depending on the desired results. Select Genomics > Pattern Discovery > Hierarchical Clustering. Load the wide data set. Selecting one or more experimental variables into the Compare Variables box will color the different groups and can be a good visual cue in the output to determine how samples are grouping.The samples are along the y-axis and the probe sets are along the x-axis. The reverse can be done with tall data sets. K-Means Clustering: K-means clustering will group probe sets according to their patterns. The user can choose how many clusters to create, or the number of clusters can be determined on the basis of the correlation between probe sets. Select Genomics > Pattern Discovery > K-Means Clustering. Select the significant results table for clustering. Either the mean values or the standardized mean values can be used. On the Analysis tab, select

Automated Radius K-Means. The correlation radius is a measure of how closely you would like the probe sets within a cluster to correlate. Higher correlations will give fewer probe sets per cluster and more clusters, and lower correlations will have more probe sets per cluster and fewer clusters. Select “Center Rows” if working with mean values, and clear this option if you are working with standardized mean values. The counts for each cluster and the clusters themselves are presented as part of the results. The probesets within each cluster can be easily parsed using Rows > Data Filter and selecting the Cluster variable on the output table. Output Files The results from an analysis in JMP Genomics do not have to be recreated each time you wish to view analysis results. Every time a process is run, a JSL file is created. This JSL file is linked to output tables. Running this JSL file in JMP Genomics will recreate the analysis results.