Microarry Prac

Versions

Aims

This practical involves microarray data analysis using the R language for statistical programming. The commands have been tested on CSE linux machines, such as those found in any of the labs. No testing has been done on other platforms. However, it should be possible to complete this on your own machine if you install R.

Part 1 of this practical involves some basic microarray data analysis using the built-in marray package. Part 2 is on clustering and checking for relevant enriched annotation categories in the clustered microarray data.

This will give you practice in using R and some of the packages desgined for microarray analysis from the BioConductor project, which is rapidly becoming a standard for open source software for such tasks. The objective is simply for you to work through some steps to give you a feeling for the process of using R in this way.

Note: the material in the lecture notes is not sufficient by itself to allow you to complete this practical. You will need to refer to either the on-line R documentation or to the other materials referred to in this specification.

Background

Microarray Data Analysis

Microarray data is unusual in a number of respects (refer to lecture notes). As such, certain methods have been developed for its analysis in order to answer some of the biological questions underlying microarray experiments.

We will start with some basic analysis, then move on to tackle some larger-scale problems of inference from the data.

The R project

R is "Gnu S", i.e. an open source project to develop an implementation of the statistical programming language S (or S-Plus) originally from Bell Labs. The linux version we will use is mainly driven from a command-line interface.

Introduction

The first thing to do is get a feel for the R language. An easy way to start is by running R and going through the some initial chapters of the online documentation. This should take up to an hour depending on how much you experiment.

On one of the CSE linux machines simply type R at the command-line:

% R

... lines of messages ...

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.

>

You are now in the top-level R interpreter. As you will discover, R is basically a functional language with object-oriented features. This means whatever you type at the command-line will be evaluated. And usually, what you type is assigning the results of some expression to some object.

To get the HTML browser interface to the help system type:

> help.start()
Click on the link "An Introduction to R". You will then see a structured set of links. For this prac you should go through the following sections:

It may make sense to go through the sample session early on. You may also need to refer to the sections Function and variable index and Concept index.

Part 1 - Normalization

Getting started

You will need to use R Version 1.8.0 or above. The version of R installed on CSE linux machines is 2.2.1 and this should work with the microarray packages you need.

You will need some specialised microarray packages from the BioConductor project.

R uses a particular method to install and load packages. The easiest way is to install them via the Web:

source("http://www.bioconductor.org/biocLite.R")
biocLite()
biocLite("siggenes")

However, if you have limited disk space, you can access then as follows (this should work on CSE linux machines):

Normalization case study using Swirl zebrafish microarrays

You will follow the steps below to load in a set of microarray included in the marray package. The final steps will generate a series of plots. Once you get these, first save them to a graphics format (see the R manual) and answer the supplied questions with reference to the lecture notes.

For ease of copy and paste the R commands are given without the prompt.

First load the packages you installed using the library() command and set up the swirl data:

% R

...

library(Biobase)
library(multtest)
library(siggenes)
library(limma)
library(marray)

You can find out more about the data set using help (using ?). Note the difference between the second command, which invokes help on the object swirl, and the third command, which invokes the object itself to produce a lot of output:

data(swirl)
?swirl
swirl
datadir <- system.file("swirldata", package="marray")
dir(datadir)

The microarray layout information

Consider first the function read.marrayLayout, which may be used to read in and store information on the layout of spots in a batch of arrays. The main quantities are the dimensions of the grid and spot matrices. In addition, it is useful to keep track of information on the location and nature of control spots, and the printtipgroup and plate origin of the probes. The following command stores such layout information in the object swirl.layout of class marrayLayout. The location of the control spots is extracted from the fourth (ctl.col = 4)) column of the file fish.gal

The file fish.gal is a gal file generated by the program GenePix; it contains information on individual probe sequences, such as gene names, spot ID, spot coordinates.

You need these commands:

swirl.layout <- read.marrayLayout(fname = file.path(datadir,
        "fish.gal"), ngr = 4, ngc = 4, nsr = 22, nsc = 24, skip = 21,
        ctl.col = 4)
ctl <- rep("Control", maNspots(swirl.layout))
ctl[maControls(swirl.layout) != "control"] <- "N"
maControls(swirl.layout) <- factor(ctl)
swirl.layout

This loads information on probe sequences and target samples:

swirl.samples <- read.marrayInfo(file.path(datadir, "SwirlSample.txt"))
swirl.samples
swirl.gnames <- read.marrayInfo(file.path(datadir, "fish.gal"),
        info.id = 4:5, labels = 5, skip = 21)
swirl.gnames

The following commands read in all the Spot} files residing in the datadir directory. The arguments further specify that the red and green foreground intensities are stored under the headings Rmean and Gmean, and that the red and green background intensities are store under the headings morphR and morphG, respectively.

fnames <- dir(path = datadir, pattern = paste(".*", "spot", sep = "."))
swirl.raw <- read.marrayRaw(fnames, path = datadir, name.Gf = "Gmean",
        name.Gb = "morphG", name.Rf = "Rmean", name.Rb = "morphR",
        layout = swirl.layout, gnames = swirl.gnames, targets = swirl.samples)

Perform the microarray normalization

Use these commands:

library("marray", verbose=FALSE)
data(swirl)
maPlate(swirl)<-maCompPlate(swirl,n=384)
swirl.norm <- maNormMain(swirl)
swirl.norm <- maNorm(swirl, norm="p")
swirl.norms <- maNorm(swirl, norm="s")
swirl.normg <- maNormScale(swirl.norm, norm="g")

View the effects of normalization by plotting

To see the effect of normalization, compare the pre-and post-normalization boxplots and MA-plots.
boxplot(swirl[,3], xvar="maPrintTip", yvar="maM", main="Swirl array 93: pre--normalization")

boxplot(swirl.norm[,3], xvar="maPrintTip", yvar="maM", main="Swirl array 93: post--normalization")
 
plot(swirl[,3], main="Swirl array 93: pre--normalization MA--plot")

plot(swirl.norm[,3], main="Swirl array 93: post--normalization MA--plot")

You should save these plots to PostScript, PDF, Jpeg or PNG files for inclusion in your report.

Questions (1 mark each)

1) [box-plots] Look at the M values across different print-tip groups in the pre- and post- normalization box-plots. What is M ? (see lecture notes) Since the print-tip group reflects spatial distribution of sample on the slide, what would you say is the effect of the normalization in this case ?

2) [scatter-plots] Look at the M versus A values across different print-tip groups in the pre- and post- normalization scatter-plots. What is A ? (see lecture notes) Can you explain the variation in the M-A plot ? (see lecture notes) What would you say is the effect of the normalization in this case ?

You should include all 4 plots and the answers to these questions in your report.

Part 2 - Differential expression and Clustering

The aim of this part is to identify a subset of genes which are differentially expressed and search for possible relationships between them using clustering. This part of the work will require generating some output from microarray data using R and BioConductor and evaluating the results from yeast gene annotation from the Saccharomyces Genome Database (SGD).

We will work with a data set taken from a yeast experiment using a signalling pathway. In this domain the Ca2+/calmodulin-dependent protein phosphatase calcineurin is activated by specific environmental conditions including exposure to Ca2+ and Na+, and induces gene expression by regulating the Crz1p/Tcn1p transcription factor.

The microarray data includes measurements taken from yeast cells following addition of Ca2+ and Na+ in addition to a number of other experimental conditions.

First download the calcineurin data to your working directory, then start R and load the required packages using the library() command:

% R

...

library(Biobase)
library(genefilter)

Now the data should be loaded into R. Since it is in comma-separated format there is a built-in command to handle that. We also do some quick checks on what we have loaded in.

marr <- read.csv("calcineurin.csv",header=TRUE)
n <- ncol(marr)
print(n)
m <- nrow(marr)
print(m)

You should see that there are 25 columns and 6102 rows in this data set. We can look at the names of the columns as a further check:

attach(marr)
names(marr)

In order to set the data up for what follows we need to do some more work. First, we select the expression measurements for all genes and samples from the input table as a matrix.

marrall <- marr[1:nrow(marr),2:ncol(marr)]
matmarr <- as.matrix(marrall)
eset <- new('exprSet',exprs=matmarr)
gStr <- marr[,1]
geneNames(eset) <- gStr
show(eset)

Notice the use of eset which is designed to allow the association of gene and sample names with the data matrix of microarray expression measurements.

We would like to filter out the genes which are not involved in the pathway under study. To do this we will use the genefilter library from BioConductor. This involves defining a number of functions, then giving them to the genefilter function to apply in turn to select only the genes passing through the set of filters. To see more details use the help function ?genefilter

There are two basic types of filtering which can be used. The first is non-specific, which applies to genes in terms of all the samples. Here we will use a filter which will remove genes whose expression does not vary much over all the samples. The particular function used is IQR which looks at the range of values for a gene (specifically the "interquartile" range). Only genes with an interquartile range above a certain threshold will pass through the filter.

fnsp <- function(x) (IQR(x, na.rm = T) > 0.5)

The second type of filter is to select genes which are differentially expressed according to some criterion applied to one or more samples. We will use a very simple version of such a criterion on two of the samples. The details are quite involved, but the basic idea is to select genes which show an inhibition of expression under conditions in which this could indicate involvement in the calcineurin pathway.

fnna15 <- function(x) { (!is.na(x[9])) }
cnca15 <- function(x) { (x[9] < -1.0) }
fnna30 <- function(x) { (!is.na(x[10])) }
cnca30 <- function(x) { (x[10] < -1.0) }

Having defined our filter functions, we put them together and apply them to the data to select a subset of genes.

ff <- filterfun(fnsp,fnna15,cnca15,fnna30,cnca30)
selected <- genefilter(exprs(eset),ff)
sum(selected)
esetSub <- exprs(eset)[selected,]

The sum function shows that 93 of the genes have been selected. Clearly this is a significant reduction on the size of the original set. This gives us a more manageable number of genes to which we can apply clustering.

The clustering first requires the construction of a distance matrix using the R built-in function dist. The hierarchical clustering is then done using the complete linkage method (see lecture notes).

d <- dist(esetSub)
hcl <- hclust(d, method = "complete", members=NULL)
plot(hcl)

For another look at our selected genes we can construct and display a heatmap which shows how genes and samples can be clustered together on the basis of similarity of expression.

heatmap(esetSub,
        Rowv=as.dendrogram(hcl),
        col = topo.colors(64),
        margins = c(10,10),
        xlab = "Samples", ylab= "Genes"
)
In this map the dark blue cells show genes which are relatively down-expressed (negative values) and reddish-pink cells are relatively up-expressed (positive values). Intermediate levels from negative to positive pass through green to yellow. White indicates missing values in the data. Notice that samples with similar names are grouped together along the bottom of the heatmap.

Now we shall investigate the properties of some of the genes clustered togther, to see if we can find any possible similar functions. First, we plot the hierarchical clustering again but this time as a dendrogram to give a slightly more regular display. Then we will use the identify function to select lists of genes from the cluster.

plot(as.dendrogram(hcl))
identify(hcl, function(k) print(k))

Once the clustering is displayed, you can left-click in the plot window and the list of genes in the dendrogram below the point clicked on will be written to the terminal. Try selecting a number of points in the dendrogram and see how many genes (yeast ORFs, looking something like this: YGR121C) are printed out. Recall that the higher up in the dendrogram you go, the more genes are in the cluster. See if you can select a cluster with a group of 10-20 genes. To finish identifying genes, right-click in the plot window.

Now check to see if any relevant annotation categories are enriched, i.e., over-represented, for the genes in the cluster you selected. In your web browser go to the FunSpec web site. Copy the genes (select only the names) from the R terminal and paste them into the box at the left hand side of the FunSpec tool. Leave all the parameters at their default setting and click on the Submit Query button. You should then see a results page headed "FunSpec Cluster Interpreter". Scroll down to view the following annotation types headed: MIPS Phenotypes, MIPS Functional Classification, GO Biological Process and GO Molecular Function.

Now you can complete the prac by completing the following and adding the results to your report.

Questions (1 mark for question 3, 2 marks for question 4)

3) [heatmap] Save the heatmap to file and include it in your report.

4) [hierarchical clustering] Find two (2) annotation categories of different types which you can argue are evidence of the following biological properties:

For each category, give the the annotation type, category name, p-value and list of names of genes in that category, and your justification for their relation to the stated properties. If you cannot find two such categories, state why you think no such annotation was found.

Submission

The hand-in for this prac is a written report containing your answers. It is due at 16:59pm Thursday June 7. Reports can be in PDF, PostScript or Word (.doc). Submit the report by give, as follows:

         give bi9010 arrayprac myreport
where myreport is a file called either report.pdf or report.ps or report.doc.

There are 5 marks available. The actual course mark will be scaled out of 4%.