Versions
Part 1 of this practical involves some basic microarray data analysis using the built-in marray package. Part 2 is on clustering and checking for relevant enriched annotation categories in the clustered microarray data.
This will give you practice in using R and some of the packages desgined for microarray analysis from the BioConductor project, which is rapidly becoming a standard for open source software for such tasks. The objective is simply for you to work through some steps to give you a feeling for the process of using R in this way.
Note: the material in the lecture notes is not sufficient by itself to allow you to complete this practical. You will need to refer to either the on-line R documentation or to the other materials referred to in this specification.
We will start with some basic analysis, then move on to tackle some larger-scale problems of inference from the data.
On one of the CSE linux machines simply type R at the command-line:
% R ... lines of messages ... Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for a HTML browser interface to help. Type 'q()' to quit R. >
You are now in the top-level R interpreter. As you will discover, R is basically a functional language with object-oriented features. This means whatever you type at the command-line will be evaluated. And usually, what you type is assigning the results of some expression to some object.
To get the HTML browser interface to the help system type:
> help.start()Click on the link "An Introduction to R". You will then see a structured set of links. For this prac you should go through the following sections:
It may make sense to go through the sample session early on. You may also need to refer to the sections Function and variable index and Concept index.
You will need to use R Version 1.8.0 or above. The version of R installed on CSE linux machines is 2.2.1 and this should work with the microarray packages you need.
You will need some specialised microarray packages from the BioConductor project.
R uses a particular method to install and load packages. The easiest way is to install them via the Web:
source("http://www.bioconductor.org/biocLite.R")
biocLite()
biocLite("siggenes")
However, if you have limited disk space, you can access then as follows (this should work on CSE linux machines):
% cd
.libPaths("~mike/public_html/myrlibrary")
You will follow the steps below to load in a set of microarray included in the marray package. The final steps will generate a series of plots. Once you get these, first save them to a graphics format (see the R manual) and answer the supplied questions with reference to the lecture notes.
For ease of copy and paste the R commands are given without the prompt.
First load the packages you installed using the library() command and set up the swirl data:
% R ... library(Biobase) library(multtest) library(siggenes) library(limma) library(marray)
You can find out more about the data set using help (using ?). Note the difference between the second command, which invokes help on the object swirl, and the third command, which invokes the object itself to produce a lot of output:
data(swirl)
?swirl
swirl
datadir <- system.file("swirldata", package="marray")
dir(datadir)
The file fish.gal is a gal file generated by the program GenePix; it contains information on individual probe sequences, such as gene names, spot ID, spot coordinates.
You need these commands:
swirl.layout <- read.marrayLayout(fname = file.path(datadir,
"fish.gal"), ngr = 4, ngc = 4, nsr = 22, nsc = 24, skip = 21,
ctl.col = 4)
ctl <- rep("Control", maNspots(swirl.layout))
ctl[maControls(swirl.layout) != "control"] <- "N"
maControls(swirl.layout) <- factor(ctl)
swirl.layout
This loads information on probe sequences and target samples:
swirl.samples <- read.marrayInfo(file.path(datadir, "SwirlSample.txt"))
swirl.samples
swirl.gnames <- read.marrayInfo(file.path(datadir, "fish.gal"),
info.id = 4:5, labels = 5, skip = 21)
swirl.gnames
The following commands read in all the Spot} files residing in the datadir directory. The arguments further specify that the red and green foreground intensities are stored under the headings Rmean and Gmean, and that the red and green background intensities are store under the headings morphR and morphG, respectively.
fnames <- dir(path = datadir, pattern = paste(".*", "spot", sep = "."))
swirl.raw <- read.marrayRaw(fnames, path = datadir, name.Gf = "Gmean",
name.Gb = "morphG", name.Rf = "Rmean", name.Rb = "morphR",
layout = swirl.layout, gnames = swirl.gnames, targets = swirl.samples)
Use these commands:
library("marray", verbose=FALSE)
data(swirl)
maPlate(swirl)<-maCompPlate(swirl,n=384)
swirl.norm <- maNormMain(swirl)
swirl.norm <- maNorm(swirl, norm="p")
swirl.norms <- maNorm(swirl, norm="s")
swirl.normg <- maNormScale(swirl.norm, norm="g")
boxplot(swirl[,3], xvar="maPrintTip", yvar="maM", main="Swirl array 93: pre--normalization") boxplot(swirl.norm[,3], xvar="maPrintTip", yvar="maM", main="Swirl array 93: post--normalization") plot(swirl[,3], main="Swirl array 93: pre--normalization MA--plot") plot(swirl.norm[,3], main="Swirl array 93: post--normalization MA--plot")
You should save these plots to PostScript, PDF, Jpeg or PNG files for inclusion in your report.
1) [box-plots] Look at the M values across different print-tip groups in the pre- and post- normalization box-plots. What is M ? (see lecture notes) Since the print-tip group reflects spatial distribution of sample on the slide, what would you say is the effect of the normalization in this case ?
2) [scatter-plots] Look at the M versus A values across different print-tip groups in the pre- and post- normalization scatter-plots. What is A ? (see lecture notes) Can you explain the variation in the M-A plot ? (see lecture notes) What would you say is the effect of the normalization in this case ?
You should include all 4 plots and the answers to these questions in your report.
The aim of this part is to identify a subset of genes which are differentially expressed and search for possible relationships between them using clustering. This part of the work will require generating some output from microarray data using R and BioConductor and evaluating the results from yeast gene annotation from the Saccharomyces Genome Database (SGD).
We will work with a data set taken from a yeast experiment using a signalling pathway. In this domain the Ca2+/calmodulin-dependent protein phosphatase calcineurin is activated by specific environmental conditions including exposure to Ca2+ and Na+, and induces gene expression by regulating the Crz1p/Tcn1p transcription factor.
The microarray data includes measurements taken from yeast cells following addition of Ca2+ and Na+ in addition to a number of other experimental conditions.
First download the calcineurin data to your working directory, then start R and load the required packages using the library() command:
% R ... library(Biobase) library(genefilter)
Now the data should be loaded into R. Since it is in comma-separated format there is a built-in command to handle that. We also do some quick checks on what we have loaded in.
marr <- read.csv("calcineurin.csv",header=TRUE)
n <- ncol(marr)
print(n)
m <- nrow(marr)
print(m)
You should see that there are 25 columns and 6102 rows in this data set. We can look at the names of the columns as a further check:
attach(marr) names(marr)
In order to set the data up for what follows we need to do some more work. First, we select the expression measurements for all genes and samples from the input table as a matrix.
marrall <- marr[1:nrow(marr),2:ncol(marr)]
matmarr <- as.matrix(marrall)
eset <- new('exprSet',exprs=matmarr)
gStr <- marr[,1]
geneNames(eset) <- gStr
show(eset)
Notice the use of
We would like to filter out the genes which are not involved in
the pathway under study.
To do this we will use the
There are two basic types of filtering which can be used.
The first is non-specific, which applies to genes in terms of
all the samples.
Here we will use a filter which will remove genes whose expression
does not vary much over all the samples.
The particular function used is
fnsp <- function(x) (IQR(x, na.rm = T) > 0.5)
The second type of filter is to select genes which are differentially expressed according to some criterion applied to one or more samples. We will use a very simple version of such a criterion on two of the samples. The details are quite involved, but the basic idea is to select genes which show an inhibition of expression under conditions in which this could indicate involvement in the calcineurin pathway.
fnna15 <- function(x) { (!is.na(x[9])) }
cnca15 <- function(x) { (x[9] < -1.0) }
fnna30 <- function(x) { (!is.na(x[10])) }
cnca30 <- function(x) { (x[10] < -1.0) }
Having defined our filter functions, we put them together and apply them to the data to select a subset of genes.
ff <- filterfun(fnsp,fnna15,cnca15,fnna30,cnca30) selected <- genefilter(exprs(eset),ff) sum(selected) esetSub <- exprs(eset)[selected,]
The
The clustering first requires the construction of a distance matrix
using the R built-in function
d <- dist(esetSub) hcl <- hclust(d, method = "complete", members=NULL) plot(hcl)
For another look at our selected genes we can construct and display a heatmap which shows how genes and samples can be clustered together on the basis of similarity of expression.
heatmap(esetSub,
Rowv=as.dendrogram(hcl),
col = topo.colors(64),
margins = c(10,10),
xlab = "Samples", ylab= "Genes"
)
In this map the dark blue cells show genes which are relatively
down-expressed (negative values) and reddish-pink cells are
relatively up-expressed (positive values). Intermediate levels
from negative to positive pass through green to yellow.
White indicates missing values in the data.
Notice that samples with similar names are grouped together
along the bottom of the heatmap.
Now we shall investigate the properties of some of the genes
clustered togther, to see if we can find any possible similar
functions.
First, we plot the hierarchical clustering again but this
time as a dendrogram to give a slightly more regular display.
Then we will use the
plot(as.dendrogram(hcl)) identify(hcl, function(k) print(k))
Once the clustering is displayed, you can left-click in the
plot window and the list of genes in the dendrogram below the point
clicked on will be written to the terminal.
Try selecting a number of points in the dendrogram and see
how many genes (yeast ORFs, looking something like this:
Now check to see if any relevant annotation categories are enriched,
i.e., over-represented, for the genes in the cluster you selected.
In your web browser go to the
FunSpec web site.
Copy the genes (select only the names) from the R
terminal and paste them into the box at the left hand side of the
FunSpec tool.
Leave all the parameters at their default setting and click
on the
Now you can complete the prac by completing the following and adding the results to your report.
3) [heatmap] Save the heatmap to file and include it in your report.
4) [hierarchical clustering] Find two (2) annotation categories of different types which you can argue are evidence of the following biological properties:
The hand-in for this prac is a written report containing your answers. It is due at 16:59pm Thursday June 7. Reports can be in PDF, PostScript or Word (.doc). Submit the report by give, as follows:
give bi9010 arrayprac myreport
where myreport is a file called either report.pdf
or report.ps or report.doc.
There are 5 marks available. The actual course mark will be scaled out of 4%.