Sample Question - Part A

The following questions are meant to give you some orientation about the kind of questions and the range of topics you may see in the exam. To value of all question on the part taught by Anthony Knittel will be 60 marks (or 70 marks for 9844 students) (corresponding to 60 minutes of allocated time).

1) Perceptron (10 marks)

Provide a schematic diagram of a simple perceptron neuron and describe mathematically its function. Give the Perceptron training algorithm in pseudo code. Does the Perceptron algorithm perform gradient descent? Justify your answer.

Answer: Diagram and mathematical description of its output function is shown on page 4 of slides for Week 2. Pseudocode for the Perceptron training algorithm is shown on page 7.

No, the Perceptron algorithm does not perform gradient descent. To do that, the error function would need to be differentiable, which it is not due to the step-function as output function.

2) Backpropagation (10 marks)

Explain the basic idea of the backpropagation learning algorithm. Explain the role of the learning rate: What effect has it when the learning rate is increased? What effect has a decrease of the learning rate?

Answer: Backpropagation performs gradient descent on the error function of the output neuron. i.e. it modifies the weights in many steps. Each step changing the weight in the opposite direction of the gradient, so as to move towards a (local) minimum of the error function. While the error can be determined for the output neuron(s), since the target output is known, this is more difficult for the hidden neurons. Backpropagation propagates back the error signal from the output unit(s) to the hidden units so that it is possible to change the weights of the hidden units also in order to reduce the error of the output units. The learning rate determines the step size of each weight update towards the local minimum. If the learning rate is very small, a large number of updates are necessary. If the learning rate is large, the local minimum may be overstepped and possibly not found.

3) MLPs (10 marks)

What would happen if the transfer functions (at the hidden and output layer) in a multi-layer perceptron would be omitted; i.e. if the activation would simply be the weighted sum of all input signals? Explain why this (simpler) activation scheme is not normally used in MLPs although it would simplify and accelerate the calculations for the backpropogation algorithm?

Answer: In this situation the output of the network could also be generated by a single-layer network: each neuron computes a weighted sum of its inputs. If weighted sums of the input signals are fed into a neuron at the next level, that neuron will still just compute a weighted sum of the outputs of the neurons from the previous layer, which is a weighted sum of the input signals to the network.

To be more spcific, we can write the weights from the input to the hidden layer as a matrix W^HI, the weights from the hidden to output layer as W^OH, and the bias at the hidden and output layer as vectors b^H and b^O.

Using vector and matrix multiplication, the hidden activations can be written as

H = b^H + W^HI * I

The output activations can be written as

O = b^O + W^OH * H

= b^O + W^OH * ( b^H + W^HI * I )

= ( b^O + W^OH * b^H) + (W^OH * W^HI) * I

Because of the associativity of matrix multiplication, this can be written as

O = b^OI + W^OI * I

where

b^OI = b^O + W^OH * b^H

W^OI = W^OH * W^HI

Therefore, the same function can be computed with a simpler network, with no hidden layer, using the weights W^OI and bias b^OI.

4) Computational Learning Theory (10 marks)

Explain the definition of the Vapnik-Chervonenkis dimension. What is the Vapnik-Chervonenkis dimension (VC dimension) of the following set of sets F over the domain X={a,b,c,d,e,f,g}?

F={{},{a},{b,c,d,f},{d,e,f},{a,b,d,e},{a,d,e,f},{b,c,e},{a,b,c,d},{c,d,f},{a,c,e,g}}

I.e. each set corresponds to a function f that can be learned. Such a function f classifies the objects belonging to the set positively and classifies all other objects negatively. Justify your answer.

Answer: The Vapnik-Chervonenkis dimension of a set of sets F over the domain X is the cardinality (number of elements) of the largest subset S of X such that S is shattered by F. S is shattered by F if, for each subset s of S there is at least one set f in F such that s is obtained by intersecting f with S.

For F on X the Vapnik-Chervonenkis dimension is 3. The subset S={a,c,e} is shattered by F, i.e. for each subset s of {a,c,e}, there is a set f in F such S intersected with f results exactly in s.

I.e.

for s={} there is {} in F.

for s={a} there is {a} in F.

for s={c} there is {b,c,d,f} in F.

for s={e} there is {d,e,f} in F.

for s={a,c} there is {a,b,c,d} in F.

for s={a,e} there is {a,b,d,e} in F.

for s={c,e} there is {b,c,e} in F.

for s={a,c,e} there is {a,c,e,g} in F.

Furthermore, we would need to establish that there is no larger subset than S that is shattered by F.

5) Support Vector Machines: (10 marks)

What is the Optimal hyperplane? What is the input space as opposed to the feature space? What is the purpose of Kernel functions in Support Vector Machines? Justify your answers.

Answer: The optimal hyperplane is the hyperplane that classifies the training data points correctly and has maximal distance to the closest training data points. The input space is the space within which the training data is presented. In Support Vector Machines there is usually a mapping performed from the input space into a much higher dimensional feature space. Reason being that a separating hyperplance in the feature space corresponds to a more complex non-linear separation function in the input space. Kernel functions in Support Vector Machines perform implicitly the mapping of two points in the input space into the feature space. They do it by providing the result of the inner product calculation of two points after being mapped into the features space. The kernel functions are fast to compute based on the two vectors in the input space, which avoids laborious computations using a large number of dimensions in the feature space.

6) Boosting (10 marks)

What is Ada-boost? Explain in intuitive terms what it does. Provide a description of it in pseudo code. Explain in intuitive terms why it often improves the accuracy of the learned function. How does it differ from the original boosting algorithm? With which neural network learning algorithms can it be combined?

Answer:Ada-boost is a boosting technique that - opposed to boosting by filtering - works with a given limited set of training examples. It trains multiple different classifiers and finally classifies objects by using a voting scheme where those classifiers who had a smaller training error have a higher weight when voting. During training the classifiers, the training data points are assigned weights, which are varied depending on whether the training examples were correctly classified by a previously generated classifier. An incorrectly classified training example receives a higher weight in order to ensure that the training of the next classifier attempts to generate a classifier which classifies the example correctly. Ada-booost generates multiple classifiers which focus on correctly classifying different areas of the input space. Thus, a combination of many such classififiers often leads to superior overall performance compared to any individual classifier. Pseudocode: See slides for week 6, p. 12 & 13. It can be combined with all supervised neural network learning techniques that allow calculation of the total training error taking different weights of each training example into account.

O =	b^O + W^OH * H
=	b^O + W^OH * ( b^H + W^HI * I )
=	( b^O + W^OH * b^H) + (W^OH * W^HI) * I