Recognising Biological Sounds
Using Machine Learning

Andrew Taylor
School of Computer Science and Engineering,
The University of New South Wales, Sydney 2052, AUSTRALIA.
Email: andrewt@cse.unsw.edu.au

Abstract:

The development of a software system which can detect and identify the flight calls of migrating birds is reported. The system first produces a spectrogram using a Discrete Fourier transform. Calls are detected in the spectrogram using an ad-hoc combination of local peak-finding and a connectedness measure. Attributes are extracted both globally from the call and from a window moved incrementally through the call. Quinlan's C4.5 machine learning system is used to induce a decision tree-based classifier.

The system has been tested on a set of 138 flight calls from 9 species of birds. Some calls are faint and interfering insect noise is present in others. Eight-fold resampling was used to classify the calls unseen. 78% of calls were identified correctly, 4% incorrectly and 18% were left unclassified.

Introduction

There is increasing interest and expenditure on environmental monitoring, both in Australia and globally. Population estimation based on counting of vocalisations is a key technique in the monitoring of a number of important taxa. These include groups such as bird and frogs which are often used as general indicators of diversity or ecological change.

Manual censuses of animal vocalisations however are often time-consuming, expensive and prone to observer biases. In some areas and for some taxa few suitably-skilled observers are available. Automatic methods to identify and count animal vocalisations would allow more extensive, more consistent and cheaper population monitoring for many species. It also can make feasible important studies which are not feasible with human observers [Gri95].

In this paper we report a successful method for identifying the flight calls of number of bird species. We believe a useful set of animal vocalisations have sufficiently similar basic characteristics that they could also be identified using this methods. These include the echo-location signals of bats and the whistles of dolphins and other small cetaceans.

Data Set

Evans [Eva94] has used recording stations spread over several hundred kilometres to census the extensive nocturnal bird migrations of North America. Manually identifying and counting the flight calls is an extremely time consuming part of Evans' study. There is considerable concern about a long term decline in migratory passerines in North America. Automation of call identification would allow Evans' methods to be used for large scale monitoring on a per species basis.

The data set used in this work was recorded, digitised and identified by Evans. It contains 138 calls from 9 species. All 9 species are small passerines weighing only 10 to 30 grams. Their flight calls are very different to the songs which are the vocalisations usually associated with these and other birds. The flight calls are very brief, 30 to 100 millisecond in length. They are high-pitched, 4-6 khz in frequency and of narrow bandwidth.

These characteristics make the calls indistinguishable to all but the highly-practised human ear. Fortunately distinctive frequency modulations make identification of calls from spectrograms much easier. Figure 1, contains examples of call spectrograms of 4 of the 9 species.

As can be seen in Figure 1 automatic identification would not be difficult if the calls were recorded in a controlled environment. Unfortunately this is not the case. The calls are recorded by specially designed microphones mounted on the roofs of buildings. The birds are flying 50 to 500 meters above the microphones. Figure 2, contains examples of calls of the same 4 species as in Figure 1 but demonstrating difficulties arising from the recording environment.

   figure35
Figure 1: Clear Flight Calls of 4 Species

   figure49
Figure 2: Unclear Flight Calls of 4 Species

Each call in Figure 2 exhibits a different problem:

There are important differences between this domain and the domains of speech and speaker recognition. The flight calls are much simpler than speech utterances but are recorded under very difficult conditions. Most work on speech and speaker recognition is done with utterances recorded in good to excellent conditions. Work on speech and speaker focuses on the utterances, this work focuses on robustly handling the recording conditions.

Preprocessing

The choice of representation is crucial in any classification problem. The importance of the frequency modulation in the calls made the frequency time representation of spectrograms attractive.

We used a 128-point Discrete Fourier transform (DFT) with a window size of 3ms and an increment of 1ms and a Hann window to produce spectrograms of each call.

   figure73
Figure 3: Frequency track of a Call

The narrow bandwidth of the calls prompted us to simplify the representation by tracking the dominant frequency of the call. This reduces the two dimensional call spectrogram to a single dimensional track of the dominant frequency. Figure 3 contains an example of such a frequency track.

Call frequency tracks are found by searching for sequences of local peaks in the spectrogram. Sequences whose length or frequency variation are outside the limits appropriate for calls are rejected. Low energy peaks forming a spurious suffix or prefix to a frequency track are detected by comparison to the average energy of the peaks of the track. This is done using ad-hoc metrics arrived at by trial-and-error.

Classification System

We have used Quinlan's C4.5 [Qui93] to build the classifiers in this work. Classifiers of suitable performance could likely have been produced by other methods, for example using a network classification system, which are popular in similar domains, e.g [PMC94].

However the efficiency of classifier construction was crucial and other classification systems may not have met these demands.

Global Classification

We tried many methods for extracting attributes from frequency tracks and there are many other possibilities we didn't explore. Among the attributes we tried were:

We often found the performance of the classifier induced by C4.5 would decline as we provided C4.5 with extra attributes. This is not unique to C4.5 or decision tree induction systems; the performance of many other classification systems decline as poor attributes are added.

The obvious remedy is to somehow filter out poor attributes. We developed a wrapper shell script for C4.5 which starts with an empty set of attributes and incrementally adds to the set the attribute that produces the largest improvement in classifier performance.

We developed this method independently of [JKP94] who term this forward selection and discuss it and alternative methods in detail. This approach may be not be feasible with classification systems in which classifier construction is more expensive. Fortunately, classifier construction time in C4.5 is linear in the number of cases.

A set of 14 attributes produced the best performance:

A decision-tree classifier was induced using the above 14 attributes. The performance of the classifier was evaluated on unseen cases using 8-fold resampling. It classified 83% of the 138 calls in our data set correctly.

This performance was promising but a classifier with a 17% error rate was not useful for our purposes.

Exploratory attempts at inducing separate classifiers for each of the 9 species suggested the data set was too small for this approach to be successful. Otherwise the combination of 9 separate classifiers might produce a significantly lower error rate.

Mills [Mil95] was successful in constructing separate classifiers using a larger version of the same data set which contained roughly 600 calls. Exact comparison with Mills' work is difficult because Mills employed some manual preprocessing. However Mills' neural network classifiers seem no more successful than the classifier described above and less successful than the approach described below.

Windowed Classification

An approach sometimes used in speech analysis is to extract attributes from a fixed-sized window stepped through the signal. This, in combination with machine learning, has been applied very successfully to speaker recognition [SS95]. This seemed a promising approach for handling unclear calls such as those in Figure 2 as, hopefully, some easily recognisable windows should remain in the call.

After some trial-and-error experimentation with the window size and the resolution of the DFT we developed a successful windowed approach to classifying calls.

For windowed classification, a spectrogram was produced using a 64-point DFT with a window size of 10ms and an increment of 3ms and a Hann window to produce spectrograms.

   figure105
Figure 4: Example Window Extracted from Figure 3

A window (nothing to do with the DFT window mentioned above) 11 peaks long and hence 33ms long is slid along the frequency track. It is moved forward 1 peak at each step. This produces from 10 to 30 overlapping windows depending on the length of the call. Figure 4 contains an example of such a window. It was extracted from the Blackpoll Warbler call in Figure 3.

There are 13 attributes extracted from each window:

A decision tree classifier for individual windows was induced using the above 13 attributes. The performance of the classifier was evaluated on unseen cases using 8-fold resampling (on windows). It classified correctly 46% of the 2978 windows extracted from the 138 calls in our data set.

A wrapper was used to convert the classifications of the 10 to 30 windows extracted from a call to a classification of the call was constructed. When evaluated on unseen cases using 8-fold resampling (on calls), it classified 79% of the calls in our data set correctly.

The error rate of 21% from windowed classification is worse than for global classification but the voting procedure yields some information about the certainty of classifications.

A wrapper classifier which left calls unclassified unless one species received at least 30% more votes than any other species was constructed and evaluated on unseen cases using 8-fold resampling (on calls). The figure of 30% for the required majority in the wrapper classifier was arrived at by trial-and-error.

This classifier recognised 61% of the calls correctly and left 34% of the 138 calls unclassified. The 6% error rate approached our required performance but at the cost of leaving a third of the calls unclassified.

Global and Windowed Classification Combined

Our first attempt at combining the global and windowed approaches was to add the 14 attributes extracted from the whole call to the classifications of the 10 to 30 individual windows extracted from the call. This meant each window classification was made on the basis of 27 attributes, 13 extracted from the window and 14 extracted from the entire call.

The classifiers constructed by C4.5 for this approach performed poorly. They focussed almost entirely on the 14 global attributes. This is presumably because of the repetition of these attributes in the cases presented to C4.5.

Instead a more pragmatic approach was employed. A global classification was made separately as described previously. It was included in the voting procedure for the windowed classification by giving it a weight of 3 votes. The weight was chosen by trial-and-error.

The resulting wrapper classifier, evaluated on unseen cases using 8-fold resampling (on calls), classified 78% of the calls correctly and left 18% of the 138 calls unclassified.

All the calls in Figure 2 were classified correctly.

The 4% error rate is acceptable for our purposes and compares not unfavourably with the estimated rate for a human expert of 1.5%. This is not a fair comparison as the human expert would leave none of the calls in our data set unclassified.

It is important to note that the global classifier could recognise slightly more calls but was, its 21% error rate made it useless for our purposes. The classifier recognises slightly fewer calls but is useful because it only has a 4% error rate. If complete call identification is necessary, the remaining 18% could be identified manually.

Conclusions and Further Work

The combination of individual windowed classifications is a powerful technique because it converts weak local classifications into a strong overall classification and it provides a good indication of the certainty of this classification.

Simple search techniques can be used to explore a space of possible classification attributes if classifiers construction is not expensive.

As far as we are aware this the most successful published attempt at classifying biological sounds.

It would not be surprising if more impressive results have been obtained for marine mammals in military research but have not been published in the open literature.

It is interesting such good results can be obtained with very simple signal processing. It suggests too little emphasis has been placed on classification methods in past work on related problems.

In further work, we hope to extend the attribute space search to including a search in the space of possible windowing and signal processing parameters. We also hope to explore examination of the signal at at a hierarchy of time resolutions.

Later this year, we hope to field test our classifier with Evans.

References

Eva94
W. R. Evans. personal communication, 1994.

Gri95
G. Grigg. personal communication, 1995.

JKP94
G.H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning, pages 121-129. Morgan Kauffman, 1994.

Mil95
H. Mills. Automatic detection and classification of nocturnal migrant bird calls. Journal of the Acoustical Society of America, 97(5):3370-3371, May 1995.

PMC94
J.R. Potter, D.K. Mellinger, and C.W. Clark. Marine mammal call discrimination using artificial neural networks. Journal of the Acoustical Society of America, 96(3):1255-1262, September 1994.

Qui93
J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kauffman, 1993.

SS95
B. Squires and C. Sammut. Automatic speaker recognition: An application of machine learning. In Proceeding of the 12th International Conference on Machine Learning, 1995.

About this document ...

This document was generated using the LaTeX2HTML translator Version 96.1-e (April 9, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 warbler_ai95.tex.

The translation was initiated by Andrew Taylor on Sat May 4 15:13:14 EST 1996


Andrew Taylor
Sat May 4 15:13:14 EST 1996