ENGG1811 Assignment 1: Automatic diagnosis

Version: v1.00, at 11am 22 June 2021

Change Log

Automatic diagnosis  

This assignment is inspired by the diagnosis of a medical condition called hypopnea. The word hypopnea is derived from the Greek roots hypo meaning under normal and pnea meaning breathing. Informally, hypopnea is sometimes referred to as overly shallow breathing. Hypopnea can be diagnosed by measuring the air flow rate into and out of the lungs together with other measurements. Here, we will only look at the air flow rate. The figure below, which is taken from [1], shows the air flow rate into and out of the lungs of a subject over a duration of about 120 seconds.

Three episodes of hypopnea, as well as their duration, have been highlighted in the figure above. An observation that can be made from the figure is that during an episode of hypopnea, the air flow rate hovered around zero or was much smaller than normal, which means that the subject was breathing a lot less than normal.

In this assignment, you will write Python programs to perform automatic diagnosis inspired by the above example on hypopnea. The aim of your programs is to process a data sequence (given as a Python list of numbers) to determine the starting time and duration of the episodes within the data. The reason why we chose the word inspired is because you will not be using the actual medical criteria for diagnosing hypopnea. We have adapted the diagnostic problem so that, in this assignment, you will have to use the various Python constructs that you have learnt but at the same time giving you a taste on how programming can be used to perform diagnosis automatically.

Although the above example comes from biomedical engineering, there are plenty of examples of automatic diagnosis in all other branches of engineering and science, e.g. diagnosing engine performance, quality control in chemical reactors etc. 

Learning objectives

By completing this assignment, you will learn:

  1. To apply basic programming concepts of variable declaration, assignment, conditional, functions, loops and import.
  2. To use the Python data types: list, list of lists and Boolean
  3. To translate an algorithm described in a natural language to a computer language.
  4. To organize programs into modules by using functions
  5. To use good program style including comments and documentation
  6. To get a practice on software development, which includes incremental development, testing and debugging.

Prohibition

You are not allowed to use numpy for this assignment. This is an individual assignment, so no group work.

Requirements for automatic diagnosis  

This section describes the requirements on the automatic diagnostic algorithm that you will be programming in this assignment. You should be able to implement these requirements by using only the Python skills that you have learnt in the first four weeks' of the lectures in this course. This also means that the algorithm is a very simple minded one compared to those that people really use nowadays.

We begin with describing the data that the algorithm will operate on. We will use the following Python code as an example. In the following, we will refer to the following code as the sample code. Note that the data and parameter values in the sample code are for illustration only; your code should work with any allowed input data and parameter values.

  # Flow rate
  flow_rate = [-4.5,  0.5,  4.5,  -0.1,  -4.3,
               -4.1,  0.1,  4.1,   0.4,  -4.9,
               -1.3,  0.2,  1.1,   0.4,   1.1,
               -1.7,  0.3,  3.1,   0.8,  -2.6,
               -1.5, -0.2,  1.2,   0.6,  -4.1,
               -4.1,  0.1,  4.1,   0.4,  -4.9,   
               -1.2, -0.1,  1.2,   0.7,  -1.9,
               -3.9,  0.1,  2.9,   0.5,  -2.2,
               -2.0,  0.5,  1.7,   4.6,   4.7,
               -3.4,  0.2]
              
  # Parameters for the diagnostic algorithm (Algorithmic parameters)
  segment_len = 5          # Number of data points in a segment
  interval = [-2.6,3.1]    # For determining whether a segment has the symptom
  threshold = 0.8          # For determining whether a segment has the symptom
  min_segment = 2          # Minimum number of segments to form an episode
  
  # Call the functions (which you will write in the assignment) to determine the episodes
  episodes = diag.run_diagnostic(flow_rate,segment_len,interval,threshold,min_segment)  

In the sample code, the data for the diagnostic algorithm are stored in a list called flow_rate. There are also four algorithmic parameters segment_len, interval, threshold and min_segment; we will explain their meaning later.

A plot of the data is given in the blue line in the following plot.

For this example, there are two episodes where the flow rate is lower than normal and they have been highlighted by the magenta rectangles. The aim of the diagnosis is to determine all the episodes in the given flow rate data. We will now describe the requirements.

(Divide the flow rate data into segments and determine whether each segment has the symptom) We first divide the given flow rate data into a number of non-overlapping segments. The number of data points in each segment is given by the variable segment_len which has the value of 5 in the sample code. Because of this value of segment_len, the first segment will contain the data points:

flow_rate[0], flow_rate[1], flow_rate[2], flow_rate[3], flow_rate[4].

The second segment will contain the data points:

flow_rate[5], flow_rate[6], flow_rate[7], flow_rate[8], flow_rate[9],

and so on. The list flow_rate given in the sample code contains 47 data points, so we will get 9 complete segments. The two remaining data points (flow_rate[-2], flow_rate[-1]), will be discarded and will not be used. When we typeset the sample code above, we have purposely put 5 elements in each row for flow_rate, so that each row (other than the last one) is a complete segment.

The next step is to determine whether each segment has the symptom that we are looking for. Intuitively, we will say that a segment has the symptom if most of the data points in the segment has a smaller amplitude than normal. We will use the algorithmic parameters interval and threshold to determine whether a segment has the symptom. The parameter interval is used to determine whether a data point has smaller amplitude than normal and the parameter threshold is used to determine whether most points in a segment are small in amplitude.

The parameter interval is a list with 2 elements, and the parameter threshold is a scalar. In the sample code above, interval is the list [-2.6,3.1] and threshold is 0.8. We will use these values in an example to explain how you should use them. With the given values of interval and threshold, we say that a segment has the symptom if a fraction of 0.8 or more of the data points in a segment are between -2.6 and 3.1, inclusive of the end-points.  The following table shows the calculation to determine whether the 9 segments in flow_rate have the symptoms or not.


Data segments Fraction of the data points between -2.6 and 3.1 inclusively Does the segment have the symptom?
-4.5,  0.5,  4.5, -0.1,  -4.3 2 / 5 = 0.4 False
-4.1,  0.1,  4.1,  0.4,  -4.9 2 / 5 = 0.4 False
-1.3,  0.2,  1.1,  0.4,   1.1 5 / 5 = 1 True
-1.7,  0.3,  3.1,  0.8,  -2.6 5 / 5 = 1  (Note: Both -2.6 and 3.1 in the data segment are counted.) True
-1.5, -0.2,  1.2,  0.6,  -4.1 4 / 5 = 0.8 True
-4.1,  0.1,  4.1,  0.4,  -4.9 2 / 5 = 0.4 False
-1.2, -0.1,  1.2,  0.7,  -1.9 5 / 5 = 1 True
-3.9,  0.1,  2.9,  0.5,  -2.2 4 / 5 = 0.8 True
-2.0,  0.5,  1.7,  4.6,   4.7 3 / 5 = 0.6 False

Note that the algorithmic parameters segment_len, interval and threshold may take on different values in different tests.

After computing whether each complete segment has the symptom, we can summarise the results in a Python list of Boolean values. We will refer to this list using the variable name disorder_status where disorder means the symptom is present. For the flow_rate data in the sample code, the variable disorder_status is:

disorder_status = [False, False, True, True, True, False, True, True, False]

Note that there are 9 elements in disorder_status and they correspond to the 9 complete segments in the given flow_rate. Note also that you can obtain disorder_status from the right-most column in the table above.

The next part of the computation is to determine the episodes from the variable disorder_status.

(Determining the episodes) 

An episode is formed by consecutive segments that have symptoms and an episode must have a minimum number of segments. The algorithmic parameter min_segment specifies the minimum number of segments an episode must have. The value of min_segment is 2 in the sample code but its value can change from test to test.

The determination of the episodes requires only two variables:  disorder_status and min_segment. For min_segment equals to 2, the variable disorder_status given above has two episodes, which are highlighted by the orange colour:

[False, False, True, True, True, False, True, True, False]

The first episode starts in the third segment (corresponding to a Python list index of 2) and a duration of 3 segments. The second episode starts in the seventh segment (corresponding to a Python list index of 6) and a duration of 2 segments. We will summarise the information on the episodes by using a list of lists as follows:

[[2,3],[6,2]]

The first list [2,3] corresponds to the first episode. The first element 2 in [2,3] is the Python list index of the segment that the episode begins and the second element 3 is the number of segments in the episode. Similarly for the second list. The variable episodes, in the last line of the sample code above is expected to take on the value of this list of lists.

Let us consider the case where the variable min_segment has the value of 3 instead. Then, in this case, the variable disorder_status given above has only one episode, which is highlighted by the orange colour:

[False, False, True, True, True, False, True, True, False]

This is because each episode is now required to have at least 3 segments. We will summarise the information on the episodes by using a list of lists as follows:

[[2,3]]

If we further increase the variable min_segment to the value of 4, then there are no episodes in the variable disorder_status given above. In this case, we summarise the information on the episodes by using an empty list, i.e. [].

Validity checks

The description above shows how the data (flow_rate) and algorithmic parameters (segment_len, interval, threshold, min_segment) are used to compute the episodes. Note that the algorithmic parameters must be valid so that the computation can be carried out. We require that your code performs a number of validity checks before computing the episodes. For example, the algorithmic parameter segment_len must be a positive integer greater than or equal to 1 for it to be valid, otherwise it is not valid. The following table state the requirements for the algorithmic parameters to be valid and what assumptions you can make when testing.

Algorithmic parameters Requirements for the parameter to be valid Assumptions you can make when testing
segment_len A positive integer greater than or equal to 1 You can assume that, when we test your code, the given segment_len is always a number (int or float).
In other words, the given segment_len cannot be of data type str, list etc.
For example, when we test your code, we may give segment_len a value from say 1, 5, -6, -7.3, 2.7.
Out of these, 1 and 5 are valid, while the others are not.
interval interval[0] must be strictly less than interval[1] You can assume that the given interval is always a list with 2 numbers (int or float).
For example, when we test your code, we may give interval the values of say [-10,-5.7], [10, 5.7]
Out of these, [-10,-5.7] valid, while [10, 5.7] is not.
threshold A float strictly between 0 and 1, i.e. 0 and 1 not included. You can assume that the given threshold is always a number (int or float)
min_segment A positive integer greater than or equal to 1 You can assume that the given min_segment is always a number (int or float).

You can assume that the given flow_rate is always a list. This list can be empty. If the list is not empty, then its elements are either int or float. In order for the computation described above to be carried out, the number of elements in flow_rate must be greater than or equal to the product of the algorithmic parameters segment_len and min_segment; you should only carry out the computation if there are enough data in flow_rate.

Implementation requirements

You need to implement the following four functions. All these four functions working together will implement the the automatic diagnosis.

The requirement is that you implement each function in a separate file. This is so that we can test them independently and we will explain this point here. We have provided template files, see Getting Started

1. def has_symptom(data_segment, interval, threshold): 
    • The aim of this function is to determine whether a segment has the symptom.
    • The expected behaviour has been described here where we explain how to determine whether a data segment has the symptom given the data segment and the algorithmic parameters interval and threshold.
    • The function has 3 inputs and their names reflect their role in the description earlier.
    • The function should return one output which is a Boolean variable. The output is True is if the segment has the symptom, otherwise the output is False
    • For example, according to the example in this table, if data_segment is the list [-4.5,  0.5,  4.5, -0.1,  -4.3], interval is the list [-2.6,3.1] and threshold is 0.8, then this function should return False.
    • This function can be tested using the file test_has_symptom.py
2. def flow_rate_to_disorder_status(flow_rate, segment_len, interval, threshold):
    • The aim of this function is to compute the disorder_status for the given data, see here for a description on computing the disorder status from flow_rate, segment_len, interval and threshold.
    • The function has 4 inputs and their names reflect their role in the description earlier.  
    • The function should return one output which is a list of Boolean values True and False. The list returned by this function is the disorder status.
    • For example, if we use the sample code to make the function call flow_rate_to_disorder_status(flow_rate, segment_len, interval, threshold) then the function should return the disorder status [False, False, True, True, True, False, True, True, False].
    • This function requires the function has_symptom(). An import line has been included in the template file for you. Please do not change that.
    • This function can be tested using the file test_flow_rate_to_disorder_status.py.
3. def find_episodes(disorder_status, min_segment):
    • The input disorder_status is a list of Boolean values.
    • The input min_segment is a positive integer which specifies the minimum number of segments in an episode.
    • The aim of the function is to compute and return the information on the episodes, see the description under the heading (Determining the episodes).
    • The function should return one output which is either a list of lists or an empty list.
    • If the function returns a list of lists containing the information on the episodes, we require that the lists be sorted in the ascending order of the starting segment index of the episodes. For the example in the sample code, the format [[2,3],[6,2]] is acceptable, but not [[6,2],[2,3]].
    • This function can be tested using the file test_find_episodes.py. Note that there are many examples in the test file.
4. def run_diagnostic(flow_rate, segment_len, interval, threshold, min_segment):
    • This function is called after all the input data have been specified, see the last line in the sample code
    • The function has 5 inputs. The names for the inputs have been chosen to match their roles in the description earlier.
    • The function should return one output which can be a list of lists, an empty list or a string depending on the situation
    • The expected steps within the function run_diagnostic() are:
      • The function should first check whether all algorithmic parameters are valid. If any of the algorithmic parameter is invalid, the function should return the string 'Corrupted input'. It should not proceed to execute the next two steps.
      • If all algorithmic parameters are valid, the function should determine whether there are enough data in flow_rate for the calculations. If there are not enough data in flow_rate, the function should return the string 'Not enough data'. It should not proceed to execute the next step.
      • If all algorithmic parameters are valid and there are enough data, then the function should proceed to determine the episodes. The function should return either an empty list or a list of lists. 
    • You can use the following test files: test_run_diagnostic_1.py, test_run_diagnostic_2.py and test_run_diagnostic_3.py.
      • For both test_run_diagnostic_1.py and test_run_diagnostic_2.py, there are enough data and all algorithmic parameters are valid. Your code should proceed to compute the episodes.
        • For test_run_diagnostic_1.py, the data and algorithmic parameters are the same as those in the sample code.
        • For test_run_diagnostic_2.py, the variable flow_rate is a digitised version of the data from [1].
      • The test file test_run_diagnostic_3.py contains a number of test cases where the algorithmic parameters are invalid and/or there are not enough data in flow_rate. For all the test cases, the function should return a string.
        • We want to point out that for one test case, some algorithmic parameters are invalid and there are not enough data, the function is expected to return the string 'Corrupted input'.
    • This function requires the functions flow_rate_to_disorder_status() and find_episodes(). Two import lines have been included in the template file for you. Please do not change them.

Additional requirements: In order to facilitate testing, you need to make sure that within each submitted file, you only have the code required for that function. Do not include test code in your submitted file.

Clarification: Since run_diagnostic() will only proceed to determine the episodes if all the parameters are valid and there are enough data, you are allowed to assume that when we test the correctness of has_symptom(), flow_rate_to_disorder_status() and find_episodes(), all the algorithmic parameters are valid.

Getting Started

  1. Download the zip file assign1_prelim.zip, and unzip it. This will create the directory (folder) named 'assign1_prelim'.
  2. Rename/move the directory (folder) you just created named 'assign1_prelim' to 'assign1'. The name is different to avoid possibly overwriting your work if you were to download the 'assign1_prelim.zip' file again later.
  3. First browse through all the files provided including the test files.
  4. (Incremental development) Do not try to implement too much at once, just one function at a time and test that it is working before moving on.
  5. Start implementing the first function, properly test it using the given testing file, and once you are happy, move on to the the second function, and so on.
  6. Please do not use 'print' or 'input' statements. We won't be able to assess your program properly if you do. Remember, all the required values are part of the parameters, and your function needs to return the required answer. Do not 'print' your answers.

Testing

Test your functions thoroughly before submission.

You can use the provided Python programs (files like test_has_symptom.py etc.) to test your functions. Please note that each file covers a limited number of test cases. We have purposely not included all the cases because we want you to think about how you should be testing your code.

Note that the file test_2_data.txt contains the flow rate data for the test file test_run_diagnostic_2.py.


We will test each of your files independently. Let us give you an example. Let us assume we are testing three files: prog_a.py, prog_b.py and prog_c.py. These files contain one function each and they are: prog_a(), prog_b() and prog_c(). Let us say prog_b() calls prog_a(); and prog_c() calls both prog_b() and prog_a(). We will test your files as follows:

Submission

You need to submit the following four files. Do not submit any other files. For example, you do not need to submit your modified test files.

Instructions on how to submit your files will be available here in Week-5.

Assessment Criteria

We will test your program thoroughly and objectively. This assignment will be marked out of 25 where 20 marks are for correctness and 5 marks are for style.

Correctness

The 20 marks for correctness are awarded according to these criteria.

Criteria Nominal marks
Function has_symptom.py 4
Function flow_rate_to_disorder_status.py 5
Function find_episodes.py (Case 1: One or more episodes but none of the episodes include the first or last complete segment) 3
Function find_episodes.py (Case 2: no episodes) 1
Function find_episodes.py (Case 3: One or more episodes but some of the episodes include the first and/or last complete segment) 3
Function run_diagnostic.py  Case 1: Expected output is the string 'Corrupted input' 2
Function run_diagnostic.py  Case 2: Expected output is the string 'Not enough data' 1
Function run_diagnostic.py  Case 3: Expected output is a list of lists or an empty list. 1

Style

Five (5) marks are awarded by your tutor for style and complexity of your solution. The style assessment includes the following, in no particular order:

Assignment Originality

You are reminded that work submitted for assessment must be your own. It's OK to discuss approaches to solutions with other students, and to get help from tutors, but you must write the Python code yourself. Sophisticated software is used to identify submissions that are unreasonably similar, and marks will be reduced or removed in such cases.

Help Sessions and ED forum

Reference:

[1] Jennifer Accardo and Jennifer Reesman, "Can you hear me snore?". Journal of Clinical Sleep Medicine, Vol. 9, Number 11.
http://jcsm.aasm.org/ViewAbstract.aspx?pid=29203