Read COMP1021 lecture slides on sorting.

Read the COMP1021 slides on sorting.

Sorting

Sorting is the process of arranging a list of objects into a specific order.

The purpose of sorting is to facilitate the later search for members of the list.

Real-life examples:


Sorting

We are given a sequence of elements

a1, a2, ..., an

Sorting consists of rearranging these elements into an order

a'1, a'2, ..., a'n

such that

a'1 < a'2 < ... < a'n


Sorting Algorithms

This problem has been extensively studied and many algorithms proposed.

Simple algorithms tend to require a number of operations proportional to n2 to sort a sequence of n elements.

Better algorithms require a number of operations proportional to n*log(n).

If n is large n*log(n) is much smaller than n2.

Hence if you have a large number of elements to sort you need a good algorithm.


Selection Sort

Selection sort is a simple O(n2) algorithm.

Probably the one you would think of if asked to sort a sequence by hand.


Selection Sort

ScanSequence
1218424461925
01218424461925
6184244121925
16184244121925
6124244181925
26124244181925
6121844421925
36121844421925
6121819424425
46121819424425
6121819254442
56121819254442
6121819254244


Selection Sort

void
selection_sort(int *sequence, int length) {
    int index, scan, min, temp;
    
    for (index = 0; index < length - 1; index++) {
        min = index;
        for (scan = index+1; scan < length; scan++)
            if (sequence[scan] < sequence[min])
                min = scan;

        /* swap the values */
        temp = sequence[min];
        sequence[min] = sequence[index];
        sequence[index] = temp;
    }
}

Analysis of Selection Sort

Selection sort is simple to analyze.

Best, average and worst cases are all the same.

Not affected by the ``sortedness'' of the input, only by the number of elements.

There are n - 1 scans over the array.

The first scan requires n - 1 comparisons, the second scan requires n - 2, the thirdn - 3, and so on ...

The number of comparisons is thus:

(n-1) + (n-2) + ... 1 = n*(n-1)/2


Quicksort

Quicksort is an efficient O(n*log(n) method for sorting.

Also a classic example of divide-and-conquer problem solving.

Basic idea is as follows:

  1. choose an element x to be pivot

  2. rearrange sequence so that:

  3. sort left part, sort right part (recursive)


Quicksort

Expressed in Haskell:
qsort []     = []
qsort (x:xs)
   = sortedSmall ++ [x] ++ sortedLarge
     where
       sortedSmall = qsort smalls
       sortedLarge = qsort larges
       smalls      = [y | y <- xs; y < x]
       larges      = [y | y <- xs; y > x]

Quicksort

To quicksort a segment of a sequence sequence[lo..hi]:

void
quick_sort(int *sequence, int length) {
	int i, j, pivotValue, temp;
	
    if (length < 2)
        return;

    /* start from left and right ends */

    i = 0;
    j = length - 1;

    /* use middle value as pivot */

    pivotValue = sequence[length/2];
    while (i < j) {

        /* Find two out-of-place elements */

        while (sequence[i] < pivotValue)
            i++;
        while (sequence[j] > pivotValue)
            j--;

        /* and swap them over */

        if (i <= j) {
            temp = sequence[i];
            sequence[i] = sequence[j];
            sequence[j] = temp;
            i++;
            j--;
        }
    }
    quick_sort(sequence, i);
    quick_sort(sequence + i, length - i);
}

Quicksort Partitioning

Example of ``good'' partitioning:

Step Sequence
i j
0 1 7 8 2 5 4 9 6 3
i j
0a 1 7 8 2 5 4 9 6 3
i j
1 1 3 8 2 5 4 9 6 7
i j
1a 1 3 8 2 5 4 9 6 7
i j
2 1 3 4 2 5 8 9 6 7
ij
2a 1 3 4 2 5 8 9 6 7
j i
3 1 > 3 4 2 5 8 9 6 7


Quicksort Partitioning

Example of ``bad'' partitioning:

Step Sequence
i j
0 5 7 8 3 2 4 9 1 6
i j
0a 5 7 8 3 2 4 9 1 6
i j
1 1 7 8 3 2 4 9 5 6
i j
1a 1 7 8 3 2 4 9 5 6
i j
2 1 2 8 3 7 4 9 5 6
j i
2a 1 2 8 3 7 4 9 5 6
3 1 2 8 3 7 4 9 5 6


Analysis of Quicksort

To partition entire array requires n comparisons, giving two sub-partitions.

If lucky, divide sequence roughly in half.

To partition each sub-partition then requires roughly n/2 comparisons.

There are two of them, so total n comparisons.

To partition each sub-sub-partition requires roughly n/4 comparisons.

There are four of them, so total n comparisons ...


Analysis of Quicksort

Can think of levels of partitioning, corresponding to the levels of recursive calls in the program.

At each level there are n comparisons.

If k levels, require total n*k comparisons.


Best-case Performance of Quicksort

Minimum number of levels occurs if sequence partitioned into equal halves at each level.

That is, repeatedly divide the sequence in half until only one element in each partition.

If sequence is length n=2j, takes j steps.

In other words, there are log2(n) levels.

So sorting takes n * log2(n) operations.


Worst-case Performance of Quicksort

Worst-case occurs with unfortunate choice of pivot (i.e. largest or smallest value).

In this case, partition sequence into:

In very worst case, every pivot we choose is like this.

This givesn levels

So sorting takes n 2 operations.


Average-case Performance of Quicksort

Average case performance is tricky to analyse.

Consider all possible permutations of input, and compute cost for each.

Turns out that average case is around 1.4 times as slow as the best case.


Selection Sort versus Quick Sort - 200 items

Selection SortQuick Sort
200*200 = 40000200*log2(200) = 1529

Entire Program

#include <stdio.h>

void
selection_sort(int *sequence, int length) {
    int index, scan, min, temp;
    
    for (index = 0; index < length - 1; index++) {
        min = index;
        for (scan = index+1; scan < length; scan++)
            if (sequence[scan] < sequence[min])
                min = scan;

        /* swap the values */
        temp = sequence[min];
        sequence[min] = sequence[index];
        sequence[index] = temp;
    }
}

void
quick_sort(int *sequence, int length) {
	int i, j, pivotValue, temp;
	
    if (length <= 1)
        return;

    /* start from left and right ends */

    i = 0;
    j = length - 1;

    /* use middle value as pivot */

    pivotValue = sequence[length/2];
    while (i < j) {

        /* Find two out-of-place elements */

        while (sequence[i] < pivotValue)
            i++;
        while (sequence[j] > pivotValue)
            j--;

        /* and swap them over */

        if (i <= j) {
            temp = sequence[i];
            sequence[i] = sequence[j];
            sequence[j] = temp;
            i++;
            j--;
        }
    }
    quick_sort(sequence, j + 1);
    quick_sort(sequence + i, length - i);
}

void
print_array(int *a, int length) {
    int i;
    for (i = 0; i < length; i++)
        printf("%2d ", a[i]);
    printf("\n");
}

int main() {
    int numbers[] = {12,18,42,44,6,19,25};
    int length = sizeof numbers/sizeof numbers[0];
    
    printf("Unsorted: ");
    print_array(numbers, length);
/*    selection_sort(numbers, length); */
    quick_sort(numbers, 0, length - 1);
	printf("Sorted:   ");
    print_array(numbers, length);
    return 0;
}

Problem

We need a list of the words occurring in a piece text

We also need the number of occurrences of each word.

The words must be in dictionary order.

The text will be supplied on standard input.

For example:

% a.out <text
  381 a
    1 abandoned
    1 ability
   10 able
   19 about
    2 above
    1 abreast
    1 absence
    1 absent
    3 absolutely
    1 abyss
    1 acceptance
    2 accord
    1 accumulation
    ....

Problem Decomposition

    while there is input {
        read a word
        if we have seen that word before
            increment the counter for that word
        else
            create a new counter for that word
    }
    put the (unique) words in an array
    sort the  array
    for each word
        print counter and array

Linked Listed used to Count Words

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

typedef struct word_count {
       char                 *word;
       int                  count;
       struct word_count    *next;
} word_count_t;

void *
salloc(size_t size) {
    void *m = malloc(size);
    if (m == NULL) {
        fprintf(stderr, "out of memory - attempting to malloc %d bytes\n", (int)size);
        exit(1);
    }
    return m;
}

void *
srealloc(void *p, size_t size) {
    void *m = realloc(p, size);
    if (m == NULL) {
        fprintf(stderr, "out of memory - attempting to realloc %d bytes\n", (int)size);
        exit(1);
    }
    return m;
}

char *
get_word(void) {
    int i, c;
    char *buffer = NULL;
    int buffer_length = 0;
    
    for (i = 0; ; i++) {
        c = getchar();
        if (c == EOF)
            break;
        if (!isalpha(c)) {
            if (i == 0) {
                i--;
                continue;
            } else
                break;
        }
        if (i >= buffer_length-1) {
            buffer_length += 16;
            buffer = srealloc(buffer, buffer_length*sizeof (char));
        }
        buffer[i] = c;
    }
    if (i == 0)
        return NULL;
    buffer[i] = '\0';
    return srealloc(buffer, strlen(buffer) + 1);
}


word_count_t *find_word(char *word, word_count_t *w) {
    for (; w != NULL; w = w->next)
        if (strcmp(word, w->word) == 0)
            return w;
    return NULL;
}

/*
 * read the words from a stream
 * associate a word_count struct with
 * each new word
 *
 * increment the count field each time the 
 * word is seen
 */
word_count_t *
read_words(void) {
    char *word;
    word_count_t *w;
    word_count_t *word_count_list = NULL;
   
    while (1) {
        word = get_word();
        if (word == NULL)
            return word_count_list;
        w = find_word(word, word_count_list);
        if (w != NULL) {
            w->count++;
            free(word);
            continue;
        }
        w = salloc(sizeof *w);
        w->word = word;
        w->count = 1;
        w->next = word_count_list;
        word_count_list = w;
    }
}

void
sort_words(word_count_t **sequence, int length) {
    int i, j;
    word_count_t *pivotValue;
    word_count_t *temp;
    
    if (length <= 1)
        return;

    /* start from left and right ends */

    i = 0;
    j = length - 1;

    /* use middle value as pivot */

    pivotValue = sequence[length/2];
    while (i < j) {

        /* Find two out-of-place elements */

        while (strcmp(sequence[i]->word, pivotValue->word) < 0)
            i++;
        while (strcmp(sequence[j]->word, pivotValue->word) > 0)
           j--;
        /* and swap them over */

        if (i <= j) {
            temp = sequence[i];
            sequence[i] = sequence[j];
            sequence[j] = temp;
            i++;
            j--;
        }
    }
    sort_words(sequence, j + 1);
    sort_words(sequence+i, length - i);
}

int
main(int argc, char *argv[]) {
    int i, n_unique_words = 0;
    word_count_t *word_count_list, *w;
    word_count_t **word_array;
     
    word_count_list = read_words();
    for (w = word_count_list; w != NULL; w = w->next)
        n_unique_words++;
    word_array = salloc(n_unique_words*sizeof *word_array);
    for (w = word_count_list, i = 0; w != NULL; w = w->next)
        word_array[i++] = w;
    sort_words(word_array, n_unique_words);

    for (i = 0; i < n_unique_words; i++)
        printf("%5d %s\n", word_array[i]->count, word_array[i]->word);

    return 0;
}

Performance

Unfortunately our code is very slow on larger input texts: Input
LinesExecution
Time (s) 250 0.03 500 0.13 1000 0.48 2000 1.66 4000 7.85 8000 47.55 16000189.57


Extrapolating Performance

The execution time of one program increases faster than linearly.

A doubling of the input size, increases execution time by factor of very roughly four.

This suggests execution time is a function of the square of the input size.

Fitting a quadratic to the times (for gcc -O3) yields this equation:

execution_time(lines) = 3.90609e-07*lines*lines + 0.00195474*lines

Here is a graph of this equation:

This extrapolation suggests 100,000 lines of input will take 4000 seconds or over an hour to process.

Just FYI, here is how the fitting and graph were done:

% cat >times <<eof
  250   0.03
  500   0.13
 1000   0.48
 2000   1.66
 4000   7.85
 8000  47.55
16000 189.57
        eof
% gnuplot
gnuplot> plot [1:100000] [1:5000] "times" with lines
gnuplot> f(x) = a*x*x + b*x
gnuplot> fit f(x) "times" via a,b 
gnuplot> execution_time(lines) = 3.90609e-07*lines*lines + 0.00195474*lines
gnuplot> replot execution_time(x)

Improving Performance - Inspection and Guessing

Common to attempt to improve performance by examining program code and guessing where performance problems lie.

Typically >80% of execution time spent in <20% of program.

Hence, blind attempts to improve performance often fruitless.

We need to know which functions are contributing most to execution time. Some tools allow you to examine where execution time is spent.

E.g. gcc -p and gprof.

% gcc -p count_words.c
% sed 2000q dracula.txt |a.out >/dev/null
% gprof a.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 98.92      8.38     8.38    19978     0.00     0.00  find_word
  0.95      8.46     0.08    19979     0.00     0.00  get_word
  0.12      8.47     0.01        1     0.01     8.47  read_words
  0.12      8.48     0.01        1     0.01     0.01  sort_words
  0.12      8.49     0.01                             main
  0.00      8.49     0.00    39965     0.00     0.00  srealloc
  0.00      8.49     0.00     4518     0.00     0.00  salloc
...
Clearly we need to improve performance of the find_word functions.

Counting Words with A Binary Tree

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

typedef struct word_count {
       char                 *word;
       int                  count;
       struct word_count    *smaller;
       struct word_count    *larger;
} word_count_t;

void *
salloc(size_t size) {
    void *m = malloc(size);
    if (m == NULL) {
        fprintf(stderr, "out of memory - attempting to malloc %d bytes\n", (int)size);
        exit(1);
    }
    return m;
}

void *
srealloc(void *p, size_t size) {
    void *m = realloc(p, size);
    if (m == NULL) {
        fprintf(stderr, "out of memory - attempting to realloc %d bytes\n", (int)size);
        exit(1);
    }
    return m;
}

char *
get_word(void) {
    int i, c;
    char *buffer = NULL;
    int buffer_length = 0;
    
    for (i = 0; ; i++) {
        c = getchar();
        if (c == EOF)
            break;
        if (!isalpha(c)) {
            if (i == 0) {
                i--;
                continue;
            } else
                break;
        }
        if (i >= buffer_length-1) {
            buffer_length += 16;
            buffer = srealloc(buffer, buffer_length*sizeof (char));
        }
        buffer[i] = c;
    }
    if (i == 0)
        return NULL;
    buffer[i] = '\0';
    return srealloc(buffer, strlen(buffer) + 1);
}

word_count_t *update_tree(char *word, word_count_t *w) {
    int compare;
    if (w== NULL) {
        w = salloc(sizeof *w);
        w->word = word;
        w->count = 1;
        w->smaller = NULL;
        w->larger = NULL;
        return w;
    }
    compare = strcmp(word, w->word);
    if (compare == 0) {
        w->count++;
        free(word);
    } else if (compare < 0)
        w->smaller = update_tree(word, w->smaller);
    else
        w->larger = update_tree(word, w->larger);
    return w;
}

/*
 * read the words from a stream
 * associate a word_count struct with
 * each new word
 *
 * increment the count field each time the 
 * word is seen
 */
word_count_t *
read_words(void) {
    char *word;
    word_count_t *word_count_tree = NULL;
   
    while (1) {
        word = get_word();
        if (word == NULL)
            return word_count_tree;
        word_count_tree = update_tree(word, word_count_tree);
    }
}

void print_words(word_count_t *w) {
    if (w == NULL)
        return;
    print_words(w->smaller);
    printf("%5d %s\n", w->count, w->word);
    print_words(w->larger);
}

int
main(int argc, char *argv[]) {
    print_words(read_words());
    return 0;
}

Binary Tree - Performance

Input
Lines
Execution
Time (s)
1250.00
2500.00
5000.01
10000.02
20000.04
40000.08
80000.16
160000.32

Execution is now fast and linear, i.e proportional to input size.

An input of 100000 lines will take 2 seconds to process

Great improvement on linked list (1.5 hours).


Solving the Problem with A Shell Command

Many problems can be solved with a single-line Unix shell command.

This can be much much faster than writing a C program to solve a problem.

It is well worth becoming familiar with Unix programs such as: sort, grep, sed, wc, cut, uniq and tr.

This Unix command will count the words in its input:

tr -c A-Za-z '\n'|sed '/^$/d'|sort|uniq -c

tr -c A-Za-z '\n' converts all characters except letters to newline characters.

The leaves lines containing either a single word or nothing.

sed '/^$/d' deletes empty lines.

sort places the lines (words) in sorted order.

uniq -c counts adjacent identical lines (words).

The result is also quite efficient:

% time sh -c "<dracula.txt tr -c A-Za-z '\n'|sed '/^$/d'|sort|uniq -c >/dev/null"
0.99user ...

or in Perl:

#!/usr/bin/perl -w
while (<>) {
    $c{$_}++ foreach split /[^a-zA-Z]+/;
}
printf "%5d %s\n",$c{$_}, $_ foreach sort keys %c;

Cartoon

Copyright Scott Adams


Assessment Summary

The assessable components of the subject are:

Hurdle Requirements


50+% on both exams

Students with < 15/40 for assignments and lab exercises will not be allowed to attempt the practical exam


Check Your Marks

You can check your marks with this command:
/home/cs1721/bin/classrun -sturec
After the practical exam, check that your ass1, ass2 and lab marks are correct.

Written Exam

Two hours.

8:45 Tuesday 16th.

Run by examinations branch.

http://www.infonet.unsw.edu.au/academic/exams/

Multiple choice and short answer questions.

Skeletal sample written exam on class web page.

Note the penalty for answering multiple choice questions incorrectly.

Written exam shared with 1021.

Only topics covered in both subjects can appear.


Written Exam - C Info Sheet

For both written and practical exam you will be give a C info sheet.

Its a summary of useful information about C syntax and library functions.

Copy linked to class web page.


Written Exam - Core C Material

This C material is likely to appear on the written exam:


Written Exam - Core Systems Material

This systems material is likely to appear on the written exam:


Written Exam - Systems Material

The written exam will not cover the MIPS.

No questions about assembler on written exam.

Note, the practical exam will have an assembler question.

The written exam will only cover systems material which has also been covered in 1021.

Some of the 1021 material has been covered only indirectly in 1721.

Read the 1021 lecture slides for Stacks & Queues, Binary Trees, Sorting Algorithms and the System Strand and make sure you understand this material.


Practical Exam - Location and Time

The exam is held Monday 22/11 14:15-17:00

Time on examinations web page is incorrect.

Seat allocations posted on web several days beforehand.


Practical Exam - Format

Exam is conducted on-line.

You will use a special exam account.

You will not be able to access your files or any other files except a few files provided specifically for the exam.

You will not be able to access the network or world wide web.

You will type the answer to each question into a separate file.

Exam is 150 minutes long.

Sample exams on class web page.


Practical Exam - Materials

Exam is closed book.

No notes. No books. No files.

You will be given a C info sheet.

Its a summary of useful information about C syntax and library functions.

You will also be given a summary of useful MIPS instructions and assembler syntax.

Copies linked to class web page.


Practical Exam - Questions

A typical question will specify a task and ask you to write a C program to solve this task.

You may be asked to write a program in MIPS assembler.

Question will usually include examples.

Read question and examples carefully.

Some questions may forbid you using some or all methods from the C API.

Some questions may require a particular data structure - e.g.: a linked list or array.


Practical Exam - Hints

Some questions may give you some already written C.

Most questions will give you no C - you will have to write the program from scratch.

Questions may ask you to write programs which take a fixed or unspecified number of command line arguments.

Questions may ask you to write programs which read a fixed or unspecified number of lines of input.

You may be required to read input until a certain value or string occurs.

You may be required to read input until the end-of-file is reached.

You may be require to convert between types such as: int, double, char and strings.


Sample Practical Exam

Week 14 tutorial questions are from an old practical exam.

A sample practical exam posted on class web page.


Automatic Marking

Your answers will be run through automatic marking software.

Please follow the input/output format shown exactly.

Please make your program behave exactly as specified.

All answers are also hand marked. The automatic marking is to assist these markers.


Practical Exam - Marking

No marks awarded for style or comments.

Answers may be penalised if they are confusing or otherwise difficult for the marker to understand.

Use decent formatting so the marker (and you) can read the program.

Comments only necessary if you want to tell the marker something.

Minor errors will result in only a small penalty.

E.g.: an answer correct except for a missing semi-colon would receive almost full marks.

No marks will given unless an answer contains a substantial part of a solution (> 50%).

No marks just for starting a question and writing some C


Provisional Results

Provisional results will be available via /home/cs1721/bin/classrun -sturec by Friday 3rd December.

Final results will appear via myUNSW.


Supplementary Exams

The supplementary exams will only be available to students offered supplementary assessment by school or faculty examiners meetings.

Most people sitting the supplementary exams missed the original exam due to illness (and hence applied for special consideration).

The examiners meetings sometimes offer supplementary assessment to students with borderline results, e.g. with a final mark of 48 and passes in all other subjects.

Please don't ask me if you can sit the supplementary exams. It is up to the examiners meeting.

If you are disadvantaged in some way you should apply for special consideration at the student centre.

Students who have passed the subject are not normally offered supplementary assessment.

Details posted on class web page the week before.


Illness during Final Exam

By attending the exam, you are saying "I am well enough to sit it".

If you really are sick, stay home and apply for Special Consideration.

Claims for special consideration by student who sit the final exams are normally be ignored.

Sometimes students claim to have become ill during an exam.

If you do become ill during the exam, notify an exam supervisor immediately.

Students do not get two attempts at the exam without excellent reasons.


Supplementary Exam

The supplementary exam will likely sometime in the period be held December 13-17

A list of students offerred supplementary assessment should appear on the class web page by Monday December 13. It will also be available from the CSE school office.

Its your responsibility to discover if you have been offerred supplementary assessment.

The format of the supplementary exam will be the same as the final exam.

There is no alternative to the supplementary exam - if you miss it your grade will be FL.


Good Luck

I hope you learnt something in this subject.

I hope you get the mark you deserve.