The purpose of sorting is to facilitate the later search for members of the list.
Real-life examples:
a1, a2, ..., an
Sorting consists of rearranging these elements into an order
a'1, a'2, ..., a'n
such that
a'1 < a'2 < ... < a'n
Simple algorithms tend to require a number of operations proportional to n2 to sort a sequence of n elements.
Better algorithms require a number of operations proportional to n*log(n).
If n is large n*log(n) is much smaller than n2.
Hence if you have a large number of elements to sort you need a good algorithm.
Probably the one you would think of if asked to sort a sequence by hand.
| Scan | Sequence
| 12 | 18 | 42 | 44 | 6 | 19 | 25
| 0 | 12 | 18 | 42 | 44 | 6 | 19 | 25
| 6 | 18 | 42 | 44 | 12 | 19 | 25
| 1 | 6 | 18 | 42 | 44 | 12 | 19 | 25
| | 6 | 12 | 42 | 44 | 18 | 19 | 25
| 2 | 6 | 12 | 42 | 44 | 18 | 19 | 25
| 6 | 12 | 18 | 44 | 42 | 19 | 25
| 3 | 6 | 12 | 18 | 44 | 42 | 19 | 25
| 6 | 12 | 18 | 19 | 42 | 44 | 25
| 4 | 6 | 12 | 18 | 19 | 42 | 44 | 25
| 6 | 12 | 18 | 19 | 25 | 44 | 42
| 5 | 6 | 12 | 18 | 19 | 25 | 44 | 42
| 6 | 12 | 18 | 19 | 25 | 42 | 44
| |
|---|
void
selection_sort(int *sequence, int length) {
int index, scan, min, temp;
for (index = 0; index < length - 1; index++) {
min = index;
for (scan = index+1; scan < length; scan++)
if (sequence[scan] < sequence[min])
min = scan;
/* swap the values */
temp = sequence[min];
sequence[min] = sequence[index];
sequence[index] = temp;
}
}
Best, average and worst cases are all the same.
Not affected by the ``sortedness'' of the input, only by the number of elements.
There are n - 1 scans over the array.
The first scan requires n - 1 comparisons, the second scan requires n - 2, the thirdn - 3, and so on ...
The number of comparisons is thus:
(n-1) + (n-2) + ... 1 = n*(n-1)/2
Also a classic example of divide-and-conquer problem solving.
Basic idea is as follows:
x to be pivot
qsort [] = []
qsort (x:xs)
= sortedSmall ++ [x] ++ sortedLarge
where
sortedSmall = qsort smalls
sortedLarge = qsort larges
smalls = [y | y <- xs; y < x]
larges = [y | y <- xs; y > x]
sequence[lo..hi]:
void
quick_sort(int *sequence, int length) {
int i, j, pivotValue, temp;
if (length < 2)
return;
/* start from left and right ends */
i = 0;
j = length - 1;
/* use middle value as pivot */
pivotValue = sequence[length/2];
while (i < j) {
/* Find two out-of-place elements */
while (sequence[i] < pivotValue)
i++;
while (sequence[j] > pivotValue)
j--;
/* and swap them over */
if (i <= j) {
temp = sequence[i];
sequence[i] = sequence[j];
sequence[j] = temp;
i++;
j--;
}
}
quick_sort(sequence, i);
quick_sort(sequence + i, length - i);
}
| Step | Sequence | ||||||||
i | j
| ||||||||
| 0 | 1 | 7 | 8 | 2 | 5 | 4 | 9 | 6 | 3 |
i | j
| ||||||||
| 0a | 1 | 7 | 8 | 2 | 5 | 4 | 9 | 6 | 3 |
i | j | ||||||||
| 1 | 1 | 3 | 8 | 2 | 5 | 4 | 9 | 6 | 7 |
i | j | ||||||||
| 1a | 1 | 3 | 8 | 2 | 5 | 4 | 9 | 6 | 7 |
i | j | ||||||||
| 2 | 1 | 3 | 4 | 2 | 5 | 8 | 9 | 6 | 7 |
ij | |||||||||
| 2a | 1 | 3 | 4 | 2 | 5 | 8 | 9 | 6 | 7 |
j | i | ||||||||
| 3 | 1 | > 3 | 4 | 2 | 5 | 8 | 9 | 6 | 7 |
| Step | Sequence | ||||||||
i | j
| ||||||||
| 0 | 5 | 7 | 8 | 3 | 2 | 4 | 9 | 1 | 6 |
i | j | ||||||||
| 0a | 5 | 7 | 8 | 3 | 2 | 4 | 9 | 1 | 6 |
i | j | ||||||||
| 1 | 1 | 7 | 8 | 3 | 2 | 4 | 9 | 5 | 6 |
i | j | ||||||||
| 1a | 1 | 7 | 8 | 3 | 2 | 4 | 9 | 5 | 6 |
i | j | ||||||||
| 2 | 1 | 2 | 8 | 3 | 7 | 4 | 9 | 5 | 6 |
j | i | ||||||||
| 2a | 1 | 2 | 8 | 3 | 7 | 4 | 9 | 5 | 6 |
| 3 | 1 | 2 | 8 | 3 | 7 | 4 | 9 | 5 | 6 |
If lucky, divide sequence roughly in half.
To partition each sub-partition then requires roughly n/2 comparisons.
There are two of them, so total n comparisons.
To partition each sub-sub-partition requires roughly n/4 comparisons.
There are four of them, so total n comparisons ...
At each level there are n comparisons.
If k levels, require total n*k comparisons.
That is, repeatedly divide the sequence in half until only one element in each partition.
If sequence is length n=2j, takes j steps.
In other words, there are log2(n) levels.
So sorting takes n * log2(n) operations.
In this case, partition sequence into:
In very worst case, every pivot we choose is like this.
This givesn levels
So sorting takes n 2 operations.
Consider all possible permutations of input, and compute cost for each.
Turns out that average case is around 1.4 times as slow as the best case.
| Selection Sort | Quick Sort
|
|
| 200*200 = 40000 | 200*log2(200) = 1529
| |
|---|
#include <stdio.h>
void
selection_sort(int *sequence, int length) {
int index, scan, min, temp;
for (index = 0; index < length - 1; index++) {
min = index;
for (scan = index+1; scan < length; scan++)
if (sequence[scan] < sequence[min])
min = scan;
/* swap the values */
temp = sequence[min];
sequence[min] = sequence[index];
sequence[index] = temp;
}
}
void
quick_sort(int *sequence, int length) {
int i, j, pivotValue, temp;
if (length <= 1)
return;
/* start from left and right ends */
i = 0;
j = length - 1;
/* use middle value as pivot */
pivotValue = sequence[length/2];
while (i < j) {
/* Find two out-of-place elements */
while (sequence[i] < pivotValue)
i++;
while (sequence[j] > pivotValue)
j--;
/* and swap them over */
if (i <= j) {
temp = sequence[i];
sequence[i] = sequence[j];
sequence[j] = temp;
i++;
j--;
}
}
quick_sort(sequence, j + 1);
quick_sort(sequence + i, length - i);
}
void
print_array(int *a, int length) {
int i;
for (i = 0; i < length; i++)
printf("%2d ", a[i]);
printf("\n");
}
int main() {
int numbers[] = {12,18,42,44,6,19,25};
int length = sizeof numbers/sizeof numbers[0];
printf("Unsorted: ");
print_array(numbers, length);
/* selection_sort(numbers, length); */
quick_sort(numbers, 0, length - 1);
printf("Sorted: ");
print_array(numbers, length);
return 0;
}
We also need the number of occurrences of each word.
The words must be in dictionary order.
The text will be supplied on standard input.
For example:
% a.out <text
381 a
1 abandoned
1 ability
10 able
19 about
2 above
1 abreast
1 absence
1 absent
3 absolutely
1 abyss
1 acceptance
2 accord
1 accumulation
....
while there is input {
read a word
if we have seen that word before
increment the counter for that word
else
create a new counter for that word
}
put the (unique) words in an array
sort the array
for each word
print counter and array
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
typedef struct word_count {
char *word;
int count;
struct word_count *next;
} word_count_t;
void *
salloc(size_t size) {
void *m = malloc(size);
if (m == NULL) {
fprintf(stderr, "out of memory - attempting to malloc %d bytes\n", (int)size);
exit(1);
}
return m;
}
void *
srealloc(void *p, size_t size) {
void *m = realloc(p, size);
if (m == NULL) {
fprintf(stderr, "out of memory - attempting to realloc %d bytes\n", (int)size);
exit(1);
}
return m;
}
char *
get_word(void) {
int i, c;
char *buffer = NULL;
int buffer_length = 0;
for (i = 0; ; i++) {
c = getchar();
if (c == EOF)
break;
if (!isalpha(c)) {
if (i == 0) {
i--;
continue;
} else
break;
}
if (i >= buffer_length-1) {
buffer_length += 16;
buffer = srealloc(buffer, buffer_length*sizeof (char));
}
buffer[i] = c;
}
if (i == 0)
return NULL;
buffer[i] = '\0';
return srealloc(buffer, strlen(buffer) + 1);
}
word_count_t *find_word(char *word, word_count_t *w) {
for (; w != NULL; w = w->next)
if (strcmp(word, w->word) == 0)
return w;
return NULL;
}
/*
* read the words from a stream
* associate a word_count struct with
* each new word
*
* increment the count field each time the
* word is seen
*/
word_count_t *
read_words(void) {
char *word;
word_count_t *w;
word_count_t *word_count_list = NULL;
while (1) {
word = get_word();
if (word == NULL)
return word_count_list;
w = find_word(word, word_count_list);
if (w != NULL) {
w->count++;
free(word);
continue;
}
w = salloc(sizeof *w);
w->word = word;
w->count = 1;
w->next = word_count_list;
word_count_list = w;
}
}
void
sort_words(word_count_t **sequence, int length) {
int i, j;
word_count_t *pivotValue;
word_count_t *temp;
if (length <= 1)
return;
/* start from left and right ends */
i = 0;
j = length - 1;
/* use middle value as pivot */
pivotValue = sequence[length/2];
while (i < j) {
/* Find two out-of-place elements */
while (strcmp(sequence[i]->word, pivotValue->word) < 0)
i++;
while (strcmp(sequence[j]->word, pivotValue->word) > 0)
j--;
/* and swap them over */
if (i <= j) {
temp = sequence[i];
sequence[i] = sequence[j];
sequence[j] = temp;
i++;
j--;
}
}
sort_words(sequence, j + 1);
sort_words(sequence+i, length - i);
}
int
main(int argc, char *argv[]) {
int i, n_unique_words = 0;
word_count_t *word_count_list, *w;
word_count_t **word_array;
word_count_list = read_words();
for (w = word_count_list; w != NULL; w = w->next)
n_unique_words++;
word_array = salloc(n_unique_words*sizeof *word_array);
for (w = word_count_list, i = 0; w != NULL; w = w->next)
word_array[i++] = w;
sort_words(word_array, n_unique_words);
for (i = 0; i < n_unique_words; i++)
printf("%5d %s\n", word_array[i]->count, word_array[i]->word);
return 0;
}
A doubling of the input size, increases execution time by factor of very roughly four.
This suggests execution time is a function of the square of the input size.
Fitting a quadratic to the times (for gcc -O3)
yields this equation:
execution_time(lines) = 3.90609e-07*lines*lines + 0.00195474*lines
Here is a graph of this equation:
This extrapolation suggests 100,000 lines of input will take 4000 seconds or over an hour to process.
Just FYI, here is how the fitting and graph were done:
% cat >times <<eof
250 0.03
500 0.13
1000 0.48
2000 1.66
4000 7.85
8000 47.55
16000 189.57
eof
% gnuplot
gnuplot> plot [1:100000] [1:5000] "times" with lines
gnuplot> f(x) = a*x*x + b*x
gnuplot> fit f(x) "times" via a,b
gnuplot> execution_time(lines) = 3.90609e-07*lines*lines + 0.00195474*lines
gnuplot> replot execution_time(x)
Typically >80% of execution time spent in <20% of program.
Hence, blind attempts to improve performance often fruitless.
We need to know which functions are contributing most to execution time. Some tools allow you to examine where execution time is spent.
E.g. gcc -p and gprof.
% gcc -p count_words.c
% sed 2000q dracula.txt |a.out >/dev/null
% gprof a.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
98.92 8.38 8.38 19978 0.00 0.00 find_word
0.95 8.46 0.08 19979 0.00 0.00 get_word
0.12 8.47 0.01 1 0.01 8.47 read_words
0.12 8.48 0.01 1 0.01 0.01 sort_words
0.12 8.49 0.01 main
0.00 8.49 0.00 39965 0.00 0.00 srealloc
0.00 8.49 0.00 4518 0.00 0.00 salloc
...
Clearly we need to improve performance of the find_word functions.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
typedef struct word_count {
char *word;
int count;
struct word_count *smaller;
struct word_count *larger;
} word_count_t;
void *
salloc(size_t size) {
void *m = malloc(size);
if (m == NULL) {
fprintf(stderr, "out of memory - attempting to malloc %d bytes\n", (int)size);
exit(1);
}
return m;
}
void *
srealloc(void *p, size_t size) {
void *m = realloc(p, size);
if (m == NULL) {
fprintf(stderr, "out of memory - attempting to realloc %d bytes\n", (int)size);
exit(1);
}
return m;
}
char *
get_word(void) {
int i, c;
char *buffer = NULL;
int buffer_length = 0;
for (i = 0; ; i++) {
c = getchar();
if (c == EOF)
break;
if (!isalpha(c)) {
if (i == 0) {
i--;
continue;
} else
break;
}
if (i >= buffer_length-1) {
buffer_length += 16;
buffer = srealloc(buffer, buffer_length*sizeof (char));
}
buffer[i] = c;
}
if (i == 0)
return NULL;
buffer[i] = '\0';
return srealloc(buffer, strlen(buffer) + 1);
}
word_count_t *update_tree(char *word, word_count_t *w) {
int compare;
if (w== NULL) {
w = salloc(sizeof *w);
w->word = word;
w->count = 1;
w->smaller = NULL;
w->larger = NULL;
return w;
}
compare = strcmp(word, w->word);
if (compare == 0) {
w->count++;
free(word);
} else if (compare < 0)
w->smaller = update_tree(word, w->smaller);
else
w->larger = update_tree(word, w->larger);
return w;
}
/*
* read the words from a stream
* associate a word_count struct with
* each new word
*
* increment the count field each time the
* word is seen
*/
word_count_t *
read_words(void) {
char *word;
word_count_t *word_count_tree = NULL;
while (1) {
word = get_word();
if (word == NULL)
return word_count_tree;
word_count_tree = update_tree(word, word_count_tree);
}
}
void print_words(word_count_t *w) {
if (w == NULL)
return;
print_words(w->smaller);
printf("%5d %s\n", w->count, w->word);
print_words(w->larger);
}
int
main(int argc, char *argv[]) {
print_words(read_words());
return 0;
}
| Input Lines | Execution Time (s) |
|---|---|
| 125 | 0.00 |
| 250 | 0.00 |
| 500 | 0.01 |
| 1000 | 0.02 |
| 2000 | 0.04 |
| 4000 | 0.08 |
| 8000 | 0.16 |
| 16000 | 0.32 |
Execution is now fast and linear, i.e proportional to input size.
An input of 100000 lines will take 2 seconds to process
Great improvement on linked list (1.5 hours).
This can be much much faster than writing a C program to solve a problem.
It is well worth becoming familiar with Unix programs such as: sort, grep, sed, wc, cut, uniq and tr.
This Unix command will count the words in its input:
tr -c A-Za-z '\n'|sed '/^$/d'|sort|uniq -c
tr -c A-Za-z '\n' converts all characters except letters to newline characters.
The leaves lines containing either a single word or nothing.
sed '/^$/d' deletes empty lines.
sort places the lines (words) in sorted order.
uniq -c counts adjacent identical lines (words).
The result is also quite efficient:
% time sh -c "<dracula.txt tr -c A-Za-z '\n'|sed '/^$/d'|sort|uniq -c >/dev/null"
0.99user ...
or in Perl:
#!/usr/bin/perl -w
while (<>) {
$c{$_}++ foreach split /[^a-zA-Z]+/;
}
printf "%5d %s\n",$c{$_}, $_ foreach sort keys %c;
Copyright Scott Adams
Students with < 15/40 for assignments and lab exercises will not be allowed to attempt the practical exam
/home/cs1721/bin/classrun -sturecAfter the practical exam, check that your ass1, ass2 and lab marks are correct.
8:45 Tuesday 16th.
Run by examinations branch.
http://www.infonet.unsw.edu.au/academic/exams/
Multiple choice and short answer questions.
Skeletal sample written exam on class web page.
Note the penalty for answering multiple choice questions incorrectly.
Written exam shared with 1021.
Only topics covered in both subjects can appear.
Its a summary of useful information about C syntax and library functions.
Copy linked to class web page.
The written exam will not cover the MIPS.
No questions about assembler on written exam.
Note, the practical exam will have an assembler question.
The written exam will only cover systems material which has also been covered in 1021.
Some of the 1021 material has been covered only indirectly in 1721.
Read the 1021 lecture slides for Stacks & Queues, Binary Trees, Sorting Algorithms and the System Strand and make sure you understand this material.
The exam is held Monday 22/11 14:15-17:00
Time on examinations web page is incorrect.
Seat allocations posted on web several days beforehand.
You will use a special exam account.
You will not be able to access your files or any other files except a few files provided specifically for the exam.
You will not be able to access the network or world wide web.
You will type the answer to each question into a separate file.
Exam is 150 minutes long.
Sample exams on class web page.
No notes. No books. No files.
You will be given a C info sheet.
Its a summary of useful information about C syntax and library functions.
You will also be given a summary of useful MIPS instructions and assembler syntax.
Copies linked to class web page.
You may be asked to write a program in MIPS assembler.
Question will usually include examples.
Read question and examples carefully.
Some questions may forbid you using some or all methods from the C API.
Some questions may require a particular data structure - e.g.: a linked list or array.
Some questions may give you some already written C.
Most questions will give you no C - you will have to write the program from scratch.
Questions may ask you to write programs which take a fixed or unspecified number of command line arguments.
Questions may ask you to write programs which read a fixed or unspecified number of lines of input.
You may be required to read input until a certain value or string occurs.
You may be required to read input until the end-of-file is reached.
You may be require to convert between types such as: int, double, char and strings.
A sample practical exam posted on class web page.
Please follow the input/output format shown exactly.
Please make your program behave exactly as specified.
All answers are also hand marked. The automatic marking is to assist these markers.
No marks awarded for style or comments.
Answers may be penalised if they are confusing or otherwise difficult for the marker to understand.
Use decent formatting so the marker (and you) can read the program.
Comments only necessary if you want to tell the marker something.
Minor errors will result in only a small penalty.
E.g.: an answer correct except for a missing semi-colon would receive almost full marks.
No marks will given unless an answer contains a substantial part of a solution (> 50%).
No marks just for starting a question and writing some C
Final results will appear via myUNSW.
Most people sitting the supplementary exams missed the original exam due to illness (and hence applied for special consideration).
The examiners meetings sometimes offer supplementary assessment to students with borderline results, e.g. with a final mark of 48 and passes in all other subjects.
Please don't ask me if you can sit the supplementary exams. It is up to the examiners meeting.
If you are disadvantaged in some way you should apply for special consideration at the student centre.
Students who have passed the subject are not normally offered supplementary assessment.
Details posted on class web page the week before.
If you really are sick, stay home and apply for Special Consideration.
Claims for special consideration by student who sit the final exams are normally be ignored.
Sometimes students claim to have become ill during an exam.
If you do become ill during the exam, notify an exam supervisor immediately.
Students do not get two attempts at the exam without excellent reasons.
A list of students offerred supplementary assessment should appear on the class web page by Monday December 13. It will also be available from the CSE school office.
Its your responsibility to discover if you have been offerred supplementary assessment.
The format of the supplementary exam will be the same as the final exam.
There is no alternative to the supplementary exam - if you miss it your grade will be FL.
I hope you learnt something in this subject.
I hope you get the mark you deserve.