Wednesday Week 9
Nerds You Should Know #8 | 1/34 |
The next in a series on famous computer scientists ...
They developed one of the most useful Web tools ...
... Nerds You Should Know #8 | 2/34 |
Larry Page
Sergey Brin
|
|
- Co-founders of
- Page: BSc/BE University of Michigan
- Brin: BSc University of Maryland
- Both moved to Stanford for PhD in mid-1990's
- PhD work led to new ideas on Web searching
- use keywords like "normal" search engines
- augment document ranking by "credibility"
- credibility related to inbound links
- Ideas led to prototype, then to company
- Google Inc. founded in 1998
|
A file is a sequence of bytes on a storage device.
Files are normally persistent ...
- data written to a file remains in storage
- the same data can be read multiple times
Files are named (e.g. /home/mit/notes.txt
) ...
- the name is contained in the file system
- the data is accessed via the file's name
- file system also enforces access permissions on files
Exercise: Unix file permissions | 4/34 |
Each file on Unix has:
- three permission levels: user, group, others
- three permission types: read, write, execute
Investigate how each of these
- can be set/unset
- affects what you can do with the file
What kind of data is held in files?
Text files contain ASCII characters ...
- each byte has value in range 0..127 (7-bits unsigned)
- typically partitioned into lines (
"\n"
or "\r\n"
)
Binary files contain arbitrary bytes ...
- each byte has full 8-bits available
- sequences of bytes may be interpreted as e.g.
int
Type of data in a file is determined by suffix or by content
(e.g. .c .o .txt .tex .doc .jpg .mp3 .wmv
vs Unix file
command )
... File Data/Operations | 6/34 |
Standard operations on files:
- open ... get access to a file to use it
- read ... read data from an open file
- write ... write data to an open file
- close ... stop using a file
Other common operations (in Unix):
- tell ... get location within a file
- seek ... move to location within a file
... File Data/Operations | 7/34 |
Unix uses a file-like interface for many kinds of objects:
-
/dev/tty1
... terminal devices
-
/dev/null
... data sink
-
/dev/cdrom
... CD reader
-
/dev/nbd1
... network streams
-
/proc/3666/
... info about one process
Some of these (e.g. /dev/tty1
) are not persistent.
Instead, they give an infinite stream of incoming/outgoing data.
Input/output to/from programs occurs via streams
- a stream is a sequence of byte data
- can be read-from or written-to depending on mode
All input/output in programs so far ...
- is via standard input/output streams
- keyboard is default standard input stream (
stdin
)
- screen is default standard output stream (
stdout
)
-
stdin
/stdout
can be connected to files
(using <
and >
)
Examples of redirection of standard streams:
-
$ ./prog < abc
(reads stdin
from file abc
, writes stdout
to screen, writes stderr
to screen)
-
$ ./prog > def
(reads stdin
from keyboard, writes stdout
to def
, writes stderr
to screen)
-
$ ./prog < abc > def
(reads stdin
from file abc
, writes stdout
to def
, writes stderr
to screen)
-
$ ./prog < abc &> def
(reads stdin
from file abc
, writes both stdout
and stderr
to def
)
-
$ ./prog < abc > def 2> ghi
(reads stdin
from file abc
, writes stdout
to def
, writes stderr
to ghi
)
stdio.h
gives an interface for manipulating text files
- defines a type (
FILE *
) to represent streams
- defines three standard streams (
stdin
/stdout
/stderr
)
- defines a range of operations on
FILE*
's
- operations often come in two versions:
- one operates on specified
FILE*
(e.g. fgetc()
)
- other operates on standard stream (e.g.
getchar()
)
- provides buffering of streams
Note: printf(
fmt,...)
= fprintf(stdout,
fmt,...)
... The stdio.h Library | 11/34 |
For C programs using the stdio.h
library:
- need to
#include <stdio.h>
- need to link library (automatic)
The stdin
/stdout
/stderr
streams
- are opened automatically when the program starts
- closed automatically when the program finishes
Other streams must be opened/closed by the programmer
(C programs have a limit on number of simultaneously open streams (e.g. 1024))
FILE*
is the type used to interact with files (a handle)
Conceptually, a FILE*
represents a stream
- a sequence of bytes moving to or from a device
FILE*
values are created by fopen()
FILE*
values are deleted by fclose()
- A
FILE*
contains data to keep track of the stream's state
- object at end of pointer is like a
FileRep
struct
- contains: buffer, current location, mode, etc. (but not file name)
Common operations on FILE*
objects:
int fgetc(FILE *inf)
... read next character (cast to an int
) from inf
int fputc(int ch, FILE *outf)
... write ch
to outf
char *fgets(char *buf, int size, FILE *inf);
- read chars from
inf
into buf
; stop at \n
or after size
chars
int fputs(char *buf, FILE *outf);
- write all chars from
buf
to outf
int fclose(FILE *fp);
- flushes any buffered output to
fp
and then closes fp
FILE *fopen(char *name, char *mode);
- attempts to open a stream to a file called
name
- typical modes:
"r"
read, "w"
write, "a"
append
- returns
FILE*
if successful, NULL
if not
- on open for reading ...
- failure if file does not exist
- failure if user does not have read permission
- on open for writing ...
- file is created if it does not already exist
- file is truncated if it does exist
- failure if no write permission on file or directory
- for append: output added after end of existing file
Iterating over Text Files | 15/34 |
Character-by-character:
FILE *inf, *outf;
int ch;
while ((ch = getc(inf)) != EOF) { // end-of-file char
putc(ch, outf);
}
Line-by-line:
FILE *inf, *outf;
char line[BUFSIZE];
while (fgets(line, BUFSIZE, inf) != NULL) {
puts(line, outf);
}
Assumes inf
open for reading, outf
open for writing
Exercise: Display Text | 16/34 |
Write a program that emulates what cat
does
- if no command line args, read from stdin
- treat each command line arg as a file name
- for each file, copy its contents to stdout
Usage:
$ ./mycat < xyz
$ ./mycat xyz
$ ./mycat abc def ghi
Exercise: Two-way File Merge | 17/34 |
Write a program that
- takes two command-line arguments (file names)
- assumes that each file is sorted
- reads files a produces a single sorted output
- containing all the lines from each file
Buffering and fflush | 18/34 |
The stdio.h
library buffers input/output
- when you do e.g.
putc
, char is not sent to stream
- rather it is placed in a buffer in memory
- when this buffer fills up, entire contents sent to stream
- also, all buffers are flushed when program finishes
int fflush(FILE *outf);
- allows programmer to control flushing of output buffers
- forces contents of buffer for
outf
to output
Binary files are different to text files
- individual bytes are not necessarily ASCII chars
- do not contain end-of-line markers (no lines)
So, functions like getc()
, fgets()
don't work properly
- e.g. byte in middle of file might look like
EOF
To manipulate binary files, use:
-
fopen()
, fclose()
... manage file handles
-
fread()
, fwrite()
... read/write blocks of data
But why do we need binary files?
We can write all kinds of data as encoded text.
Problems with this approach:
- typically text representation is larger than binary
(consider "1000000000"
(11 bytes) vs 1000000000
(4 bytes)
- encoded text needs to be parsed into binary form
(e.g. scanf("%d",&x)
, where scanf
is relatively expensive)
i.e. binary files are more compact and efficient for binary data
Disadvantages of binary files:
- cannot be examined/modified using a text editor
- contents are machine-architecture-dependent
(file written on one machine not may read the same on another machine)
- contents are simply bytes; no hints on interpretation
Despite this, binary files are useful for e.g.
- holding large amounts of numeric data
- for subsequent re-use/maintenance on the same machine
The Unix od
command provides
- a method for examining the contents of binary files
- the ability to choose a format for interpreting data
Usage: od
Format File
Dumps the contents of File in the specified Format
- format
-c
... treat each byte as an ASCII character
- format
-x
... treat each 2-byte as a hexadecimal number
- format
-d
... treat each 2-byte as a decimal integer
See man od
for many more options (e.g. N-byte rather than 2-byte)
Examples of od
use:
$ cat text
abcABC123!@#
$ od --format=c text
0000000 a b c A B C 1 2 3 ! @ # \n
0000015
$ od --format=d1 text
0000000 97 98 99 65 66 67 49 50 51 33 64 35 10
0000015
$ od --format=x1 text
0000000 61 62 63 41 42 43 31 32 33 21 40 23 0a
0000015
$ od --format=x4 text
0000000 41636261 32314342 23402133 0000000a
0000015
(default is octal data format, hence the name od
= "octal dump")
int fwrite(void *b, size_t z, size_t n, FILE *f);
- uses stream
f
, which is open for writing
- writes
n
data items, each of size z
bytes
- items are taken from memory buffer at address
b
- returns the number of items successfully written (
nr
)
- if an error occurs,
0 <= nr < n
-
void*
means that buffer b
could hold any type
... The fwrite function | 26/34 |
Examples (dump several data structures):
FILE *outf;
int array[50];
struct { float x; float y; } point;
// ... set values in array[] and point
outf = fopen("myDataFile","w");
// ... write array to file
fwrite(array, sizeof(int), 50, outf);
// ... write struct to file
fwrite(point, sizeof(point), 1, outf);
int fread(void *b, size_t z, size_t n, FILE *f);
- uses stream
f
, which is open for reading
- reads
n
data items, each of size z
bytes
- items are stored in memory buffer at address
b
- returns the number of items successfully read (
nr
)
- at end-of-file or on error,
0 <= nr < n
-
void*
means that buffer b
could hold any type
... The fread function | 28/34 |
Examples (read in data written above):
FILE *inf;
int array[50], n;
struct { float x; float y; } point;
inf = fopen("myDataFile","r");
// ... read array from file
if (fread(array, sizeof(int), 50, outf) != 50)
fprintf(stderr, "Can't read array\n");
// ... read struct from file
if (fread(point, sizeof(point), 1, outf) != 1)
fprintf(stderr, "Can't read struct\n");
For a more extensive example:
testfread.c
Reading/Writing Dynamic Structures | 29/34 |
You cannot write-then-read pointer values.
Pointer values refer to memory configuration in one process.
Subsequent processes may have different configuration.
What you can do for linked structures:
- write individual nodes one after another in file
- read values back into newly-
malloc
'd node structs
- ignore any pointer values that were written to file
- set links to reflect new locations of node structs
Exercise: Persistent Linked-List | 30/34 |
Write a program to maintain a list of int
s in a file
- on first use, asks for all values
- on subsequent uses ...
- list is read, re-created and displayed on startup
- asks user for index of element to change, and a new value
- if user types an index longer than current list
- appends value to linked list
- repeats until negative index entered
- then frees list and quits
Random-access to Files | 31/34 |
Files are typically sequential data structures.
Most common access pattern:
- open at start of file
- read item-by-item until end of data, OR
- write items one after another
Two operations (fseek()
and ftell()
) provide random access.
int fseek(FILE *f, long int offset, int whence);
- set the cursor for stream
f
, open for reading or writing
- puts it at location
offset
bytes relative to whence
-
whence
can have the values
-
SEEK_SET
... set offset
relative to start of file
-
SEEK_CUR
... set offset
relative to current location
-
SEEK_END
... set offset
relative to end of file
- returns 0 if successful, -1 on error
- it is legitimate to seek beyond the current end of file
... The fseek function | 33/34 |
Examples of fseek()
usage:
FILE *fp; // open for reading and/or writing
// move cursor to start of file (rewind)
fseek(fp, 0L, SEEK_SET);
// move cursor to end of file
fseek(fp, 0L, SEEK_END);
// backup one byte in file
fseek(fp, -1L, SEEK_CUR);
For a more extensive example:
testseek.c
long ftell(FILE *f);
- returns current offset for stream
f
(or -1 on error)
- used to grab current location to return there later
Example of use:
FILE *fp; // open for reading and/or writing
... add some data to stream fp ...
long here = ftell(fp); // save current location
... add more text to stream fp ...
fseek(fp, here, SEEK_SET); // return to known location
Produced: 5 Oct 2016