# Algorithms and Data Structures

*New Beginnings:* Week 7

*Lecture Notes by Bart Massey*

Copyright (c) 2014 Bart Massey

- 2014-08-11: Complexity; Searching and Sorting
- 2014-08-12: Fancy Sorting; Priority Queues
- 2014-08-13: Hashing and Hash Tables
- 2014-08-14: Stacks and Queues

## 2014-08-11: Complexity; Searching and Sorting

### Goals Of This Section

Learn more about problem specification, common engineering constraints, proof of algorithm correctness, proof of algorithm performance.

Learn techniques and methods of algorithm design.

Familiarize yourself with common algorithm "building blocks".

### ΟΩΘοω

Let's revisit Ο:

`f ϵ Ο(g) iff ∃ k1, k2 . ∀ x . f(x) ≤ k1 g(x) + k2`

Note that Ο(g) is a set of functions for which g is an upper bound.

The equivalent book definition:

`f ϵ Ο(g) iff ∃ k1, k2 . ∀ n . n > k2 -> f(n) ≤ k1 g(n)`

We can also look at lower bounds:

`f ϵ Ω(g) iff ∃ k1, k2 . ∀ x . k1 f(x) + k2 ≥ g(x)`

and bounds both above and below:

`f ϵ Θ(g) iff f ϵ Ο(g) and f ϵ Ω(g)`

The "lowercase" bounds are "not tight":

`f ϵ ο(g) iff ∀ k1 . ∃ k2 . ∀ x . f(x) ≤ k1 g(x) + k2`

and similarly for ω.

### Ο and Time/Space Complexity

For a problem description P, select an instance p of size n, and an algorithm A that solves all instances of P. Let T[p] be the "running time" of A on p, and S[p] be the "extra memory space" required by A on p.

By "running time" we mean something like "count of the number of operations needed to execute", so the units of T[p] is "operations".

We are usually just interested in p as a function of n:

Worst-case: Select p such that T[p] is maximal over all p of size n; we write T[n].

Best-case: Select p such that T[p] is minimal over all p of size n; we write T[n].

Average-case: Average T[p] over all p of size n; we write T[n].

We get to choose what operations we count. Normally what we count doesn't matter much. For example:

`Sum Of Array Given: An array a of integers, with |a| = n Find: The sum of elements of a Algorithm Linear Scan To find the sum of elements of a: t <- 0 for i in 1..n t <- t + a[i] return t`

What should we count here? The initialization? The loop increments? The array indexing? The sum total? The assignment? THe return?

Turns out it doesn't matter: worst-case T[n] ϵ O(n) for any combination of these costs.

Similarly, the units of S[n] are bytes, above and beyond the size needed to describe the instance itself.

### Composition Rules For Ο

Some simple rules for estimating worst-case Ο(n):

Fundamental operations are Ο(1)

Ο(f) op O(g) = O(T[op]) + (Ο(f); O(g)) ϵ O(max T[op], f, g)

Conditional should always be taken worst way: if O(f) then O(g) else O(h) = O(max f, g, h)

O(f) repeated O(n) times ϵ O(n f)

Recursion gets complicated. We'll pass on that here.

### Arrays

Arrays are represented as lists in Python.

However, arrays typically:

Have O(1) access.

Are fixed size; do not allow adding or deleting elements.

This will be important when we move to C/C++ which has arrays like this.

### Search

Imagine that we're trying to find whether an integer k occurs in an unordered array a of n integers.

- More generally, perhaps k is a "key" in a dictionary of n integers.

Captain Obvious's Algorithm:

`for i in 1..n: if a[i] = k return i return None`

By the previous reasoning, worst-case T[n] ϵ O(n)

For a classical computer, the best possible algorithm has worst-case T[n] ϵ Θ(n)

For a quantum computer, Grover's Algorithm runs in worst-case time T[n] ϵ O(sqrt(n)) and S[n] ϵ O(lg(n)).

Comparison sorting (below) requires Θ(n lg(n)) time, but can speed search to O(lg(n)) time. If multiple searches are to be done, it may be worth paying.

### Binary Search

If a[n] is ordered from smallest to largest key, we can search more efficiently; look at the middle element a[i] of a:

If a[i] = k then done

If a[i] < k then look in a[i+1]..a[n]

If a[i] > k then look in a[1]..a[i-1]

We can do this recursively:

`To find k in an array a of n elements: if n = 0 return False m <- (n + 1) div 2 if a[m] = k return True if a[m] < k return find(a[m+1..n]) return find(a[1..m-1])`

With a bit more work, we can return the index or whatever.

We can also search iteratively:

`To find k in an array a of n elements: l <- 1 u <- n while u > l m <- (u - l + 1) div 2 if a[m] = k return m if a[m] < k l <- m+1 else u <- m-1 return None`

Worst-case running time T[n] = 1 + T[(n -1) div 2] ϵ O(lg n)

### Exercise: Exchange

Write a function Python exchange(a, i, j) which exchanges elements i and j of array a.

### Selection Sort

Binary search requires a sorted array.

Here's a simple way to sort an array:

`To sort an array a of n elements: for i <- 1 .. n for j <- i+1 .. n if a[j] < a[i] a[i] <-> a[j]`

To see the correctness of this, note that at the end of each inner loop a[i] will be a smallest not-yet-worted element of a[i].

Running time is T[n] ϵ Θ(n

^{2})

### Exercise: Block copy

Write a function copy(a, i1, j1, i2) which copies elements i1..j1-1 of array a to position i2.

There should be sufficient room after i2 for the copy to complete. The source elements should all be preserved, but the destination elements will be clobbered.

Watch out for overlaps!

### Insertion Sort

So we can sort by selecting successive elements. We can also sort by inserting successive elements into a sorted sublist.

`To sort an array a of n elements: for i <- 1 .. n v <- a[i] for j <- 1..i if a[j] > a[i] break copy(a, j, i, j+1) a[j] <- v`

where the copy could just be

`for k <- i..j+1 (backward) a[k] <- a[k-1]`

It looks like one ought to be able to make this O(n lg n) by binary searching for j. However:

For an array, the second shifting loop still makes it O(n

^{2}).For a linked list, the indexing cost is no longer O(1) so binary search doesn't work.

We will see O(n lg n) sorting algorithms tomorrow.

## 2014-08-12: Fancy Sorting; Priority Queues

In many cases O(n^{2}) searching and sorting algorithms are
just fine. When they are not, CS has given us a whole pile
of more efficient algorithms. This efficiency comes at a
price: usually complexity and difficulty of analysis.

### Comparison Sorting is Ω(n lg n)

Review

There are n! ways to rearrange n (unequal) items.

To decide which one we are looking at, we can at best split the list of possibilities in half with each binary comparison.

Thus, the worst-case T[n] is

`Ω(lg (n!)) = Ω(lg(n^n)) = Ω(n lg n)`

### Accelerating Sorting

The algorithms we have so far are O(n

^{2}). Let's try to do better.Basic idea: Use divide-and-conquer like the lower-bound proof does to try to reach O(n lg n).

Today:

Quicksort: Partition values to divide-and-conquer.

Merge Sort: Partition and merge to divide-and-conquer.

Heapsort: Build binary trees to divide-and-conquer.

### Array Partitioning

Imagine that we want to organize an array a of length n into two sections: first all the values of a smaller than some integer k, and then those equal to k or larger.

The basic strategy: work from both ends toward the middle, swapping elements that are both out of place. Return the index of the first element in the second partition.

`l <- 1 u <- n repeat while l <= n and a[l] < k increment l while u >= 1 and a[u] >= k decrement u if l >= u return l a[l] <--> a[u]`

Running time: indices l and u collectively hit each element of a once, so n comparisons.

### Quicksort

Given partition(), we can divide-and-conquer to sort a list completely.

Since the stuff in the left partition is all guaranteed smaller than the stuff in the right partition, we can sort each partition separately and the result will be sorted.

`quicksort(a, l, u): if l >= u return p <- partition(a, a[0]) quicksort(a, l, p) quicksort(a, p, u)`

### Performance of Quicksort

Start by assuming that a[0] always perfectly splits the array in half.

Then the recursion depth is lg n, and at each level all n elements are processed by partition, so O(n lg n)

If a[0] is always the smallest element, then recursion depth n and n, n - 1, n - 2, ... elements processed per pass, so O(n

^{2})Worst case is already-sorted order!

### Picking a Pivot Better

Pick the middle element? Just changes the worst case.

Pick a random pivot? Makes it unlikely that a patterned list will be a worst case.

"Median-of-three?" (or more?). Gives a statistically better chance of finding a median pivot.

Linear-time median? Ugh.

Maybe we should just find a better sorting algorithm.

### Merging

You have two sorted lists, and want to make a single sorted list out of them.

One (obvious) way: concatenate them and then sort them.

Better: keep picking minimum elements off until one list is empty, then tack on the other one.

`merge(l1, l2): r <- [] while l1 and l2 are both nonempty if l1[0] <= l2[0] r <- r, l1[0] delete l1[0] else r <- r, l2[0] delete l2[0] return r + l1 + l2`

O(n1 + n2) worst case, where n1 and n2 are the sizes of l1 and l2.

### Merge Sort

Idea: Divide-and-conquer, but the other way round:

- Split the list (exactly) in half.
- Sort each half.
Merge the halves.

`mergesort(l): n <- len(l) if n < 2: return l l1 <- l[0 : |l| div 2] l2 <- l[|l| div 2 : n] mergesort(l1) mergesort(l2) r <- merge(l1, l2) return r`

Achieves O(n lg n) worst-case performance, for the same reason as the quicksort best case.

### Priority Queues

Put elements in in any order, always get them out in "priority" order:

- Minqueue: Smallest elements first.
- Maxqueue: Largest elements first.

In this discussion, we will focus on maxqueues for clarity.

### Binary Heap as Priority Queue

Binary Heap: Packed binary tree with invariant "node is >= all those beneath it"

Insert via upheap(), extract via downheap().

Insert and extract-max are O(lg n) operations.

### Array Implementation of Binary Heap

Can treat array elements as binary tree nodes.

Implicit indexing of parent, children:

- Left child of node i is node 2*i + 1
- Right child of node i is node 2*i + 2
- Parent of node i is node (i - 1) div 2

### Extract-Max via Downheap

Idea: Replace the top element from the heap with the last element, then re-establish the heap invariant.

Re-establish by sliding the too-small element downward until the heap invariant is restored.

Two cases:

Source is larger than either of its children: we're done.

Source is smaller than at least one of its children: exchange with larger child and continue.

### Insert via Upheap

Idea: Put a new element at the bottom of the heap, then re-establish the heap invariant.

Re-establish by sliding the too-large value upward until the heap invariant is restored.

Two cases:

Source is smaller than target: we're done.

Source is larger than target: Exchange and continue.

### Heapsort

To heapsort a list, just push all the elements onto a PQ, then pop all of them back off.

Clever hole-filling tricks:

As you insert each element, it fills the hole from where you picked it up.

As you pop each element off, put it in the hole left by the extraction.

Algorithm is O(n lg n) for obvious reasons.

Can construct heap in O(n) time by clever application of downheap() to build successively larger heaps.

## 2014-08-13: Stacks and Queues

### The Humble Stack

Also known as a "LIFO" (Last-In First-Out).

Two basic methods: push(v) and pop().

- Usually also an is_empty() and/or size() method.

Obvious implementation as an array or list. We choose array for portability and because both push() and pop() become O(1).

Need to worry about the "stack full" case.

I will show you a stack example.

### A Note About Python Representation

It's easy to get confused about how Python treats values.

Numbers are stored in memory. Whenever you assign a number variable or pass a number as a parameter, Python makes a copy. We call such values "scalar".

Lists and objects are represented by their address in memory. Whenever you assign an object variable or pass an object as a parameter, Python just makes a copy of the address. We call such values "structured", and they can be confusing to work with.

The fact that objects are structured is what makes code like my stack example work.

### Exercise: Write A Stack Class

Write an array-based class Stack that provides the above methods.

Write tests for your stack, including random tests.

Use your stack to implement reverse().

### Queues

"FIFO" (First-In First-Out) variant of the stack, with similar methods.

As with stacks, can be implemented using lists or arrays.

List-based idea: keep hold of the head and tail of the list. Insert by tacking an element onto the tail. Extract by pulling an element off the head.

Array-based idea: Treat the array as "circular" by wrapping indices around when they go off the end.

Keep an index of the extraction point and one past the insertion point. When these indices are equal, the queue is empty. When the insertion point is just before the extraction point, the queue is full.

This is an interesting engineering tradeoff. As described, an array of n elements can enqueue at most n + 1 elements.

Alternatives include using an empty/full boolean or setting the insertion or extraction point to None when the queue is empty.

### Exercise: Write An Array-Based Queue Class

### Applications Of Stacks And Queues

In a programming language, the return addresses of function calls are usually stacked.

In an operating system, interrupts are usually stacked. That is, when an interrupt is received, the OS stacks the CPU state and processes the interrupt, then unstacks the state and continues.

Note that stacks are not "fair": whoever is last gets the fastest service. This can be a problem, e.g. DDOS attacks.

A grocery line is obviously a kind of queue. This is where the name came from.

Packets are generally queued at intermediate nodes on the Internet.

Call centers often queue customers: "Your call will be processed in the order received."

### Queueing Theory

"A buffer cannot match rates."

If on average one is putting more into a queue than getting out over time, the queue size must grow perpetually to avoid failure.

If on average one is putting less into a queue than getting out over time, the queue size will eventually reach zero.

Goal: Statistically predict what a queue's state will look like over time from information about how it is being used.

How big will the queue be on average?

How long will an average job spend in the queue?

Wikipedia summarizes a classic result for a simple case of this.

### Dequeue: Best Of Both Worlds

Either the list implementation or the array implementation can be extended with methods allowing extraction from either end.

This isn't generally useful, but at least it unifies some code.

Python has a queue module.

## 2014-08-14: Hashing and Hash Tables

### Why Hash

A hash function H is an integer-valued function of an object designed to be as unique as possible. The hash value of an object is the value of H on the object.

Since small integers are small, they take up less storage in data structures than the objects they represent.

Since integers are total-ordered, they induce a total order on the hashed objects.

Since integers can be array indices, they can be used for fast lookup and storage in arrays.

Cardinality Problem: Since there are generally many more objects than possible hash values, many objects must exist that hash to the same value.

- Thus, we have to be careful when using hash values.

### Hashing Algorithms

There's lots of ways to build a hash function:

Large integers might be hashed by simple modulus.

Strings are hashed by combining the characters according to some scrambling function.

Collections are hashed by hashing individual elements and then combining these hashes.

Example: Checksumming

Want to hash an array of integers.

Take the sum of the integers modulo some n.

Different arrays will tend to produce differing hashes.

However, it is easy to find arrays with the same checksum: e.g. all zeros except a 1 in any position.

Generally, the hash of an array element should depend on more than just the value there:

Hash of everything before it.

Position in the array.

Example: An OK string hash function (djb2):

`def hash(s): h = 5381 for c in s: # h = ((c << 5) + c) & 0xffff h = (33 * c + h) % 65536 return h`

To hash a list of hashable values, hash the values, then use a hash function to combine the hashes.

### Hash Tables

Python implements its dictionaries as hash tables (I think). We will consider only "bucket hashing" here. See other sources for "open hashing".

Imagine that we want to build a dictionary mapping strings to integers. We could do this with a list, but then either lookups or insertions would be O(n) where n is the total length of the strings.

Take an array h of 65536 elements, where each element is a list of (key, value) tuples. Initialize all the elements to the empty list.

To insert, hash the string key and append the key/value tuple to the list at that position in the array.

`insert(k, v): hk = hash(k) h[hk] += [(k, v)]`

This can be done in constant-ish time. However, don't forget to replace the tuple instead of appending it if you insert a new value at an old key.

To lookup, hash the string key and search through the list at that position in the array.

`lookup(k): hk = hash(k) for t in h[hk]: (kt, vt) = t if kt == k: return v return None`

We are still technically O(n) in the worst case, but in practice we will have reduced our lookup time by a factor of about 65536!

### An Aside: Powers of Two and Ten

Note that 2

^{10}= 1024 ~ 1000. This is handy to note: it means that 2^{20}~ 1000000 and so forth.In CS we sometimes use the metric suffix K to mean 1024, M to mean 1024

^{2}, etc... Other times, we use K to mean 1000 and so forth, so be careful.Other useful powers of two to know, because they come up all the time: the small powers of two, 2

^{8}= 256, 2^{16}= 65536, 2^{24}~ 16M, 2^{32}~ 4G.

### Exercise: String Table Class

- Take the code for insertion, deletion and hashing above and create a StringTable class that maps strings to arbitrary values. You may hardwire 64K if you wish.

### Shortest Path

Imagine we're looking at a roadmap and want to find the shortest path from one city to another.

- Again, no cycles, so it's a "simple path".

Again, searching the set of possible paths is hopeless.