# Data Structures

## Part II of II

### Intro

```
+-----------------------------------+
| applications |
| databases |
+-----------------------------------+ +------+
| dictionaries (associative arrays) | | |
| priority queues | | |
+-----------------------------------+------------+ ADTs |
| stacks, queues | containers | |
| +------------+ |
| linked lists, trees | linked | |
| +------------+------+
| arrays, hash tables | contiguous |
+-----------------------------------+------------+
| OS memory abstraction |
+-----------------------------------+
```

The horizontal borders represent layers of abstraction on which structures above are built. We're focusing on data structures optimized for storage and retrieval.

Part I notes

- arrays
- linked lists
- stacks & queues
- priority queues
- trees

Part II notes

- hash tables
- dictionaries
- block chain

### Hash Tables

Dream data retrieval scenario: we know the index in an array where our data would be if it were in the array

A hash function is a mathematical function that maps keys to integers.

- For each key:
- map the key to a large integer
- take the modulo of that integer by a constant

- This his provides a fixed-length hash with the following properties:
- every key can be assigned a hash value
- two equal keys will have the same hash value
- with a good hash function, most unequal keys will have different hash values

- For each key:

A hash table then uses that index to store and retrieve the value of a key.

What to do in the event of a collision of hash values from two unequal keys?

Values can be chained together in a bucket:

- Pros and Cons of chaining
- It's simple and flexible.
- But pointers take space
- And we have to follow them while the cache is already warm with adjacent buckets

Is there a way we could use the empty, adjacent spaces to hold collided values?

Yes, open addressing:

- The simplest approach is sequential probing
- insert item into the next open space in the table
- this creates a 'run' of adjacent occupied spaces
- upon deletion, re-insert every item following the gap

- Pros and cons of sequential probing
- traversal is faster if there are collisions
- deletion requires O(n) re-hashing or lost space to a tombstone

Efficiency of a chained hash table with doubly linked list

and n keys in a table of m buckets

expected | worst case | balanced tree | |
---|---|---|---|

search | O(n/m) |
O(n) | O(lg n) |

insert | O(1) |
O(1) | O(lg n) |

delete | O(1) |
O(1) | O(lg n) |

successor | O(n+m) | O(n+m) | O(lg n) |

predecessor | O(n+m) | O(n+m) | O(lg n) |

minimum | O(n+m) | O(n+m) | O(lg n) |

maximum | O(n+m) | O(n+m) | O(lg n) |

- Hashing algorithms can be optimized for different use cases:
- Rabin-Karp
- particularly good at finding any one of a set of sub-strings of a text
- average O(n+m) with text of length n and substrings of combined length m
- worst-case O(nm) with high collision

- Knuth-Morris-Pratt
- faster for single pattern matching, O(n)

- Rabin-Karp

#### Dictionaries

- key-value storage structure like the word and definition in a dictionary
- provide the same operations as the other structures we've seen, with focus on
- search
- insert
- delete

- min and max allow dictionary to serve as a priority queue on the values of keys
predecessor and successor allow iteration through the dictionary

can be implemented with any of data structures we've seen

- especially well suited to bsts and hash tables

#### Block Chain

We want to transmit ordered blocks of data and know if any of them have been modified or reordered. Hashing can help.

- In the case of block n
- append the identifier of block n-1 to your current block data
- compute the hash of this new concatenated string
- if it below a target value (if it has a certain number of leading zeros)
- this is the identifier of block n

- otherwise, calculate a nonce (a value used on one occasion)
- the hash of block n-1's identifier, the block data, and the nonce is below the target value.
- The resulting hash below that value is the block n's identifier.

- Application in cryptocurrency
- blocks can be transmitted openly over a network
- if one is intercepted and altered, the hash of its data, its nonce, plus the identifier of block n-1 will no longer match the identifier of block n recorded in block n+1.
- you could intercept block n+1 but you would need to recompute the nonce for block n to reflect your changes and get a new identifier to place in n+1...
- and do this for every block n in the chain
- because calculating the nonce is very computationally expensive, over a long enough chain, you are provided a degree of confidence that no block has been altered without the need to encrypt any part of the chain

- Applications in networking more generally
- cypher block chaining, invented in 1976, encrypts the block chain with a single key and removes the need for a computation intensive nonce

Outline prepared by Sascha Strand from:

S.S. Skiena,

The Algorithm Design Manual, 2nd ed., chapter 3

Springer-Verlag London Limited 2008

Image credits:

Sandeep Dayananda, Blockchain Enthusiast, 2017

https://www.quora.com/How-do-I-gain-an-in-depth-understanding-of-Blockchain