Data Structures

Part II of II

Intro

   +-----------------------------------+
   | applications                      |
   | databases                         |
   +-----------------------------------+            +------+
   | dictionaries (associative arrays) |            |      |
   | priority queues                   |            |      |
   +-----------------------------------+------------+ ADTs |
   | stacks, queues                    | containers |      |
   |                                   +------------+      |
   | linked lists, trees               | linked     |      |
   |                                   +------------+------+
   | arrays, hash tables               | contiguous |
   +-----------------------------------+------------+
   | OS memory abstraction             |
   +-----------------------------------+

The horizontal borders represent layers of abstraction on which structures above are built. We're focusing on data structures optimized for storage and retrieval.

  • Part I notes

    • arrays
    • linked lists
    • stacks & queues
    • priority queues
    • trees
  • Part II notes

    • hash tables
    • dictionaries
    • block chain

Hash Tables

  • Dream data retrieval scenario: we know the index in an array where our data would be if it were in the array

  • A hash function is a mathematical function that maps keys to integers.

    • For each key:
      • map the key to a large integer
      • take the modulo of that integer by a constant
    • This his provides a fixed-length hash with the following properties:
      • every key can be assigned a hash value
      • two equal keys will have the same hash value
      • with a good hash function, most unequal keys will have different hash values

A hash table then uses that index to store and retrieve the value of a key.
simple_has_table

What to do in the event of a collision of hash values from two unequal keys?
Values can be chained together in a bucket:
chain_hash_table

  • Pros and Cons of chaining
    • It's simple and flexible.
    • But pointers take space
    • And we have to follow them while the cache is already warm with adjacent buckets

Is there a way we could use the empty, adjacent spaces to hold collided values?

Yes, open addressing:
open_addr_hash_table

  • The simplest approach is sequential probing
    • insert item into the next open space in the table
    • this creates a 'run' of adjacent occupied spaces
    • upon deletion, re-insert every item following the gap
  • Pros and cons of sequential probing
    • traversal is faster if there are collisions
    • deletion requires O(n) re-hashing or lost space to a tombstone

Efficiency of a chained hash table with doubly linked list
and n keys in a table of m buckets

expected worst case balanced tree
search O(n/m) O(n) O(lg n)
insert O(1) O(1) O(lg n)
delete O(1) O(1) O(lg n)
successor O(n+m) O(n+m) O(lg n)
predecessor O(n+m) O(n+m) O(lg n)
minimum O(n+m) O(n+m) O(lg n)
maximum O(n+m) O(n+m) O(lg n)
  • Hashing algorithms can be optimized for different use cases:
    • Rabin-Karp
      • particularly good at finding any one of a set of sub-strings of a text
      • average O(n+m) with text of length n and substrings of combined length m
      • worst-case O(nm) with high collision
    • Knuth-Morris-Pratt
      • faster for single pattern matching, O(n)

Dictionaries

  • key-value storage structure like the word and definition in a dictionary
  • provide the same operations as the other structures we've seen, with focus on
    • search
    • insert
    • delete
  • min and max allow dictionary to serve as a priority queue on the values of keys
  • predecessor and successor allow iteration through the dictionary

  • can be implemented with any of data structures we've seen

    • especially well suited to bsts and hash tables

Block Chain

We want to transmit ordered blocks of data and know if any of them have been modified or reordered. Hashing can help.

  • In the case of block n
    • append the identifier of block n-1 to your current block data
    • compute the hash of this new concatenated string
    • if it below a target value (if it has a certain number of leading zeros)
      • this is the identifier of block n
    • otherwise, calculate a nonce (a value used on one occasion)
      • the hash of block n-1's identifier, the block data, and the nonce is below the target value.
      • The resulting hash below that value is the block n's identifier.

block_chain

  • Application in cryptocurrency
    • blocks can be transmitted openly over a network
    • if one is intercepted and altered, the hash of its data, its nonce, plus the identifier of block n-1 will no longer match the identifier of block n recorded in block n+1.
    • you could intercept block n+1 but you would need to recompute the nonce for block n to reflect your changes and get a new identifier to place in n+1...
    • and do this for every block n in the chain
    • because calculating the nonce is very computationally expensive, over a long enough chain, you are provided a degree of confidence that no block has been altered without the need to encrypt any part of the chain
  • Applications in networking more generally
    • cypher block chaining, invented in 1976, encrypts the block chain with a single key and removes the need for a computation intensive nonce

cbc


Outline prepared by Sascha Strand from:

S.S. Skiena, The Algorithm Design Manual, 2nd ed., chapter 3
Springer-Verlag London Limited 2008

Image credits:

Sandeep Dayananda, Blockchain Enthusiast, 2017
https://www.quora.com/How-do-I-gain-an-in-depth-understanding-of-Blockchain