Algorithms, Math and Python

New Beginnings: Week 2
Lecture Notes by Bart Massey
Copyright © 2014 Bart Massey

2014-07-07: Numbers; More Python Programming

Number Representation

  • History: tally, tally groups, positional

  • Positional number systems:

    • Define x0 = 1, xy = x x x x [y times]
    • Each "digit" is the count of groups of b next-smaller groups

    • So: 123

      • base 10: 1 102 + 2 101 + 3 100
      • base 2: 1 22 + 0 21 + 1 20
      • base 16: 1 162 + 2 161 + 3 160
        • Note that we don't have enough digits: use 0-9 and a-f

More About Modulo Arithmetic

  • Define r = a mod b such that q = a div b, q b + r = a

    • Negative numbers are confusing
  • Some laws we can verify:

    • 0 <= a mod c < c
    • (a + b) mod c = (a mod c + b mod c) mod c
    • (a b) mod c = ((a mod c) (b mod c) mod c)
  • Useful for "wraparound" calculations

  • Algebraically "nice" when modulus is prime

Base Conversions

  • 2, 16 <-> 10: "do the math"

  • 2 <-> 16: grouping is sufficient

Why Base 2?

  • Little switches inside computer again

    • on-off / 1-0 / +5V-0V / open-closed
  • Simplest possible positional system

    • Nicely hierarchical
  • Easy to do arithmetic

Looping

  • Conditionals are "forward branching"; looping is "backward branching"

  • Repeat the same computation over and over

    • But with different data
    • This is why computers have to be fast

The while Loop

  • Repeats a sequence of statements until a condition is false

  • Condition is tested at the start of each "iteration"

    • Allows for "looping 0 times"

          i = 0
          while i < 10:
              print(i)
              i = i + 1   # should be i += 1
      

The for Loop

  • The pattern of the previous problem is very common:

    • Repeat a computation with some "index variable" set to successive values (usually increases by 1)
  • Python provides special syntax for this:

          for i in range(10):
              print(i)
    
          i = 0
          while i < len(range(10)):
              print(range(10)[i])
              i += 1
    
  • Makes program easier to read

  • Makes program easier to write:

    • Can't forget to initialize index variable
    • Can't forget to increase index variable

The range() Function

  • The range() function is not a special case

  • Takes an ending value n and returns the sequence 0,1,2...n-1

    • Note the 0 and the n - 1!
    • The sequence is in a weird form in Python 3.3 called a "generator"
    • You can make the sequence into a list with the list() function

          >>> list(range(3))
          [0, 1, 2]
      
  • The for statement can loop over any collection (e.g. sequence, set)

  • You can give more arguments to range() to get different sequences

          >>> list(range(1, 12, 5))
          [1, 6, 11]
    
          >>> list(range(9, -1, -1))
          [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
    

Exercise: Sum of numbers from 1 to n

Early Termination: break

  • Sometimes your program should get out of the loop before its scheduled exit

  • One can let the loop run its course, but this is tedious and inefficient

    • e.g. "Find the first number in the list greater than 10"

          first_big = None
          for i in my_list:
              if first_big == None and i > 10:
                  first_big = i
          print(first_big)
      
    • This has all kinds of drawbacks

  • The break statement cleans this up a bit

          first_big = None
          for i in my_list:
              if i > 10:
                  first_big = i
                  break
          print(first_big)
    
  • The break statement is generally used inside an if. It breaks out of the nearest enclosing for or while.

Early Looping: continue

  • The continue statement is like the break statement except that it starts the next iteration of the nearest enclosing for or while loop

          for i in my_list:
              if i == 0:
                 print("zero in list")
                 continue
              print(1 / i)
    

Function Definitions

  • All those Python built-in functions are very nice, but sometimes you need something they didn't provide

  • You can define your own functions with the def statement.

  • Your function can use the return statement to return a value.

    • The return statement is like break: it immediately exits the nearest enclosing function to return the result
    • Normally return is at the end of the function, but not always

          def fact(n):
              result = 1
              for i in range(2, n + 1):
                  result *= i
              return result
      

Formal and Actual Parameters

  • The n in the previous example is an example of a "formal parameter" or "argument". You can have your function require as many formals as you want. Order matters here

  • When you call your function, you need to pass a value ("actual parameter") for each formal the function requires

          >>> fact(3 + 2)
          120
    
  • Note that each actual will be evaluated to produce a value before the function is called

Local and Global Variables

  • A formal is actually a variable: you can store in it

    • If you store into a formal, it will only change that storage location, not any other in the program
    • Storing into formals is generally considered bad style: hard to read and understand
  • Any new variable you assign to inside your definition is also "local": it has no relationship to any variable outside the function, even if it has the same name

          x = 3
    
          def f():
              x = "hello"
              print(x)
    
          f()
          print(x)
    
  • If you want to refer to a variable defined outside your function while inside, use the global statement

          x = 3
    
          def f():
              global x
              x = "hello"
              print(x)
    
          f()
          print(x)
    

Globals Considered Somewhat Harmful

  • So your function can:

    • Refer to data passed as a parameter or available in globals
    • Return a result or store into a global
  • Generally, it is a good idea to use parameters and results rather than global variables to communicate between a function and its caller

  • If the function is working with data that is part of the well-managed global state of the program, it may be ok to just use global statements

    • Many (most?) instructors tell beginning programmers to never let functions access global variables
    • Some go farther: "never have global variables"

Procedures

  • A function that "doesn't return a result" by using the return statement is sometimes referred to as a "procedure"

    • In Python, if you fall off the end of a function, it is considered to have returned None
  • Procedures are useful for their "side effects": things they do besides compute a value

  • In general, you should be a bit suspicious of procedures

    • Pure output is the most common use of procedures

Exercise: Define Some Functions

  • squaring function
  • Gauss's sum
  • Euclidean distance (four parameters)

2014-07-08: Recursion; Formal Proof

More About Dictionaries

  • Dictionary: think of it as a function from a "key" to a "value"

    • Function: one particular value for each key
  • Creating a dictionary:

          {1 : 2, "hello" : "world", 12 : "goodbye"}
    
          {}
    
  • Indexed and assigned like lists

  • Can loop over keys with for

  • Can test presence of a key with in

More About Lists

  • Slicing:

          [1, 2, 3][0:2] == [1, 2, 3][:2] == [1, 2]
          [1, 2, 3][1:] == [1, 2, 3][1:len([1,2,3])+1] == [2, 3]
          [1, 2, 3][::-1] == [3, 2, 1]
    
  • Append:

          [1, 2] + [3] == [1, 2, 3]
    
          l = []
          for i in range(0, 5):
              l += [i]
    

Recursion

  • Functions are tricky.

  • In particular, suppose I define f(x) = f(x) + 1

    • This definition doesn't look sensible; one cannot find an f that meets this property for any x
  • But recursion is not always a problem: f(x) = x f(x)

    • The function f(x) = 0 satisfies this definition
  • We call functions defined in terms of themselves "recursive"

Base and Recursive Cases

  • Usually, a well-defined recursive function f will have at least one input for which f is not defined in terms of itself

    F(0) = 1 F(n) = n * F(n - 1) [n > 0]

  • We call these the "base case" and "recursive case" of the definition

  • It is typical that the base case will correspond to a small input, and the recursive case will define larger inputs in terms of smaller ones

    G(0) = 1 G(1) = 1 G(n + 2) = G(n + 1) + G(n) [n >= 0]

Exercise: Recursion In Python

  • Both F() and G() of the previous examples are implementable in Python

  • Can you say anything about the functions by studying the behavior of their implementations?

Exercise: Collatz Conjecture

Consider the function

    f(1, n) = n
    f(x, n) = f(x / 2, n + 1)     [x > 1 and even]
            = f(3 x + 1, n + 1)   [x > 1 and odd]
    C(x)    = f(x, 0)

Note that n is used as a counter of the number of steps needed to compute f.

The Collatz Conjecture is that the range of the function C is finite. Implement C and try it for various inputs.

Recursion As A Power Tool

  • The beauty of recursion is that you can think of it as defining the behavior of a function on "more complicated" (larger) inputs in terms of the behavior on "less complicated" (smaller) inputs

  • This is a way to break down problems

Iteration Is A Special Case

  • Integer log base 2 can be defined as a recursion

          lg(1) = 0
          lg(n) = 1 + lg(n div 2)
    
  • This recursive formulation is pretty nice. One can remove the recursion by replacing it with an iteration

          To compute lg(n):
            a <- 1
            while n > 1
                n <- n div 2
                a <- a + 1
            return a
    
  • The iterative version may be faster (next slide) but is more complicated and harder to reason about

How Functions Are Implemented: Activation Records

  • When you call a Python function, Python has to remember the values of all the local variables, so it can put them back

    • "Re-entrancy" is the property that a function can (directly or indirectly) call itself
    • Re-entrancy is a feature of all modern PLs
  • The way that this is implemented is:

    • When any function is called, Python creates an activation record with the new variables in it for parameters and locals
    • When the function call returns, its activation record is discarded
  • Creating an activation record can be kind of expensive

    • Performance of allocation / deallocation
    • Storage for currently active activation records

What Is A Proof?

  • A "social construct" (Eugene Luks)

  • A watertight argument

  • Made by piling up trivial arguments

    • So can be easily checked for correctness
  • Not necessarily "true in the real world"

Elements Of A Proof

  • Axioms: Statements that will be assumed to be true

    • One important kind of axiom is the "definition" of some term
  • Inference rules: Ways to get new true statements from old true statements

"Informal Proof"

  • Informal proofs are common in CS and even mathematics

  • Generally, the axioms and inference rules that you have learned over the years in math classes are fair game

  • Leaves a lot to be desired (c.f. next week)

"Lemma": Divisibility

If a number d greater than 1 divides n, it cannot divide n + 1

  • Proof:

    • Assume that n rem d = 0, that d >= 2 and that (n + 1) rem d = 0
    • (n + 1) rem d =
      n rem d + 1 rem d = 1 rem d = 1
  • This contradicts our original assumptions. The only assumption that we can change is the third

  • Thus, the original statement must be true ("Q.E.D.")

Example: There is no largest prime

(Due to Georgia Reh)

  • First, take as an axiom the definition of prime number

    • 2 is a prime
    • Any natural number that is not divisible by any smaller prime is a prime
  • Next define composite number as any natural number greater than 2 that is not prime

  • Assume that some number p is the largest prime

  • Let q be the product of all the numbers from 2 to p

  • By our Lemma, none of the numbers from 2 to p can divide q + 1

  • Thus q + 1 is prime, and larger than p

That Proof Was Kind Of Borked

  • Yeah, need an additional assumption: that every number can be uniquely factored

  • This is the danger of informal proofs

  • This proof can be made right, but not all borked proofs can

Some Proof Techniques

  • Direct proof: Usually by instantiation

    • e.g. if we know that for all natural numbers n we have 2n >= n, we can conclude 4 >= 2
  • Contradiction: What we just saw. Assume the opposite of what we want to prove, and show something that cannot be true

  • Induction: Use the power of recursion to show that successively smaller and larger cases are true

Inductive Proof

  • Induction looks a lot like recursion
    • Show that the proposition is true for some simple "base case"
    • Assuming that the proposition is true for some generic case, show that it must also be true for a "more complicated" case

Example: Gauss's Sum

  • One of the most common formulae in CS

  • sum(1..n) = n (n + 1) / 2

  • Direct Proof: (Gauss at age 11 or somesuch)

    • Add sum(1..n) to sum(n..1)
    • Result must be (n + 1) repeated n times
    • Thus, this sum must be n (n + 1)
    • But this sum also must be sum(1..n) + sum(1..n) = 2 sum(1..n)
    • Thus sum(1..n) = n (n + 1) / 2
  • Inductive Proof:

    • Base case: When n = 1, we have sum(1..n) = sum(1..1) = 1
    • Inductive case:
      • Assume sum(1..n-1) = (n - 1) ((n - 1) + 1) / 2 = n (n - 1) / 2
      • This means that

        sum(1..n) =
        sum(1..n-1) + n =
        n (n - 1) / 2 + n =
        (n2 - n + 2n) / 2 =
        (n2 + n) / 2 =
        n (n + 1) / 2

Exercise: Sum Of Odd Numbers

  • Prove that sum(1, 3, 5 ... (2n-1)) = n2

2014-07-09: Introduction To Algorithms; Trees and Graphs

What Is An Algorithm?

We've spent a lot of time dancing around this. Let's get to it...

  • An algorithm is instructions for solving any given instance of some particular problem

  • A problem is a named collection of generic givens and a question to be answered from them

  • An instance is a particular instantiation of the givens such that the problem can be solved

Example: Largest Element Of A Set

        Largest Element of a Set
        Given: A set S of integers
        Find: The largest element in S
  • OK, this is a simple problem.

  • A given instance might be something like S = {1, 4, 7}

  • We seek an approach that will yield the largest element of S regardless of how S is instantiated

  • This problem is a bit ill posed: What about the empty set?

    • Change the given to "A finite nonempty set"

Algorithm: Largest Element Of A Set

  • An algorithm must be given in terms of fundamental, implementable operations

  • An algorithm is best described in terms of pseudocode: English-like code that does not refer to a particular programming language

    • By definition, pseudocode is not a programming language: there is no "correct" syntax, and the operations described may take some effort to implement in yours
  • Here's an algorithm

          Algorithm: Linear Scan
    
          To find the largest element of a set S:
            let m be any arbitrary element of S
            for each e in S
                if e > m
                    m ← e
            return m
    

Proving Linear Scan "Correct"

  • Involves the proof techniques discussed yesterday

  • First, verify that our Linear Scan algorithm always returns a result

    • Must, since the for loop is guaranteed to always finish and thus the program must get to the end
    • Algorithms that may not return an answer can still be "partially correct"
  • Next, verify that the answer is always correct

    • First, show that the result is always <= the maximum element

      • Yes, because it is always some element of the set, so it can't be greater than any element in the set
    • Next, show that the result is always >= the maximum element

      • Harder. By contradiction: assume that for some set S, the algorithm returns an element of S smaller than maximum
      • Note that each element examined increased the current maximum m to at least its size
      • Thus, the result must be some unexamined element
      • But the for loop examines all the elements

Performance Of Linear Scan

  • We can say things about how fast this algorithm is without ever looking at an implementation

  • Assume that each step takes a fixed amount of time

  • Then total running time is

          2 + 2 |S| <= T[S] <= 2 + 3 |S|
    
  • Note that this isn't a terribly interesting difference

    • Probably OK to just say that T[S] is linearly proportional to |S|
  • This is probably as good as one can do: It's hard to imagine how to find a maximum element of S without examining every element of S at least once

Big-O Notation As A Performance Description

  • We'll see this again, but here's a preview:

  • It is reasonable to consider

    • Only the worst-case performance of the algorithm on inputs of a given size
    • Only the performance of the algorithm as the inputs get arbitrarily large
    • Only approximate performance: not worrying about "constant factors"
  • If this is the plan, let's zero in on it a bit

    • Define the input size in terms of the amount of memory required to represent the instance
    • Consider only the upper bound on runtime as a function of size
    • Throw away additive and multiplicative constants as somewhat irrelevant
    • We usually call the input size n and write this approximate function as O
  • For our example, we say that T[n] is in O(n) where n = |S|

Algorithm Design

  • Teaching algorithm design is hard

    • Mostly you have to get some experience with ways of thinking about problems
  • Typical techniques:

    • "Brute force": Like our example, just work directly from the definition
    • Recursive: Split off the base case and call yourself for the recursive case
    • Divide-and-Conquer: A kind of recursion

Exercises: Algorithm Design

  • For each of the following

    • Formalize problem
    • Give pseudocode for an algorithm
    • Describe why you think the algorithm is correct
    • Estimate the performance of the algorithm
  • Problems

    • Find the minimum element of a set S of integers
    • Find a maximum element of a sequence S of integers
    • Find the average of a sequence S of integers
    • Given sets S and T, print the cross product set S x T
    • Given sets S and T, answer true if any element of S divides any element of T, or false otherwise

Graphs

  • Unrelated to HS graphs!

  • Fundamental CS / discrete math structures showing relationships between entities

  • Consider an arbitrary set V of "vertices" / "nodes" / "bubbles"

  • It is easy to imagine a set E of "edges" / "arcs" connecting vertices

    • E is a subset of V x V
    • Self-edges?
  • Edges can be directed or undirected

  • The degree of a vertex is the number of edges connected to it

More About Graphs

  • Edges and vertices may be labeled: use a "labeling function" l : E -> L or l : V -> L

  • There's an obvious graphical representation here

    • Directed edges are arrows
    • Undirected edges are lines
  • One can find lots of representations in algorithms / programs

    • Vertex + edge set or list
    • Adjacency set / list: for each vertex, set of edges / vertices connected to it
    • Adjacency matrix: more later
  • Lots of interesting things to say about these structures!

Special Case: Linked List

  • Consider a directed line graph

  • Each node in the line graph points to its successor

  • This is one natural representation for a sequence

  • The other natural representation is as an array: successive values are at successive locations

  • Linked lists vs arrays

    • DA: To get to the n-th element, you have to start at the beginning and take n steps
    • DA: Takes twice as much storage: need to store both the element and the edge information
    • AD: To insert an element e, you can just change the successor of the element before and set e's successor appropriately

Special Case: Tree

  • A tree is a connected undirected graph with no cycles

  • A rooted tree has one special node designated as the root

    • In CS rooted trees are drawn upside-down (why?)
  • A binary tree has degree at most three

Other Graph Properties

  • Connected: As mentioned earlier, there is a path from any vertex to any other vertex

    • More complicated when directed
  • Complete: All vertices are connected

  • A bajillion other special graphs, each with their own stuff

2014-07-10: Implementing Algorithms in Python

Translating Pseudocode To Python

  • This is mostly obvious:
    • Give concrete representations to mathematical objects in the pseudocode
    • Create auxiliary functions as needed to keep the main code high-level
    • Reconsider identifier choices

Exercise: Voting Machine

  • Specification

    • Hardcode 2 or more candidates into program

    • Each voter must be able to select a candidate

    • Vote totals for each candidate must be maintained

    • Vote totals must be reported (but only to authorities)

    • Mechanism for preventing multiple votes by same person

    • Each voter must be anonymous (to other voters and officials)

  • Pseudocode

          set all candidates' vote totals to 0
          note that no one has voted yet
          repeat
              clear the page
              prompt for a username u
              if u is the election official id
                  report vote totals
                  continue
              if u has already voted
                  report an error
                  continue
              present choices
              prompt for a choice
              increase the vote total of the choice
              mark u as having voted
    
  • Python

    • See voting.py
  • Verification

    • Not performed.

The assert Statement

  • Reconsider this algorithm

          To find the earliest most-frequently-occurring
          element of a nonempty sequence S:
            m <- 0
            d <- new empty dictionary
            for k in S
                if k is a key in d
                    d[k] <- d[k] + 1
                else
                    d[k] <- 1
                if d[k] > m
                    m <- d[k]
            for k in the keys of d
                if d[k] = m
                    return k
    
  • Translate into Python

          def earliest_repeated(s):
          """Find the earliest most-frequently-occurring
          element of a nonempty list s."""
            m = 0
            d = {}
            for k in s:
                if k in d:
                    d[k] += 1
                else:
                    d[k] = 1
                if d[k] > m:
                    m = d[k]
            for k in d:
                if d[k] == m:
                    return k
    
  • Problem: What if there's a bug in our algorithm? Or what if someone passes an empty list?

    • Then the last loop will finish without returning
    • Thus the function will return None
    • Uggh
  • Good practice: Have the program crash at that point.

            assert False
    
  • The assert statement causes your program to crash with a reasonable message unless its condition is true

  • Should also add an assertion at the beginning that the list is nonempty

            assert s != []
    
  • In general, use assertions liberally

    • Anywhere you're not sure things will be right
    • Also serve as documentation of how things are supposed to be
    • Make debugging much easier
  • Don't remove assertions gratuitously

Instrumentation and "print Debugging"

  • Your programs are starting to be complicated enough to need "debugging"

    • Not a part of the "standard engineering process"
    • Rather, iteration activity
  • Debugging reasons backward, from evidence to cause

  • A good way to have evidence is to instrument your program

    • Have it print information as it runs so that you can see what is happening
  • Probably want to have those prints controlled by a global variable so that you can turn them on only when wanted

          # Turn on for instrumentation.
          debug = True
    
          def earliest_repeated(s):
          """Find the earliest most-frequently-occurring
          element of a nonempty list s."""
            global debug
            assert s != []
            m = 0
            d = {}
            for k in s:
                if debug:
                    print("Considering", k)
                if k in d:
                    d[k] += 1
                    if debug:
                        print("Increased count to", d[k])
                else:
                    d[k] = 1
                    if debug:
                        print("Initialized count")
                if d[k] > m:
                    if debug:
                        print("Found new maximum", m)
                    m = d[k]
            for k in d:
                if d[k] == m:
                    if debug:
                        print("Returning", k)
                    return k
            assert False
    
  • Obviously, these crud up your program; open question how much you should remove

Running Programs: The Linux Command Line

  • So far, most people have mostly been working in IDLE

  • The Linux command line provides quite a bit of power, so worth getting used to

  • Short version: Run program with

          $ python3 myprog.py
    
  • I/O:

    • Program reads user input from "standard input" (default terminal)
    • Program writes user output to "standard output" (default terminal)
    • Program also has available a "standard error" channel for writing output not intended to be processed
  • To provide input from a file, use the "<" command-line operator

          $ python3 myprog.py <myinput.txt
    
  • To store output in a file, use the ">" operator

          $ python3 myprog.py <myinput.txt >results.txt
    
  • Python's stupid input() prompts are to standard output

  • These things are available from Python via the sys module:

          from sys import stderr
    
          print("oops", file=stderr)
    

Using read() and readline()

  • .readline() reads a line (like input())

    • Includes trailing newline
    • Looping over a file
  • .read() reads a whole file (with newlines)

    • Fairly viable in 2014
  • All of this works with stdin

Stripping Trailing Newlines

  • You can apply .rstrip('\n') to a string to remove any trailing newline

  • A note on dot

    • Think of x.y() as y(x)
    • You can string dots together; they left-associate
    • Style is object-oriented programming; soon

Exercise: Command-Line Programming

  • Write a program that counts the number of characters on its input and prints the count as its output. Include newline characters etc in the count.

Enhancing Your Instrumentation: A note Function

  • Let's implement a function that does a couple of nice instrumentation things

    • Prints your debugging messages to stderr
    • Prints them only when debugging is turned on
  • To do this, we're going to have to explore Python quite a bit

  • This is the kind of process I go through every week

Keyword Arguments

  • Note that print() takes "keyword arguments" where the actual is passed by formal name

  • Keyword arguments can help with getting arguments in the right order

Default Values

  • Arguments can be given default values

          def mystery(x = 5):
              print(x)
    
  • Defaulted arguments must be at the end of the formals

Variable Arguments

  • Let us figure out how to implement Python functions that take a variable number of arguments, like print()

  • Now, let's make the default be stderr

          from sys import stdout, stderr
    
          def note(*args, **kwargs):
              if "file" not in kwargs:
                  kwargs["file"] = stderr
              print(*args, **kwargs)
    
          note("hello", 5)
          note("hello", 6, file=stdout)
    
  • Now, let's make it all controlled by a debug variable

          from sys import stdout, stderr
    
          debug = False
    
          def note(*args, **kwargs):
              if not debug:
                  return
              if "file" not in kwargs:
                  kwargs["file"] = stderr
              print(*args, **kwargs)
    
          note("hello", 5)
          note("hello", 6, file = stdout)
    

Implementing A Python Module

  • We would like to make our new note function available for future programs

  • Just stick it in a file named noting.py and put it where Python can find it

    • Current directory
    • System standard place
    • The PYTHONPATH "environment variable"

          $ PYTHONPATH=/tmp
          $ export PYTHONPATH
      
  • Now can import it just like anything else

          from noting import *
    
          debug = True
    
          note("noting something")
    
  • However, Python module variables don't work right, so we'll have to fix our interface

          from noting import *
          import noting
    
          noting.debug = True
    
          note("noting something")
    

Cleaning Up Our Code

        from noting import *
        noting.debug = True

        def earliest_repeated(s):
        """Find the earliest most-frequently-occurring
        element of a nonempty list s."""
          assert s != []
          m = 0
          d = {}
          for k in s:
              note("Considering", k)
              if k in d:
                  d[k] += 1
                  note("Increased count to", d[k])
              else:
                  d[k] = 1
                  note("Initialized count")
              if d[k] > m:
                  note("Found new maximum", m)
                  m = d[k]
          for k in d:
              if d[k] == m:
                  note("Returning", k)
                  return k
          assert False

Debugging

From: Debugging Your Programs, Bart Massey 2013-01-24

What is debugging?

  • During/after coding, before/during/after testing

  • Bring the program to a state where it appears to be bug-free (but this is a lie)

  • Estimate 20-40% of programming effort

  • "Secret": No good books, no chapter in our book, nothin'

How to debug

  • Given a failure of the software:

    • Find the causes ("faults") leading to that failure
    • Find the root causes of those faults
    • Figure out and apply a repair
    • Check the repair
      • Does it fix the failures?
      • Does it cause new failures?

Key activity: diagnosis

  • Like in medicine or car repair: "It doesn't work; what's happening and what can be done?"

  • Diagnosis is hypothesis formation and testing

    • What possible reasons might there be for observed symptoms?
    • Can those reasons be ruled out by what is known so far?
    • If not, can we do tests to rule each reason out or increase our belief that it is the correct one?
    • Repeat until exactly one possible reason remains, and it looks really likely to be true.

Common bugs

  • Two basic kinds:

    • Bad control flow
    • Just plain calculating the wrong thing
  • Examples

    • Off-by-one "fencepost" errors
    • Copy-and-paste calculation errors
    • Typos/"Thinkos"
    • Failure to design to the spec
    • Failure to understand/implement the design

Root Cause Analysis

  • It's not enough to find the line of code that "causes the bug"

  • You want to find out how that line got there

  • In software, faults are caused by mistakes ("errors") that were made by a human (usually you)

  • With the "root causes" found, you can:

    • Correct all the faults caused by that error
    • Take steps to make that error less likely in the future

Preparing code for debugging

  • "Real programmers don't comment. It was hard to write--it should be hard to read and harder to understand."

  • Code should have a spec, simple tests, and pseudocode

  • Formatting should be as clean as possible

    • Consistent indentation
    • Consistent liberal use of whitespace
    • Good names
    • Idiomatic
  • Code should be instrumented appropriately

Debugging pre-inspection

  • Read the code in question carefully. Look for things that are wrong or unclear

  • Explain the code to someone. Have them look at it too

  • Most bugs are easily found and fixed by inspection alone

Debugging tools: your brain

  • Are you sure the spec and tests are correct?

  • White-box: what kinds of similar inputs might produce the same program misbehavior?

  • Black-box: what properties distinguish misbehaving inputs?

  • Is the timing as expected?

  • Are your current hypotheses consistent with everything you have observed or can observe?

Debugging tools: print() function

  • For a specific hypothesis, stick a print() in that will either disconfirm or confirm the hypothesis

    • Works in a huge variety of situations
    • But don't spam instrumentation everywhere, or you will get confused by it
  • Can use print() for exploring program behavior ("tracing"), but beware: one can waste a lot of time doing this without learning anything.

  • Always best to know what the question is before you start looking for the answer

Debugging tools: "debugger"

  • Idle will happily provide you the ability to

    • Step through your program one statement at a time
    • Run until a given program line is reached
    • Examine/change variable values anytime stopped
  • Except the debugger is really fragile and hard to use

  • In particular, doesn't interact well with input()

  • In general, debugger is tool of last resort

Post-diagnosis

  • Once you've found the immediate source of a bug, do RCA

  • Look for other places where faults may have been inserted due to the same root causes

  • Think hard about how those faults got there. What are you going to do to avoid this in the future?

  • Craft fixes that fix the faults properly

    • This may involve changing the design or revising the specification
  • Apply the fixes, then test everything carefully

    • Did the problem get fixed?
    • Are there new problems?

Backups, versions and source code management

  • It is really easy to get the buggy version and the fixed version and the version you are working on right now mixed up

  • Tool called source code management system helps here

  • It is probably a mistake to have too many backup files around; in any case, use a consistent clear naming scheme for backup files

Parting thoughts

  • Don't get stuck!

    • Interrupt yourself every few minutes and see if you're making real progress
    • If you are stuck, many strategies are available:
      • Try a different approach
      • Take a break
      • Ask for help
  • Don't get discouraged

    • The most experienced programmers still make a lot of bugs and have a hard time fixing them

    • The bugs you will make are all fixable

Source Code Management: Git