sicp-ex-2.72

Exercise 2.72. Consider the encoding procedure that you designed in exercise 2.68. What is the order of growth in the number of steps needed to encode a symbol? Be sure to include the number of steps needed to search the symbol list at each node encountered. To answer this question in general is difficult. Consider the special case where the relative frequencies of the n symbols are as described in exercise 2.71, and give the order of growth (as a function of n) of the number of steps needed to encode the most frequent and least frequent symbols in the alphabet.

jirf

Major thanks to RWitak. I used his encode-symbol to improve my own which made answering this question require a lot less math.

encode-symbol

  
 (define (encode-symbol symbol tree) 
   (cond ((leaf? tree) 
          (if (eq? (symbol-leaf tree) symbol) 
              '() 
              (error "invald symbol -- ENCODE-SYMBOL --" symbol))) 
         ((memq symbol (symbols (left-branch tree))) 
          (cons 0 (encode-symbol symbol (left-branch tree)))) 
         (else 
          (cons 1 (encode-symbol symbol (right-branch tree))))))

Solution

Best Case (symbol = most frequent) = O(1)
Worst Case = O(n)

Explanation

The two operations that potentially grow in time complexity with the size of input are the memq operation for searching for the symbol in the left branch and the recursive call to search the rest of the tree.

searching for the symbols in left-branch

        ((memq symbol (symbols (left-branch tree)))

Because as discussed in ex. 2.71 the number of symbols in the left branch is always 1 this operation does not grow in time-complexity with the growth in input size. Therefore it is O(1)

Tree Depth

The maximum depth of a huffman-tree with n nodes where their frequencies are:

    (list 2^0 2^1 ... 2^(n-1))

Is n-1.

To see this lets look at how the huffman tree will be merged together. The n=1 case is trivial. For each n > 1 observe that the first two leaves have the same depth while each subsequent leaf merged into the tree at one "level" above (because we only ever merge two leaves together one time).

n=6

-> (1 2 4 8 16 32) <- max-depth = 1
-> ((1 2 3) 4 8 16 32) <- max-depth = 2
-> ((4 (1 2 3) 7) 8 16 32) <- max-depth = 3
-> ((8 (4 (1 2) 3) 7) 15) 16 32) <- max-depth = 4
-> ((16 (8 (4 (1 2 3) 7) 15) 31) 32) <- max-depth = 5
-> ((32 (16 (8 (4 ( 1 2 3) 7) 15) 31) 63)) <- max-depth = 6
-> (32 (16 (8 (4 ( 1 2 3) 7) 15) 31) 63) <- max-depth = 5

Best Case

In the function checks to see if the only symbol found in the left branch equals our target symbol. It will find it, recurse find that it is looking at a leaf with the correct symbol value and return. Therefore O(1)

Worst Case

The function will have to check if the symbol is a leaf/symbol in left branch (1) n times (n). So O(1*n) = O(n).

RWitak

I think the answer to this question depends too much on the implementations to be answered generally. Take this version of encode-symbol:

    (define (encode-symbol symbol tree)
      (define (iter-encode symbol tree result)
        (if (leaf? tree)
            result
            (if (memq symbol (symbols (left-branch tree)))
                (iter-encode symbol (left-branch tree) 
                             (append result (list 0)))
                (iter-encode symbol (right-branch tree) 
                             (append result (list 1))))))
      (if (memq symbol (symbols tree))
          (iter-encode symbol tree '())
          (error "bad symbol -- ENCODE-SYMBOL" symbol)))

The symbol lists, just as the tree itself, could be made so that they're ordered by decreasing weight, eg. when implemented similar to SICP's "sample-tree" (and remember that we only discuss special cases similar to Ex. 2.71!):

    (define power2-huffman
      (make-code-tree (make-leaf 'A 16)
                      (make-code-tree
                       (make-leaf 'B 8)
                       (make-code-tree
                        (make-leaf 'C 4)
                        (make-code-tree
                         (make-leaf 'D 2)
                         (make-leaf 'E 1))))))

Codes:

In this scenario, encode-symbol only checks the full list once. Then it iterates over the tree, only ever checking the left branch - which in the special case we're supposed to examine always has just a single leaf with a single symbol, no matter how big the tree.

This means that in more than half of all cases (16/31), we're done with one iteration - O(1). In more than half of the remaining cases, we need 2 iterations, each checking just a single symbol. This continues until the last case, which is the rarest of all, and takes n iterations, making it O(n). Also, the initial lookup is very fast for most cases, only the least likely case has to traverse the full list.

Now comes the speculative part, as I suck at math: Given that in my implementation, the number of cases that depend on deeper levels gets cut in half and each step on its own is O(n), I estimate the overall complexity of that procedure as about O(log n) - please correct me, if that's wrong.

Apart from all that, if we know beforehand that only a "powers-of-two" kind of tree will be used (and that it's implemented in the way given above), we don't even need the tree itself for encoding, just the ordered list of all symbols:

    (define (power2-encode-symbol symbol power2-huffman)
      (define (traverse symlist)
        (cond ((null? symlist) 
               (error "bad symbol -- POWER2-ENCODE-SYMBOL" symbol))
              ((eq? symbol (car symlist))
               (if (null? (cdr symlist))
                   '(1)
                   '(0)))
              (else (cons 1 (traverse (cdr symlist))))))
      (traverse (symbols power2-huffman)))

In this very efficient encoding algorithm, the resulting code is already finished by the time we checked if the symbol is even contained in the tree! I guess it has the same overall complexity, but fewer steps in reality.

aos

Here's my take on this problem.

NB: This specifically refers to the encode-symbol procedure, which is a sub-procedure of the actual encode. If we were to extend the solution to encode, we'd have to multiply by an additional n for the entire message.

For the encode-symbol procedure in 2.68:

Search the symbol list at each node: O(n) time
Then take log_n branches
Total: O(n * log_n)

For the special case described in 2.71:

1. Encoding the most frequent symbol:

Search through symbol list: O(n) time
Take the first single branch, since it will be at the top of the list: constant
Total: O(n)

2. Encoding the least frequent symbol:

Search through symbol list at each level: O(n) time
Take the next branch, since we are only removing one node, it would be: O(n - 1)
Total: O(n * (n - 1)), or O(n^2)

Sphinxsky

The above answer is not comprehensive enough.

For the encode-symbol procedure in 2.68:

    Consider the maximum and minimum

The minimum value is the case of a full binary tree:O(n * log_n)
The maximum value is like in 2.71:O(n^2)
The final answer is between O(n * log_n) and O(n^2).

Big Old Duck

My encode-symbol for the special case have the growth rate:

Most frequent symbol: O(1) Least frequent symbol: O(n) Any symbol other than two above: O(n*log(f)), where f is the frequency of that symbol. Can the growth be dependent on two variables though? IDK.

Why you may ask?

So, here is my encode-symbol:

(define (encode-symbol symbol tree)

  (cond ((leaf? tree) '())
        ((element-of-set?
          symbol
          (symbols (left-branch tree)))
         (cons 0
               (encode-symbol symbol
                              (left-branch tree))))
        ((element-of-set?
          symbol
          (symbols (right-branch tree)))
         (cons 1
               (encode-symbol symbol
                              (right-branch tree))))
        (else (error "bad symbol: ENCODE-SYMBOL"
                     symbol))))

I assume that the symbols will be sorted by their frequencies (or weights), in increasing order. So, steps for searching will be roughly log(2, f). So, number of steps:

T(n) = log(f)+T(n-1)

     = 2*log(f)+T(n-2)
     = r*log(f)+T(n-r)

That will continue until we reach the point where n-r=log(f), that is the symbol is at the end of the list and will branch off to a leaf. So, T(n) = (n-log(f))*log(f)+T(log(f)) T(log(f)) will be O(1), since we only have to search a leaf. Thus, T(n) = O(n*log(f)). Hence proved (I wouldn't call this a proof though, a proof is more sophisticated).

partj

For the general case, for a tree with n symbols, the order of growth in the number of steps for encoding a given symbol depends upon the position of the symbol within the tree, i.e. which branches lead to its leaf when starting at the root, whether left or right, and how many, and the structure of the tree, i.e. the symbol lists encountered at every node along the way.

We can consider the best and worst cases along two different axes of consideration and extrapolate for cases in between. The symbol could be located right below the root of the tree as one of its direct branches or it could be farthest from the root at a maximum depth of n-1. Also, we may get lucky with our searches in the symbol list at each node encountered so that the symbol is found in Θ(1) steps every time. Or we may have to search through all the symbols at very node and since the number of symbols at every node is some function of n, we may consider the cost of searching to be Θ(n) at every node encountered. Here are the orders of growth from permuting on these scenarios:

best position, best search - Θ(1)
best position, worst search - Θ(n)
worst position, best search - Θ(n)
worse position, worst search - Θ(n^2)

It is difficult to answer for the average case, since that would require us to know the structure of the average huffman tree. But if we consider that the average tree distributes its elements evenly between its branches (even if they aren't halves at every step), then the average element might very well be within a depth of log n from the root of the tree (remember frequent symbols being closer to the tree helps our assumption). So our average complexity might be Θ(n.logn). If we can make our search more efficient so that it is Θ(1) steps every time, our average complexity would improve to Θ(logn).

In the special case described in ex. 2.71, we observe that the depth of the tree is n-1.

For the most frequent symbol, we need only search once which would mean Θ(n) in the worst case and Θ(1) in the best.

For the least frequent symbol, we would encounter about n-1 nodes along the way so depending on our search efficiency, we'd make searches of Θ(n), Θ(n-1) and so on, which would give us a number of steps of n*(n+1)/2 or Θ(n^2) order of growth in the worst case. Or with Θ(1) efficient searching, we'd have Θ(n) in the best case.