4-1-1 Information

"2 bits, 4 bits, 6 bits a byte!"

  • Representing information as bits
  • Number Representations
  • Other bits

What is "Information"?

information, n. Knowledge communicated or received concerning a particular fact or circumstance.

A Computer Scientist's definiton:

Information resolves uncertainty.

Information is simply that which cannot be predicted. The less predictable a message is, the more information it conveys!

Quantifying Information

(Claude Shannon, 1948)

Suppose you're faced with N equally probable choices, and I give you a fact that narrows it down to M choices. Then you've been given:

log2(N/M) bits of information

Examples:

log2(107/104) = ~9.966 bits

Another Eample: Sum of 2 dice

i2 = log2(36/1) = 5.170 bits
i3 = log2(36/2) = 4.170 bits
i4 = log2(36/3) = 3.585 bits
i5 = log2(36/4) = 3.170 bits
i6 = log2(36/5) = 2.848 bits
i7 = log2(36/6) = 2.585 bits
i8 = log2(36/5) = 2.848 bits
i9 = log2(36/4) = 3.170 bits
i10 = log2(36/3) = 3.585 bits
i11 = log2(36/2) = 4.170 bits
i12 = log2(36/1) = 5.170 bits

The average information provided by the sum of 2 dice is: 3.274 bits.

The average information from events of given process is called its Entropy.

Show me the bits!

Variable-Length Encoding

Variable-Length Decoding

2 = 100113 = 01014 = 0115 = 001
6 = 1117 = 1018 = 1109 = 000
10 = 100011 = 010012 = 10010 

Huffman Coding

A simple greedy algorithm for approximating a minimum encoding

  1. Find the 2 items with the smallest probabilities
  2. Join them into a new *meta* item whose probability is their sum
  3. Remove the two items and insert the new meta item
  4. Repeat from step 1 until there is only one item

Converting Tree to an Encoding

Once the *tree* is constructed, label its edges consistently and follow the paths from the largest *meta* item to each of the real item to find the encoding.

2 = 100113 = 01014 = 0115 = 001
6 = 1117 = 1018 = 1109 = 000
10 = 100011 = 010012 = 10010 

Coding Efficiency

How does this code compare to the information content?

Pretty close. Recall that the lower bound was 3.274 bits.

However, an efficient encoding (as defined by having an average code size close to the information content) is not always what we want!

Encoding Considerations

Information Encoding Standards

Fixed-Size Codes

If all choices are equally likely (or we have no reason to expect otherwise), then a fixed-size code is often used. Such a code should use at least enough bits to represent the information content.

BCD
0 -0000
1 -0001
2 -0010
3 -0011
4 -0100
5 -0101
6 -0110
7 -0111
8 -1000
9 -1001

ex. Decimal digits 10 = {0,1,2,3,4,5,6,7,8,9}
4-bit BCD (binary code decimal)
log2(10/1) = 3.322 < 4 bits

ex. ~84 English characters = {A-Z (26), a-z (26), 0-9 (10),
                                           punctuation (8), math (9), financial (5)}
7-bit ASCII (American Standard Code for Information Interchange)
log2(84/1) = 6.392 < 7 bits

ASCII

 0000000100100011010001010110011110001001101010111100110111101111
000NULSOHSTXETXEOTACKENQBELBSHTLFVTFFCRSOSI
001DLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS
010 !"#$%&'()*+,-./
0110123456789:;<=>?
100@ABCDEFGHIJKLMNO
101PQRSTUVWXYZ[\]^_
110`abcdefghijklmno
111pqrstuvwxyz{|}~DEL

Unicode

Encoding Positive Integers

It is straightforward to encode positive integers as a sequence of bits. Each bit is assigned a weight. Ordered from right to left, these weights are increasing powers of 2. The value of an n-bit number encoded in this fashion is given by the following formula:

211 210 29 28 27 26 25 24 23 22 21 20
0 1 1 1 1 1 0 1 1 1 1 1

Favorite Bits

Tricks wit Bits

Other Helpful Clusterings

Oftentimes we will find it convenient to cluster groups of bits together for a more compact written representation. Clustering by 3 bits is called Octal, and it is often indicated with a leading zero, 0. Octal is not that common today.

One Last Clusterings

Clusters of 4 bits are used most frequently. This representation is called hexadecimal. The hexadecimal digits include 0-9, and A-F, and each digit position represents a power of 16. Commonly indicated with a leading "0x".

Summary and Next Time