Jim Mahoney | Computer Science @ Marlboro College | cs.marlboro.edu | Feb 2014 | MIT License
An explanation and illustration of the math behind the Discrete Cosine Transform and the concepts used in lossy JPEG image compression - low pass filtering.
You can think of a vector - a list of numbers - as coefficients times basis vectors.
$$ f_0 \left[ \begin{array}{c} 1 \\ 0 \end{array} \right] + f_1 \left[ \begin{array}{c} 0 \\ 1 \end{array} \right] $$Using a different basis, different coefficients can describe the same vector.
$$ G_0 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ 1 \end{array} \right] + G_1 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ -1 \end{array} \right] $$(The sqrt(2)'s give the basis vectors length 1, i.e. "normalizes" them.)
This transormation f to G is a DCT (Discrete Cosine Transform). For a vector with 2 components, this perhaps isn't all that exciting, but does still transform the original $(f_0, f_1)$ into low and high frequency components $(G_0, G_1)$.
If $f_0 = 3$ and $f_1 = 5$, what are the G's?
This transform can be written as a matrix multiplication.
$$ f_0 \left[ \begin{array}{c} 1 \\ 0 \end{array} \right] + f_1 \left[ \begin{array}{c} 0 \\ 1 \end{array} \right] = \left[ \begin{array}{c} f_0 \\ f_1 \end{array} \right] = G_0 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ 1 \end{array} \right] + G_1 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ -1 \end{array} \right] = \frac{1}{\sqrt{2}} \left[ \begin{array}{cc} 1 & 1 \\ 1 & -1 \end{array} \right] \left[ \begin{array}{c} G_0 \\ G_1 \end{array} \right] $$Moreover, this orthnormal matrix has the interesting and useful property that its transpose is its inverse. That makes the equation easy to invert.
Show that the matrix times its transpose is the identity, and use that to find the G's.
The same idea can be applied to 2D images rather than 1D vectors, by applying the 1D transform to each row and column of the image.
The 2D basis images for N=2 are then the outer products of the 1D basis vectors. From lowest (0,0) to highest (1,1) spatial frequency these basis images are :
basis = (1/sqrt(2) * array([1, 1]), 1/sqrt(2) * array([1, -1]))
for i in [0,1]:
for j in [0,1]:
print "{}, {} :".format(i,j)
print outer(basis[i], basis[j])
print
0, 0 : [[ 0.5 0.5] [ 0.5 0.5]] 0, 1 : [[ 0.5 -0.5] [ 0.5 -0.5]] 1, 0 : [[ 0.5 0.5] [-0.5 -0.5]] 1, 1 : [[ 0.5 -0.5] [-0.5 0.5]]
For an image $ f = \left[ \begin{array}{cc} 5 & 8 \\ 4 & -1 \end{array} \right]$, what are the correspoding four $G$ coefficients?
JPEG image compression uses the same sort of transform but with 8 coefficients, not 2.
The matrix is defined by this formula :
$$ G_u = \sqrt{\frac{2}{N}} \frac{1}{\sqrt{2}} f_0 + \sqrt{\frac{2}{N}} \sum_{x=1}^7 f_x \, cos\left( \frac{\pi}{8} (u + \frac{1}{2}) x \right) $$See http://en.wikipedia.org/wiki/Discrete_cosine_transform and http://www.whydomath.org/node/wavlets/dct.html for the details. In the wikipedia entry, we're using the JPEG transform which corresponds to the "Some authors further multiply the X0 term by 1/sqrt(2) and multiply the resulting matrix by an overall scale factor of sqrt(2/N)" variation, where their $(k, n)$ indices are my $(u, x)$, and their $(X_k, x_n)$ is my $(G_u, f_x)$.
# The 8 x 8 DCT matrix thus looks like this.
N = 8
dct = zeros((N, N))
for x in range(N):
dct[0,x] = sqrt(2.0/N) / sqrt(2.0)
for u in xrange(1,N):
for x in xrange(N):
dct[u,x] = sqrt(2.0/N) * cos((pi/N) * u * (x + 0.5) )
np.set_printoptions(precision=3)
dct
array([[ 0.354, 0.354, 0.354, 0.354, 0.354, 0.354, 0.354, 0.354], [ 0.49 , 0.416, 0.278, 0.098, -0.098, -0.278, -0.416, -0.49 ], [ 0.462, 0.191, -0.191, -0.462, -0.462, -0.191, 0.191, 0.462], [ 0.416, -0.098, -0.49 , -0.278, 0.278, 0.49 , 0.098, -0.416], [ 0.354, -0.354, -0.354, 0.354, 0.354, -0.354, -0.354, 0.354], [ 0.278, -0.49 , 0.098, 0.416, -0.416, -0.098, 0.49 , -0.278], [ 0.191, -0.462, 0.462, -0.191, -0.191, 0.462, -0.462, 0.191], [ 0.098, -0.278, 0.416, -0.49 , 0.49 , -0.416, 0.278, -0.098]])
The corresponding eight 1D basis vectors (the matrix rows) oscillate with successively higher spatial frequencies.
# Here's what they look like.
figure(figsize=(9,12))
for u in xrange(N):
subplot(4, 2, u+1)
ylim((-1, 1))
title(str(u))
plot(dct[u, :])
plot(dct[u, :],'ro')
Like the N=2 case, the vectors are orthnormal. In other words, their dot products are 0, and each has length 1. Here are a few illustrative products.
def rowdot(i,j):
return dot(dct[i, :], dct[j, :])
rowdot(0,0), rowdot(3,3), rowdot(0,3), rowdot(1, 7), rowdot(1,5)
(0.99999999999999978, 0.99999999999999989, 5.5511151231257827e-17, 1.9428902930940239e-16, -2.4980018054066022e-16)
This also implies the inverse of this matrix is just its transpose.
dct_transpose = dct.transpose()
dct_transpose
array([[ 0.354, 0.49 , 0.462, 0.416, 0.354, 0.278, 0.191, 0.098], [ 0.354, 0.416, 0.191, -0.098, -0.354, -0.49 , -0.462, -0.278], [ 0.354, 0.278, -0.191, -0.49 , -0.354, 0.098, 0.462, 0.416], [ 0.354, 0.098, -0.462, -0.278, 0.354, 0.416, -0.191, -0.49 ], [ 0.354, -0.098, -0.462, 0.278, 0.354, -0.416, -0.191, 0.49 ], [ 0.354, -0.278, -0.191, 0.49 , -0.354, -0.098, 0.462, -0.416], [ 0.354, -0.416, 0.191, 0.098, -0.354, 0.49 , -0.462, 0.278], [ 0.354, -0.49 , 0.462, -0.416, 0.354, -0.278, 0.191, -0.098]])
# Is the dot product of dct and its transpose the identity?
maybe_identity = dot(dct, dct_transpose)
# Since there are many nearly zero like 3.2334e-17 in this numerical result,
# the output will look much nicer if we round them all of to (say) 6 places.
roundoff = vectorize(lambda m: round(m, 6))
roundoff(maybe_identity)
array([[ 1., 0., -0., 0., 0., 0., -0., -0.], [ 0., 1., 0., -0., 0., -0., 0., 0.], [-0., 0., 1., 0., -0., 0., 0., 0.], [ 0., -0., 0., 1., 0., 0., -0., 0.], [ 0., 0., -0., 0., 1., 0., -0., -0.], [ 0., -0., 0., 0., 0., 1., 0., -0.], [-0., 0., 0., -0., -0., 0., 1., 0.], [-0., 0., 0., 0., -0., -0., 0., 1.]])
To make all this more concrete, let's apply the 2D DCT transform to part of a real image.
Here's one, takenly randomly from the web.
# See http://matplotlib.org/users/image_tutorial.html for the image manipulation syntax.
# The image itself is a small piece from http://www.cordwainer-smith.com/virgil_finlay.htm.
import matplotlib.image as mpimg
img = mpimg.imread('stormplanet112.jpg')
p=plt.imshow(img, origin='lower')
# The image itself contains 3 dimensions: rows, columns, and colors
img.shape
(112, 112, 3)
All three of the R,G,B color values in the greyscale image are the same for each pixel.
Let's just look at values from one tiny 8 x 8 block (which is what's used JPEG compression) near his nose.
(The next images use a false color spectrum to display pixel intensity.)
tiny = img[40:48, 40:48, 0] # a tiny 8 x 8 block, in the color=0 (Red) channel
def show_image(img):
plt.imshow(img)
plt.colorbar()
show_image(tiny)
# And here are the numbers.
tiny
array([[ 24, 147, 212, 216, 209, 223, 156, 74], [ 47, 33, 179, 221, 201, 230, 164, 95], [ 20, 73, 201, 235, 215, 219, 175, 109], [140, 181, 215, 217, 197, 192, 142, 95], [204, 235, 206, 195, 204, 208, 192, 159], [208, 187, 217, 226, 222, 216, 209, 173], [203, 234, 225, 211, 204, 185, 232, 227], [155, 143, 150, 193, 204, 177, 178, 195]], dtype=uint8)
Now we define the 2D version of the N=8 DCT described above.
The trick is to apply the 1D DCT to every column, and then also apply it to every row, i.e.
$$ G = {DCT} \cdot f \cdot {DCT}^{T} $$def doDCT(grid):
return dot(dot(dct, grid), dct_transpose)
def undoDCT(grid):
return dot(dot(dct_transpose, grid), dct)
# test : do DCT, then undo DCT; should get back the same image.
tiny_do_undo = undoDCT(doDCT(tiny))
show_image(tiny_do_undo) # Yup, looks the same.
# And the numbers are the same.
tiny_do_undo
array([[ 24., 147., 212., 216., 209., 223., 156., 74.], [ 47., 33., 179., 221., 201., 230., 164., 95.], [ 20., 73., 201., 235., 215., 219., 175., 109.], [ 140., 181., 215., 217., 197., 192., 142., 95.], [ 204., 235., 206., 195., 204., 208., 192., 159.], [ 208., 187., 217., 226., 222., 216., 209., 173.], [ 203., 234., 225., 211., 204., 185., 232., 227.], [ 155., 143., 150., 193., 204., 177., 178., 195.]])
The DCT transform looks like this. Note that most of the intensity is at the top left, in the lowest frequencies.
tinyDCT = doDCT(tiny)
show_image(tinyDCT)
set_printoptions(linewidth=100) # output line width (default is 75)
round6 = vectorize(lambda m: '{:6.1f}'.format(m))
round6(tinyDCT)
array([['1429.2', ' -55.9', '-241.7', ' -9.0', ' -54.7', ' 31.9', ' 9.7', ' 0.1'], ['-152.3', ' -58.3', '-201.2', ' -4.0', ' -64.9', ' 24.0', ' 35.8', ' -10.9'], [' -54.2', ' -74.9', ' -27.0', ' -15.7', ' 8.3', ' -0.2', ' 0.1', ' 0.3'], [' 92.6', ' 59.6', ' 48.2', ' 12.2', ' -30.1', ' -17.3', ' -16.2', ' 0.1'], [' -19.7', ' 64.2', ' 21.0', ' 10.9', ' -14.3', ' -44.2', ' -21.1', ' -15.0'], [' 35.3', ' 41.9', ' 0.2', ' -39.1', ' -32.3', ' -21.0', ' -23.1', ' 0.2'], [' -19.8', ' -26.2', ' -47.4', ' -0.7', ' 0.4', ' 0.3', ' 0.5', ' -0.3'], [' 27.9', ' -18.2', ' 19.1', ' -20.5', ' -22.5', ' -20.0', ' -21.1', ' 0.7']], dtype='|S8')
<img src="http://cs.marlboro.edu/courses/spring2014/information/images/Dctjpeg.png"/ width=292 style="float:left; padding:2em">
The grid positions in that last image correspond to spatial frequencies, with the lowest DC component at the top left, and the highest vertical and horizontal frequency at the bottom right.
These 2D basis functions can be visualized with this image from wikimedia commons.
The details of what I'm doing here don't really match the JPEG transformations: I haven't done the color space transforms, and I haven't handled the DC offsets as the JPEG spec does (which centers the values around 0 explicitly.)
But the concept is visible in the last two pictures: after the DCT, most of the power is in fewer pixels, which are typically near the top left DC part of the grid.
So here's a simple lossy "low pass filter" of the data : let's chop some of the high frequency numbers. One (somewhat arbitrary) choice to to set the frequencies higher than the (1,7) to (7,1) line, to zero.
This is a lossy transormation since we're throwing away information - it can't be undone. But since there are fewer numbers, it's a form of compression.
# First make a copy to work on.
tinyDCT_chopped = tinyDCT.copy()
# Then zero the pieces below the x + y = 8 line.
for x in xrange(N):
for u in xrange(N):
if x + u > 8:
tinyDCT_chopped[x,u] = 0.0
show_image(tinyDCT_chopped)
round6(tinyDCT_chopped)
# Notice all the zeros at the bottom right - those are the chopped high frequences.
# We've essentially done a "low pass filter" on the spacial frequencies.
array([['1429.2', ' -55.9', '-241.7', ' -9.0', ' -54.7', ' 31.9', ' 9.7', ' 0.1'], ['-152.3', ' -58.3', '-201.2', ' -4.0', ' -64.9', ' 24.0', ' 35.8', ' -10.9'], [' -54.2', ' -74.9', ' -27.0', ' -15.7', ' 8.3', ' -0.2', ' 0.1', ' 0.0'], [' 92.6', ' 59.6', ' 48.2', ' 12.2', ' -30.1', ' -17.3', ' 0.0', ' 0.0'], [' -19.7', ' 64.2', ' 21.0', ' 10.9', ' -14.3', ' 0.0', ' 0.0', ' 0.0'], [' 35.3', ' 41.9', ' 0.2', ' -39.1', ' 0.0', ' 0.0', ' 0.0', ' 0.0'], [' -19.8', ' -26.2', ' -47.4', ' 0.0', ' 0.0', ' 0.0', ' 0.0', ' 0.0'], [' 27.9', ' -18.2', ' 0.0', ' 0.0', ' 0.0', ' 0.0', ' 0.0', ' 0.0']], dtype='|S8')
To see what this did to the original, we just transform it back.
tiny_chopped_float = undoDCT(tinyDCT_chopped)
# Also convert the floats back to uint8, which was the original format
tiny_chopped = vectorize(lambda x: uint8(x))(tiny_chopped_float)
show_image(tiny_chopped)
tiny_chopped
array([[ 39, 119, 222, 223, 202, 226, 154, 73], [ 25, 71, 171, 206, 204, 226, 167, 96], [ 21, 72, 188, 241, 225, 221, 168, 107], [146, 173, 222, 219, 186, 181, 149, 100], [210, 217, 213, 197, 203, 217, 195, 147], [193, 212, 214, 213, 222, 214, 205, 181], [212, 220, 218, 220, 210, 188, 216, 232], [152, 145, 153, 190, 199, 173, 191, 188]], dtype=uint8)
And we have something close to the original back again - even though close to half of the transformed image was set to zero.
The procedue here isn't what happens in JPEG compression, but does illustrate one of the central concepts - throwing away some of higher spatial frequency information after a DCT transform.
In the real JPEG lossy compression algorithm, the steps are
the color space is transformed from R,G,B to Y,Cb,Cr to take advantage of human visual prejudices
the values are shifted so that they center around zero
the values after the DCT are "quantized" (i.e. rounded off) by different amounts at different spots in the grid. (This* is the lossy step, and how lossy depends on the JPEG quality.)
a zigzag (keeping similar frequencies together) pattern turns this to a 1D stream of 64 values
which are then huffman encoded by, typically by a pre-chosen code (part of the JPEG standard)
For all the JPEG details, see http://en.wikipedia.org/wiki/JPEG .