The DCT (Discrete Cosine Transform)

Jim Mahoney | Computer Science @ Marlboro College | cs.marlboro.edu | Feb 2014 | MIT License

An explanation and illustration of the math behind the Discrete Cosine Transform and the concepts used in lossy JPEG image compression - low pass filtering.

The basic linear algebra with N = 2

You can think of a vector - a list of numbers - as coefficients times basis vectors.

$$ f_0 \left[ \begin{array}{c} 1 \\ 0 \end{array} \right] + f_1 \left[ \begin{array}{c} 0 \\ 1 \end{array} \right] $$

Using a different basis, different coefficients can describe the same vector.

$$ G_0 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ 1 \end{array} \right] + G_1 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ -1 \end{array} \right] $$

(The sqrt(2)'s give the basis vectors length 1, i.e. "normalizes" them.)

This transormation f to G is a DCT (Discrete Cosine Transform). For a vector with 2 components, this perhaps isn't all that exciting, but does still transform the original $(f_0, f_1)$ into low and high frequency components $(G_0, G_1)$.

quick quiz 1

If $f_0 = 3$ and $f_1 = 5$, what are the G's?

the matrix math

This transform can be written as a matrix multiplication.

$$ f_0 \left[ \begin{array}{c} 1 \\ 0 \end{array} \right] + f_1 \left[ \begin{array}{c} 0 \\ 1 \end{array} \right] = \left[ \begin{array}{c} f_0 \\ f_1 \end{array} \right] = G_0 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ 1 \end{array} \right] + G_1 \frac{1}{\sqrt{2}} \left[ \begin{array}{c} 1 \\ -1 \end{array} \right] = \frac{1}{\sqrt{2}} \left[ \begin{array}{cc} 1 & 1 \\ 1 & -1 \end{array} \right] \left[ \begin{array}{c} G_0 \\ G_1 \end{array} \right] $$

Moreover, this orthnormal matrix has the interesting and useful property that its transpose is its inverse. That makes the equation easy to invert.

quick quiz 2

Show that the matrix times its transpose is the identity, and use that to find the G's.

two dimensions

The same idea can be applied to 2D images rather than 1D vectors, by applying the 1D transform to each row and column of the image.

The 2D basis images for N=2 are then the outer products of the 1D basis vectors. From lowest (0,0) to highest (1,1) spatial frequency these basis images are :

quick quiz

For an image $ f = \left[ \begin{array}{cc} 5 & 8 \\ 4 & -1 \end{array} \right]$, what are the correspoding four $G$ coefficients?

N = 8

JPEG image compression uses the same sort of transform but with 8 coefficients, not 2.

The matrix is defined by this formula :

$$ G_u = \sqrt{\frac{2}{N}} \frac{1}{\sqrt{2}} f_0 + \sqrt{\frac{2}{N}} \sum_{x=1}^7 f_x \, cos\left( \frac{\pi}{8} (u + \frac{1}{2}) x \right) $$

See http://en.wikipedia.org/wiki/Discrete_cosine_transform and http://www.whydomath.org/node/wavlets/dct.html for the details. In the wikipedia entry, we're using the JPEG transform which corresponds to the "Some authors further multiply the X0 term by 1/sqrt(2) and multiply the resulting matrix by an overall scale factor of sqrt(2/N)" variation, where their $(k, n)$ indices are my $(u, x)$, and their $(X_k, x_n)$ is my $(G_u, f_x)$.

The corresponding eight 1D basis vectors (the matrix rows) oscillate with successively higher spatial frequencies.

Like the N=2 case, the vectors are orthnormal. In other words, their dot products are 0, and each has length 1. Here are a few illustrative products.

This also implies the inverse of this matrix is just its transpose.

playing with a real image

To make all this more concrete, let's apply the 2D DCT transform to part of a real image.

Here's one, takenly randomly from the web.

All three of the R,G,B color values in the greyscale image are the same for each pixel.

Let's just look at values from one tiny 8 x 8 block (which is what's used JPEG compression) near his nose.

(The next images use a false color spectrum to display pixel intensity.)

Now we define the 2D version of the N=8 DCT described above.

The trick is to apply the 1D DCT to every column, and then also apply it to every row, i.e.

$$ G = {DCT} \cdot f \cdot {DCT}^{T} $$

The DCT transform looks like this. Note that most of the intensity is at the top left, in the lowest frequencies.

<img src="http://cs.marlboro.edu/courses/spring2014/information/images/Dctjpeg.png"/ width=292 style="float:left; padding:2em">

The grid positions in that last image correspond to spatial frequencies, with the lowest DC component at the top left, and the highest vertical and horizontal frequency at the bottom right.

These 2D basis functions can be visualized with this image from wikimedia commons.

The details of what I'm doing here don't really match the JPEG transformations: I haven't done the color space transforms, and I haven't handled the DC offsets as the JPEG spec does (which centers the values around 0 explicitly.)

But the concept is visible in the last two pictures: after the DCT, most of the power is in fewer pixels, which are typically near the top left DC part of the grid.

So here's a simple lossy "low pass filter" of the data : let's chop some of the high frequency numbers. One (somewhat arbitrary) choice to to set the frequencies higher than the (1,7) to (7,1) line, to zero.

This is a lossy transormation since we're throwing away information - it can't be undone. But since there are fewer numbers, it's a form of compression.

To see what this did to the original, we just transform it back.

And we have something close to the original back again - even though close to half of the transformed image was set to zero.

conclusions

The procedue here isn't what happens in JPEG compression, but does illustrate one of the central concepts - throwing away some of higher spatial frequency information after a DCT transform.

In the real JPEG lossy compression algorithm, the steps are

For all the JPEG details, see http://en.wikipedia.org/wiki/JPEG .