## Mathematics behind JPEG

Many of you should have this experience: when you save a picture with many words in Paint(小畫家) as .jpg file, some ugly artifacts appear. Then you may think why is such a low quality image format so common? At the same time, when you take a photo through a digital camera, the photo looks cool and does not contain many artifacts, although it is .jpg too. Why is there such a big difference? It is natural to think that it is because Paint is rubbish; more or less this is true, but the main reason is in fact the mathematics behind JPEG (i.e. jpg). (BTW this post is not quite mathematical.)

We use JPEG because in general the size of .jpg file is small. Indeed, the original purpose of JPEG is to store a photographic image (i.e. natural scenery) efficiently. What is the main difference between a natural scenery and a general computer image? If we consider a computer image as a function $f:[a,b]\times[c,d]\rightarrow\mathbb{R}$, (you may consider the range of the function to be the set of colour/light intensity represented by positive numbers), then the main difference between these two kinds of images is that a photographic image locally does not vary too much (that means locally the colour will not change too much).

If you understand the intuitive meaning of Fourier series, then you should be able to see that if a function is close to constant, then the Fourier coefficients of that function will decay very fast, because the Fourier coefficients indicate how a function oscillates. Using this nice property, the sum of the first few terms of the Fourier series of that function will then be very close to the function already.

Combining these two facts, we can already store a photographic image slightly more efficiently: first we calculate the Fourier coefficients of the image, then we throw away the small, high frequency Fourier coefficients, and we just store the first few big Fourier coefficients. Although some information is lost, we can still recover the image with a few Fourier coefficients, and it saves much more memory because we are just storing a few Fourier coefficients (compared to saving each pixel of the original image)! This is in fact the main idea of the compression process of JPEG. (Now we can understand why jpg won’t store the image with words well, because there is a large jump (big change of colour) between the background and letters, which causes big “oscillation” and hence big coefficients with high frequency. If we throw away the coefficients of high frequencies, surely some information is lost and it causes the artifacts.)

In fact, a computer image is discrete, so we should regard it as a function $f:\mathbb{Z}_N\times\mathbb{Z}_M\rightarrow\mathbb{R}$ instead of $f:[a,b]\times[c,d]\rightarrow\mathbb{R}$, and we should use discrete transform instead of continuous one. Also, because of technical reasons, practically one uses discrete cosine transform instead of Fourier transform, but the intuitive meaning is just the same.

More precisely, the process is briefly as follows:

1. First we divide we the image into 8×8 sub-matrices.
2. On each 8×8 matrix block, we apply discrete cosine transform. (example)
3. We apply a quantization matrix $Q$ to each 8×8 block (i.e. replace each $a_{ij}$ in the original block by $a_{ij}/q_{ij}$, so that the values in the block get closer to 0).
4. Now the block has many zeroes on the lower right hand corner. To store the block efficiently we “reorder” the matrix as follows:

If we save the matrix in this way, then all the zeros will be packed in the tail, and clearly you can think of a way to store the zeros wisely (e.g. instead of storing the whole string $(a,b,c,0,0,\ldots,0)$, you can store it as $(a,b,c,0,n)$, where $n$ is the number of zeros.)

To recover the image we simply reverse the above steps.

I was quite impressed by this compression method when I learnt it, as it is smart and simple enough, and through this I could feel that engineers think very differently from mathematicians.

It turns out that many media files such as avi, mp3, etc. are also based on discrete cosine transform, but in much more complicated ways, and I won’t mention them here.