Reverse Engineering: [Encoding] Understanding a Quantization Matrix

Ok Folks, here is an explanation of what a matrix really does.

With Xvid you can not only use different built-in Quantization Matrices, like H.263 and Mpeg, but you can also use custom matrices either made by yourself or someone else.

Since few people actually understand what a matrix does, I will try to give an explanation here that is possible to understand even if you're not a math expert.

You will all have heard about macroblocks by now. These are the 16x16 or 32x32 blocks of which a single mpeg4-frame is composed of.
These macroblocks are comprised of 4 8x8 blocks that are grouped together. These 8x8 blocks form the basis of MPEG-4 compression.

Instead of being composed of proper pixels, like a normal bitmap or a film still, a block is more like a representation of a complex formula that tries to mimic the content of the original picture as best as it can.

The Human eye is much more sensitive to changes in the brightness, or Luminance than it is in color changes.
So mpeg4 uses a type of color space in it's file structure that assigns less bits to color changes than it does to brightness changes.

A 8x8 block is not made up of pixels but it is made up of a single value which represents the average brightness (or color) value, and all the remaining values are mathematical representations of the amount of variation from this average value for the whole block.
To put it in other words, you have one basic mean, or average value, and all the variation, or detail of the picture , is represented by the end result of a certain complex formula.
The other places in the 8x8 block, which we invariably but inaccurately call pixels, represent different types of detail, and especially the variation of this detail.

Let's take a look at one block:

X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X
X X X X X X X X

The first place in the upper left corner represents the average, or mean value. So if the whole block is dark brown or light red on average, this place says so.
Going from this place to the right or down, we get representations of the amount of variation from this value.
Now this is hard to grasp, so pay attention.
When going from left to right, or from top to bottom, the higher the amount of detail gets.
If you take the original picture of 8x8 bits, the amount of detail is transformed into values depending on a certain frequency at which the detail is present.
Finer detail is represented by higher frequencies.
So, the further you go to the right in the block, the higher the frequency (or the finer the detail).
Let's take three examples:
An 8x8 picture with one big iron bar in it has little detail, so it has a very low frequency.
An 8x8 picture with four broomsticks in it has some detail, so it has a medium frequency.
An 8x8 picture with rain, a cornfield or hair in it, has lots of detail in it and so it has a very high frequency.
As you can guess by now, the values in a block represent both the horizontal and vertical frequency in the original picture.
The formula that does the translation or transformation from detail to frequency is called Discrete Cosinus Transformation or DCT.

So when looking at a block in the way we have just gotten to understand, we can see the next things:

Code:

Average brightness and color of the whole block
                        I
                        I   Low frequency (bigger)detail
                        I   I
                        I   I     Medium frequency (normal)detail
                        I   I      I
                        I   I      I         Higher Frequency (fine)
                        I   I      I         detail
                        I   I      I         I
                        X--X--X--X--X--X--X--X
                        X--\--X--X--X--X--X--X
Low Frequency detail--- X--X--\--X--X--X--X--X
                        X--X--X--\--X--X--X--X
Medium Frequency detail X--X--X--X--\--X--X--X
                        X--X--X--X--X--\--X--X
                        X--X--X--X--X--X--\--X
High Frequency detail---X--X--X--X--X--X--X--A
                                             I
                                             I
                           Mathematical representation of the finest
                           detail, both horizontally and vertically

EDIT: Finally used the code tag...

Ok, now we understand how a block is built.
So how come Quantization matrices into play?

A Quantization matrix (QM from now on)looks something like this:

08 16 19 22 26 27 29 34
16 16 22 24 27 29 34 37
19 22 26 27 29 34 34 38
22 22 26 27 29 34 37 40
22 26 27 29 32 35 40 48
26 27 29 32 35 40 48 58
26 27 29 34 38 46 56 69
27 29 35 38 46 56 69 83

Now there is actually a rather complex proces behind this, but I'll try to describe it in a simple manner:
Every value in the QM is the threshold for the DCT detail-to-frequency translation.
All detail below the threshold will not be regarded as detail and will NOT be encoded. It will simply be discarded.
Now you can understand why Xvid is a so called lossy codec.
It throws detail away. The amount of detail thrown away is determined by the QM.
As you can see, the farther to the right and the farther to the bottom the higher the threshold gets.
The finer the detail, the higher the detail has to stand out of the rest of the picture to be encoded and not thrown away.

So say you have a picture of a girl with long blond hair standing in front of a very light gray wall and behind two prison bars. (don't we all love to see that now)

Off course it's hard to portray that in an 8x8 picture but bear with me for a minute.

The blond girl and the wall will have little contrast between them, so the difference between the average values for brightness and contrast and the maximum values will not be that much.
The values will be lower and will not go that often over the threshold. This would mean less difference that has to be encoded and the picture will have high compressability.
If the wall had been black, the contrast would be much higher and the difference between the average value (sort-of-grey) and the extremities (blond and black) would be much higher. So much more values would have gone over the threshold and would have to be encoded. So higher contrasting scenes result in lower compressability, which we already know offcourse.
Now the two black prison bars in front of the girl are detail, all be it not very fine. So they get a low frequency and they will be encoded if their difference from the average goes over the threshold.
So assuming they're not blond, they will be encoded.

Now the girl has hair, and the texture of hair is off course very fine. So the hair has a lot of detail and gets a very high frequency.
As you can see in our QM, the threshold for high frequencies is much higher than for low frequencies. So the fine detail (the hair) has to differ much more from the average values to be encoded.
So unless the difference is very high, which in this case we assume isn't, the details of the hair won't be encoded.
The matter would be different if she has something like coloured streaks in her hair which would increase the contrast.

So the end result is that the finer the detail, the bigger the contrast of this detail needs to be from the average values of the picture, to be encoded.
This is offcourse on a per-block basis and not on a picture as a whole, which generally consists of more than one 8x8 block. Let's hope you can see through the simplification.

Now you can understand why some matrices soften the picture, like H.263, while others like Mpeg produce a sharper picture.
The values in one QM simply give finer detail a lower threshold and are therefore more likely to encode finer detail, at the price of compressability.
You can also see that the QM that I took as an example isn't a very high compressability matrix; the values are rather low in general.

Some other points:
-End credits usually have very little detail, so you could design a QM especially for this, with VERY high compressability.

-A heavy compression matrix simply ups all the finer detail values so less of that is compressed.

-You could design specific matrices for specific type of content.
You could design matrices especially for sci-fi space adventures, Anime and animation movies, and jungle scenes.

-If you know the exact frequency of interlacing artifacts, you could up their threshold to filter them out!

-Same might work for other types of noise and artifacts.

-I don't know if the one QM is meant for both luminance and color information, I assume it does, but separate QM's for luminance and color would produce higher tweakability. (Don't know if it would be Mpeg4-compliant though, or if it's possible at all).

--------------------------------------------------------------------
Well that's it folks!
Hope I didn't make too many errors, and please correct me if and where I'm wrong.

http://forum.doom9.org/showthread.php?t=54147

Reverse Engineering

Trao đổi với tôi

11/10/13

[Encoding] Understanding a Quantization Matrix