Developers Club geek daily blog

1 year, 9 months ago
Unpacking of the data squeezed by algorithm of Deflate with the fixed Huffman codes on the example of the PNG format

Within the next laboratory work we with colleagues faced a problem of analysis of a hexadecimal dump of the PNG file. According to the RFC 2083 standard the PNG format stores the pixel data squeezed by algorithm of Deflate. Therefore at analysis of a dump we needed to unpack compressed data algorithm of Inflate.

Analysis is carried out on the following image of the size of 4х4 pixels:

Unpacking of the data squeezed by algorithm of Deflate with the fixed Huffman codes on the example of the PNG format


1951 data streams squeezed by means of RFC standard Deflate are stored in PNG in the zlib format of the RFC 1950 standard, we also will use this standard at analysis. The heading zlib defines the Deflate settings. It has the following structure:
1 byte
1 byte
4 bytes
CMF
FLG
DICTID
4 bits
4 bits
2 bits
1 bits
5 bits
CINFO
CM
FLEVEL
FDICT
FCHECK

In more detail about fields:

  • CMF (Compression method and flags). This byte is separated into 2 cells on 4 bits in everyone: CM (Compression method), CINFO (Compression info).
  • CM (Compression method). This field defines a method of the compression used in the file. = 8 designates CM that Deflate with a size of window up to 32 kilobyte is used.
  • CINFO (Compression info). When CM = 8, CINFO is the logarithm with the basis 2 setting value of the size of a window minus 8 (CINFO = 7, sets the window size, to equal 32 kilobytes).
  • FLG (Flags). This byte is separated into parts as follows: FCHECK, FDICT, FLEVEL.
  • FCHECK (check bits for CMF and FLG). Let's consider value of expression sum = (CMF * 256 + FLG) as a sixteen-bit integral number without sign. This field supplements FLG so that sum value was multiple 31.
  • FDICT (Preset dictionary). If this flag is set, then the descriptor of the DICT dictionary follows FLG byte at once. Raspakovshchik can use value of the given field for determination of the dictionary which was used at compression.
  • FLEVEL (Compression level). This field is used by special shrinking algorithms. Deflate (CM = 8): 0 — the fastest compression, 1-fast compression, 2 — compression by default, 3 — the maximum compression (most slowly). Value of this field is not considered when unpacking. It is used to specify whether it makes sense, additional compression.

After DICT (if FDICT is set) there is a flow of compressed data which length for PNG depends on the CHUNK_LENGTH field of structure of IDAT. At the end of the portion of zlib there is a checksum of ADLER-32 calculated according to uncompressed data on algorithm of Adler-32.

In our case the heading zlib has the following appearance:
78 DA
0111
1000
11
0
11010
Window size = 32K
Compression method = Deflate
Compression level = fastest
Dictionary is used = false
Check bits

From heading we find out that the dictionary is not used (FDICT = 0).

Compressed data for our picture:

63 F8 3F 93 E1 3F 03 C3 CC FF 20 1A C8 00 22 24 0E 58 12 85 33 D3 F8 3F 03 32 07 44 03 00 AA 05 23 77

Bits are read out further continuously on bytes. In byte we go from the last bit to the first (for example, in a data stream consistently there are 2 bytes: 63 F8 (0b01100011 0b11111000), then have to receive the following bit sequence: 1100011000011111).

BFINAL flag specifying whether this portion of data of the last is goes in the first bit in the read-out sequence. The following 2 bits of BTYPE specify compression type: 00 — without compression, 01 — compression with the fixed Huffman codes, 10 — compression with dynamic Huffman codes, 11 — value is reserved.

The table for the fixed Huffman codes:
The unpacked value
Number of bits
Codes
base
0 — 143
8
From 00110000 to 10111111
00110000
144 — 255
9
From 110010000 to 111111111
110010000
256 — 279
7
From 0000000 to 0010111
0000000
280 — 287
8
From 11000000 to 11000111
11000000

Table of shifts:
Codes
Additional bits
Distance
Codes
Additional bits
Distance
Codes
Additional bits
Distance
0
0
1
10
4
33 — 48
20
9
1025 — 1536
1
0
2
11
4
49 — 64
21
9
1537 2048
2
0
3
12
5
65 — 96
22
10
2049 — 3072
3
0
4
13
5
97 — 128
23
10
3073 — 4096
4
1
5, 6
14
6
129 — 192
24
11
4097 — 6144
5
1
7, 8
15
6
193 — 256
25
11
6145 — 8192
6
2
9 — 12
16
7
257 — 384
26
12
8193 — 12288
7
2
13 — 16
17
7
385 — 512
27
12
12289 — 16384
8
3
17 — 24
18
8
513 — 768
28
13
16385 — 24576
9
3
25 — 32
19
8
769 — 1024
29
13
24577 — 32768

Table of lengths:
Codes
Additional. Bits
Length
Codes
Additional. Bits
Length
Codes
Additional. Bits
Length
257
0
3
267
1
15, 16
277
4
67 — 82
258
0
4
268
1
17, 18
278
4
83 — 98
259
0
5
269
2
19 — 22
279
4
99 — 114
260
0
6
270
2
23 — 26
280
4
115 — 130
261
0
7
271
2
27 — 30
281
5
131 — 162
262
0
8
272
2
31 — 34
282
5
163 — 194
263
0
9
273
3
35 — 42
283
5
195 — 226
264
0
10
274
3
43 — 50
284
5
227 — 257
265
1
11, 12
275
3
51 — 58
285
0
258
266
1
13, 14
276
3
59 — 66
 
 
 

When unpacking data can be submitted by two types: fixed value and length of reverse bias.

Data which we read out have to be transferred to fixed values. The transfer formula is given below:

lit value = data — base + left bound, where
lit value — fixed value,
base — a column from table 1,
left bound — the left number from a column of the unpacked values from table 1.

At data reading, we can unambiguously define to what interval of fixed values from table 1 it corresponds thanks to the fact that prefixes of each interval are different.

For example, we will select 2 bytes of F8 and 3F from our sequence. Sequence of bits which turned out 0001111111111100. Let current position of reading 4, we read out further 7 bits (minimum the number of bits), we receive a prefix 1111111 which corresponds to an interval 2. From this it turns out that length of the read-out code is equal to 9.

0b111111111 = 0d511

Using the formula which is stated above, we will receive that number which was coded equally 255 = 511 — 400 (0b110010000) + 144.

Let's consider a case when at us instead of fixed value the data stream contains information on length and reverse bias, i.e. the read-out value gets to the 3rd interval. In our option it will be a subsequence:

20 1A C8 (00000100 01011000 00010011).

We read out the first 7 bits (0b0000010) which get to an interval 3. On a formula we transfer to number. It will be number 257 = 1 — 0 + 256. Further we use table 3. The code 257 means that the number of additional bits which needs to be considered equally in 0. And length on the third column is equal to 3. If additional bits are, then they are read out upside-down. These bits define number which we add to length.

Further we read out 5 bits (0b00101 = 0d5) which define reverse bias. In our case this number 5. From table 2 clear that we should consider 1 additional bit. We read out it upside-down. It turned out 1 (0d1). We add this number to the minimum distance from a column 3. And it follows from this that our reverse bias is equal 7 + 1 = 8.

What does it mean? For example, we considered 9 values: 0 255 153 0 255 0 0 153 255. The return distance shows, on how many values we should be displaced in this flow back, i.e. we will be on the second value — 255. Length which at us is equal 3 shows that we should take 3 values in from a data stream, since value on which we stand, i.e. 255 153 0. If length is more, than shift, then we return to an initial position and we read out again in an initial order until we consider the amount of values equal to length. Let's say at us length 7 turned out, and reverse bias is equal to 2. We are displaced on 2 values, i.e. our position on penultimate number — 153. The sequence which turned out is equal to 153 255 153 255 153 255 153. Eventually, when the raspakovshchik reads out the next fixed value equal to zero, it completes the work.

Complete analysis of a dump:
01100011 FINAL CHUNK, FIXED HUFFMAN
11111000 48 — 48 + 0 = 0 — FILTER: NONE
00111111 511 — 400 + 144 = 255
10010011 409 — 400 + 144 = 153
11100001 48 — 48 + 0 = 0
00111111 511 — 400 + 144 = 255
00000011 48 — 48 + 0 = 0
11000011 48 — 48 + 0 = 0
11001100 409 — 400 + 144 = 153
11111111 511 — 400 + 144 = 255
00100000 2 — 0 + 256 = 258 LENGTH=4
00011010 5 DISTANCE=7+1=8
11001000 1 — 0 + 256 = 257 LENGTH=3
00000000 6 DISTANCE=9+0=9
00100001 1 — 0 + 256 = 257 LENGTH=3, 2 DISTANCE=3
00100100 9 — 0 + 256 = 265 LENGTH=11+0=11
00001110 7 DISTANCE=13+0=13
01011000 3 — 0 + 256 = 259 LENGTH=5
00010010 9 DISTANCE=25+4=29
10000101 10 — 0 + 256 = 266 LENGTH=13+0=13
00110011 7 DISTANCE=13+0=13
11010011 409 — 400 + 144 = 153
11111000 99 — 48 + 0 = 51
00111111 511 — 400 + 144 = 255
00000011 48 — 48 + 0 = 0
00110010 9 — 0 + 256 = 265 LENGTH=11+1=12
00000111 7 DISTANCE=13+0=13
01000100 2 — 0 + 256 = 258 LENGTH=4
00000011 5 DISTANCE=7+1=8
00000000 0 — 0 + 256 = 256 END
10101010 ADLER
00000101 ADLER
00100011 ADLER
01110111 ADLER
0
255 153 0 255
0 0 153 255
255 153 0 255
255 0 0 255
0
0 0 153 255
255 153 0 255
255 0 0 255
255 153 0 255
0
255 153 0 255
255 0 0 255
255 153 0 255
0 153 51 255
0
255 0 0 255
255 153 0 255
0 153 51 255
255 153 0 255

Finally on laboratory work a test and approval of the teacher was passed. And now according to its sentence this article is also written. Searches of similar materials led eventually to standards. We hope that this article already in mother and clear Russian will help persons in need with their own undertakings.

Source image

This article is a translation of the original post at habrahabr.ru/post/274825/
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here: sysmagazine.com@gmail.com.

We believe that the knowledge, which is available at the most popular Russian IT blog habrahabr.ru, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus