Zipofig. A story of a ZIP archive recovery
Nov 30, 2003 by Ilya Levin
Recently I had a ZIP archive of 2.7GB with 2857 files stored with "no compression" method. Every archive meant to be unzipped eventually, and this one was not an exception.
But unzip failed, to my surprise.
All attempts with different versions and different software (WinZip, PowerArchiver, WinRar, PKZIP, etc.) were unsuccessful too.
Some software reported a broken zip structure and crashed with GPF during repair. Some were able to unzip archive only partially, pretending there are no more files inside the archive. Some unzipped the whole archive but made many zero-sized files instead of normal-sized ones.
However, all files were in archive indeed, and I certainly wanted my data back.
Closer look at the archive as binary revealed some structure abnormalities. The archive was created with PKZIP 2.5 CLI for Windows 98/NT. There were no errors in structure but, starting from the particular offset (0x 803e07f0 to be exact), all files were stored in so-called 'non-seekable device' or 'bit-3 on' mode with a flags field set to 0.
A generic structure of regular ZIP file looks like follows:
[local file header1] [file data1] : [local file headern] [file datan] [central directory entry1] : [central directory entryn] [end of central directory record]There are two, ZIP64 and digital signature, records may also exist in the archive, but I'll skip them here to stay clear. If you are interested in ZIP full format, then refer to [2].
A local file header record begins with signature 0x04034b50 and contains a bit flag field, a CRC32 field, a compressed file size field, an original size field, a variable file name field plus few more fields. A central directory entry begins with signature 0x02014b50 and is again a file header. It contains all fields from a local file header plus few extra fields. End of central directory record begins with signature 0x06054b50 and contains a total number of entries in a central directory, start offset of a central directory plus few more fields.
If a ZIP archive was made on a standard output or a non-seekable device, then an additional data descriptor record used. A single file entry in this mode would be
[local file header] [file data] [data descriptor]The data descriptor record has no signatures and follows right after the file data. It has such fields as CRC32, compressed file size, original file size. In this case, the same fields in local file header should be set to 0 and bit 3 of bit flag field should be set to 1.
It is evident enough that PKZIP 2.5 have faced a signed 32-bit value overflow problem during archive creation. At some point, it has failed to seek a file pointer for output archive properly and switched itself to the 'non-seekable device' mode (without bit 3 set to 1 in local file headers for some reason).
The most obvious method of reading ZIP archives would be something like
open archive read local file header while (signature from local file header == 0x04034b50) read (file name size from local file header) bytes of file name skip next (extra field size + compressed file size) bytes read local file header end while close archive
As you can see it will not work correctly on files stored in 'non-seekable device' mode format, where compressed size field in the local file header is 0. It will fail at the skip next line by jumping to a wrong place instead of the next local file header.
Unfortunately, some tested ZIP processing software uses this approach. Those that partially unzipped the archive were exactly demonstrating the described blunder. Other software somehow combines access to the central directory and local headers. These methods resulted in crashes and hordes of empty files, taken from local file headers.
Dealing with a sink or swim situation, I've decided to make my own tool to recover the data.
The only problem was to find both a beginning and a size of each file within the archive.
The most obvious solution was to read the local file headers. If the compressed size, the original size, and the CRC32 fields are 0 then read data until next local file header signature appears and the file content will be that data minus 20 bytes (size of ZIP64 data descriptor record). Clean and elegant solution but here is the catch: there were the nested ZIP archives. In this case, the local file header signature can appear before the actual end of processed file. It can be avoided with the different method: read data per 8 bytes with a 1-byte shift until this 8-bytes value equal to the number of bytes read already and a 4-bytes value after the next 8 bytes is the local file header signature. There are still chances for an accidental match, but the probability is low. However, this method is rather slow.
I found it is easier to go thru the central directory structure because all the necessary information is already there. All I need was to locate the beginning of the central directory, not the beginning of each file. Instead of looking for the first occurrence of a central directory entry signature (remember the nested archives) the necessary value can be taken from the "end of central directory" record. Here is the final method:
open archive from the end of archive to its beginning find a first occurrence of the end of central directory record's signature if found read the end of central directory record jump to the central directory first entry's offset read central directory entry while (signature from central directory entry==0x02014b50) let X= local file header' offset from central directory entry let X=X+ local file header size let X=X + file name size from central directory entry let Y= compressed size from central directory entry let S= file name from central directory entry if ( (Y==0) and (original size from central directory entry==0) and (crc32 from central directory entry==0)) then make directory (S) else save (Y) bytes from (X) offset to the (S) file read central directory entry end while end if found close archive
This method will work on any ZIP archive, and this is the actual method used in Zipofig, a tool I finally made. There were no compression methods applied on files in my particular case, and this is the reason no additional efforts required to get an original file content. In other cases, it probably makes sense to unpack these Y bytes with an appropriate uncompressing method before saving them to file. The uncompressing method can be selected, based on a compression method field from central directory entry.
Zipofig is a Win32 console utility, distributed in source code. It was written and can be compiled with Visual C. Please refer the source code [1] for compatibility notes and compiling instructions.
Zipofig has two modes: list contents of archive and extract files from the archive. The only one uncompressing method supported – store aka zero level of compression aka no compression. Be my guest to add and implement whatever you like and whatever you need.
Indeed Zipofig is not a pearl of programming art, but I do not care. It works, and it really helped me when other software disgraceful failed.
You may wonder wasn't it easier to set bit 3 of bit flag field of local file header for each file, stored with zero sizes and CRC32 fields? Yes, it was. However, there were reasons:
- There were too many files for manual fixing with a hex editor;
- The problem with finding a local file header for each file still need to be solved anyway;
- There was no guaranty that the software listed above will be able to unpack even a fixed archive and I had no desire to waste my time on it;
- Feed my ego with proving that I can do better than the authors of that failing brand software.
I should say, the last one is the reason. In the meantime, I've got my data back. Not a bad bonus indeed.
References
[1] Zipofig C source code.
[2] PKWARE's Application Note: The ZIP file format.