Thursday, November 25, 2010

The cost of History

well well, another idea proved fruitless.
After a quick note on cuckoo hashing , i tried to apply the concept on a compressor using Etincelle's principles, betting that the superior occupation rate of cuckoo's hash table would result in better compression rate.
There was, indeed a gain, but a small one.

In fact, my initial diagnostic was simply wrong : i assumed that Etincelle's tables were incompletely used. It proved incorrect : they are fully used, fully cranked i should say. And that is necessary for the scheme to be efficient.

Indeed, the real question is : which slot should be eliminated when a new entrant gets in. With Etincelle, i'm using pure randomness to free a slot for the new occupant. This obviously means there are some collisions. With Cuckoo hash, the table gets filled more before the first collision occurs. But in the end, both tables are completely filled, therefore collisions are happening at each update.

The only benefit of the new hash function is that i can select which slot will be freed. In my test, betting that the least recently used slot is probably the most unworthy did provide a small compression boost.
But not enough for the cost of it.

I now remember that a pending question i had when designing Etincelle was : why keeping the whole history into memory ? The larger it is, the more it costs to reference into it. If it was possible to only keep a useful part of history, references would become so much smaller, compression would therefore improve.

Alas, you can't guess which part of history is going to become useful. As a consequence, the usual way for compressors has always been to keep everything into memory. When reference size matters, they are reduced thanks to context filtering.

Trying to improve on this situation, i tried my hand at an "explicit history manager", which would tell the decoder which position to keep in memory for later use.
My first implementation proved disastrous. Although references would become much smaller, this was not enough to offset the cost of instructions passed to decoder.
Later, i tried a new implementation, much more dynamic and (alas) much more complex. This time, there was a net gain. Not a large one, but definitely a gain.
From a speed/ratio perspective however, this was really not worth it; so, outside of a pure compression exercise, it proved a wrong path.

So my conclusion, at that time, was i'd better try to find some heuristics which would, on average, correctly guess which part of history to keep in memory.

I never found the correct formula.

It seems this idea would be remotely cousin to another concept used in some vastly different compression algorithms, called SSE (Secondary Symbol Estimation). To summarize this concept, it states that, according to additional statistics (such as latest compression ratio, local guess success, etc.), current guess probabilities will be affected. For example, if the algorithm tries to compress some noise, SSE will simply "take over" past statistics and tell that final estimation is completely random, providing a flat probability for all symbols in alphabet.

This is the kind of thing i would like to mimic : get from a bunch of easy statistics an estimation of current position "worthwhileness" to be stored into memory, for future reference.

Apparently, this still deserves to be properly implemented.

No comments:

Post a Comment