30 November 2012

Dev Tip: My Experience of using Compression Library used by Google

Hi Folks,

While I was doing System Architecture and Development for a brand new Product of EduAlert - "Learn360". I went through a serious of challenges, lets go through one interesting challenge.

Note: I am not going to explain the product architecture; since the product rights goes to EduAlert only. Technical key challenges I faced during the development was the only concern.

A little introduction about the product "Learn360":

      'A proprietary learning platform that provides easy, coherent access to networks of people and resources'.

New generation LMS (Learning Management System) requires large amount of meta-data to be stored and to be processed. Since most of the contents pushing through LMS in Video/Textual format.

Here the problem is "What's the best way to store the redundant data on File System or in a Database without compromising efficiency?"

I thought of implementing a Dedupe File system at file system level by adding a custom linux kernel Module built with Fuse API wrapper. I written a simple python program using Fuse API, that works great. But lacks File Locking mechanism, severe performance issues. Since its a user level Kernel module, IO times are high. Hence its not a advisable one for me.

My thoughts went in different angle, "What about incorporating zlib library and compress / Decompress data during IO?". Finally I decided not to use this approach since the compression overhead is too high. Since our system is highly dependent on IO (metadata).

I thought of putting Memcached the frequently used decompressed metadata. But still the overhead is high and far better than the previous approach.

So my primary focus was to find an efficient way to compress and decompress metadata with less overhead. After a few hours of literature survey I found a  library "Snappy" used by Google and other companies. I incorporated this module into our the core framework.

This library may help someone in the near future, those who are working on similar problem related to compression.

More about Snappy:

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

Snappy is widely used inside Google, in everything from BigTable and MapReduce to our internal RPC systems. (Snappy has previously been referred to as “Zippy” in some presentations and the likes.)

Supported libraries are:

Snappy is written in C++, but C bindings are included, and several bindings to other languages are maintained by third parties:

Cheers!



No comments:

Post a Comment