With one of my projects at Lijit I got the pleasure of learning and getting to contribute back to the Flume Project. If your not familiar with Flume its basically a distributed, reliable, and highly available service for efficiently collecting, aggregating, and moving large amounts of log data. With the added ability of moving this log data into Hadoop.
So the basics of this project was to implement Flume to move log data from our web servers into our Hadoop cluster for processing. As part of our standard storage practice in Hadoop we compress everything. This saves us in space and reduces disk and network IO wait time. The only downside is an increase in CPU utilization. Now as many people know the cpu tends to not be the bottleneck. Making the trade off well worth it. Now we don’t just compress with Gzip or BZip, we prefer to use LZO. Why us ask? Well because LZO allows for a single file to be separated into chunks. Allowing for a much improved Map phase. You can read more on LZO with this great article from Twitter.
Now to get Flume to support LZO you’re going to need a few things. Mainly the key library for which you most likely have installed into your Hadoop install, LZO support itself. Now I wont go into detail on how to install this as there are many great articles and a readme on how to do this.
In a basic Flume setup you have 3 types of machines. An agent, a collector, and a master node. For LZO support we don’t really care about any of the other nodes but the collector. As the collectors job is to actually preform the writing into HDFS/Hadoop. To do this you just need to make a few changes to your Flume config files.
Extend Flume’s Java Classpath with the LZO Native Libs.
#64 Bit OS export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64 #32 Bit OS export JAVA_LIBRARY_PATH=/usr/lib/hadoop-0.20/lib/native/Linux-i386-3
Make flume aware of the compression codec you want to use.
<property> <name>flume.collector.dfs.compress.codec</name> <value>Lzop</value> <description>Writes formatted data compressed in specified codec to dfs. Value is None, GzipCodec, DefaultCodec (deflate), BZip2Codec, or any other Codec Hadoop is aware of </description> </property>
You will also need to create a symlink or copy the hadoop-lzo.jar file that exists in your hadoop install’s lib folder. To Flume’s lib folder in /usr/lib/flume/lib. Once both of these changes are completed you should now have LZO support within Flume. Currently the only ways to use compression in your write outs of files is via the customDFS, escapedCustomDFS, and collectorSink sinks. These sinks will automatically compress the output thanks to the change in the flume-site.xml file.
Now a note about LZO. The LZO library mentioned provides access to two versions of the LZO codec, LZO and LZOP. LZO is a streaming version from what I can piece together, and the LZOP is a file based version of the codec. In most cases the LZOP version is what you want.
This concludes Part 1. You should now have support for LZO in your Flume flows. In Part 2 I’ll go over how to use escapedCustomDFS and collectorSink to allow for your output to be pupped right into Hive.