Using Flume with Hive and Lzo (Part 1)

With one of my projects at Lijit I got the pleasure of learning and getting to contribute back to the Flume Project. If your not familiar with Flume its basically a distributed, reliable, and highly available service for efficiently collecting, aggregating, and moving large amounts of log data. With the added ability of moving this log data into Hadoop.

So the basics of this project was to implement Flume to move log data from our web servers into our Hadoop cluster for processing. As part of our standard storage practice in Hadoop we compress everything. This saves us in space and reduces disk and network IO wait time. The only downside is an increase in CPU utilization. Now as many people know the cpu tends to not be the bottleneck. Making the trade off well worth it. Now we don’t just compress with Gzip or BZip, we prefer to use LZO. Why us ask? Well because LZO allows for a single file to be separated into chunks. Allowing for a much improved Map phase. You can read more on LZO with this great article from Twitter.

Now to get Flume to support LZO you’re going to need a few things. Mainly the key library for which you most likely have installed into your Hadoop install, LZO support itself. Now I wont go into detail on how to install this as there are many great articles and a readme on how to do this.

In a basic Flume setup you have 3 types of machines. An agent, a collector, and a master node. For LZO support we don’t really care about any of the other nodes but the collector. As the collectors job is to actually preform the writing into HDFS/Hadoop. To do this you just need to make a few changes to your Flume config files.

Extend Flume’s Java Classpath with the LZO Native Libs.
/usr/lib/flume/binflume-env.sh

#64 Bit OS
export JAVA_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

#32 Bit OS
export JAVA_LIBRARY_PATH=/usr/lib/hadoop-0.20/lib/native/Linux-i386-3

Make flume aware of the compression codec you want to use.
/usr/lib/flume/config/flume-site.xml

<property>
    <name>flume.collector.dfs.compress.codec</name>
    <value>Lzop</value>
    <description>Writes formatted data compressed in specified codec to dfs. Value is None, GzipCodec, DefaultCodec (deflate), BZip2Codec, or any other Codec Hadoop is aware of </description>
</property>

You will also need to create a symlink or copy the hadoop-lzo.jar file that exists in your hadoop install’s lib folder. To Flume’s lib folder in /usr/lib/flume/lib. Once both of these changes are completed you should now have LZO support within Flume. Currently the only ways to use compression in your write outs of files is via the customDFS, escapedCustomDFS, and collectorSink sinks. These sinks will automatically compress the output thanks to the change in the flume-site.xml file.

Now a note about LZO. The LZO library mentioned provides access to two versions of the LZO codec, LZO and LZOP. LZO is a streaming version from what I can piece together, and the LZOP is a file based version of the codec. In most cases the LZOP version is what you want.

This concludes Part 1. You should now have support for LZO in your Flume flows. In Part 2 I’ll go over how to use escapedCustomDFS and collectorSink to allow for your output to be pupped right into Hive.

This entry was posted in Personal Projects and tagged , , , , . Bookmark the permalink.

3 Responses to Using Flume with Hive and Lzo (Part 1)

  1. Nuke says:

    Nick, thanks for the writing and great job. It's really helpful to me.

    I've been also trying to replicate a Flume master for HA but still have no any success so far. What do you suggest on this? It would be wonderful if you could write a post about this solution.

    • NerdyNick says:

      I will tell you this as of Flume 0.9.4 Multi master does have a few limitations. One of which is the auto* Sources and Sinks. Standalone Flume masters, this is when your using the embedded Zookeeper, does have the same failover standard as Zookeeper. That is you must have a majority of the machines running to work. This means with 3 ZK machines you can only fail 1. With 4 you can still only fail 1, but 5 you can now fail 2. I haven't confirmed this limitation though with standalone ZK machines.

      I will however look at trying to get a blog post up about how to set this all up and some ideas of different way to do it. As well as common flows you will need to use.

  2. Nuke says:

    Thank for the reply and look forward to your post.

    I was awared of Multi-masters' limitations so I did consider the alternative solution for that: 1 master + 1 standby for HA. No success so far.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>