<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~files/atom.xsl"?>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedpress="https://feed.press/xmlns">
  <feedpress:locale>en</feedpress:locale>
  <link rel="via" href="http://www.michael-noll.com/atom.xml"/>
  <link rel="hub" href="http://feedpress.superfeedr.com/"/>
  <title><![CDATA[Michael G. Noll]]></title>
  <link href="http://feedpress.me/miguno" rel="self"/>
  <link href="http://www.michael-noll.com/"/>
  <updated>2017-10-25T10:21:16+02:00</updated>
  <id>http://www.michael-noll.com/</id>
  <author>
    <name><![CDATA[Michael G. Noll]]></name>
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>
  <entry>
    <title type="html"><![CDATA[Integrating Kafka and Spark Streaming: Code Examples and State of the Game]]></title>
    <link href="http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2014-10-01T16:51:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial</id>
    <content type="html"><![CDATA[<p><a href="https://spark.apache.org/streaming/">Spark Streaming</a> has been getting some attention lately as a real-time data
processing tool, often mentioned alongside <a href="http://storm.apache.org/">Apache Storm</a>.  If you ask me, no real-time data
processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to
<a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> that demonstrates how to read from Kafka and write
to Kafka, using <a href="http://avro.apache.org/">Avro</a> as the data format and
<a href="https://github.com/twitter/bijection">Twitter Bijection</a> for handling the data serialization.</p>

<p>In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state
of Kafka integration in Spark Streaming.  All this with the disclaimer that this happens to be my first experiment with
Spark Streaming.</p>

<!-- more -->

<p><br clear="all" /></p>

<div class="note">
  <strong>
    The Spark Streaming example code is available at
    <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> on GitHub.
    And yes, the project&#8217;s name might now be a bit misleading. :-)
  </strong>
</div>

<h1 id="what-is-spark-streaming">What is Spark Streaming?</h1>

<p><a href="http://spark.apache.org/streaming/">Spark Streaming</a> is a sub-project of <a href="http://spark.apache.org/">Apache Spark</a>.
Spark is a batch processing platform similar to Apache Hadoop, and Spark Streaming is a real-time processing tool
that runs on top of the Spark engine.</p>

<h2 id="spark-streaming-vs-apache-storm">Spark Streaming vs. Apache Storm</h2>

<p>In terms of use cases Spark Streaming is closely related to <a href="http://storm.apache.org/">Apache Storm</a>, which is
arguably today’s most popular real-time processing platform for Big Data.  Bobby Evans and Tom Graves of Yahoo!
Engineering recently gave a talk on
<a href="http://yahoohadoop.tumblr.com/post/98213421641/storm-and-spark-at-yahoo-why-chose-one-over-the-other">Spark and Storm at Yahoo!</a>,
in which they compare the two platforms and also cover the question of when and why choosing one over the other.
Similarly, P. Taylor Goetz of HortonWorks shared a slide deck titled
<a href="http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming">Apache Storm and Spark Streaming Compared</a>.</p>

<p>Here’s my personal, very brief comparison: Storm has higher industry adoption and better production stability compared
to Spark Streaming.  Spark on the other hand has a more expressive, higher level API than Storm, which is arguably more
pleasant to use, at least if you write your Spark applications in Scala (I prefer the Spark API, too).  But don’t trust
my word, please do check out the talks/decks above yourself.</p>

<p>Both Spark and Storm are top-level Apache projects, and vendors have begun to integrate either or both tools into their
commercial offerings, e.g. HortonWorks (<a href="http://hortonworks.com/hadoop/storm/">Storm</a>,
<a href="http://hortonworks.com/hadoop/spark/">Spark</a>) and Cloudera
(<a href="http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/spark.html">Spark</a>).</p>

<h1 id="excursus-machines-cores-executors-tasks-and-receivers-in-spark">Excursus: Machines, cores, executors, tasks, and receivers in Spark</h1>

<p>The subsequent sections of this article talk a lot about parallelism in Spark and in Kafka.  You need at least a basic
understanding of some Spark terminology to be able to follow the discussion in those sections.</p>

<ul>
  <li>A Spark <strong>cluster</strong> contains 1+ worker nodes aka slave machines (simplified view; I exclude pieces like cluster
managers here.)</li>
  <li>A <strong>worker node</strong> can run 1+ executors.</li>
  <li>An <strong>executor</strong> is a process launched for an application on a worker node, which runs tasks and keeps data in memory
or disk storage across them.  Each application has its own executors.  An executor has a certain amount of cores aka
“slots” available to run tasks assigned to it.</li>
  <li>A <strong>task</strong> is a unit of work that will be sent to one executor.  That is, it runs (part of) the actual computation of
your application.  The <code>SparkContext</code> sends those tasks for the executors to run.  Each task occupies one slot aka
core in the parent executor.</li>
  <li>A <strong>receiver</strong>
(<a href="http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.receiver.Receiver">API</a>,
<a href="http://spark.apache.org/docs/latest/streaming-custom-receivers.html">docs</a>)
is run within an executor as a long-running task.  Each receiver is responsible for exactly one so-called
<em>input DStream</em> (e.g. an input stream for reading from Kafka), and each receiver – and thus input DStream – occupies
one core/slot.</li>
  <li>An <strong>input DStream</strong>: an <em>input</em> DStream is a special DStream that connects Spark Streaming to external data sources
for reading input data.  For each external data source (e.g. Kafka) you need one such input DStream implementation.
Once Spark Streaming is “connected” to an external data source via such input DStreams, any subsequent DStream
transformations will create “normal” DStreams.</li>
</ul>

<p>In Spark’s execution model, each application gets its own executors, which stay up for the duration of the whole
application and run 1+ tasks in multiple threads.  This isolation approach is similar to Storm’s model of execution.
This architecture becomes more complicated once you introduce cluster managers like YARN or Mesos, which I do not cover
here.  See <a href="http://spark.apache.org/docs/latest/cluster-overview.html">Cluster Overview</a> in the Spark docs for further
details.</p>

<h1 id="integrating-kafka-with-spark-streaming">Integrating Kafka with Spark Streaming</h1>

<h2 id="overview">Overview</h2>

<p>In short, Spark Streaming supports Kafka but there are still some rough edges.</p>

<p>A good starting point for me has been the
<a href="https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala">KafkaWordCount</a>
example in the Spark code base
(<strong>Update 2015-03-31:</strong> see also
<a href="https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala">DirectKafkaWordCount</a>).
When I read this code, however, there were still a couple of open questions left.</p>

<p>Notably I wanted to understand how to:</p>

<ul>
  <li>Read from Kafka <em>in parallel</em>.  In Kafka, a topic can have <em>N</em> partitions, and ideally we’d like to parallelize
reading from those <em>N</em> partitions.  This is what the
<a href="https://github.com/apache/incubator-storm/tree/master/external/storm-kafka">Kafka spout in Storm</a> does.</li>
  <li>Write to Kafka from a Spark Streaming application, also, <em>in parallel</em>.</li>
</ul>

<p>On top of those questions I also ran into several known issues in Spark and/or Spark Streaming, most of which have been
discussed in the Spark mailing list.  I’ll summarize the current state and known issues of the Kafka integration further
down below.</p>

<h2 id="primer-on-topics-partitions-and-parallelism-in-kafka">Primer on topics, partitions, and parallelism in Kafka</h2>

<p><em>For details see my articles</em>
<em><a href="http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/">Apache Kafka 0.8 Training Deck and Tutorial</a></em>
<em>and</em>
<em><a href="http://www.michael-noll.com/blog/2013/03/13/running-a-multi-broker-apache-kafka-cluster-on-a-single-node/">Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node</a>.</em></p>

<p>Kafka stores data in <em>topics</em>, with each topic consisting of a configurable number of <em>partitions</em>.  The number of
partitions of a topic is very important for performance considerations as this number is an <em>upper bound on the</em>
<em>consumer parallelism</em>: if a topic has <em>N</em> partitions, then your application can only consume this topic with a maximum
of <em>N</em> threads in parallel.  (At least this is the case when you use Kafka’s built-in Scala/Java consumer API.)</p>

<p>When I say “application” I should rather say <em>consumer group</em> in Kafka’s terminology.  A consumer group, identified by
a string of your choosing, is the cluster-wide identifier for a logical consumer application.  All consumers that are
part of the same consumer group share the burden of reading from a given Kafka topic, and only a maximum of <em>N</em> (=
number of partitions) threads across all the consumers in the same group will be able to read from the topic.  Any
excess threads will sit idle.</p>

<div class="note">
<strong>Multiple Kafka consumer groups can be run in parallel:</strong> Of course you can run multiple, independent logical consumer applications against the same Kafka topic.  Here, each logical application will run its consumer threads under a unique consumer group id.  Each application can then also use different read parallelisms (see below).  When I am talking about the various ways to configure read parallelism in the following sections, then I am referring to the settings of a <em>single</em> one of these logical consumer applications.
</div>

<p>Here are some simplified examples.</p>

<ul>
  <li>Your application uses the consumer group id “terran” to read from a Kafka topic “zerg.hydra” that has
<strong>10 partitions</strong>.
If you configure your application to consume the topic with only <strong>1</strong> thread, then this single thread will read data
from all 10 partitions.</li>
  <li>Same as above, but this time you configure <strong>5</strong> consumer threads.  Here, each thread will read from 2 partitions.</li>
  <li>Same as above, but this time you configure <strong>10</strong> consumer threads.  Here, each thread will read from a single
partition.</li>
  <li>Same as above, but this time you configure <strong>14</strong> consumer threads.  Here, 10 of the 14 threads will read from a
single partition each, and the remaining 4 threads will be idle.</li>
</ul>

<p>Let’s introduce some real-world complexity in this simple picture – the <em>rebalancing</em> event in Kafka.  Rebalancing is
a lifecycle event in Kafka that occurs when consumers join or leave a consumer group (there are more conditions that
trigger rebalancing but these are not important in this context; see my
<a href="http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/">Kafka training deck</a> for details on rebalancing).</p>

<ul>
  <li>Your application uses the consumer group id “terran” and starts consuming with 1 thread.  This thread will read from
all 10 partitions.  During runtime, you’ll increase the number of threads from 1 to 14.  That is, there is suddenly
a change of parallelism for the same consumer group.  This triggers <em>rebalancing</em> in Kafka.  Once rebalancing
completes, you will have 10 of 14 threads consuming from a single partition each, and the 4 remaining threads will be
idle.  And as you might have guessed, the initial thread will now read from only one partition and will no longer see
data from the other nine.</li>
</ul>

<p>We have now a basic understanding of topics, partitions, and the number of partitions as an upper bound for the
parallelism when reading from Kafka.  But what are the resulting implications for an application – such as a Spark
Streaming job or Storm topology – that reads its input data from Kafka?</p>

<ol>
  <li><strong>Read parallelism:</strong> You typically want to read from all <em>N</em> partitions of a Kafka topic in parallel by consuming
with <em>N</em> threads.  And depending on the data volume you want to spread those threads across different NICs, which
typically means across different machines.  In Storm, this is achieved by setting the parallelism of the
<a href="https://github.com/apache/storm/tree/master/external/storm-kafka">Kafka spout</a> to <em>N</em> via
<code>TopologyBuilder#setSpout()</code>.  The Spark equivalent is a bit trickier, and I will describe how to do this in further
detail below.</li>
  <li><strong>Downstream processing parallelism:</strong>  Once retrieved from Kafka you want to process the data in parallel.
Depending on your use case this level of parallelism must be different from the read parallelism.  If your use case
is CPU-bound, for instance, you want to have many more processing threads than read threads;  this is achieved by
shuffling or “fanning out” the data via the network from the few read threads to the many processing threads.  Hence
you pay for the access to more cores with increased network communication, serialization overhead, etc.  In Storm,
you perform such a shuffling via a
<a href="https://storm.apache.org/documentation/Concepts.html">shuffle grouping</a> from the Kafka spout to the next downstream
bolt.  The Spark equivalent is the
<a href="https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#transformations-on-dstreams">repartition</a>
transformation on DStreams.</li>
</ol>

<p>The important takeaway is that it is possible – and often desired – to decouple the level of parallelisms for
<em>reading from Kafka</em> and for <em>processing the data once read</em>.  In the next sections I will describe the various options
you have at your disposal to configure read parallelism and downstream processing parallelism in Spark Streaming.</p>

<h2 id="reading-from-kafka">Reading from Kafka</h2>

<h3 id="read-parallelism-in-spark-streaming">Read parallelism in Spark Streaming</h3>

<p>Like Kafka, Spark Streaming has the concept of <em>partitions</em>.  It is important to understand that Kafka’s per-topic
partitions are not correlated to the partitions of
<a href="http://spark.apache.org/docs/1.1.0/programming-guide.html">RDDs in Spark</a>.</p>

<p>The <a href="https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala">KafkaInputDStream</a>
of Spark Streaming – aka its Kafka “connector” – uses Kafka’s
<a href="http://kafka.apache.org/documentation.html#highlevelconsumerapi">high-level consumer API</a>, which means you have two
control knobs in Spark that determine read parallelism for Kafka:</p>

<ol>
  <li><strong>The number of input DStreams.</strong>  Because Spark will run one receiver (= task) per input DStream, this means using
multiple input DStreams will parallelize the read operations across multiple cores and thus, hopefully, across
multiple machines and thereby NICs.</li>
  <li><strong>The number of consumer threads per input DStream.</strong>  Here, the same receiver (= task) will run multiple threads.
That is, read operations will happen in parallel but on the same core/machine/NIC.</li>
</ol>

<p>For practical purposes option 1 is the preferred.</p>

<p>Why is that?  First and foremost because reading from Kafka is
normally network/NIC limited, i.e.  you typically do not increase read-throughput by running more threads <em>on the same</em>
<em>machine</em>.  In other words, it is rare though possible that reading from Kafka runs into CPU bottlenecks.  Second, if
you go with option 2 then multiple threads will be competing for the lock to push data into so-called <em>blocks</em> (the <code>+=</code>
method of <code>BlockGenerator</code> that is used behind the scenes is <code>synchronized</code> on the block generator instance).</p>

<div class="note">
<strong>Number of partitions of the RDDs created by the input DStreams:</strong>  The <tt>KafkaInputDStream</tt> will store individual messages received from Kafka into so-called <em>blocks</em>.  From what I understand, a new block is generated every <a href="http://spark.apache.org/docs/latest/configuration.html#spark-streaming">spark.streaming.blockInterval</a> milliseconds, and each block is turned into a partition of the RDD that will eventually be created by the DStream.  If this assumption of mine is true, then the number of partitions in the RDDs created by <tt>KafkaInputDStream</tt> is determined by <tt>batchInterval / spark.streaming.blockInterval</tt>, where <tt>batchInterval</tt> is the time interval at which streaming data will be divided into batches (set via a constructor parameter of <tt>StreamingContext</tt>).  For example, if the batch interval is 2 seconds (default) and the block interval is 200ms (default), your RDD will contain 10 partitions.  Please correct me if I&#8217;m mistaken.
</div>

<h4 id="option-1-controlling-the-number-of-input-dstreams">Option 1: Controlling the number of input DStreams</h4>

<p>The example below is taken from the
<a href="https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#reducing-the-processing-time-of-each-batch">Spark Streaming Programming Guide</a>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">val</span> <span class="n">ssc</span><span class="k">:</span> <span class="kt">StreamingContext</span> <span class="o">=</span> <span class="o">???</span> <span class="c1">// ignore for now</span>
</span><span class="line"><span class="k">val</span> <span class="n">kafkaParams</span><span class="k">:</span> <span class="kt">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;group.id&quot;</span> <span class="o">-&gt;</span> <span class="s">&quot;terran&quot;</span><span class="o">,</span> <span class="cm">/* ignore rest */</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">numInputDStreams</span> <span class="k">=</span> <span class="mi">5</span>
</span><span class="line"><span class="k">val</span> <span class="n">kafkaDStreams</span> <span class="k">=</span> <span class="o">(</span><span class="mi">1</span> <span class="n">to</span> <span class="n">numInputDStreams</span><span class="o">).</span><span class="n">map</span> <span class="o">{</span> <span class="k">_</span> <span class="k">=&gt;</span> <span class="nc">KafkaUtils</span><span class="o">.</span><span class="n">createStream</span><span class="o">(...)</span> <span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In this example we create five input DStreams, thus spreading the burden of reading from Kafka across five cores and,
hopefully, five machines/NICs.  (I say “hopefully” because I am not certain whether Spark Streaming task placement
policy will try to place receivers on different machines.)  All input DStreams are part of the “terran” consumer group,
and the Kafka API will ensure that these five input DStreams a) will see all available data for the topic because it
assigns each partition of the topic to an input DStream and b) will not see overlapping data because each partition is
assigned to only one input DStream at a time.  In other words, this setup of “collaborating” input DStreams works
because of the consumer group behavior provided by the Kafka API, which is used behind the scenes by
<code>KafkaInputDStream</code>.</p>

<p>What I have not shown in the example is how many threads are created <em>per input DStream</em>, which is done via parameters
to the <code>KafkaUtils.createStream</code> method (the actual input topic(s) are also specified as parameters of this method).
We will do this in the next section.</p>

<p>But before we continue let me highlight several known issues with this setup and with Spark Streaming in particular,
which are caused on the one hand by current limitations of Spark in general and on the other hand by the current
implementation of the Kafka input DStream in particular:</p>

<blockquote><p>[When you use the multi-input-stream approach I described above, then] those consumers operate in one [Kafka] consumer group, and they try to decide which consumer consumes which partitions.  And it may just fail to do syncpartitionrebalance, and then you have only a few consumers really consuming.  To mitigate this problem, you can set rebalance retries very high, and pray it helps.</p><p>Then arises yet another &#8220;feature&#8221; — if your receiver dies (OOM, hardware failure), you just stop receiving from Kafka!</p><footer><strong>spark-user discussion</strong> <cite><a href="http://markmail.org/message/257a5l3oqyftsjxj">markmail.org/message/&hellip;</a></cite></footer></blockquote>

<p>The “stop receiving from Kafka” issue requires
<a href="http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-and-the-spark-shell-tp3347.html">some explanation</a>.
Currently, when you start your streaming application
via <code>ssc.start()</code> the processing starts and continues indefinitely – even if the input data source (e.g. Kafka) becomes
unavailable.  That is, streams are not able to detect if they have lost connection to the upstream data source and
thus cannot react to this event, e.g. by reconnecting or by stopping the execution.  Similarly, if you lose a receiver
that reads from the data source, then
<a href="http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-and-the-spark-shell-tp3347p3375.html">your streaming application will generate empty RDDs</a>.</p>

<p>This is a pretty unfortunate situation.  One crude workaround is to restart your streaming application whenever it runs
into an upstream data source failure or a receiver failure.  This workaround may not help you though if your use case
requires you to set the Kafka configuration option <code>auto.offset.reset</code> to “smallest” – because of a known bug in
Spark Streaming the resulting behavior of your streaming application may not be what you want.  See the section on
<em>Known issues in Spark Streaming</em> below for further details.</p>

<h4 id="option-2-controlling-the-number-of-consumer-threads-per-input-dstream">Option 2: Controlling the number of consumer threads per input DStream</h4>

<p>In this example we create a <em>single</em> input DStream that is configured to run three consumer threads – in the same
receiver/task and thus on the same core/machine/NIC – to read from the Kafka topic “zerg.hydra”.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">val</span> <span class="n">ssc</span><span class="k">:</span> <span class="kt">StreamingContext</span> <span class="o">=</span> <span class="o">???</span> <span class="c1">// ignore for now</span>
</span><span class="line"><span class="k">val</span> <span class="n">kafkaParams</span><span class="k">:</span> <span class="kt">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;group.id&quot;</span> <span class="o">-&gt;</span> <span class="s">&quot;terran&quot;</span><span class="o">,</span> <span class="o">...)</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">consumerThreadsPerInputDstream</span> <span class="k">=</span> <span class="mi">3</span>
</span><span class="line"><span class="k">val</span> <span class="n">topics</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;zerg.hydra&quot;</span> <span class="o">-&gt;</span> <span class="n">consumerThreadsPerInputDstream</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">stream</span> <span class="k">=</span> <span class="nc">KafkaUtils</span><span class="o">.</span><span class="n">createStream</span><span class="o">(</span><span class="n">ssc</span><span class="o">,</span> <span class="n">kafkaParams</span><span class="o">,</span> <span class="n">topics</span><span class="o">,</span> <span class="o">...)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The <code>KafkaUtils.createStream</code> method is overloaded, so there are a few different method signatures.  In this example
we pick the Scala variant that gives us the most control.</p>

<h4 id="combining-options-1-and-2">Combining options 1 and 2</h4>

<p>Here is a more complete example that combines the previous two techniques:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">val</span> <span class="n">ssc</span><span class="k">:</span> <span class="kt">StreamingContext</span> <span class="o">=</span> <span class="o">???</span>
</span><span class="line"><span class="k">val</span> <span class="n">kafkaParams</span><span class="k">:</span> <span class="kt">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;group.id&quot;</span> <span class="o">-&gt;</span> <span class="s">&quot;terran&quot;</span><span class="o">,</span> <span class="o">...)</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">numDStreams</span> <span class="k">=</span> <span class="mi">5</span>
</span><span class="line"><span class="k">val</span> <span class="n">topics</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;zerg.hydra&quot;</span> <span class="o">-&gt;</span> <span class="mi">1</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">kafkaDStreams</span> <span class="k">=</span> <span class="o">(</span><span class="mi">1</span> <span class="n">to</span> <span class="n">numDStreams</span><span class="o">).</span><span class="n">map</span> <span class="o">{</span> <span class="k">_</span> <span class="k">=&gt;</span>
</span><span class="line">    <span class="nc">KafkaUtils</span><span class="o">.</span><span class="n">createStream</span><span class="o">(</span><span class="n">ssc</span><span class="o">,</span> <span class="n">kafkaParams</span><span class="o">,</span> <span class="n">topics</span><span class="o">,</span> <span class="o">...)</span>
</span><span class="line">  <span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>We are creating five input DStreams, each of which will run a single consumer thread.  If the input topic “zerg.hydra”
has five partitions (or less), then this is normally the best way to parallelize read operations if you care primarily
about maximizing throughput.</p>

<h3 id="downstream-processing-parallelism-in-spark-streaming">Downstream processing parallelism in Spark Streaming</h3>

<p>In the previous sections we covered parallelizing reads from Kafka.  Now we can tackle parallelizing the downstream
data processing in Spark.  Here, you must keep in mind how Spark itself parallelizes its processing.  Like Kafka,
Spark ties the parallelism to the number of (RDD) partitions by running
<a href="http://spark.apache.org/docs/1.1.0/programming-guide.html#resilient-distributed-datasets-rdds"><em>one task per RDD partition</em></a>
(sometimes partitions are still called “slices” in the docs).</p>

<div class="note">
<strong>Just like any Spark application:</strong> Once a Spark Streaming application has received its input data, any
further processing is identical to non-streaming Spark applications.  That is, you use exactly the same tools and
patterns to scale your application as you would for &#8220;normal&#8221; Spark data flows.  See <a href="https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#level-of-parallelism-in-data-processing">Level of Parallelism in Data Processing</a>.
</div>

<p>This gives us two control knobs:</p>

<ol>
  <li><strong>The number of input DStreams</strong>, i.e. what we receive as a result of the previous sections on read parallelism.
This is our starting point, which we can either take as-is or modify with the next option.</li>
  <li><strong>The</strong>
<strong><a href="https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#transformations-on-dstreams">repartition</a></strong>
<strong>DStream transformation.</strong>  It returns a new DStream with an increased or decreased level <em>N</em> of parallelism.  Each
RDD in the returned DStream has exactly <em>N</em> partitions.  DStreams are a continuous series of RDDs, and behind the
scenes <code>DStream.repartition</code> calls <code>RDD.repartition</code>.  The latter “reshuffles the data in the RDD randomly to create
either more or fewer partitions and balance it across them. This always shuffles all data over the network.”  In
other words, <code>DStream.repartition</code> is very similar to Storm’s
<a href="https://storm.apache.org/documentation/Concepts.html">shuffle grouping</a>.</li>
</ol>

<p>Hence <code>repartition</code> is our primary means to decouple read parallelism from processing parallelism.  It allows us to
set the number of processing tasks and thus the number of cores that will be used for the processing.  Indirectly, we
also influence the number of machines/NICs that will be involved.</p>

<p>A related DStream transformation is
<a href="https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#transformations-on-dstreams">union</a>.
(This method also exists for <code>StreamingContext</code>, where it returns the unified DStream from multiple DStreams of the same
type and same slide duration.  Most likely you would use the <code>StreamingContext</code> variant.)  A <code>union</code> will return a
<code>UnionDStream</code> backed by a <code>UnionRDD</code>.  A <code>UnionRDD</code> is comprised of all the partitions of the RDDs being unified, i.e.
if you unite 3 RDDs with 10 partitions each, then your union RDD instance will contain 30 partitions.  In other words,
<code>union</code> will squash multiple DStreams into a single DStream/RDD, but it will not change the level of parallelism.
Whether you need to use <code>union</code> or not depends on whether your use case requires information from all Kafka partitions
“in one place”, so it’s primarily because of semantic requirements.  One such example is when you need to perform a
(global) count of distinct elements.</p>

<div class="note">
Note: <a href="http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-tp766p5089.html">RDDs are not ordered.</a>  So when you <tt>union</tt> RDDs, then the resulting RDD itself will not have a well-defined ordering either.  If you need ordering <tt>sort</tt> the RDD.
</div>

<p>Your use case will determine which knobs and which combination thereof you need to use.  Let’s say your use case is
CPU-bound.  Here, you may want to consume the Kafka topic “zerg.hydra” (which has five Kafka partitions) with a read
parallelism of 5 – i.e. 5 receivers with 1 consumer thread each – but bump up the processing parallelism to 20:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">val</span> <span class="n">ssc</span><span class="k">:</span> <span class="kt">StreamingContext</span> <span class="o">=</span> <span class="o">???</span>
</span><span class="line"><span class="k">val</span> <span class="n">kafkaParams</span><span class="k">:</span> <span class="kt">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;group.id&quot;</span> <span class="o">-&gt;</span> <span class="s">&quot;terran&quot;</span><span class="o">,</span> <span class="o">...)</span>
</span><span class="line"><span class="k">val</span> <span class="n">readParallelism</span> <span class="k">=</span> <span class="mi">5</span>
</span><span class="line"><span class="k">val</span> <span class="n">topics</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span><span class="s">&quot;zerg.hydra&quot;</span> <span class="o">-&gt;</span> <span class="mi">1</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">kafkaDStreams</span> <span class="k">=</span> <span class="o">(</span><span class="mi">1</span> <span class="n">to</span> <span class="n">readParallelism</span><span class="o">).</span><span class="n">map</span> <span class="o">{</span> <span class="k">_</span> <span class="k">=&gt;</span>
</span><span class="line">    <span class="nc">KafkaUtils</span><span class="o">.</span><span class="n">createStream</span><span class="o">(</span><span class="n">ssc</span><span class="o">,</span> <span class="n">kafkaParams</span><span class="o">,</span> <span class="n">topics</span><span class="o">,</span> <span class="o">...)</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line"><span class="c1">//&gt; collection of five *input* DStreams = handled by five receivers/tasks</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">unionDStream</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">union</span><span class="o">(</span><span class="n">kafkaDStreams</span><span class="o">)</span> <span class="c1">// often unnecessary, just showcasing how to do it</span>
</span><span class="line"><span class="c1">//&gt; single DStream</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">processingParallelism</span> <span class="k">=</span> <span class="mi">20</span>
</span><span class="line"><span class="k">val</span> <span class="n">processingDStream</span> <span class="k">=</span> <span class="n">unionDStream</span><span class="o">(</span><span class="n">processingParallelism</span><span class="o">)</span>
</span><span class="line"><span class="c1">//&gt; single DStream but now with 20 partitions</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In the next section we tie all the pieces together and also cover the actual data processing.</p>

<h2 id="writing-to-kafka">Writing to Kafka</h2>

<p>Writing to Kafka should be done from the <code>foreachRDD</code> output operation:</p>

<blockquote><p>The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed at the driver, and will usually have RDD actions in it that will force the computation of the streaming RDDs.</p></blockquote>

<div class="note">
Note: The remark &#8220;the function <tt>func</tt> is executed at the driver&#8221; does not mean that, say, a Kafka producer itself
would be run from the driver.  Rather, read this remark more as &#8220;the function <tt>func</tt> is <em>evaluated</em> at the
driver&#8221;.  The actual behavior will become more clear once you read <em>Design Patterns for using foreachRDD</em>.
</div>

<p>You should read the section
<a href="http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams">Design Patterns for using foreachRDD</a>
in the Spark docs, which explains the recommended patterns as well as common pitfalls when using <code>foreachRDD</code> to talk to
external systems.</p>

<p>In my case, I decided to follow the recommendation to re-use Kafka producer instances across multiple RDDs/batches via
a pool of producers.  I implemented such a pool with <a href="http://commons.apache.org/proper/commons-pool/">Apache Commons Pool</a>,
see <a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/kafka/PooledKafkaProducerAppFactory.scala">PooledKafkaProducerAppFactory</a>.
Factories are helpful in this context because of Spark’s execution and serialization model.  The pool itself is provided
to the tasks via a <a href="http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables">broadcast variable</a>.</p>

<p>The end result looks as follows:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">val</span> <span class="n">producerPool</span> <span class="k">=</span> <span class="o">{</span>
</span><span class="line">  <span class="c1">// See the full code on GitHub for details on how the pool is created</span>
</span><span class="line">  <span class="k">val</span> <span class="n">pool</span> <span class="k">=</span> <span class="n">createKafkaProducerPool</span><span class="o">(</span><span class="n">kafkaZkCluster</span><span class="o">.</span><span class="n">kafka</span><span class="o">.</span><span class="n">brokerList</span><span class="o">,</span> <span class="n">outputTopic</span><span class="o">.</span><span class="n">name</span><span class="o">)</span>
</span><span class="line">  <span class="n">ssc</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">broadcast</span><span class="o">(</span><span class="n">pool</span><span class="o">)</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="n">stream</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}.</span><span class="n">foreachRDD</span><span class="o">(</span><span class="n">rdd</span> <span class="k">=&gt;</span> <span class="o">{</span>
</span><span class="line">  <span class="n">rdd</span><span class="o">.</span><span class="n">foreachPartition</span><span class="o">(</span><span class="n">partitionOfRecords</span> <span class="k">=&gt;</span> <span class="o">{</span>
</span><span class="line">    <span class="c1">// Get a producer from the shared pool</span>
</span><span class="line">    <span class="k">val</span> <span class="n">p</span> <span class="k">=</span> <span class="n">producerPool</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">borrowObject</span><span class="o">()</span>
</span><span class="line">    <span class="n">partitionOfRecords</span><span class="o">.</span><span class="n">foreach</span> <span class="o">{</span> <span class="k">case</span> <span class="n">tweet</span><span class="k">:</span> <span class="kt">Tweet</span> <span class="o">=&gt;</span>
</span><span class="line">      <span class="c1">// Convert pojo back into Avro binary format</span>
</span><span class="line">      <span class="k">val</span> <span class="n">bytes</span> <span class="k">=</span> <span class="n">converter</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">apply</span><span class="o">(</span><span class="n">tweet</span><span class="o">)</span>
</span><span class="line">      <span class="c1">// Send the bytes to Kafka</span>
</span><span class="line">      <span class="n">p</span><span class="o">.</span><span class="n">send</span><span class="o">(</span><span class="n">bytes</span><span class="o">)</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">    <span class="c1">// Returning the producer to the pool also shuts it down</span>
</span><span class="line">    <span class="n">producerPool</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">returnObject</span><span class="o">(</span><span class="n">p</span><span class="o">)</span>
</span><span class="line">  <span class="o">})</span>
</span><span class="line"><span class="o">})</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Keep in mind that Spark Streaming creates many RRDs per minute, each of which contains multiple partitions, so
preferably you shouldn’t create new Kafka producers for each partition, let alone for each Kafka message.  The setup
above minimizes the creation of Kafka producer instances, and also minimizes the number of TCP connections that are
being established with the Kafka cluster.  You can use this pool setup to precisely control the number of Kafka producer
instances that are being made available to your streaming application (if in doubt, use fewer).</p>

<h2 id="complete-example">Complete example</h2>

<p>The code example below is the gist of my example Spark Streaming application
(<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/spark/KafkaSparkStreamingSpec.scala">see the full code for details and explanations</a>).
Here, I demonstrate how to:</p>

<ul>
  <li>Read Avro-encoded data (the <code>Tweet</code> class) from a Kafka topic in parallel.  We use a the optimal read parallelism of
one single-threaded input DStream per Kafka partition.</li>
  <li>Deserialize the Avro-encoded data back into pojos, then serializing them back into binary.  The serialization is
performed via <a href="https://github.com/twitter/bijection">Twitter Bijection</a>.</li>
  <li>Write the results back into a different Kafka topic via a Kafka producer pool.</li>
</ul>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
<span class="line-number">44</span>
<span class="line-number">45</span>
<span class="line-number">46</span>
<span class="line-number">47</span>
<span class="line-number">48</span>
<span class="line-number">49</span>
<span class="line-number">50</span>
<span class="line-number">51</span>
<span class="line-number">52</span>
<span class="line-number">53</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// Set up the input DStream to read from Kafka (in parallel)</span>
</span><span class="line"><span class="k">val</span> <span class="n">kafkaStream</span> <span class="k">=</span> <span class="o">{</span>
</span><span class="line">  <span class="k">val</span> <span class="n">sparkStreamingConsumerGroup</span> <span class="k">=</span> <span class="s">&quot;spark-streaming-consumer-group&quot;</span>
</span><span class="line">  <span class="k">val</span> <span class="n">kafkaParams</span> <span class="k">=</span> <span class="nc">Map</span><span class="o">(</span>
</span><span class="line">    <span class="s">&quot;zookeeper.connect&quot;</span> <span class="o">-&gt;</span> <span class="s">&quot;zookeeper1:2181&quot;</span><span class="o">,</span>
</span><span class="line">    <span class="s">&quot;group.id&quot;</span> <span class="o">-&gt;</span> <span class="s">&quot;spark-streaming-test&quot;</span><span class="o">,</span>
</span><span class="line">    <span class="s">&quot;zookeeper.connection.timeout.ms&quot;</span> <span class="o">-&gt;</span> <span class="s">&quot;1000&quot;</span><span class="o">)</span>
</span><span class="line">  <span class="k">val</span> <span class="n">inputTopic</span> <span class="k">=</span> <span class="s">&quot;input-topic&quot;</span>
</span><span class="line">  <span class="k">val</span> <span class="n">numPartitionsOfInputTopic</span> <span class="k">=</span> <span class="mi">5</span>
</span><span class="line">  <span class="k">val</span> <span class="n">streams</span> <span class="k">=</span> <span class="o">(</span><span class="mi">1</span> <span class="n">to</span> <span class="n">numPartitionsOfInputTopic</span><span class="o">)</span> <span class="n">map</span> <span class="o">{</span> <span class="k">_</span> <span class="k">=&gt;</span>
</span><span class="line">    <span class="nc">KafkaUtils</span><span class="o">.</span><span class="n">createStream</span><span class="o">(</span><span class="n">ssc</span><span class="o">,</span> <span class="n">kafkaParams</span><span class="o">,</span> <span class="nc">Map</span><span class="o">(</span><span class="n">inputTopic</span> <span class="o">-&gt;</span> <span class="mi">1</span><span class="o">),</span> <span class="nc">StorageLevel</span><span class="o">.</span><span class="nc">MEMORY_ONLY_SER</span><span class="o">).</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">_2</span><span class="o">)</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">  <span class="k">val</span> <span class="n">unifiedStream</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">union</span><span class="o">(</span><span class="n">streams</span><span class="o">)</span>
</span><span class="line">  <span class="k">val</span> <span class="n">sparkProcessingParallelism</span> <span class="k">=</span> <span class="mi">1</span> <span class="c1">// You&#39;d probably pick a higher value than 1 in production.</span>
</span><span class="line">  <span class="n">unifiedStream</span><span class="o">.</span><span class="n">repartition</span><span class="o">(</span><span class="n">sparkProcessingParallelism</span><span class="o">)</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="c1">// We use accumulators to track global &quot;counters&quot; across the tasks of our streaming app</span>
</span><span class="line"><span class="k">val</span> <span class="n">numInputMessages</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">accumulator</span><span class="o">(</span><span class="mi">0L</span><span class="o">,</span> <span class="s">&quot;Kafka messages consumed&quot;</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">numOutputMessages</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">accumulator</span><span class="o">(</span><span class="mi">0L</span><span class="o">,</span> <span class="s">&quot;Kafka messages produced&quot;</span><span class="o">)</span>
</span><span class="line"><span class="c1">// We use a broadcast variable to share a pool of Kafka producers, which we use to write data from Spark to Kafka.</span>
</span><span class="line"><span class="k">val</span> <span class="n">producerPool</span> <span class="k">=</span> <span class="o">{</span>
</span><span class="line">  <span class="k">val</span> <span class="n">pool</span> <span class="k">=</span> <span class="n">createKafkaProducerPool</span><span class="o">(</span><span class="n">kafkaZkCluster</span><span class="o">.</span><span class="n">kafka</span><span class="o">.</span><span class="n">brokerList</span><span class="o">,</span> <span class="n">outputTopic</span><span class="o">.</span><span class="n">name</span><span class="o">)</span>
</span><span class="line">  <span class="n">ssc</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">broadcast</span><span class="o">(</span><span class="n">pool</span><span class="o">)</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line"><span class="c1">// We also use a broadcast variable for our Avro Injection (Twitter Bijection)</span>
</span><span class="line"><span class="k">val</span> <span class="n">converter</span> <span class="k">=</span> <span class="n">ssc</span><span class="o">.</span><span class="n">sparkContext</span><span class="o">.</span><span class="n">broadcast</span><span class="o">(</span><span class="nc">SpecificAvroCodecs</span><span class="o">.</span><span class="n">toBinary</span><span class="o">[</span><span class="kt">Tweet</span><span class="o">])</span>
</span><span class="line">
</span><span class="line"><span class="c1">// Define the actual data flow of the streaming job</span>
</span><span class="line"><span class="n">kafkaStream</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="k">case</span> <span class="n">bytes</span> <span class="k">=&gt;</span>
</span><span class="line">  <span class="n">numInputMessages</span> <span class="o">+=</span> <span class="mi">1</span>
</span><span class="line">  <span class="c1">// Convert Avro binary data to pojo</span>
</span><span class="line">  <span class="n">converter</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">invert</span><span class="o">(</span><span class="n">bytes</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
</span><span class="line">    <span class="k">case</span> <span class="nc">Success</span><span class="o">(</span><span class="n">tweet</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">tweet</span>
</span><span class="line">    <span class="k">case</span> <span class="nc">Failure</span><span class="o">(</span><span class="n">e</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="c1">// ignore if the conversion failed</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line"><span class="o">}.</span><span class="n">foreachRDD</span><span class="o">(</span><span class="n">rdd</span> <span class="k">=&gt;</span> <span class="o">{</span>
</span><span class="line">  <span class="n">rdd</span><span class="o">.</span><span class="n">foreachPartition</span><span class="o">(</span><span class="n">partitionOfRecords</span> <span class="k">=&gt;</span> <span class="o">{</span>
</span><span class="line">    <span class="k">val</span> <span class="n">p</span> <span class="k">=</span> <span class="n">producerPool</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">borrowObject</span><span class="o">()</span>
</span><span class="line">    <span class="n">partitionOfRecords</span><span class="o">.</span><span class="n">foreach</span> <span class="o">{</span> <span class="k">case</span> <span class="n">tweet</span><span class="k">:</span> <span class="kt">Tweet</span> <span class="o">=&gt;</span>
</span><span class="line">      <span class="c1">// Convert pojo back into Avro binary format</span>
</span><span class="line">      <span class="k">val</span> <span class="n">bytes</span> <span class="k">=</span> <span class="n">converter</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">apply</span><span class="o">(</span><span class="n">tweet</span><span class="o">)</span>
</span><span class="line">      <span class="c1">// Send the bytes to Kafka</span>
</span><span class="line">      <span class="n">p</span><span class="o">.</span><span class="n">send</span><span class="o">(</span><span class="n">bytes</span><span class="o">)</span>
</span><span class="line">      <span class="n">numOutputMessages</span> <span class="o">+=</span> <span class="mi">1</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">    <span class="n">producerPool</span><span class="o">.</span><span class="n">value</span><span class="o">.</span><span class="n">returnObject</span><span class="o">(</span><span class="n">p</span><span class="o">)</span>
</span><span class="line">  <span class="o">})</span>
</span><span class="line"><span class="o">})</span>
</span><span class="line">
</span><span class="line"><span class="c1">// Run the streaming job</span>
</span><span class="line"><span class="n">ssc</span><span class="o">.</span><span class="n">start</span><span class="o">()</span>
</span><span class="line"><span class="n">ssc</span><span class="o">.</span><span class="n">awaitTermination</span><span class="o">()</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p><em><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/spark/KafkaSparkStreamingSpec.scala">See the full source code for further details and explanations.</a></em></p>

<p>Personally, I really like the conciseness and expressiveness of the Spark Streaming code.  As Bobby Evans and Tom Graves
are eluding to in their talk, the Storm equivalent of this code is more verbose and comparatively lower level:
The <a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/KafkaStormSpec.scala">KafkaStormSpec</a>
in <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> wires and runs a Storm topology that performs
the same computations.  Well, the spec file itself is only a few lines of code once you exclude the code comments,
which I only keep for didactic reasons;  however, keep in mind that in Storm’s Java API you cannot use Scala-like
anonymous functions as I show in the Spark Streaming example above (e.g. the <code>map</code> and <code>foreach</code> steps).  Instead you
must write “full” classes – bolts in plain Storm, functions/filters in Storm Trident – to achieve the
same functionality, see e.g.
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/bolts/AvroDecoderBolt.scala">AvroDecoderBolt</a>.
This feels a bit similar to, say, having to code against Spark’s own API using Java, where juggling with anonymous
functions is IMHO just as painful.</p>

<p>Lastly, I also liked the <a href="http://spark.apache.org/documentation.html">Spark documentation</a>.  It was very easy to get
started, and even some more advanced use is covered (e.g.
<a href="http://spark.apache.org/docs/1.1.0/tuning.html">Tuning Spark</a>).  I still had to browse the mailing list and also dive
into the source code, but the general starting experience was ok – only the Kafka integration part was lacking (hence
this blog post).  Good job to everyone involved maintaining the docs!</p>

<h1 id="known-issues-in-spark-streaming">Known issues in Spark Streaming</h1>

<div class="note">
Update Jan 20, 2015:  Spark 1.2+ includes features such as write ahead logs (WAL) that help to minimize some of the
data loss scenarios for Spark Streaming that are described below.  See
<a href="http://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html">Improved Fault-tolerance and Zero Data Loss in Spark Streaming</a>.
</div>

<p>You might have guessed by now that there are indeed a number of unresolved issues in Spark Streaming.  I try to
summarize my findings below.</p>

<p>On the one hand there are issues due to some confusion about how to correctly read from and write to Kafka, which you
can follow in mailing list discussions such as
<a href="http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Kafka-Receivers-and-Union-td14901.html">Multiple Kafka Receivers and Union</a>
and <a href="http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-td13883.html">How to scale more consumer to Kafka stream </a>.</p>

<p>On the other hand there are apparently still some inherent issues in Spark Streaming as well as Spark itself,
notably with regard to data loss in failure scenarios.  In other words, issues that you do not want to run into in
production!</p>

<ul>
  <li>The current (v1.1) driver in Spark does not recover such raw data that has been received but not processed
(<a href="https://www.mail-archive.com/user@spark.apache.org/msg10572.html">source</a>).  Here, your Spark application
may lose data under certain conditions.  Tathagata Das points out that driver recovery should be fixed in
Spark v1.2, which will be released around the end of 2014.</li>
  <li>The current Kafka “connector” of Spark is based on Kafka’s high-level consumer API.  One effect of this is that Spark
Streaming cannot rely on its <code>KafkaInputDStream</code> to properly replay data from Kafka in case of a downstream data loss
(e.g. Spark machines died).
    <ul>
      <li>Some people even advocate that the current
<a href="http://markmail.org/message/2lb776ta5sq6lgtw">Kafka connector of Spark should not be used in production</a>
because it is based on the high-level consumer API of Kafka. Instead Spark should use the simple consumer API
(like Storm’s Kafka spout does), which allows you to control offsets and partition assignment deterministically.</li>
    </ul>
  </li>
  <li>The Spark community has been working on filling the previously mentioned gap with e.g. Dibyendu
Bhattacharya’s <a href="https://github.com/dibbhatt/kafka-spark-consumer">kafka-spark-consumer</a>.  The latter is a port of
Apache Storm’s <a href="https://github.com/apache/storm/tree/master/external/storm-kafka">Kafka spout</a>, which is based on
Kafka’s so-called simple consumer API, which provides better replaying control in case of downstream failures.</li>
  <li>Even given those volunteer efforts, the Spark team would prefer to not special-case data recovery for Kafka, as their
goal is “to provide strong guarantee, exactly-once semantics in all transformations”
(<a href="https://www.mail-archive.com/user@spark.apache.org/msg10572.html">source</a>), which is understandable.
On the flip side it still feels a bit like a wasted opportunity to not leverage Kafka’s built-in replaying
capabilities.  Tough call!</li>
  <li><a href="https://spark-project.atlassian.net/browse/SPARK-1340">SPARK-1340</a>: In the case of Kafka input DStreams, receivers
are not getting restarted if the worker running the receiver fails.  So if a worker dies in production, you will
simply miss the data the receiver(s) was/were responsible to retrieve from Kafka.</li>
  <li>See also
<a href="http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node">Failure of a Worker Node</a>
for further discussions on data loss scenarios (“lost input data!”) as well as data duplication scenarios (“wrote
output data twice!”).  Applies to Kafka, too.</li>
  <li>Spark’s usage of the Kafka consumer parameter
<a href="http://kafka.apache.org/documentation.html#consumerconfigs">auto.offset.reset</a> is different from Kafka’s semantics.
In Kafka, the behavior of setting <code>auto.offset.reset</code> to “smallest” is that the consumer will automatically reset the
offset to the smallest offset when a) there is no existing offset stored in ZooKeeper or b) there is an existing
offset but it is out of range.  Spark however will <em>always</em> remove existing offsets and then start all the way from
zero again.  This means whenever you restart your application with <code>auto.offset.reset = "smallest"</code>, your application
will completely re-process all available Kafka data.  Doh!
See <a href="http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-and-the-spark-shell-tp3347p3387.html">this discussion</a> and
<a href="http://markmail.org/message/257a5l3oqyftsjxj">that discussion</a>.</li>
  <li><a href="https://spark-project.atlassian.net/browse/SPARK-1341">SPARK-1341</a>: Ability to control the data rate in Spark
Streaming.  This is relevant in so far that if you are already in trouble because of the other Kafka-relatd issues
above (e.g. the <code>auto.offset.reset</code> misbehavior), then what may happen is that your streaming application must
or thinks it must re-process a lot of older data.  But since there is no built-in rate limitation this may cause your
workers to become overwhelmed and run out of memory.</li>
</ul>

<p>Apart from those failure handling and Kafka-focused issues there are also scaling and stability concerns.  Again, please
refer to the
<a href="http://yahoohadoop.tumblr.com/post/98213421641/storm-and-spark-at-yahoo-why-chose-one-over-the-other">Spark and Storm</a>
talk of Bobby and Tom for further details.  Both of them have more experience with Spark than I do.</p>

<p>I also came across <a href="https://www.mail-archive.com/user@spark.apache.org/msg11505.html">one comment</a> that there may be
issues with the (awesome!) G1 garbage collector that is available in Java 1.7.0u4+, but I didn’t run into any such issue
so far.</p>

<h1 id="spark-tips-and-tricks">Spark tips and tricks</h1>

<p>I compiled a list of notes while I was implementing the example code.  This list is by no means a comprehensive
guide, but it may serve you as a starting point when implementing your own Spark Streaming jobs.  It contains
references to the
<a href="http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html">Spark Streaming programming guide</a> as well as
information compiled from the spark-user mailing list.</p>

<h2 id="general">General</h2>

<ul>
  <li>When creating your Spark context pay special attention to the configuration that sets the number of cores used by
Spark.  You must configure enough cores for running both all the required for <em>receivers</em> (see below) and for the
actual data processing part.  In Spark, each receiver is responsible for exactly one input DStream, and each receiver
(and thus each input DStream) occies one core – the only exception is when reading from a file stream
(<a href="http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#input-dstreams">see docs</a>).  So if, for
instance, your job reads from 2 input streams but only has access to 2 cores than the data will be read but no
processing will happen.
    <ul>
      <li>Note that in a streaming application, you can create multiple input DStreams to receive multiple streams of data
in parallel.  I demonstrate such a setup in the example job where we parallelize reading from Kafka.</li>
    </ul>
  </li>
  <li>You can use <a href="http://spark.apache.org/docs/1.1.0/programming-guide.html#broadcast-variables">broadcast variables</a> to
share common, read-only variables across machines (see also the relevant section in the
<a href="http://spark.apache.org/docs/1.1.0/tuning.html#broadcasting-large-variables">Tuning Guide</a>).  In the example job I
use broadcast variables to share a) a Kafka producer pool (through which the job writes its output to Kafka) and b)
an injection for encoding/decoding Avro data (from Twitter Bijection).
<a href="http://spark.apache.org/docs/1.1.0/programming-guide.html#passing-functions-to-spark">Passing functions to Spark</a>.</li>
  <li>You can use <a href="http://spark.apache.org/docs/1.1.0/programming-guide.html#accumulators">accumulator</a> variables to track
global “counters” across the tasks of your streaming job (think: Hadoop job counters).  In the example job I use
accumulators to track how many total messages the job has been consumed from and produced to Kafka, respectively.
If you give your accumulators a name (see link), then they will also be displayed in the Spark UI.</li>
  <li>
    <p>Do not forget to import the relevant implicits of Spark in general and Spark Streaming in particular:</p>

    <pre><code>// Required to gain access to RDD transformations via implicits.
import org.apache.spark.SparkContext._

// Required when working on `PairDStreams` to gain access to e.g. `DStream.reduceByKey`
// (versus `DStream.transform(rddBatch =&gt; rddBatch.reduceByKey()`) via implicits.
//
// See also http://spark.apache.org/docs/1.1.0/programming-guide.html#working-with-key-value-pairs
import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
</code></pre>
  </li>
  <li>If you’re a fan of <a href="https://github.com/twitter/algebird">Twitter Algebird</a>, then you will like how you can leverage
Count-Min Sketch and friends in Spark.  Typically you’ll use operations such as <code>reduce</code> or <code>reduceByWindow</code> (cf.
<a href="http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#transformations-on-dstreams">transformations on DStreams</a>).
The Spark project includes examples for
<a href="https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebirdCMS.scala">Count-Min Sketch</a>
and
<a href="https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterAlgebirdHLL.scala">HyperLogLog</a>.</li>
  <li>If you need to determine the memory consumption of, say, your fancy Algebird data structure – e.g. Count-Min Sketch,
HyperLogLog, or Bloom Filters – as it is being used in your Spark application, then the <code>SparkContext</code> logs might be
an option for you.  See
<a href="http://spark.apache.org/docs/1.1.0/tuning.html#determining-memory-consumption">Determining Memory Consumption</a>.</li>
</ul>

<h2 id="kafka-integration">Kafka integration</h2>

<p>Beyond what I already said in the article above:</p>

<ul>
  <li>You may need to tweak the Kafka consumer configuration of Spark Streaming.  For example, if you need to read
large messages from Kafka you must increase the <code>fetch.message.max.bytes</code> consumer setting.  You can pass such custom
Kafka parameters to Spark Streaming when calling <code>KafkaUtils.createStream(...)</code>.</li>
</ul>

<h2 id="testing">Testing</h2>

<ul>
  <li>Make sure you stop the <code>StreamingContext</code> and/or <code>SparkContext</code> (via <code>stop()</code>) within a <code>finally</code> block or your test
framework’s <code>tearDown method</code>, as Spark does not support two contexts running concurrently in the same program (or
JVM?).  (<a href="http://spark.apache.org/docs/1.1.0/programming-guide.html#accumulators">source</a>)</li>
  <li>In my experience, when using sbt, you want to configure your build to fork JVMs during testing.  At least in the case
of <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> the tests must run several threads in
parallel, e.g. in-memory instances of ZooKeeper, Kafka, Spark.  See
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/build.sbt">build.sbt</a> for a starting point.</li>
  <li>Also, if you are on Mac OS X, you may want to disable IPv6 in your JVMs to prevent DNS-related timeouts.  This issue
is unrelated to Spark.  See <a href="https://github.com/miguno/kafka-storm-starter/blob/develop/.sbtopts">.sbtopts</a> for how
to do disable IPv6.</li>
</ul>

<h2 id="performance-tuning">Performance tuning</h2>

<ul>
  <li>Make sure you understand the runtime implications of your job if it needs to talk to external systems such as Kafka.
You should read the section <em>Design Patterns for using foreachRDD</em> in the
<a href="http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#output-operations-on-dstreams">Spark Streaming programming guide</a>.
For instance, my example application uses a pool of Kafka producers to optimize writing from Spark Streaming to Kafka.
Here, “optimizing” means sharing the same (few) producers across tasks, notably to reduce the number of new TCP
connections being established with the Kafka cluster.</li>
  <li>Use Kryo for serialization instead of the (slow) default Java serialization (see
<a href="http://spark.apache.org/docs/1.1.0/tuning.html#serialized-rdd-storage">Tuning Spark</a>).  My example enables Kryo
and registers e.g. the Avro-generated Java classes with Kryo to speed up serialization.  See
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/spark/serialization/KafkaSparkStreamingRegistrator.scala">KafkaSparkStreamingRegistrator</a>.
By the way, the use of Kryo is recommended in Spark for the very same reason it is recommended in Storm.</li>
  <li>Configure Spark Streaming jobs to clear persistent RDDs by setting <code>spark.streaming.unpersist</code> to <code>true</code>.
This is likely to reduce the RDD memory usage of Spark, potentially improving GC behavior as well.
(<a href="http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html#memory-tuning">source</a>)</li>
  <li>Start your P&amp;S tests with storage level <code>MEMORY_ONLY_SER</code> (here, RDD are stored as serialized Java objects, one byte
array per partition).  This is generally more space-efficient than deserialized objects, especially when using a fast
serializer like Kryo, but more CPU-intensive to read.  This option is often the best for Spark Streaming jobs.
For local testing you may want to not use the <code>*_2</code> variants (<code>2</code> = replication factor).</li>
</ul>

<h1 id="wrapping-up">Wrapping up</h1>

<p>The full Spark Streaming code is available in <a href="https://github.com/miguno/kafka-storm-starter/">kafka-storm-starter</a>.
I’d recommend to begin reading with the
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/spark/KafkaSparkStreamingSpec.scala">KafkaSparkStreamingSpec</a>.
This spec launches in-memory instances of Kafka, ZooKeeper, and Spark, and then runs the example streaming application I
covered in this post.</p>

<p>In summary I enjoyed my initial Spark Streaming experiment.  While there are still several problems with Spark/Spark
Streaming that need to be sorted out, I am sure the Spark community will eventually be able to address those.  I have
found the Spark community to be positive and willing to help, and I am looking forward to what will be happening over
the next few months.</p>

<p>Given that Spark Streaming still needs some <a href="http://en.wiktionary.org/wiki/tender_loving_care">TLC</a> to reach Storm’s
capabilities in large-scale production settings, would I use it in 24x7 production?  Most likely not, with the addendum
“not yet”.  So where would I use Spark Streaming in its current state right now?  Here are two ideas, and I am sure
there are even more:</p>

<ol>
  <li>It seems a good fit to prototype data flows very rapidly.  If you run into scalability issues because your data
flows are too large, you can e.g. opt to run Spark Streaming against only a sample or subset of the data.</li>
  <li>What about combining Storm and Spark Streaming?  For example, you could use Storm to crunch the raw, large-scale
input data down to manageable levels, and then perform follow-up analysis with Spark Streaming, benefitting from the
latter’s out-of-the-box support for many interesting algorithms and computations.
use cases.</li>
</ol>

<p>Thanks to the Spark community for all their great work!</p>

<h1 id="references">References</h1>

<ul>
  <li><a href="https://spark.apache.org/docs/latest/streaming-kafka-integration.html">Spark Streaming + Kafka Integration Guide</a></li>
  <li><a href="http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617">Deep Dive with Spark Streaming</a>, by Tathagata Das, Jun 2013</li>
  <li>Mailing list discussions:
    <ul>
      <li><a href="https://www.mail-archive.com/dev@spark.incubator.apache.org/msg00531.html">Spark Streaming threading model</a>
– also contains some information on how Spark Streaming pushes input data into blocks</li>
      <li><a href="http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-td11258.html">Low Level Kafka Consumer for Spark</a>
– lots of information about the current state of Kafka integration in Spark Streaming, known issues, possible
remedies, etc.</li>
      <li><a href="http://apache-spark-user-list.1001560.n3.nabble.com/How-are-the-executors-used-in-Spark-Streaming-in-terms-of-receiver-and-driver-program-td9336.html">How are the executors used in Spark Streaming in terms of receiver and driver program? </a>
– machines vs. cores vs. executors vs. receivers vs. DStreams in Spark</li>
    </ul>
  </li>
</ul>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Apache Storm 0.9 training deck and tutorial]]></title>
    <link href="http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2014-09-15T12:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial</id>
    <content type="html"><![CDATA[<p>Today I am happy to share an extensive training deck on <a href="http://storm.incubator.apache.org/">Apache Storm</a> version 0.9,
which covers Storm’s core concepts, operating Storm in production, and developing Storm applications.  I also discuss
data serialization with <a href="http://avro.apache.org/">Apache Avro</a> and
<a href="https://github.com/twitter/bijection">Twitter Bijection</a>.</p>

<!-- more -->

<p><br clear="all" /></p>

<p>The training deck (130 slides) is aimed at developers, operations, and architects.</p>

<p><strong>What the training deck covers</strong></p>

<ol>
  <li>Introducing Storm: history, Storm adoption in the industry, why Storm</li>
  <li>Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism</li>
  <li>Operating Storm: architecture, hardware specs, deploying, monitoring</li>
  <li>Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps (with <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a>), performance and scalability tuning</li>
  <li>Playing with Storm using <a href="https://github.com/miguno/wirbelsturm">Wirbelsturm</a></li>
</ol>

<p>Many thanks to the <a href="https://engineering.twitter.com/">Twitter Engineering team</a> (the creators of Storm) and the Apache
Storm open source community!</p>

<p>See also:</p>

<ul>
  <li><a href="http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/">Apache Kafka 0.8 training deck and tutorial</a>,
which I published a month ago</li>
  <li><a href="http://www.michael-noll.com/blog/categories/storm/">My other articles on Apache Storm</a></li>
</ul>

<iframe src="http://www.michael-noll.com//www.slideshare.net/slideshow/embed_code/39087523?rel=0" width="597" height="486" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
<div style="margin-bottom:5px"> <strong> <a href="https://www.slideshare.net/miguno/apache-storm-09-basic-training-verisign" title="Apache Storm 0.9 basic training - Verisign" target="_blank">Apache Storm 0.9 basic training - Verisign</a> </strong> from <strong><a href="http://www.slideshare.net/miguno" target="_blank">Michael Noll</a></strong> </div>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Apache Kafka 0.8 training deck and tutorial]]></title>
    <link href="http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2014-08-18T12:00:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial</id>
    <content type="html"><![CDATA[<p>Today I am happy to share an extensive training deck on <a href="http://kafka.apache.org/">Apache Kafka</a> version 0.8, which
covers Kafka’s core concepts, operating Kafka in production, and developing Kafka applications.  I also discuss data
serialization with <a href="http://avro.apache.org/">Apache Avro</a> and <a href="https://github.com/twitter/bijection">Twitter Bijection</a>.</p>

<!-- more -->

<p><br clear="all" /></p>

<div class="warning">
<strong>Update 2015-08-01:</strong>
Shameless plug!  Since publishing this Kafka training deck I joined <a href="http://confluent.io/">Confluent Inc.</a> as their Developer Evangelist.  Confluent is the US startup founded in 2014 by the creators of Apache Kafka who developed Kafka while at LinkedIn (see this <a href="http://www.forbes.com/sites/alexkonrad/2015/07/08/confluent-raises-24-million-for-data-streams/">Forbes article about Confluent</a>).  Next to building the world&#8217;s best <a href="http://www.confluent.io/product">stream data platform</a> we are also providing <a href="http://www.confluent.io/training">professional Kafka trainings</a>, which go even deeper as well as beyond my extensive training deck below.
<br />
<br />
I can say with confidence that these are the authoritative and most effective Apache Kafka trainings available on the market.  But you don&#8217;t have to take my word for it &#8211; feel free to <a href="http://www.confluent.io/training">take a look yourself</a> and reach out to us if you are interested. <em>&mdash;Michael</em>
</div>

<p>The training deck (120 slides) is aimed at developers, operations, and architects.</p>

<p><strong>What the training deck covers</strong></p>

<ol>
  <li>Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka</li>
  <li>Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers</li>
  <li>Operating Kafka: architecture, hardware specs, deploying, monitoring, performance and scalability tuning</li>
  <li>Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps (with
<a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a>)</li>
  <li>Playing with Kafka using <a href="https://github.com/miguno/wirbelsturm">Wirbelsturm</a></li>
</ol>

<p>Many thanks to the <a href="https://engineering.linkedin.com/tags/kafka">LinkedIn Engineering team</a> (the creators of Kafka) and
the Apache Kafka open source community!</p>

<p>See also:</p>

<ul>
  <li><a href="http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/">Apache Storm 0.9 training deck and tutorial</a>,
which I published a month after this training on Kafka</li>
  <li><a href="http://www.michael-noll.com/blog/categories/kafka/">My other articles on Apache Kafka</a></li>
</ul>

<iframe src="http://www.michael-noll.com//www.slideshare.net/slideshow/embed_code/38083024?rel=0" width="597" height="486" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe>
<div style="margin-bottom:5px"> <strong> <a href="https://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign" title="Apache Kafka 0.8 basic training - Verisign" target="_blank">Apache Kafka 0.8 basic training - Verisign</a> </strong> from <strong><a href="http://www.slideshare.net/miguno" target="_blank">Michael Noll</a></strong> </div>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Integrating Kafka and Storm: Code Examples and State of the Game]]></title>
    <link href="http://www.michael-noll.com/blog/2014/05/27/kafka-storm-integration-example-tutorial/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2014-05-27T16:51:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2014/05/27/kafka-storm-integration-example-tutorial</id>
    <content type="html"><![CDATA[<p>The only thing that’s even better than <a href="https://kafka.apache.org/">Apache Kafka</a> and
<a href="http://storm.incubator.apache.org/">Apache Storm</a> is to use the two tools in combination.  Unfortunately, their
integration can and is still a pretty challenging task, at least judged by the many discussion threads on the respective
mailing lists.  In this post I am introducing <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a>,
which contains many code examples that show you how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using
<a href="http://avro.apache.org/">Apache Avro</a> as the data serialization format.  I will also briefly summarize the current
state of their integration on a high level to give you additional context of where the two projects are headed in this
regard.</p>

<!-- more -->

<p><br clear="all" /></p>

<div class="note">
  <strong>
    kafka-storm-starter is available at <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> on GitHub.
  </strong>
</div>

<h1 id="state-of-the-integration-game">State of the (integration) game</h1>

<p>For the lazy reader here’s the TL;DR version of Kafka and Storm integration:</p>

<ul>
  <li>You can indeed integrate Kafka 0.8.1.1 (latest stable) and Storm 0.9.1-incubating (latest stable).  I mention this
explicitly only to clear up any confusion whatsoever that may have resulted from you reading the mailing lists.</li>
  <li>The Kafka/Storm integration is, at this time, still more complicated and error prone than it should be.  For this
reason I released the code project <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> (more details
below), which should answer most questions you may have when setting out to connect Storm to Kafka for both reading
and writing data.  As such kafka-storm-starter can serve as a bootstrapping template to build your own real-time data
processing pipelines with Kafka and Storm.</li>
  <li>In the Storm project we are actively working on closing this integration gap.  For instance, we have recently
<a href="https://github.com/apache/incubator-storm/tree/master/external/storm-kafka">merged</a> the
<a href="https://github.com/wurstmeister/storm-kafka-0.8-plus">most popular Kafka spout</a> into the core Storm project.
This Kafka spout will be included in the next version of Storm, 0.9.2-incubating, which is just around the corner.
And the spout is now <a href="https://issues.apache.org/jira/browse/STORM-331">compatible with the latest Kafka 0.8.1.1</a>.
Kudos to <a href="https://twitter.com/ptgoetz">P. Taylor Goetz</a> of HortonWorks for acting as the initial sponsor of the
storm-kafka component!  For more information see
<a href="https://github.com/apache/incubator-storm/tree/master/external/storm-kafka">external/storm-kafka</a> in the Storm code
base.</li>
  <li>The Kafka project is working on an improved, consolidated consumer API for Kafka 0.9.  Take a look at the respective
discussions in the <a href="http://grokbase.com/t/kafka/users/142avhm32j/new-consumer-api-discussion">kafka-user</a> and
<a href="http://grokbase.com/t/kafka/dev/142avhm32j/new-consumer-api-discussion">kafka-dev</a> mailing lists.  The
<a href="https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design">Kafka 0.9 Consumer Rewrite Design</a>
document is also worth a read.  Moving forward this API initiative should simplify interaction with Kafka in general
and integration with storm-kafka in particular.</li>
</ul>

<h1 id="kafka-storm-starter">kafka-storm-starter</h1>

<h2 id="overview-and-quick-start">Overview and quick start</h2>

<p>A few days ago I released <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> as a means to jumpstart
developers interested in integrating Kafka 0.8 and Storm 0.9.  Without further ado let’s take a first quick look.</p>

<p>Before we start we must grab the latest version of the code, which is implemented in Scala 2.10:</p>

<pre><code>$ git clone https://github.com/miguno/kafka-storm-starter.git
$ cd kafka-storm-starter
</code></pre>

<p>We begin the tour by running the test suite:</p>

<pre><code>$ ./sbt test
</code></pre>

<p>Notably this command will run end-to-end tests of Kafka, Storm, and Kafka/Storm integration.  See this shortened version
of the test output:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
<span class="line-number">44</span>
<span class="line-number">45</span>
<span class="line-number">46</span>
<span class="line-number">47</span>
<span class="line-number">48</span>
<span class="line-number">49</span>
<span class="line-number">50</span>
<span class="line-number">51</span>
<span class="line-number">52</span>
<span class="line-number">53</span>
<span class="line-number">54</span>
<span class="line-number">55</span>
<span class="line-number">56</span>
<span class="line-number">57</span>
<span class="line-number">58</span>
<span class="line-number">59</span>
<span class="line-number">60</span>
<span class="line-number">61</span>
</pre></td><td class="code"><pre><code class=""><span class="line">[...other tests removed...]
</span><span class="line">
</span><span class="line">[info] KafkaSpec:
</span><span class="line">[info] Kafka
</span><span class="line">[info] - should synchronously send and receive a Tweet in Avro format
</span><span class="line">[info]   + Given a ZooKeeper instance
</span><span class="line">[info]   + And a Kafka broker instance
</span><span class="line">[info]   + And some tweets
</span><span class="line">[info]   + And a single-threaded Kafka consumer group
</span><span class="line">[info]   + When I start a synchronous Kafka producer that sends the tweets in Avro binary format
</span><span class="line">[info]   + Then the consumer app should receive the tweets
</span><span class="line">[info] - should asynchronously send and receive a Tweet in Avro format
</span><span class="line">[info]   + Given a ZooKeeper instance
</span><span class="line">[info]   + And a Kafka broker instance
</span><span class="line">[info]   + And some tweets
</span><span class="line">[info]   + And a single-threaded Kafka consumer group
</span><span class="line">[info]   + When I start an asynchronous Kafka producer that sends the tweets in Avro binary format
</span><span class="line">[info]   + Then the consumer app should receive the tweets
</span><span class="line">[info] StormSpec:
</span><span class="line">[info] Storm
</span><span class="line">[info] - should start a local cluster
</span><span class="line">[info]   + Given no cluster
</span><span class="line">[info]   + When I start a LocalCluster instance
</span><span class="line">[info]   + Then the local cluster should start properly
</span><span class="line">[info] - should run a basic topology
</span><span class="line">[info]   + Given a local cluster
</span><span class="line">[info]   + And a wordcount topology
</span><span class="line">[info]   + And the input words alice, bob, joe, alice
</span><span class="line">[info]   + When I submit the topology
</span><span class="line">[info]   + Then the topology should properly count the words
</span><span class="line">[info] KafkaStormSpec:
</span><span class="line">[info] Feature: AvroDecoderBolt[T]
</span><span class="line">[info]   Scenario: User creates a Storm topology that uses AvroDecoderBolt
</span><span class="line">[info]     Given a ZooKeeper instance
</span><span class="line">[info]     And a Kafka broker instance
</span><span class="line">[info]     And a Storm topology that uses AvroDecoderBolt and that reads tweets from topic testing-input and writes them as-is to topic testing-output
</span><span class="line">[info]     And some tweets
</span><span class="line">[info]     And a synchronous Kafka producer app that writes to the topic testing-input
</span><span class="line">[info]     And a single-threaded Kafka consumer app that reads from topic testing-output
</span><span class="line">[info]     And a Storm topology configuration that registers an Avro Kryo decorator for Tweet
</span><span class="line">[info]     When I run the Storm topology
</span><span class="line">[info]     And I use the Kafka producer app to Avro-encode the tweets and sent them to Kafka
</span><span class="line">[info]     Then the Kafka consumer app should receive the decoded, original tweets from the Storm topology
</span><span class="line">[info] Feature: AvroScheme[T] for Kafka spout
</span><span class="line">[info]   Scenario: User creates a Storm topology that uses AvroScheme in Kafka spout
</span><span class="line">[info]     Given a ZooKeeper instance
</span><span class="line">[info]     And a Kafka broker instance
</span><span class="line">[info]     And a Storm topology that uses AvroScheme and that reads tweets from topic testing-input and writes them as-is to topic testing-output
</span><span class="line">[info]     And some tweets
</span><span class="line">[info]     And a synchronous Kafka producer app that writes to the topic testing-input
</span><span class="line">[info]     And a single-threaded Kafka consumer app that reads from topic testing-output
</span><span class="line">[info]     And a Storm topology configuration that registers an Avro Kryo decorator for Tweet
</span><span class="line">[info]     When I run the Storm topology
</span><span class="line">[info]     And I use the Kafka producer app to Avro-encode the tweets and sent them to Kafka
</span><span class="line">[info]     Then the Kafka consumer app should receive the decoded, original tweets from the Storm topology
</span><span class="line">[info] Run completed in 21 seconds, 852 milliseconds.
</span><span class="line">[info] Total number of tests run: 25
</span><span class="line">[info] Suites: completed 8, aborted 0
</span><span class="line">[info] Tests: succeeded 25, failed 0, canceled 0, ignored 0, pending 0
</span><span class="line">[info] All tests passed.
</span><span class="line">[success] Total time: 22 s, completed May 23, 2014 12:31:09 PM</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>We finish the tour by launching the
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/KafkaStormDemo.scala">KafkaStormDemo</a>
application:</p>

<pre><code>$ ./sbt run
</code></pre>

<p>This demo starts in-memory instances of ZooKeeper, Kafka, and Storm.  It then runs a demo Storm topology that connects
to and reads from the Kafka instance.</p>

<p>You will see output similar to the following (some parts removed to improve readability):</p>

<div class="bogus-wrapper"><notextile><figure class="code"><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class=""><span class="line">7031 [Thread-19] INFO  backtype.storm.daemon.worker - Worker 3f7f1a51-5c9e-43a5-b431-e39a7272215e for storm kafka-storm-starter-1-1400839826 on daa60807-d440-4b45-94fc-8dd7798453d2:1027 has finished loading
</span><span class="line">7033 [Thread-29-kafka-spout] INFO  storm.kafka.DynamicBrokersReader - Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=127.0.0.1:9092}}
</span><span class="line">7050 [Thread-29-kafka-spout] INFO  backtype.storm.daemon.executor - Opened spout kafka-spout:(1)
</span><span class="line">7051 [Thread-29-kafka-spout] INFO  backtype.storm.daemon.executor - Activating spout kafka-spout:(1)
</span><span class="line">7051 [Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - Refreshing partition manager connections
</span><span class="line">7065 [Thread-29-kafka-spout] INFO  storm.kafka.DynamicBrokersReader - Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=127.0.0.1:9092}}
</span><span class="line">7066 [Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - Deleted partition managers: []
</span><span class="line">7066 [Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - New partition managers: [Partition{host=127.0.0.1:9092, partition=0}]
</span><span class="line">7083 [Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Read partition information from: /kafka-spout/kafka-storm-starter/partition_0  --&gt; null
</span><span class="line">7100 [Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - No partition information found, using configuration to determine offset
</span><span class="line">7105 [Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Starting Kafka 127.0.0.1:0 from offset 18
</span><span class="line">7106 [Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - Finished refreshing
</span><span class="line">7126 [Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committing offset for Partition{host=127.0.0.1:9092, partition=0}
</span><span class="line">7126 [Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committed offset 18 for Partition{host=127.0.0.1:9092, partition=0} for topology: 47e82e34-fb36-427e-bde6-8cd971db2527
</span><span class="line">9128 [Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committing offset for Partition{host=127.0.0.1:9092, partition=0}
</span><span class="line">9129 [Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committed offset 18 for Partition{host=127.0.0.1:9092, partition=0} for topology: 47e82e34-fb36-427e-bde6-8cd971db2527</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>At this point Storm is connected to Kafka (more precisely: to the <code>testing</code> topic in Kafka).  The last few lines from
above – “Committing offset …” — will be repeated again and again, because a) this demo Storm topology only reads
from the Kafka topic but it does nothing to the data that was read and b) because we are not sending any data to the
Kafka topic.</p>

<div class="note">
<strong>Note:</strong> This example will actually run <em>two</em> in-memory instances of ZooKeeper:  the first (listening at <tt>127.0.0.1:2181/tcp</tt>) is used by the Kafka instance, the second (listening at <tt>127.0.0.1:2000/tcp</tt>) is automatically started and used by the in-memory Storm cluster.  This is because, when running in local aka in-memory mode, Storm does not allow you to reconfigure or disable its own ZooKeeper instance.
</div>

<p><strong>To stop the demo application you must kill or <code>Ctrl-C</code> the process in the terminal.</strong></p>

<p>You can use
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/KafkaStormDemo.scala">KafkaStormDemo</a>
as a starting point to create your own, “real” Storm topologies that read from a “real” Kafka, Storm, and ZooKeeper
infrastructure.  An easy way to get started with such an infrastructure is by deploying Kafka, Storm, and ZooKeeper via
a tool such as <a href="https://github.com/miguno/wirbelsturm">Wirbelsturm</a>.</p>

<h2 id="features">Features</h2>

<p>I showcase the following features in kafka-storm-starter.  Note that I focus on showcasing, and not necessarily on
“production ready”.</p>

<ul>
  <li>How to integrate Kafka and Storm.</li>
  <li>How to use <a href="http://avro.apache.org/">Avro</a> with Kafka and Storm for serializing and deserializing the data payload.
For this I leverage <a href="https://github.com/twitter/bijection">Twitter Bijection</a> and
<a href="https://github.com/twitter/chill/">Twitter Chill</a>.</li>
  <li>Kafka standalone code examples
    <ul>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/kafka/KafkaProducerApp.scala">KafkaProducerApp</a>:
A simple Kafka producer app for writing Avro-encoded data into Kafka.
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/KafkaSpec.scala">KafkaSpec</a>
puts this producer to use and shows how to use Twitter Bijection to Avro-encode the messages being sent to Kafka.</li>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/kafka/KafkaConsumerApp.scala">KafkaConsumerApp</a>:
A simple Kafka consumer app for reading Avro-encoded data from Kafka.
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/KafkaSpec.scala">KafkaSpec</a>
puts this consumer to use and shows how to use Twitter Bijection to Avro-decode the messages being read from
Kafka.</li>
    </ul>
  </li>
  <li>Storm standalone code examples
    <ul>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/AvroDecoderBolt.scala">AvroDecoderBolt[T]</a>:
An <code>AvroDecoderBolt[T &lt;: org.apache.avro.specific.SpecificRecordBase]</code> that can be parameterized with the type of
the Avro record <code>T</code> it will deserialize its data to (i.e. no need to write another decoder bolt just because the
bolt needs to handle a different Avro schema).</li>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/AvroScheme.scala">AvroScheme[T]</a>:
An <code>AvroScheme[T &lt;: org.apache.avro.specific.SpecificRecordBase]</code> scheme, i.e. a custom
<code>backtype.storm.spout.Scheme</code> to auto-deserialize a spout’s incoming data.  The scheme can be parameterized with
the type of the Avro record <code>T</code> it will deserializes its data to (i.e. no need to write another scheme just
because the scheme needs to handle a different Avro schema).
        <ul>
          <li>You can opt to configure a spout (such as the Kafka spout) with <code>AvroScheme</code> if you want to perform the Avro
decoding step directly in the spout instead of placing an <code>AvroDecoderBolt</code> after the Kafka spout.  You may
want to profile your topology which of the two approaches works best for your use case.</li>
        </ul>
      </li>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/TweetAvroKryoDecorator.scala">TweetAvroKryoDecorator</a>:
A custom <code>backtype.storm.serialization.IKryoDecorator</code>, i.e. a custom
<a href="http://storm.incubator.apache.org/documentation/Serialization.html">Kryo serializer for Storm</a>.
        <ul>
          <li>Unfortunately we have not figured out a way to implement a parameterized <code>AvroKryoDecorator[T]</code> variant yet.
(A “straight-forward” approach we tried – similar to the other parameterized components – compiled fine but
failed at runtime when running the tests).  Code contributions are welcome!</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Kafka and Storm integration
    <ul>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/AvroKafkaSinkBolt.scala">AvroKafkaSinkBolt[T]</a>:
An <code>AvroKafkaSinkBolt[T &lt;: org.apache.avro.specific.SpecificRecordBase]</code> that can be parameterized with the type
of the Avro record <code>T</code> it will serialize its data to before sending the encoded data to Kafka (i.e. no
need to write another Kafka sink bolt just because the bolt needs to handle a different Avro schema).</li>
      <li>Storm topologies that read Avro-encoded data from Kafka:
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/main/scala/com/miguno/kafkastorm/storm/KafkaStormDemo.scala">KafkaStormDemo</a> and
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/KafkaStormSpec.scala">KafkaStormSpec</a></li>
      <li>A Storm topology that writes Avro-encoded data to Kafka:
<a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/KafkaStormSpec.scala">KafkaStormSpec</a></li>
    </ul>
  </li>
  <li>Unit testing
    <ul>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/storm/AvroDecoderBoltSpec.scala">AvroDecoderBoltSpec</a></li>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/storm/AvroSchemeSpec.scala">AvroSchemeSpec</a></li>
      <li>And more under <a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm">src/test/scala</a></li>
    </ul>
  </li>
  <li>Integration testing
    <ul>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/KafkaSpec.scala">KafkaSpec</a>:
Tests for Kafka, which launch and run against in-memory instances of Kafka and ZooKeeper.</li>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/StormSpec.scala">StormSpec</a>:
Tests for Storm, which launch and run against in-memory instances of Storm and ZooKeeper.</li>
      <li><a href="https://github.com/miguno/kafka-storm-starter/blob/develop/src/test/scala/com/miguno/kafkastorm/integration/KafkaStormSpec.scala">KafkaStormSpec</a>:
Tests for integrating Storm and Kafka, which launch and run against in-memory instances of Kafka, Storm, and
ZooKeeper.</li>
    </ul>
  </li>
</ul>

<h2 id="interested-in-more">Interested in more?</h2>

<p>All the gory details are available at <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a>.  Apart from
the code and build script (sbt) I provide information about how to create Cobertura code coverage reports, to package
the code, to create Java “sources” and “javadoc” jars, to generate API docs, to integrate with
<a href="http://jenkins-ci.org/">Jenkins CI</a> and <a href="http://www.jetbrains.com/teamcity/">TeamCity</a> build servers, and to set up
kafka-storm-starter as a project in IntelliJ IDEA and Eclipse.</p>

<p>Moving forward my plan is to keep kafka-storm-starter up to date with the latest versions of Kafka and Storm.  The
next version of Storm, 0.9.2, will already simplify the current setup quite a lot.  Of course I welcome any code, docs,
or similar <a href="https://github.com/miguno/kafka-storm-starter#Contributing">contributions you may have</a>.</p>

<h1 id="the-quest-to-get-there">The quest to get there</h1>

<p>Just for the historical record here are some of the gotchas that are addressed by kafka-storm-starter, i.e. problems
you do not need to solve yourself anymore:</p>

<ul>
  <li>Figuring out which Kafka spout in Storm 0.9 works with the latest Kafka 0.8 version.  A lot of people tried in vain to
use a Kafka spout built for Kafka 0.7 to read from Kafka 0.8.  Others didn’t know how to use the available Kafka 0.8
spouts in their code, and so on.  In the case of kafka-storm-starter I opted to go with the spout created by
<a href="https://github.com/wurstmeister/storm-kafka-0.8-plus">wurstmeister</a>, primarily because this spout will soon by the
“official” Kafka spout maintained by the Storm project.  Unfortunately the latest version of the spout was/is not
available in a public Maven repository, so I had take care of that, too, until Storm 0.9.2 will provide the official
version.
    <ul>
      <li>Alternatively you can also try <a href="https://github.com/HolmesNL/kafka-spout">Kafka spout of HolmesNL</a>, developed by
Mattijs Ugen.  I don’t want to talk about the differences to the wurstmeister spout in detail, but essentially
the wurstmeister spout uses the
<a href="https://kafka.apache.org/documentation.html#simpleconsumerapi">Simple Consumer API</a> of Kafka 0.8 whereas the
Mattijs’ spout uses the
<a href="https://kafka.apache.org/documentation.html#highlevelconsumerapi">High Level Consumer API</a>.</li>
    </ul>
  </li>
  <li>Resolving version conflicts between the various software packages.  For instance, Storm 0.9.1 has a transitive
dependency on Kryo 2.17 because Storm depends on an old version of <a href="https://github.com/sritchie/carbonite">Carbonite</a>.
This causes problems when trying to use Twitter Bijection or Twitter Chill, because those require a newer version of
Kryo.  (Apart from that Kryo 2.21 also fixes data corruption issues, so you do want the newer version.)  To address
this issue I filed <a href="https://issues.apache.org/jira/browse/STORM-263">STORM-263</a>, which is included in upcoming
Storm 0.9.2.  Thanks to <a href="https://twitter.com/sritchie">Sam Ritchie</a>, the maintainer of Carbonite, and everyone else
involved to get the patch included.
Another example is that you must exclude <code>javax.jms:jms</code> (and a few others) when including Kafka into your build
dependencies.  Or how to handle Netflix (now: Apache) Curator conflicts.</li>
  <li>Understanding the various conflicting ZooKeeper versions, and picking a version to go with.  Right now Storm and Kafka
still prefer very old 3.3.x versions of ZooKeeper, whereas in practice many people run 3.4.x in their infrastructure
(e.g. because ZooKeeper 3.4.x is already deployed alongside other infrastructure pieces such as Hadoop clusters
when using commercial Hadoop distributions).</li>
  <li>How to write unit tests for Storm topologies.  A lot of people seem to find references to
<a href="https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java">TestingApiDemo.java</a> while
searching the Internet but struggle with extracting these examples out of the Storm code base and merging them into
their own project.</li>
  <li>How to write Storm topologies in a way that you can parameterize its components (bolts etc.) with the Avro record type
<code>T</code>, so that you don’t need to write a new bolt only because your Avro schema changes.  The goal of this code is to
show how you can improve the developer/user experience by providing ready-to-use functionality, in this case with
regards to (Avro) serialization/deserialization.  To tackle this you must understand
<a href="https://storm.incubator.apache.org/documentation/Serialization.html">Storm’s serialization system</a> as well
as its run-time behavior.
    <ul>
      <li>While doing that I discovered a (known) Scala bug when I tried to use <code>TypeTag</code> instead of deprecated <code>Manifest</code>
to implement e.g. <code>AvroDecoderBolt[T]</code>, see <a href="https://issues.scala-lang.org/browse/SI-5919">SI-5919</a>.  This
bug is still not fixed in the latest Scala 2.11.1, by the way.</li>
    </ul>
  </li>
  <li>How to write end-to-end Kafka-&gt;Storm-&gt;Kafka tests.</li>
  <li>And so on…</li>
</ul>

<h1 id="conclusion">Conclusion</h1>

<p>I hope you find <a href="https://github.com/miguno/kafka-storm-starter">kafka-storm-starter</a> useful to bootstrap your own
Kafka/Storm application.  In the Storm community we are actively working on improving and simplifying the Kafka/Storm
integration, so please stay tuned and, above all, thanks for your patience.  The upcoming 0.9.2 version of Storm is
already a first step in the right direction by bundling a Kafka spout that works with the latest stable version of
Kafka (0.8.1.1 at the time of this writing).</p>

<p>Now where to go once you have your Kafka and Storm code ready?  At this point you can then use a tool such as
<a href="https://github.com/miguno/wirbelsturm">Wirbelsturm</a> and its associated Puppet modules to deploy production Kafka and
Storm clusters and run your own real-time data processing pipelines at scale.</p>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Wirbelsturm: 1-Click Deployments of Storm and Kafka clusters with Vagrant and Puppet]]></title>
    <link href="http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-vagrant-puppet/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2014-03-17T17:58:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2014/03/17/wirbelsturm-one-click-deploy-storm-kafka-clusters-with-vagrant-puppet</id>
    <content type="html"><![CDATA[<p>I am happy to announce the first public release of <a href="https://github.com/miguno/wirbelsturm">Wirbelsturm</a>, a Vagrant and
Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure.
Wirbelsturm’s goal is to make tasks such as “I want to deploy a multi-node Storm cluster” <em>simple</em>, <em>easy</em>, and <em>fun</em>.
In this post I will introduce you to Wirbelsturm, talk a bit about its history, and show you how to launch a multi-node
Storm (or Kafka or …) cluster faster than you can brew an espresso.</p>

<!-- more -->

<p><br clear="all" /></p>

<div class="note">
  <strong>
    Wirbelsturm is available at <a href="https://github.com/miguno/wirbelsturm">wirbelsturm</a> on GitHub.
  </strong>
</div>

<p><strong>Update May 27, 2014:</strong>  If you want to build real-time data processing pipelines based on Kafka and Storm, you may be
interested in <a href="http://www.michael-noll.com/blog/2014/05/27/kafka-storm-integration-example-tutorial/">kafka-storm-starter</a>.  It contains code
examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data
serialization format.</p>

<h1 id="wirbelsturm-quick-start">Wirbelsturm quick start</h1>

<p>This section is an appetizer of what you can do with Wirbelsturm.  Do not worry if something is not immediately obvious
to you – the <a href="https://github.com/miguno/wirbelsturm">Wirbelsturm documentation</a> describes everything in full detail.</p>

<p>Assuming you are using a reasonably powerful computer and have already installed <a href="http://www.vagrantup.com/">Vagrant</a>
(1.4.x – 1.5.x is not supported yet) and <a href="https://www.virtualbox.org/">VirtualBox</a> you can launch a multi-node
<a href="http://storm.incubator.apache.org/">Apache Storm</a> cluster on your local machine with the following commands.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>git clone https://github.com/miguno/wirbelsturm.git
</span><span class="line"><span class="nv">$ </span><span class="nb">cd </span>wirbelsturm
</span><span class="line"><span class="nv">$ </span>./bootstrap     <span class="c"># &lt;&lt;&lt; May take a while depending on how fast your Internet connection is.</span>
</span><span class="line"><span class="nv">$ </span>vagrant up      <span class="c"># &lt;&lt;&lt; ...and this step also depends on how powerful your computer is.</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Done – you now have a fully functioning Storm cluster up and running on your computer!  The deployment should have
taken you significantly less time and effort than
<a href="http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/">going through long blog posts</a> or
<a href="http://storm.incubator.apache.org/documentation/Documentation.html">working through the official documentation</a>.  On
top of that, you can now re-deploy your setup everywhere and every time you need it, thanks to automation.</p>

<div class="note">
Note: Running a small, local Storm cluster is just the default example.  You can do much more with Wirbelsturm than this.
</div>

<p>Let’s take a look at which virtual machines back this cluster behind the scenes:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>vagrant status
</span><span class="line">Current machine states:
</span><span class="line">
</span><span class="line">zookeeper1                running <span class="o">(</span>virtualbox<span class="o">)</span>
</span><span class="line">nimbus1                   running <span class="o">(</span>virtualbox<span class="o">)</span>
</span><span class="line">supervisor1               running <span class="o">(</span>virtualbox<span class="o">)</span>
</span><span class="line">supervisor2               running <span class="o">(</span>virtualbox<span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Storm also ships with a web UI that shows you the cluster’s state, e.g. how many nodes it has, whether any processing
jobs (topologies) are being executed, etc.  Wait 20-30 seconds after the deployment is done and then open the Storm UI
at <a href="http://localhost:28080/">http://localhost:28080/</a>.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/wirbelsturm-storm-ui-screenshot.png" title="Storm UI showing two slaves nodes" /></p>

<div class="caption">
Figure 1: The default example of Wirbelsturm deploys a multi-node Storm cluster.  In this screenshot of the Storm UI you can see the two slave nodes &#8211; named <em>supervisor1</em> and <em>supervisor2</em> &#8211; running Storm&#8217;s Supervisor daemons.  The third machine acts as the Storm master node and runs the Nimbus daemon and this Storm UI.  The fourth machine runs ZooKeeper.
</div>

<p>What’s more, Wirbelsturm also allows you to use <a href="http://www.ansible.com/">Ansible</a> to interact with the deployed
machines via its <a href="https://github.com/miguno/wirbelsturm/blob/master/ansible">ansible</a> wrapper script:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>./ansible all -m ping
</span><span class="line">zookeeper1 | success &gt;&gt; <span class="o">{</span>
</span><span class="line">    <span class="s2">&quot;changed&quot;</span>: <span class="nb">false</span>,
</span><span class="line">    <span class="s2">&quot;ping&quot;</span>: <span class="s2">&quot;pong&quot;</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line">supervisor1 | success &gt;&gt; <span class="o">{</span>
</span><span class="line">    <span class="s2">&quot;changed&quot;</span>: <span class="nb">false</span>,
</span><span class="line">    <span class="s2">&quot;ping&quot;</span>: <span class="s2">&quot;pong&quot;</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line">nimbus1 | success &gt;&gt; <span class="o">{</span>
</span><span class="line">    <span class="s2">&quot;changed&quot;</span>: <span class="nb">false</span>,
</span><span class="line">    <span class="s2">&quot;ping&quot;</span>: <span class="s2">&quot;pong&quot;</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line">supervisor2 | success &gt;&gt; <span class="o">{</span>
</span><span class="line">    <span class="s2">&quot;changed&quot;</span>: <span class="nb">false</span>,
</span><span class="line">    <span class="s2">&quot;ping&quot;</span>: <span class="s2">&quot;pong&quot;</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Want to run more Storm slaves?  As long as your computer has enough horsepower you only need to change a single number
in <code>wirbelsturm.yaml</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="yaml"><span class="line"><span class="c1"># wirbelsturm.yaml</span>
</span><span class="line"><span class="l-Scalar-Plain">nodes</span><span class="p-Indicator">:</span>
</span><span class="line">  <span class="l-Scalar-Plain">...</span>
</span><span class="line">  <span class="l-Scalar-Plain">storm_slave</span><span class="p-Indicator">:</span>
</span><span class="line">      <span class="l-Scalar-Plain">count</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">2</span>     <span class="c1"># &lt;&lt;&lt; changing 2 to 4 is all it takes</span>
</span><span class="line">  <span class="l-Scalar-Plain">...</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Then run <code>vagrant up</code> again and shortly after <code>supervisor3</code> and <code>supervisor4</code> will be up and running.</p>

<p>Want to run an <a href="http://kafka.apache.org/">Apache Kafka</a> broker?  Just uncomment the <code>kafka_broker</code> section in your
<code>wirbelsturm.yaml</code> that it looks similar to the following example snippet (only remove the leading <code>#</code> characters, do not remove any whitespace):</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
</pre></td><td class="code"><pre><code class="yaml"><span class="line"><span class="c1"># wirbelsturm.yaml</span>
</span><span class="line"><span class="l-Scalar-Plain">nodes</span><span class="p-Indicator">:</span>
</span><span class="line">  <span class="l-Scalar-Plain">...</span>
</span><span class="line">  <span class="l-Scalar-Plain"># Deploys Kafka brokers.</span>
</span><span class="line">  <span class="l-Scalar-Plain">kafka_broker</span><span class="p-Indicator">:</span>
</span><span class="line">    <span class="l-Scalar-Plain">count</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">1</span>
</span><span class="line">    <span class="l-Scalar-Plain">hostname_prefix</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">kafka</span>
</span><span class="line">    <span class="l-Scalar-Plain">ip_range_start</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">10.0.0.20</span>
</span><span class="line">    <span class="l-Scalar-Plain">node_role</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">kafka_broker</span>
</span><span class="line">    <span class="l-Scalar-Plain">providers</span><span class="p-Indicator">:</span>
</span><span class="line">      <span class="l-Scalar-Plain">virtualbox</span><span class="p-Indicator">:</span>
</span><span class="line">        <span class="l-Scalar-Plain">memory</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">1536</span>
</span><span class="line">      <span class="l-Scalar-Plain">aws</span><span class="p-Indicator">:</span>
</span><span class="line">        <span class="l-Scalar-Plain">instance_type</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">t1.micro</span>
</span><span class="line">        <span class="l-Scalar-Plain">ami</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">ami-86cdb3ef</span>
</span><span class="line">        <span class="l-Scalar-Plain">security_groups</span><span class="p-Indicator">:</span>
</span><span class="line">          <span class="p-Indicator">-</span> <span class="l-Scalar-Plain">wirbelsturm</span>
</span><span class="line">  <span class="l-Scalar-Plain">...</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Then run <code>vagrant up kafka1</code>.  Now you have Kafka running alongside Storm.</p>

<p>Once you have finished playing around, you can stop all the machines in the cluster again by executing
<code>vagrant destroy</code>.</p>

<h1 id="motivation">Motivation</h1>

<p>Let me use an analogy to explain the motivation to build Wirbelsturm.  While I assume every last one of us wants to work
somehow like this…</p>

<p><img src="http://www.michael-noll.com/blog/uploads/neo-cool.png" title="The dream." /></p>

<p>…most of our actual time is rather spent on doing something like that:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/scotty-uncool.png" title="The reality." /></p>

<p>Without any automated deployment tools the task of setting up cluster environments with (say) Storm or Kafka is simply a
very time-consuming, complicated, and – let’s face it – mind-numbingly boring experience.  So the motivation for
Wirbelsturm was really simple: first, minimize frustration, and second, help others.</p>

<p>While these were the primary reasons there were also secondary aspects:  Wirbelsturm should integrate nicely with
existing deployment infrastructures and the associated skills of Operations teams – that’s why it is so heavily based
on Puppet, though e.g. Chef and Ansible would have been good candidates, too.  Also, it should allow you to perform
local deployments (say, your dev laptop) as well as remote deployments (larger-scale environments, production, etc.) –
that’s why Vagrant was added to the picture.  You should also be able to easily transition from a Wirbelsturm/Vagrant
backed setup to a “real” production setup without having to re-architect your deployment, switch tools, etc.</p>

<p>As such Wirbelsturm is one of the tools that help to make the process of going from “Hey, I have this cool idea” to
“It’s live in production!” as simple, easy, and fun as possible.  A developer should be free to completely screw up
“his” test environment; two developers in the same team should always have the same copy of an environment;  the
integration environment of that team should look and feel the same way, too;  and for sure that should apply to
the production environment as well.</p>

<p>I think at this point the motivation should be pretty clear, and in the section <em>Is Wirbelsturm for me?</em> I list further
examples on what you can do with Wirbelsturm.</p>

<h1 id="current-wirbelsturm-features">Current Wirbelsturm features</h1>

<p>In its first public release Wirbelsturm supports the following high-level features:</p>

<ul>
  <li><strong>Launching machines:</strong>  Wirbelsturm uses Vagrant to launch the machines that make up your infrastructure
  as VMs running locally in VirtualBox (default) or remotely in Amazon AWS/EC2 (OpenStack support is in the works).</li>
  <li><strong>Provisioning machines:</strong>  Machines are provisioned via Puppet.
    <ul>
      <li>Wirbelsturm uses a master-less Puppet setup, i.e. provisioning is ultimately performed through <code>puppet apply</code>.</li>
      <li>Puppet modules are managed via <a href="https://github.com/rodjek/librarian-puppet">librarian-puppet</a>.</li>
    </ul>
  </li>
  <li><strong>(Some) batteries included:</strong>  We maintain a number of standard Puppet modules that work well with Wirbelsturm, some
of which are included in the default configuration of Wirbelsturm.  However you can use any Puppet module with
Wirbelsturm, of course.  See <a href="#supported-puppet-modules">Supported Puppet modules</a> for more information.</li>
  <li><strong>Ansible support:</strong> The <a href="http://www.ansible.com/">Ansible</a> aficionados amongst us can use Ansible to interact with
machines once deployed through Wirbelsturm and Puppet.</li>
  <li><strong>Host operating system support:</strong> Wirbelsturm has been tested with Mac OS X 10.8+ and RHEL/CentOS 6 as host machines.
Debian/Ubuntu should work, too.</li>
  <li><strong>Guest operating system support:</strong> The target OS version for deployed machines is RHEL/CentOS 6 (64-bit).  Amazon
Linux is supported, too.
    <ul>
      <li>For local deployments (via VirtualBox) and AWS deployments Wirbelsturm uses a
<a href="http://puppet-vagrant-boxes.puppetlabs.com/">CentOS 6 box created by PuppetLabs</a>.</li>
      <li>Switching to RHEL 6 only requires specifying a different <a href="http://docs.vagrantup.com/v2/boxes.html">Vagrant box</a>
in <a href="bootstrap">bootstrap</a> (for VirtualBox) or a different AMI image in <code>wirbelsturm.yaml</code> (for Amazon
AWS).</li>
    </ul>
  </li>
  <li><strong>When using tools other than Vagrant to launch machines:</strong>  Wirbelsturm-compatible Puppet modules are standard Puppet
modules, so of course they can be used standalone, too.  This way you can deploy against bare metal machines even if
you are not able to or do not want to run Wirbelsturm and/or Vagrant directly.</li>
</ul>

<h1 id="is-wirbelsturm-for-me">Is Wirbelsturm for me?</h1>

<p>Here are some ideas for what you can do with Wirbelsturm:</p>

<ul>
  <li>Evaluate new technologies such as Kafka and Storm in a temporary environment that you can set up and tear
down at will. Without having to spend hours and stay late figuring out how to install those tools.
Then tell your boss how hard you worked for it.</li>
  <li>Provide your teams with a consistent look and feel of infrastructure environments from initial prototyping
to development &amp; testing and all the way to production.  Banish “But it does work fine on <em>my</em> machine!” remarks
from your daily standups.  Well, hopefully.</li>
  <li>Save money if (at least some of) these environments run locally instead of in an IAAS cloud or on bare-metal
machines that you would need to purchase first.  Make Finance happy for the first time.</li>
  <li>Create production-like environments for training classes.  Use them to get new hires up to speed.  Or unleash a
<a href="http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html">Chaos Monkey</a> and check how well your
applications, DevOps tools, or technical staff can handle the mess.  Bring coke and popcorn.</li>
  <li>Create sandbox environments to demo your product to customers.  If Sales can run it, so can they.</li>
  <li>Develop and test-drive your or other people’s Puppet modules.  But see also
<a href="https://github.com/puppetlabs/beaker">beaker</a> and <a href="http://serverspec.org/">serverspec</a> if your focus is on
testing.</li>
</ul>

<h1 id="wirbelsturm-in-detail">Wirbelsturm in detail</h1>

<p>Actually I will <em>not</em> talk a whole lot about Wirbelsturm itself in this blog post anymore.  If I managed to spark your
interest feel free to head over to the <a href="https://github.com/miguno/wirbelsturm">Wirbelsturm project page</a> and start
reading – and fooling around – there.  There is also a list of
<a href="https://github.com/miguno/wirbelsturm#supported-puppet-modules">supported Puppet modules</a> in case you’re wondering what
kind of software you can deploy with Wirbelsturm (summary: you can use <em>any</em> Puppet module with Wirbelsturm, but some
are easier to use than others).</p>

<p>Instead I want to spend a few minutes in the next sections talking about what tasks and problems had to be solved to
put Wirbelsturm together, and also share some lessons learned along the way.</p>

<h1 id="the-long-road-of-getting-there">The long road of getting there</h1>

<p>What needed to be done to create the first version of Wirbelsturm?  Here’s a non-comprehensive list, I hope my memory
serves me well.</p>

<ul>
  <li>Packaging the relevant software where official packages (here: RPMs for RHEL 6 family) weren’t available.
    <ul>
      <li>The packaging code is also open sourced at e.g.
<a href="https://github.com/miguno/wirbelsturm-rpm-kafka">wirbelsturm-rpm-kafka</a> and
<a href="https://github.com/miguno/wirbelsturm-rpm-storm">wirbelsturm-rpm-storm</a>.</li>
      <li>Of course the packages also need to be digitally signed for security reasons.</li>
      <li>Kudos to Jordan Sissel for creating <a href="https://github.com/jordansissel/fpm">fpm</a>!</li>
    </ul>
  </li>
  <li>Making this build process deterministic, and publish that code as open source, too.  That is, don’t use an internal
infrastructure for that because a) people may not be easily able to reproduce it, and b) people may not trust what
strangers put together behind closed doors.
Think: <a href="http://cm.bell-labs.com/who/ken/trust.html">Reflections on Trusting Trust</a>.
    <ul>
      <li>The code to deploy a Wirbelsturm build server – which is used to build and sign the RPMs – is available as
open source at <a href="https://github.com/miguno/puppet-wirbelsturm_build">puppet-wirbelsturm_build</a>.</li>
    </ul>
  </li>
  <li>Understanding how to manage and host a public yum respository on Amazon S3.  <em>Please note that the idea has never</em>
<em>been to become a third-party package maintainer or third-party package repository</em>.  Instead the idea was to
provide just enough so that Wirbelsturm beginners can follow a quick start and have <em>something</em> deployed in a matter
of minutes.  And then let the users leverage the provided tools (see above) to run their own show.
    <ul>
      <li>Hosting some pre-built RPMs on a
<a href="https://github.com/miguno/puppet-wirbelsturm_yumrepos/blob/master/manifests/miguno.pp">public yum repo</a> also
meant to check whether the license of the respective software would allow that, and under which conditions.  I am
not a lawyer and made my best effort to comply with all the respective licenses.  If you have some concerns in
this regard please do let me know!</li>
    </ul>
  </li>
  <li>Learning that RHEL/CentOS 6 ships with significantly outdated versions of many packages, notably
<a href="http://www.supervisord.org/">supervisord</a> (but e.g. also nginx).  Supervisord version 2.x turned out to be a problem
in practice because a properly functioning process supervisor is highly recommended for running Storm &amp; Co. in
production.  Hence supervisord version 3.x needed to be packaged because that version is not yet available for the
RHEL 6 OS family in any “official” repository (e.g. EPEL’s version is outdated, too).</li>
  <li>Speaking of outdated or at least different versions:  Ruby on RHEL/CentOS 6 and Amazon Linux: 1.8.x.  On Mac OS X
10.9: 1.9.x.  And then we also have different versions of Puppet etc.  While every version discrepancy is likely to
complicate development and testing, Ruby and Puppet versions were particularly annoying to deal with as they are
“bootstrap” packages that we need as the foundation of any Puppet-based deployments.  I eventually created
<a href="https://github.com/miguno/ruby-bootstrap">ruby-bootstrap</a>, which addresses a part of those problems.</li>
  <li>Many Puppet modules needed to be made.  Where possible I tried to use existing modules as-is but in practice that
goal was hard to hit.  Some modules didn’t really work, some used completely different coding styles, some did
support Hiera while others didn’t, and so on.  I ended up creating several modules from scratch – e.g.
<a href="https://github.com/miguno/puppet-kafka">puppet-kafka</a>, <a href="https://github.com/miguno/puppet-storm">puppet-storm</a>, and
<a href="https://github.com/miguno/puppet-zookeeper">puppet-zookeeper</a> – as well as forking others.  In the latter case,
I tried to contribute back changes to the upstream project where possible and feasible (e.g. I contributed a bug fix
to <a href="https://github.com/electrical/puppet-lib-file_concat/pull/3">puppet-lib-file_concat</a>).  But because my plan was
also to come up with a consistent style and feature support across all Puppet modules – notably Hiera support –
the code of many forks stayed in that particular fork.  Also, some bug fixes or features that I contributed back
upstream were never merged, but since Wirbelsturm wouldn’t function properly without those changes I didn’t have an
alternative to maintaining my own fork.</li>
  <li>I ran into many bugs in many places.
<a href="stackoverflow.com/questions/17413598">Vagrant couldn’t consistently deploy to AWS</a>, for instance.  Vagrant plugins
broke amidst Vagrant version upgrades.
<a href="https://github.com/mitchellh/vagrant/issues/2087">RHEL support suddenly stopped working in Vagrant</a>, which I fixed
and contributed back.  I learned that Puppet has, for instance, a very weird way of handling boolean values when
defined in Hiera.  Or requires you to resort to a hacky <code>mkdir -p</code> based workaround using
<a href="http://docs.puppetlabs.com/references/latest/type.html#exec">exec</a> to create directories recursively.  Most of those
problems weren’t huge deals, but in combination they turned out to be death by a thousand cuts.</li>
  <li>Separating Puppet code from Wirbelsturm code.  I didn’t know about
<a href="https://github.com/rodjek/librarian-puppet">librarian-puppet</a> during the first early versions of Wirbelsturm, which
made it more difficult than necessary for Wirbelsturm users to keep their installations up to date.  In the beginning
they needed to change Puppet code in place, i.e. files checked into the Wirbelsturm git repo, so they would often run
into merge conflicts when pulling the latest upstream changes.  This unfortunate problem was resolved once I
introduced librarian-puppet.</li>
  <li>Speeding up local deployments.  If I recall correctly Mitchell Hashimoto – the creator of Vagrant – actually tried
parallel VM creation at some point but his (host) machine was completely overwhelmed by this, and the feature was not
introduced officially into Vagrant.  However, what is still possible is to perform the <em>provisioning</em> of booted VMs
in parallel.  But…the Puppet provisioner of Vagrant does not support that.  I therefore created a
<a href="https://github.com/miguno/wirbelsturm/blob/master/deploy">wrapper shell script</a> based on
<a href="https://github.com/joemiller/sensu-tests/blob/master/para-vagrant.sh">para-vagrant.sh</a> so that you can benefit from
faster local deployments when using Wirbelsturm.</li>
  <li>Adding support for Ansible turned out to be quick and easy, once I understood how to create
<a href="http://docs.ansible.com/intro_dynamic_inventory.html">dynamic inventory scripts</a>.  30 mins total.</li>
  <li>Automating the setup steps for Amazon AWS has been tricky.  Apart from so-so Vagrant support for AWS, there were a
couple of additonal problems I ran into.  I remember
<a href="https://forums.aws.amazon.com/message.jspa?messageID=449984">issues with Amazon’s implementation of cloud-init</a> when
using custom AMI, for instance.  Figuring out how to configure DNS in AWS (currently Wirbelsturm uses
<a href="http://aws.amazon.com/route53/">Amazon Route 53</a>) took some time.
Other tasks I remember include automatically creating restricted IAM users and tighter security groups.
I am still not perfectly happy with the Wirbelsturm user experience when deploying to AWS, and for a number of reasons
listed in the AWS related documentation of Wirbelsturm a code refactoring may be possible in the near future.</li>
  <li>After reading through the various issues listed above you may also understand now why at some point I decided to
postpone supporting any other operating system than the RHEL 6 OS family (which includes CentOS and Amazon Linux).
There were simply too many moving parts, and trying to tackle e.g. Debian/Ubuntu as well might have significantly
delayed the progress on Wirbelsturm.</li>
</ul>

<h1 id="lessons-learned-mistakes-made-along-the-way">Lessons learned: mistakes made along the way</h1>

<p>The wall of shame.  But hey, hindsight is 20/20.</p>

<ul>
  <li>Underestimating the amount of work it eventually took.  See the previous section, and even what I wrote there is not
the complete picture.  Now, thanks to good roadmap planning early adopters of Wirbelsturm were productive from the
very early beginning, and a close feedback loop helped a lot to keep the project on track and running in the right
direction.  Still the amount of work that actually needed to go into Wirbelsturm was significantly more than
anticipated.  It wasn’t as easy as going through
<a href="http://docs.vagrantup.com/v2/provisioning/puppet_apply.html">Vagrant’s Puppet provisioner documentation</a> and writing
a few lines of Puppet code.  In retrospect, knowing what I know today, Wirbelsturm could have been built <em>much</em>
faster though.</li>
  <li>Not realizing quickly enough how valuable it is to separate code from configuration data in Puppet manifests, using
<a href="http://docs.puppetlabs.com/hiera/1/">Hiera</a>.  Particularly because this is so second-nature when coding in “real”
programming languages instead of Puppet (which is a DSL on top of Ruby).  To my defense I can only say that my
hands-on knowledge of Puppet was very limited at the beginning, and I hadn’t even heard about Hiera (and a lot of
people I talked to didn’t use it).  In retrospect I should have spent more time up-front figuring out what the
Puppet ecosystem had in store to address the code-vs-data problem, because it was pretty obvious right from the
beginning that mixing the two would quickly lead to pain.</li>
  <li>Adding tests to Puppet modules <em>too soon</em> and <em>too late</em>.  At the beginning the Puppet modules were refactored a lot
in the quest of finding a reasonably good coding style, writing idiomatic Puppet manifests, etc. – and here dragging
unit tests along the way turned out to be a chore and a waste of time.  So I stopped writing tests.  While that
decision was ok, I made the mistake of postponing the re-introduction of proper tests for too long once the code
across the modules became more stable.  Well, at least <a href="https://github.com/miguno/puppet-kafka">puppet-kafka</a> and
<a href="https://github.com/miguno/puppet-storm">puppet-storm</a> have a good base test setup now thanks to
<a href="https://github.com/garethr/puppet-module-skeleton">puppet-module-skeleton</a>, which means there isn’t any excuse left
to postpone adding meaningful tests.</li>
</ul>

<p>Of course there were more mistakes, but the ones above were the most noteworthy ones. :-)</p>

<h1 id="summary">Summary</h1>

<p>I am really happy that <a href="https://github.com/miguno/wirbelsturm">Wirbelsturm</a> is finally available as free and open source
software.  Hopefully it will help you to quickly get up and running with technologies such as Graphite, Kafka, Storm,
Redis, and ZooKeeper.  Enjoy!</p>

<p><strong>Update May 27, 2014:</strong>  If you want to build real-time data processing pipelines based on Kafka and Storm, you may be
interested in <a href="http://www.michael-noll.com/blog/2014/05/27/kafka-storm-integration-example-tutorial/">kafka-storm-starter</a>.  It contains code
examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data
serialization format.</p>

<h1 id="related-work">Related work</h1>

<p>The following projects are similar to Wirbelsturm:</p>

<ul>
  <li><a href="https://github.com/nathanmarz/storm-deploy">storm-deploy</a> – Deploys Storm clusters to AWS, by Nathan Marz, the
creator of Storm.  storm-deploy has been around for much longer than Wirbelsturm, so it might be more mature.  It is a
nice example of a deployment tool implemented in Clojure, using <a href="https://github.com/pallet/pallet">pallet</a> and
<a href="http://www.jclouds.org/">jclouds</a>.  Because of jclouds you should also be able to deploy to clouds other than AWS,
though I haven’t found examples or documentation references on how to do so.  (If you have pointers please let me
know.)  Unfortunately, its Clojure roots may make storm-deploy less popular within Operations teams, who typically
are more familiar with tools such as <a href="http://puppetlabs.com/">Puppet</a>, <a href="http://www.getchef.com/">Chef</a>, or
<a href="http://www.ansible.com/">Ansible</a>.  Also, storm-deploy seems to address only Storm deployments, and you require
additional tools to deploy any other infrastructure pieces that you require (or enhance storm-deploy).</li>
  <li><a href="https://github.com/nathanmarz/kafka-deploy">kafka-deploy</a> – Deploys Kafka to AWS, also by Nathan Marz.  It has the
same pros and cons as storm-deploy.  Unfortunately, kafka-deploy has seen any updates since two years (Feb 2012),
which is around the time it was originally published.</li>
</ul>

<p>Commercial Hadoop vendors have also begun to integrate Storm into their product offerings:</p>

<ul>
  <li><a href="http://hortonworks.com/hadoop/storm/">Apache Storm at HortonWorks</a> – HortonWorks are working on Storm support for
their product line.  In this context they have added Storm support to their so-called
<a href="http://hortonworks.com/sandbox/">Hortonworks Sandbox</a>, which is a self-contained virtual machine with Hadoop &amp; Co.
pre-configured.</li>
  <li>If I recall correctly <a href="http://www.mapr.com/">MapR</a> were also looking at integrating Storm into their platform, but
I could not find more concrete details apart from a few
<a href="http://www.mapr.com/blog/storm-is-gearing-up-to-join-the-apache-foundation">news articles and blog posts</a>.</li>
</ul>

<p>Another way of deploying Storm is via platforms such as
<a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">Hadoop YARN</a> and
<a href="https://mesos.apache.org/">Apache Meos</a>:</p>

<ul>
  <li><a href="https://github.com/yahoo/storm-yarn">storm-on-yarn</a> – Enables Storm clusters to be deployed into machines managed
by Hadoop YARN. The project says it is still a work in progress.</li>
  <li><a href="https://github.com/nathanmarz/storm-mesos">storm-mesos</a> – Storm integration with the Mesos cluster resource manager.
The project says storm-mesos runs in production at Twitter.</li>
</ul>

<p>Lastly, there are also a few open source Puppet modules for Hadoop, Kafka, Storm, ZooKeeper &amp; Co.  I don’t want to give
an comprehensive overview of these modules in this post, but you can head over to places such as
<a href="https://forge.puppetlabs.com/">PuppetForge</a> and <a href="https://github.com/">GitHub</a> and take a look yourself.  Feel free to
drop those modules into Wirbelsturm and give them a go!</p>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Of Algebirds, Monoids, Monads, and other Bestiary for Large-Scale Data Analytics]]></title>
    <link href="http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2013-12-02T16:45:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics</id>
    <content type="html"><![CDATA[<p>Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the
field of large-scale data processing?  Twitter recently open-sourced <a href="https://github.com/twitter/algebird">Algebird</a>,
which provides you with a JVM library to work with such algebraic data structures.  Algebird is already being used in
Big Data tools such as <a href="https://github.com/twitter/scalding">Scalding</a> and
<a href="https://github.com/twitter/summingbird">SummingBird</a>, which means you can use Algebird as a mechanism to plug your
own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as
<a href="http://hadoop.apache.org/">Hadoop</a> and <a href="http://storm-project.net/">Storm</a>.  In this post I will show you how to get
started with Algebird, introduce you to monoids and monads, and address the question why you should get interested in
those in the first place.</p>

<!-- more -->

<h1 id="goal-of-this-article">Goal of this article</h1>

<p>The main goal of this is article is to spark your <em>curiosity</em> and <em>motivation</em> for
<a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monad.scala">Algebird</a>
and the concepts of monoid, monads, and category theory in general.  In other words, I want to address the questions
<em>“What’s the big deal?  Why should I care?  And how can these theoretical concepts help me in my daily work?”</em></p>

<p>While I will explain a little bit what the various concepts such as monoids are, this is not the focus of this post.
If in doubt I will rather err on the side of grossly oversimplifying a topic to get the point across even at the
expense of correctness.  There are much better resources available online and offline that can teach you the full
details of the various items I will discuss here.  That being said, I compiled a list of references at the end of this
article so that you have a starting point to understand the following concepts in full detail, and with more accurate
and thorough explanations than I could come up with.</p>

<h1 id="motivating-example">Motivating example</h1>

<h2 id="a-first-look-at-algebird">A first look at Algebird</h2>

<p>Here is a simple example what you can do with monoids and monads, based on the starter example in
<a href="https://github.com/twitter/algebird/">Algebird</a>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">30</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">20</span><span class="o">)</span>
</span><span class="line"><span class="n">res1</span><span class="k">:</span> <span class="kt">com.twitter.algebird.Max</span><span class="o">[</span><span class="kt">Int</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">30</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="c1">// Alternative, Java-like (read: ugly) syntax for readers unfamiliar with Scala.</span>
</span><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">10</span><span class="o">).+(</span><span class="nc">Max</span><span class="o">(</span><span class="mi">30</span><span class="o">)).+(</span><span class="nc">Max</span><span class="o">(</span><span class="mi">20</span><span class="o">))</span>
</span><span class="line"><span class="n">res2</span><span class="k">:</span> <span class="kt">com.twitter.algebird.Max</span><span class="o">[</span><span class="kt">Int</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">30</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>What is happening here?  Basically, we are boxing two numbers, the <code>Int</code> values <code>4</code> and <code>5</code>, into <code>Max</code>, and then
we are “adding” them.  The behavior of <code>Max[T]</code> turns the <code>+</code> operator into a function that returns the largest boxed
<code>T</code>.</p>

<p>Conceptually this is similar to the following native Scala code:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// This is native Scala.</span>
</span><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="mi">10</span> <span class="n">max</span> <span class="mi">30</span> <span class="n">max</span> <span class="mi">20</span>
</span><span class="line"><span class="n">res3</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">30</span>
</span><span class="line">
</span><span class="line"><span class="c1">// Alternative, Java-like syntax.</span>
</span><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="mf">10.</span><span class="n">max</span><span class="o">(</span><span class="mi">30</span><span class="o">).</span><span class="n">max</span><span class="o">(</span><span class="mi">20</span><span class="o">)</span>
</span><span class="line"><span class="n">res4</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">30</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>At this point you may ask, “Alright, what is the big deal?  The native Scala example looks actually better!”</p>

<p>At least, that is what I thought myself at first.  But the simplicity of this example is deceptive.  There is a lot
more to it than meets the eye at first sight.</p>

<h2 id="beyond-trivial-examples">Beyond trivial examples</h2>

<p>Admittedly, the first example used a very dull data structure, <code>Int</code>.  Any programming language comes with built-in
functionality to add two integers, right?  So you would be hardly convinced of the value of a tool like Algebird if all
it allowed you to do was <code>4 + 3 = 7</code>, particularly when doing those simple things would require you to understand
sophisticated concepts such as monoids and monads.  Too much effort for too little value I would say!</p>

<p>So let me use a different example because adding <code>Int</code> values is indeed trivial.  Imagine that you are working on
large-scale data analytics that make heavy use of <a href="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</a>.  Your
applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom
filters in parallel.  Now the money question is: <em>How do you combine or add two Bloom filters in an easy way?</em>
(This is where monoids come into play.)</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">val</span> <span class="n">first</span> <span class="k">=</span> <span class="nc">BloomFilter</span><span class="o">(...)</span>
</span><span class="line"><span class="k">val</span> <span class="n">second</span> <span class="k">=</span> <span class="nc">BloomFilter</span><span class="o">(...)</span>
</span><span class="line"><span class="n">first</span> <span class="o">+</span> <span class="n">second</span> <span class="o">==</span> <span class="n">uh</span><span class="o">?</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>And what about performing other operations on those Bloom filter instances, notably <em>data processing pipelines</em> based on
common functions such as <code>map</code>, <code>flatMap</code>, <code>foldLeft</code>, <code>reduceLeft</code>?  (And this is where monads come to into play.)</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">val</span> <span class="n">filters</span> <span class="k">=</span> <span class="nc">Seq</span><span class="o">[</span><span class="kt">BloomFilter</span><span class="o">](...)</span>
</span><span class="line"><span class="k">val</span> <span class="n">summary</span> <span class="k">=</span> <span class="n">filters</span> <span class="n">flatMap</span> <span class="o">{</span> <span class="cm">/* magic happens here */</span> <span class="o">}</span> <span class="n">reduceLeft</span> <span class="o">{</span> <span class="cm">/* more magic */</span> <span class="o">}</span> <span class="o">...</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>And what about combining two
<a href="http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/">HyperLogLog</a>
instances?</p>

<p>Intuitively we could say that the general idea of “adding” two Bloom filters is quite similar to how we would add two
sets <em>A</em> and <em>B</em>, where adding would mean creating the union set of <em>A</em> and <em>B</em>.</p>

<p>Now Algebird addresses this problem of abstraction.  In a nutshell, if you can turn a data structure into a monoid
(or semigroup, or …), then Algebird allows you to put it to good use.  You can then work with your data structure
just as nicely as you are so used to when dealing with <code>Int</code>, <code>Double</code> or <code>List</code>.  And you can use it with large-scale
data processing tools such as Hadoop and Storm, too.</p>

<h2 id="wait-a-minute">Wait a minute!</h2>

<p>In case you are asking yourself the following question (which I did):  Is the magic of Algebird simply something like a
custom <code>Max[Int]</code> class that defines a <code>+()</code> method, similar to the following snippet but actually with a bounded
type parameter <code>T : Ordering[T]</code>?  (If you do not understand the latter, take a look at
<a href="http://stackoverflow.com/questions/17597961">this StackOverflow thread</a>.)</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// Is Algebird implemented like this? (hint: nope)</span>
</span><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="k">case</span> <span class="k">class</span> <span class="nc">Max</span><span class="o">(</span><span class="k">val</span> <span class="n">i</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="o">{</span> <span class="k">def</span> <span class="o">+(</span><span class="n">that</span><span class="k">:</span> <span class="kt">Max</span><span class="o">)</span> <span class="k">=</span> <span class="k">if</span> <span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="n">i</span> <span class="o">&gt;=</span> <span class="n">that</span><span class="o">.</span><span class="n">i</span><span class="o">)</span> <span class="k">this</span> <span class="k">else</span> <span class="n">that</span> <span class="o">}</span>
</span><span class="line"><span class="n">defined</span> <span class="k">class</span> <span class="nc">Max</span>
</span><span class="line">
</span><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">10</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">30</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">20</span><span class="o">)</span>
</span><span class="line"><span class="n">res5</span><span class="k">:</span> <span class="kt">Max</span> <span class="o">=</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">30</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The answer is yes and no.  “Yes” because it <em>is</em> similar.  And “no” because the implementation is quite different from
the above analogy, and provides you with significantly more algebra-fu (but again, it has the same spirit).</p>

<h2 id="what-we-want-to-do">What we want to do</h2>

<p>Our goal in this post is to build a data structure <code>TwitterUser</code> accompanied by a <code>Max[TwitterUser]</code> monoid view of it.
We want to use the two for implementing the analytics of a fictional popularity contest on Twitter, like so:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// Let&#39;s have a popularity contest on Twitter.  The user with the most followers wins!</span>
</span><span class="line"><span class="k">val</span> <span class="n">barackobama</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;BarackObama&quot;</span><span class="o">,</span> <span class="mi">40267391</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">katyperry</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;katyperry&quot;</span><span class="o">,</span> <span class="mi">48013573</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">ladygaga</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;ladygaga&quot;</span><span class="o">,</span> <span class="mi">40756470</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">miguno</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;miguno&quot;</span><span class="o">,</span> <span class="mi">731</span><span class="o">)</span> <span class="c1">// I participate, too.  Olympic spirit!</span>
</span><span class="line"><span class="k">val</span> <span class="n">taylorswift</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;taylorswift13&quot;</span><span class="o">,</span> <span class="mi">37125055</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">winner</span><span class="k">:</span> <span class="kt">Max</span><span class="o">[</span><span class="kt">TwitterUser</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Max</span><span class="o">(</span><span class="n">barackobama</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">katyperry</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">ladygaga</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">miguno</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">taylorswift</span><span class="o">)</span>
</span><span class="line"><span class="n">assert</span><span class="o">(</span><span class="n">winner</span><span class="o">.</span><span class="n">get</span> <span class="o">==</span> <span class="n">katyperry</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Figuring out how to do this with monoids, monads, and Algebird is the objective of this article.</p>

<p>Of course, instead of using Algebird and monoids we could also project the number-of-followers field from each user
and perform any such analytics directly on the <code>Int</code> values.  That’s not the point however.  I intentionally wanted a
very simple example use case because, as you will see, there is so much to understand about what’s going on behind the
scenes that any further distraction should be avoided.  At least, that was my personal experience. :-)</p>

<h1 id="my-journey-down-the-rabbit-hole">My journey down the rabbit hole</h1>

<p><em>This section is more for entertainment.  Feel free to skip it.</em></p>

<h2 id="how-this-post-started">How this post started</h2>

<p>I am following a few Twitter folks on, well, Twitter such as Dmitriy Ryaboy
(<a href="https://twitter.com/squarecog">@squarecog</a>) and Oscar Boykin (<a href="https://twitter.com/posco">@posco</a>).  And lately
they talked a lot about how data analytics at Twitter is powered by “monoids” and “monads”, and how tools such as
<a href="https://github.com/twitter/algebird/">Algebird</a> and <a href="https://github.com/twitter/scalding">Scalding</a> form the
code foundation of their analytics infrastructure.</p>

<p>Here is an example of such a conversation:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/twitter-monad-conversation.png" title="Monads. Monads everywhere!" /></p>

<p>(Link to full image <a href="http://makeameme.org/media/created/Monads-monads-everywhere.jpg">“Monads, Monads Everywhere!”</a>)</p>

<p>A <em>mo</em>-what?  And how comes those things are apparently spreading like a contagious disease throughout their data
analytics code?</p>

<p>Another trigger was a discussion involing Ted Dunning of MapR (<a href="https://twitter.com/ted_dunning">@ted_dunning</a>) and
his work on a new data structure called <a href="https://github.com/tdunning/t-digest">t-digest</a>:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/twitter-associative-conversation.png" title="Is t-digest associative?" /></p>

<p>Why was Ted being asked whether <code>t-digest</code> is associative?  And how does all this relate to semigroups and monoids?  And
finally, what the heck are semigroups in the first place?</p>

<p>Now a dangerous series of events began to take place on my side.</p>

<p>First I thought, <em>“Hey, coincidentally I have started to pick up Scala around a month ago.  Given that Algebird is</em>
<em>written in Scala this might turn into an interesting finger exercise.”</em> (Note my focus on “finger exercise”.)
On top of that I knew that the use of Algebird extends to other interesting big data tools such as Storm and Scalding,
so it could turn out that I would not only learn something for learning’s sake but that I could put it to practical use
in my daily work, too.  The combination of these two factors – general interest and practical applicability –
eventually caused me to give in to my curiosity and decided to put “an hour or two aside” to read up on those monoid
thingies and figure out whether and how I could leverage Algebird for my own purposes.</p>

<p>You might notice at this point that it all started quite innocently.  But what I did not realize at that moment was that
I was opening <a href="https://en.wikipedia.org/wiki/Pandora%27s_box">Pandora’s box</a> on an otherwise quiet and peaceful
Swiss weekend…</p>

<h2 id="scala-functors-monoids-monads-category-theory-implicits-type-classes-aaargh">Scala, functors, monoids, monads, category theory, implicits, type classes, aaargh!</h2>

<p>What started as a seemingly innocent journey down a calm park lane quickly turned into the opening of the gates of
functional programming and category theory hell.  Not only did I struggle to understand what things like functors,
semigroups, monoids, and other algebraic structures that only a mother could love are.  No, on top of that I quickly
realized that how these things can be implemented in Scala in general and in Algebird in particular meant I had to take
my beginner Scala-fu to a whole new level.  In the end it took me the full weekend to grasp all those concepts to the
point where I’d say right now that I know enough to be dangerous.</p>

<p>The learning curve reminded me a lot of the following famous picture:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/editor-learning-curve.png" title="Learning curve for some common editors" /></p>

<div class="caption">
Figure 1: Learning curve for some common editors.  Image courtesy of <a href="http://josemdev.com/2012/11/learning-vim-introduction/">Jose M. Gilgado</a>.
</div>

<p>And it did feel like the <code>vi</code> curve – the brick wall experience.  What else could it be, right?  That being said I
still fear that, after having hit and finally made it over that initial brick wall, it may still spiral out of control
again like the Emacs curve. :-)</p>

<p>Picture myself sitting in front of my keyboard, frantically interacting with your favorite search engine, StackOverflow,
Wikipedia, your usual suspects of Scala books, and what not:</p>

<p>Me: “What is a monad?”</p>

<p><em>Internet: “A <a href="http://stackoverflow.com/questions/3870088">monad is just a monoid in the category of endofunctors</a>.”</em></p>

<p>Me: “Hmm, ok.  So what is a monoid?”</p>

<p><em>Internet: “A monoid is a semigroup with identity.”</em></p>

<p>Me: “Then what is a semigroup??” (number of question marks increases with anxiety level)</p>

<p><em>Internet: “An algebraic structure consisting of a set together with an associative binary operation.”</em></p>

<p>Me: “Alright, I see the mathematical definition and I do see a soup of greek letters.  Still, what <em>is</em> it?
<a href="http://codahale.com/downloads/email-to-donald.txt">Where can I get one from, and what can I use it for?</a>”</p>

<p><em>Internet: “Here is an example in the Haskell programming language.”</em></p>

<p>Me: &lt;censored&gt;</p>

<p>On a more serious note, the past few days have really been a tour de force where I felt I would recursively dive from
one new term or concept into yet more new terms and concepts, to the point where my brain would run into a stack
overflow.  <em>“Why am I actually reading about <a href="https://en.wikipedia.org/wiki/Magma_%28algebra%29">magmas</a>, or co- and</em>
<em>contra-variance in Scala, or bounded type parameters?  What was the original question I tried to find an answer for?”</em></p>

<p>To make a long story short I was really deep down the rabbit hole, with no Alice in sight but fully surrounded by
semigroups of monoidal and diabolical <a href="http://en.wikipedia.org/wiki/Jabberwocky">jabberwockies</a> on a big night out.
Given the questions, comments and blog posts of other folks at least I found consolation in the fact that I was
apparently not alone.</p>

<p>And, finally, at the end of the hole there was a bit of light.  In the next sections I want to share what I have learned
so far in the hope that it will prove helpful for you, too.  We start with a brief introduction to monoids and monads,
followed by how to apply what we have learned in Algebird hands-on.</p>

<h1 id="the-tldr-version-of-monoids-and-monads">The TL;DR version of monoids and monads</h1>

<blockquote><p>A monad is a monoid where you blend the &#8220;oi&#8221; into an &#8220;a&#8221;.  Depending on your typesettings (pun intended) this blend will be easier or harder for you to see.  If in doubt, squint more.</p><footer><strong>Michael&#8217;s abridged relation of monoids and monads</strong></footer></blockquote>

<p>As a grossly simplified rule of thumb:</p>

<ol>
  <li><strong>Monoid</strong>: If you want to “attach” <em>operations</em> such as <code>+</code>, <code>-</code>, <code>*</code>, <code>/</code> or <code>&lt;=</code> to <em>data objects</em> – say, adding
two Bloom filters – then you want to provide <em>monoid</em> forms for those data objects (e.g. a monoid for your Bloom
filter data structure).  This way you can combine and juggle your custom data structures just like you would do with
plain integer numbers.</li>
  <li><strong>Monad</strong>: If you want to create <em>data processing pipelines</em> that turn data objects step-by-step into the desired,
final output (e.g. aggregating raw records into summary statistics), then you want to build one or more <em>monads</em> to
model these data pipelines.  Particularly if you want to run those pipelines in large-scala data processing platforms
such as Hadoop or Storm.</li>
</ol>

<p>The intent of this section is to give you a high-level idea what those concepts are, and what you can use them for.
That is, this section should help you determine whether you want to venture down the rabbit hole, too.</p>

<p>I did not want to add yet another variant to the pool of “what is a monoid/monad” articles, but at the same time I felt
I need to explain at least very briefly what the various concepts are (as good as I can) so that you can better
understand how to use a tool such as Algebird.</p>

<p>Of course, if you ran across a blatant mistake on my side please do let me know!</p>

<h2 id="monoids">Monoids</h2>

<h3 id="what-is-a-monoid">What is a monoid?</h3>

<p>A monoid is a structure that consists of:</p>

<ol>
  <li>a set of objects (such as numbers)</li>
  <li>a binary operation as a method of combining them (such as adding those numbers)</li>
</ol>

<p>The small catch is that the way you can combine the objects in your set must adhere to a few rules, which are described
in the next section.</p>

<p>One way to explain a monoid in the context of programming is as a kind of <em>adapter</em> or <em>bounded view</em> of a type <code>T</code>.
Imagine a data structure of type <code>T</code> – say, a <code>List</code>.  If you can find a way to use <code>T</code> in a way that conforms to the
monoid laws (see next section), then you can say “type T forms a monoid” <code>Monoid[T]</code>;  for instance, if the binary
operation you picked behaves like the concept of addition, you have an additive monoid view of <code>T</code>.</p>

<div class="note">
Note: What I tried to highlight in the previous paragraph is that a given type <tt>T</tt> can have multiple monoidal
forms.  An additive monoid of <tt>T</tt> is just an example, and <tt>T</tt> might have more monoids than the additive
variant.  Also &#8211; sorry for the forward reference &#8211; a type <tt>T</tt> can form both a monoid <em>and</em> a monad.
One such dual-headed hydra is the well-known <tt>List</tt>.
</div>

<p>So you can read <code>Monoid[T]</code> as <em>“T looks like a monoid and quacks like a monoid, so it must be a monoid”</em>.  This notion
is related to the concept of <a href="http://en.wikipedia.org/wiki/Duck_typing">duck typing</a> in languages such as Python.
Scala, in which Algebird is implemented, has a static type system though, and to achieve such ad-hoc polymorphism we
typically use
<a href="http://danielwestheide.com/blog/2013/02/06/the-neophytes-guide-to-scala-part-12-type-classes.html">type classes</a>
to achieve a similar effect.  A nifty feature of type classes is that they allow you to retroactively add polymorphism
even to existing types that are not under your own control: examples are <code>Seq</code> or <code>List</code>, which are provided by the
Scala standard library and thus not under your control.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/monoid-illustration.png" title="Monoid illustration" /></p>

<div class="caption">
Figure 2: A monoid seen as a bounded view.  In this analogy we are looking at the original type <tt>T</tt> from a
different, &#8220;monoidal angle&#8221;.  Here, we are combining two values of type <tt>T</tt> under the laws of the pink-colored
monoid view of <tt>T</tt> (whatever this particular monoid might actually be doing).
</div>

<h3 id="monoids-in-more-detail">Monoids in more detail</h3>

<p>A monoid is a set of objects, <code>T</code>, together with a binary operation <tt>⋅</tt> that satisfies the three axioms
listed below.</p>

<p>One way to express a monoid in Scala would be the following trait, used as a type class:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// Important: What you see here is only part of the contract.</span>
</span><span class="line"><span class="c1">// The monoid, and thus `e` and `o`, must also adhere to the monoid laws.</span>
</span><span class="line"><span class="k">trait</span> <span class="nc">Monoid</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span>
</span><span class="line">  <span class="k">def</span> <span class="n">e</span><span class="k">:</span> <span class="kt">T</span>
</span><span class="line">  <span class="k">def</span> <span class="n">op</span><span class="o">(</span><span class="n">a</span><span class="k">:</span> <span class="kt">T</span><span class="o">,</span> <span class="n">b</span><span class="k">:</span> <span class="kt">T</span><span class="o">)</span><span class="k">:</span> <span class="kt">T</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<dl>
  <dt>Closure</dt>
  <dd>
    For all <em>a, b</em> in <em>T</em>, the result of the operation <em>a &sdot; b</em> is also in <em>T</em>:

    $$
    \forall a,b \in T: a \bullet b \in T
    $$

    In Scala, we could express this axiom with the following function signature for <tt>&sdot;</tt>:
    <tt>def op(a: T, b: T): T</tt>
  </dd>
  <dt>Associativity</dt>
  <dd>
    For all <em>a</em>, <em>b</em>, and <em>c</em> in <em>T</em>,
    the equation <em>(a &sdot; b) &sdot; c = a &sdot; (b &sdot; c)</em> holds:

    $$
    \forall a,b,c \in T: (a \bullet b) \bullet c = a \bullet (b \bullet c)
    $$

    In Scala, we could express this axiom with:
    <tt>(a op b) op c == a op (b op c)</tt>
  </dd>
  <dt>Identity element</dt>
  <dd>
    There exists an element <em>e</em> (we could also call it <em>zero</em> to draw a link to addition) in <em>T</em>,
    such that for all elements <em>a</em> in <em>T</em>, the equation <em>e &sdot; a = a &sdot; e = a</em> holds:

    $$
    \exists e \in T: \forall a \in T: e \bullet a = a \bullet e = a
    $$

    In Scala, we could express this axiom with the following, which as you might note captures the idea of a
    <a href="http://en.wikipedia.org/wiki/NOP">no-op</a>: <tt>e op a == a op e == a</tt>
  </dd>
</dl>

<div class="note">
Note: <em>Any</em> binary operation satisfying the three axioms above qualifies your data structure to be a monoid.
It does not necessarily need to be an addition-like operation.
</div>

<p>Before we move on and look at examples of monads, I want to mention one more thing about the binary function of a
monoid.  We have learned that it must be <a href="http://en.wikipedia.org/wiki/Associative_property">associative</a>.  Wouldn’t it
be helpful if the binary function were <a href="http://en.wikipedia.org/wiki/Commutative_property">commutative</a>, too, even
though this optional feature would not be required to make a monoid?</p>

<p>Here is a transcripted reply during Sam Ritchie’s SummingBird talk at CUFP:</p>

<blockquote><p>Question: Associativity is one nice thing about monoids, but what about commutativity [which] is also important.  Are there examples of non-commutative datastructures</p><p>Answer: It should be baked into the algebra (non-commutativity). This helps with data skew in particular.  An important non-commutative application is Twitter itself!  When you want to build the List monoid, the key is <tt>userid,time</tt> and the value is the list of tweets over that timeline (so ordering matters here).  It&#8217;s not good to get a non-deterministic order when building up these lists in parallel, so that’s a good example of when associativity and commutativity are both important.</p><footer><strong>Transcript of Sam Ritchie&#8217;s SummingBird talk at CUFP 2013</strong> <cite><a href="http://www.syslog.cl.cam.ac.uk/2013/09/22/liveblogging-cufp--2013/">www.syslog.cl.cam.ac.uk/2013/09/&hellip;</a></cite></footer></blockquote>

<h3 id="what-are-example-monoids">What are example monoids?</h3>

<ul>
  <li><em>Numbers</em> (= the set of objects) you can <em>add</em> (= the method of combining them).
    <ul>
      <li>For integer addition, <code>e == 0</code> and <code>op == +</code>.</li>
      <li>For integer multiplication, <code>e == 1</code> and <code>op == *</code>.</li>
    </ul>
  </li>
  <li><em>Lists</em> you can <em>concatenate</em>.
    <ul>
      <li>With <code>e == Nil</code> and <code>op == concat</code>.</li>
    </ul>
  </li>
  <li><em>Sets</em> you can <em>union</em>.
    <ul>
      <li>With <code>e == Set()</code> and <code>op == union</code>.</li>
    </ul>
  </li>
</ul>

<p>There are more and also more sophisticated examples, of course.  <code>Max[Int]</code> at the beginning of this article is a
monoid, too.</p>

<p>Here is how Algebird defines an <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monoid.scala">additive monoid for the standard type <code>Seq</code></a>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// A `Seq` concatenation monoid.</span>
</span><span class="line"><span class="c1">// Plus (the `op`) means concatenation,</span>
</span><span class="line"><span class="c1">// zero (the identity element `e`) is the empty Seq.</span>
</span><span class="line"><span class="k">class</span> <span class="nc">SeqMonoid</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="nc">extends</span> <span class="nc">Monoid</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="o">{</span>
</span><span class="line">  <span class="k">override</span> <span class="k">def</span> <span class="n">zero</span> <span class="k">=</span> <span class="nc">Seq</span><span class="o">[</span><span class="kt">T</span><span class="o">]()</span>
</span><span class="line">  <span class="k">override</span> <span class="k">def</span> <span class="n">plus</span><span class="o">(</span><span class="n">left</span> <span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">T</span><span class="o">],</span> <span class="n">right</span> <span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">=</span> <span class="n">left</span> <span class="o">++</span> <span class="n">right</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="c1">// Make an instance of `SeqMonoid` available as an implicit value.</span>
</span><span class="line"><span class="c1">// This is a Scala-specific implementation action that needs to be done,</span>
</span><span class="line"><span class="c1">// i.e. it is not related to the abstract concept of monoids.</span>
</span><span class="line"><span class="c1">//</span>
</span><span class="line"><span class="c1">// The effect of this statement is to add the &quot;monoid view&quot; of Seq</span>
</span><span class="line"><span class="c1">// as defined above to all `Seq` instances in the code.  If you</span>
</span><span class="line"><span class="c1">// define your own monoid for a type `T` in Algebird and forget</span>
</span><span class="line"><span class="c1">// this statement, Algebird will complain with the following</span>
</span><span class="line"><span class="c1">// @implicitNotFound error message:</span>
</span><span class="line"><span class="c1">//</span>
</span><span class="line"><span class="c1">//   &quot;Cannot find Monoid type class for T&quot;</span>
</span><span class="line"><span class="c1">//</span>
</span><span class="line"><span class="c1">// Implicits need to be used because this is how the notion of</span>
</span><span class="line"><span class="c1">// type classes is implemented in Scala.</span>
</span><span class="line"><span class="k">implicit</span> <span class="k">def</span> <span class="n">seqMonoid</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="k">:</span> <span class="kt">Monoid</span><span class="o">[</span><span class="kt">Seq</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SeqMonoid</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Algebird actually includes a few more methods for the <code>Monoid[T]</code> type class – which <code>SeqMonoid[T]</code> extends – but the
key functionality is shown above.</p>

<h3 id="what-can-i-use-a-monoid-for--why-should-i-look-for-one">What can I use a monoid for?  Why should I look for one?</h3>

<p>Whenever you have a data structure (which backs your “set of objects”, e.g. the <code>Int</code> data structure or the <code>List[T]</code>
data structure) you can begin checking whether you can define one or more monoids for that data structure.  Here you
will start looking for operations you can perform on any two instances of your data structure that satisfy the three
<a href="https://en.wikipedia.org/wiki/Monoid">monoid axioms</a>: <em>closure</em>, <em>associativity</em>, and <em>identity element</em> (the latter
gives your monoid a no-op function, and is the one thing that turns a semigroup into a monoid).</p>

<p>If you do find any such monoids for your data structure, hooray!  On the practical side this means means that you can
now use your data structure in any code that expects a monoid.  As I said above you can think of a monoid as an adapter,
or shape, for (some monoid-compatible aspects of) your data structure that allows you to fit your data structure peg
into a monoid hole.  Some such holes are <a href="https://github.com/twitter/algebird/">Algebird</a>,
<a href="https://github.com/twitter/scalding">Scalding</a> and <a href="https://github.com/twitter/summingbird">Summingbird</a> of Twitter.
Being supported by those tools also means that you can now plug your data structure into big data analytics tools such
as Hadoop and Storm, which might be a huge seller and productivity gain for your new data structure.</p>

<p><span class="pullquote-right" data-pullquote="If your data structure has a monoid form, this means you can plug the data structure directly into large-scale data processing platforms such as Hadoop and Storm.">
Secondly, and in more general terms, it signifies because of the <em>associativity</em> of monoid operations that those
operations on your data structure <a href="http://en.wikipedia.org/wiki/Monoid#Monoids_in_computer_science">can be parallelized</a>
in order to utilize multiple CPU cores efficiently.  Speaking in code, that means you can run operations such as
<code>foldLeft()</code> and <code>reduceLeft()</code> on them.  And parallelization support is yet another reason why monoids (and monads) are
so attractive for big data tools such as Hadoop and Storm, where your code not only runs on many cores per machine but
on many such machines in a cluster.  In other words:
If your data structure has a monoid form, this means you can plug the data structure directly into large-scale data
processing platforms such as Hadoop and Storm.
Hence monoids enable you to <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> and to
<a href="http://en.wikipedia.org/wiki/Divide_and_conquer_algorithm">divide and conquer</a>.
</span></p>

<p>Let me quote Sam Ritchie (<a href="https://twitter.com/sritchie">@sritchie</a>), former Twitter engineer and now founder of
<a href="http://www.paddleguru.com/">PaddleGuru</a> (cool idea, by the way – go sports!) for a very concrete practical application
of monoids at Twitter.  Well, actually I am quoting a transcript of his talk.</p>

<blockquote><p>One cool feature:  When you visit a tweet, you want the reverse feed of things that have embedded the tweet.  The MapReduce graph for this comes from: When you see an impression, find the key of the tweet and emit a tuple of the <tt>tweetId</tt> and <tt>Map[URL, Long]</tt>. Since Maps have a monoid, this can be run in parallel, and it will contain a list of who has viewed it and from where.  The <tt>Map</tt> has a <tt>Long</tt> since popular tweets can be embedded in millions of websites and so they use a &#8220;CountMinSketch&#8221; [Note: Reader Sam Bessalah points out that the transcript is wrong when it said &#8220;accountment sketch&#8221;.] which is an approximate data structure to deal with scale there. The Summingbird layer which the speaker [Sam Ritchie] shows on stage filters events, and generates key-value pairs and emits events.</p><p>Twitter advertising is also built on Summingbird.  Various campaigns can be built by building a backend using a monoid that expresses the needs, and then the majority of the work is on the UI work in the frontend (where it should be — remember, solve systems problems once is part of the vision).</p><footer><strong>Transcript of Sam Ritchie&#8217;s SummingBird talk at CUFP 2013</strong> <cite><a href="http://www.syslog.cl.cam.ac.uk/2013/09/22/liveblogging-cufp--2013/">www.syslog.cl.cam.ac.uk/2013/09/&hellip;</a></cite></footer></blockquote>

<p>See <a href="https://speakerdeck.com/sritchie/summingbird-at-cufp">his CUFP slides on Summingbird</a> for further detail.</p>

<p>Thirdly, you can <em>compose</em> monoids.  For instance, you can form the <em>product</em> of two monoids <code>M1</code> and <code>M2</code>, which is the
tuple type <code>(M1, M2)</code>.  This product is also a monoid.</p>

<p>Lastly, you can now combine your monoidal data structure with monads (see below) and benefit from all the features that those monads provide.</p>

<div class="note">
At this point you might guess the reason why Ted Dunning was asked whether the <tt>t-digest</tt> data structure he is
working on is associative and can be turned into a semigroup or monoid.  One of my two mysteries solved!
</div>

<h2 id="monads">Monads</h2>

<h3 id="what-is-a-monad">What is a monad?</h3>

<div class="note">
Update: A few readers pointed out that this section explains rather what monads are <em>used for</em> than what they
really <em>are</em>.  I concur!  And I even skip a discussion of Monad laws etc. intentionally because the post is
already quite long, and the focus and motivation of this article (see above) is not an in-depth introduction to monoids
or monads.  It&#8217;s about the questions &#8220;Why should I be interested in the first place, and what can I use them for?&#8221;.
Of course I can understand the need for further details, so I added a list of references and literature to the end of
this article, which you can read at your leisure.  Of course if you think that some important piece of information
should be mentioned here directly (or something happens to be plain wrong), please let me know.  It&#8217;s difficult to write
an article about such a topic in a way that can be understood by beginners and at the same time also pleases the
experts.
</div>

<p>A monad is a structure that defines a way to combine <em>functions</em>.  It represents computations defined as a <em>sequence</em>
of transformations that turn an original input into a final output, one step at a time.  Think of them like function
chaining similar to <code>y = h(g(f(x)))</code>.</p>

<p>An interesting aspect is that in the case of a monad the <em>type</em> of the value being piped through the function chain may
change along the way.  For instance, you may start with an <code>Int</code> but end up with a <code>Double</code> or <code>BloomFilter</code>.  This is
different from a monoid, which will always retain the original type because of the <em>closure</em> requirement (see monoid
laws above).</p>

<p>One of the best analogies for monads I found is the following, adapted from
<a href="https://en.wikipedia.org/wiki/Monad_%28functional_programming%29">Wikipedia</a>: You can compare monads to physical
assembly lines, where a conveyor belt (the monad) transports a piece of input material (the data) between functional
units (functions on the data) that transform the piece one step at a time.  Think of the skeleton of a car that is
turned into the final car in a sequence of steps.  Or of web server log files with raw data that is turned into business
information such as the increase of ad impressions in the EMEA market for this month.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/monad-function-pipeline.png" title="Monad data processing pipeline" /></p>

<div class="caption">
Figure 3: A monad seen as a data processing pipeline.  The monad <tt>M</tt> is used to turn the original input into the
final output one step at a time.
</div>

<p>Sticking with this analogy, a monad enables you to <em>decorate</em> each processing step in the assembly pipeline with
<em>additional context</em> (or an “environment”).  For instance, your monad could carry state information that is used by
the functions in the pipeline – this would be the example of a
<a href="https://en.wikibooks.org/wiki/Haskell/Understanding_monads/State">state monad</a>.
Alternatively, your monad could log what is going on before, within, or after a function to a file or database – this
would be the example of an <a href="https://en.wikibooks.org/wiki/Haskell/Understanding_monads/IO">I/O monad</a>.
If you are a game developer, you could use a monad to carry the representation and state of the game environment (such
as the current level), and the functions in the pipeline would model how players can interact with the environment.</p>

<p>Before we look at monads in more detail, let us take a brief detour to <a href="https://github.com/nathanmarz/storm">Storm</a>.
When you are implementing bolts in Storm – i.e. Storm’s version of the “functional units” in a data processing
pipeline – you will come across the <code>prepare()</code> and <code>execute()</code> methods
(see the <a href="https://github.com/nathanmarz/storm/wiki/Tutorial">Storm tutorial</a>):</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">TripleBolt</span> <span class="kd">extends</span> <span class="n">BaseRichBolt</span> <span class="o">{</span>
</span><span class="line">  <span class="kd">private</span> <span class="n">OutputCollectorBase</span> <span class="n">collector</span><span class="o">;</span>
</span><span class="line">
</span><span class="line">  <span class="c1">// Note how the Storm provides &quot;context&quot; -- a literal context value</span>
</span><span class="line">  <span class="c1">// and a collector value -- to the bolt as the functional unit in</span>
</span><span class="line">  <span class="c1">// the data processing pipeline.</span>
</span><span class="line">  <span class="nd">@Override</span>
</span><span class="line">  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">prepare</span><span class="o">(</span><span class="n">Map</span> <span class="n">conf</span><span class="o">,</span> <span class="n">TopologyContext</span> <span class="n">context</span><span class="o">,</span> <span class="n">OutputCollectorBase</span> <span class="n">collector</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="k">this</span><span class="o">.</span><span class="na">collector</span> <span class="o">=</span> <span class="n">collector</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="c1">// This is Storm&#39;s version of a monad&#39;s `fn` function,</span>
</span><span class="line">  <span class="c1">// which we will discuss in the next section.</span>
</span><span class="line">  <span class="nd">@Override</span>
</span><span class="line">  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">input</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="kt">int</span> <span class="n">val</span> <span class="o">=</span> <span class="n">input</span><span class="o">.</span><span class="na">getInteger</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
</span><span class="line">    <span class="kt">int</span> <span class="n">tripled</span> <span class="o">=</span> <span class="n">val</span> <span class="o">*</span> <span class="mi">3</span><span class="o">;</span>
</span><span class="line">    <span class="n">collector</span><span class="o">.</span><span class="na">emit</span><span class="o">(</span><span class="n">input</span><span class="o">,</span> <span class="k">new</span> <span class="n">Values</span><span class="o">(</span><span class="n">tripled</span><span class="o">));</span>
</span><span class="line">    <span class="n">collector</span><span class="o">.</span><span class="na">ack</span><span class="o">(</span><span class="n">input</span><span class="o">);</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="c1">// ...rest omitted...</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Note how Storm provides environmental information and context to the bolt.  This is one example where you could point
your finger at the code and say, <em>“This would be a good place to use a monad.”</em>  In this specific case I would say it
would be primarily a kind of <em>I/O-monad</em> because the <code>collector</code> instance allows the bolt write its output to
downstream bolts via network communication.</p>

<h3 id="monads-in-more-detail">Monads in more detail</h3>

<p>Here is one way to capture the concept of a monad in Scala.  It is basically the same as the
<a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monad.scala">definition of a monad in Algebird</a>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// Important: What you see here is only part of the contract.</span>
</span><span class="line"><span class="c1">// The monad, and thus `apply` and `flatMap`, must also adhere to the monad laws.</span>
</span><span class="line"><span class="k">trait</span> <span class="nc">Monad</span><span class="o">[</span><span class="kt">M</span><span class="o">[</span><span class="k">_</span><span class="o">]]</span> <span class="o">{</span>
</span><span class="line">  <span class="c1">// Also called `unit` (in papers) or `return` (in Haskell).</span>
</span><span class="line">  <span class="k">def</span> <span class="n">apply</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">v</span><span class="k">:</span> <span class="kt">T</span><span class="o">)</span><span class="k">:</span> <span class="kt">M</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
</span><span class="line">
</span><span class="line">  <span class="c1">// Also called `bind` (in papers) or `&gt;&gt;=` (in Haskell).</span>
</span><span class="line">  <span class="k">def</span> <span class="n">flatMap</span><span class="o">[</span><span class="kt">T</span>, <span class="kt">U</span><span class="o">](</span><span class="n">m</span><span class="k">:</span> <span class="kt">M</span><span class="o">[</span><span class="kt">T</span><span class="o">])(</span><span class="n">fn</span><span class="k">:</span> <span class="o">(</span><span class="kt">T</span><span class="o">)</span> <span class="o">=&gt;</span> <span class="n">M</span><span class="o">[</span><span class="kt">U</span><span class="o">])</span><span class="k">:</span> <span class="kt">M</span><span class="o">[</span><span class="kt">U</span><span class="o">]</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Alright, what is going on here?</p>

<p><code>apply()</code> boxes a <code>T</code> value into the monad <code>M[T]</code>.  For example, <code>T</code> is an <code>Int</code>, the monad <code>M[T]</code> is a <code>List[T]</code>.
In other words, it is a good-ol’ constructor for the monad.</p>

<p><code>flatMap()</code> turns a <code>T</code> into a potentially different type parameter <code>U</code> (but it can also be a <code>T</code> again) that is boxed
into the same type of monad <code>M</code>, i.e. <code>M[U]</code>.  In plain English, this means that if you have List monad all it will
ever produce for you is another List monad, but the type of elements <em>in the List monad</em> may change.  The way this
happens is controlled by the second parameter of <code>flatMap()</code>, which is a function from <code>T</code> to <code>M[U]</code>.</p>

<p>For example, <code>T</code> is an <code>Int</code>, <code>U</code> is a <code>Double</code>, and <code>M</code> is a <code>List</code> monad;
<code>fn</code> is <code>(i: Int) =&gt; List(i.toDouble / 4, i.toDouble / 2)</code>, i.e. <code>T -&gt; M[U]</code>.
If you ran this combination over the input <code>List[Int](1, 2)</code>, you would get the output:
<code>List[Double](0.25, 0.5, 0.5, 1.0)</code>.</p>

<div class="note">
Note how <tt>flatMap()</tt> provides the boxing <tt>M</tt> instance <tt>m</tt> of the input <tt>T</tt> value to the
function <tt>fn</tt> via currying.  This way <tt>fn</tt> may leverage information or functionality embedded in the
monad, including functions beyond the contractually required <tt>flatMap()</tt>.  One such example is
<tt>Monad[Some]</tt>, i.e. the
<a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monad.scala">Some monad in Algebird</a>.
The <tt>flatMap()</tt> function of this monad calls <tt>Some#get()</tt>, which is a function of <tt>Some</tt> but not
of <tt>Monad[Some]</tt>.  As such a monad is also a kind of adapter or view, similar to the
way we described monoids above.  If you still cannot see how similar monoids and monads are, just try squinting harder!
</div>

<p>Similar to the monoid laws we discussed above, monads have their own laws – and these rules are actually very similar
to its monoid brethren!  I decided not to discuss monad laws in this post because I feel it is already very long.
I may update the post at a later point though.  In the meantime take a look at the following references:</p>

<ul>
  <li><a href="http://www.haskell.org/haskellwiki/Monad_laws">Monad laws</a> (in Haskell).  Remember <code>return</code> in Haskell means our
constructor <code>apply()</code> in Scala, and <code>&gt;&gt;=</code> in Haskell is our <code>flatMap()</code>.</li>
  <li><a href="http://james-iry.blogspot.ch/2007/09/monads-are-elephants-part-1.html">Monads are elephants</a>, a series of blog posts
by James Iry.  In Scala.</li>
</ul>

<p>I hope you will notice their similarities:</p>

<ul>
  <li>The identity rules of monads are similar to the identity element <code>e</code> of monoids.</li>
  <li>Both monoids and monads have functions that must be associative.</li>
</ul>

<h3 id="what-are-example-monads">What are example monads?</h3>

<p>At the beginning of the section on monads I already mentioned the <em>state-monad</em> and the <em>I/O-monad</em>.</p>

<p>Well, this may still be a bit vague.  Let us look at a more concrete (and maybe simpler) example.  Any collection type
is typically a monad.  For example, take <code>List[T]</code>:</p>

<ul>
  <li>The constructor of <code>List[T]</code> acts as <code>unit</code> as it gives you a <code>List[T]</code> box for <code>T</code> instances.</li>
  <li><code>List</code> has an appropriate <code>flatMap()</code> function – and <code>map()</code>, which can be built from <code>flatMap()</code> and the
constructor.</li>
</ul>

<p>Here is the implementation of <code>Monad[List]</code> <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monad.scala">in Algebird</a>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">implicit</span> <span class="k">val</span> <span class="n">list</span><span class="k">:</span> <span class="kt">Monad</span><span class="o">[</span><span class="kt">List</span><span class="o">]</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Monad</span><span class="o">[</span><span class="kt">List</span><span class="o">]</span> <span class="o">{</span>
</span><span class="line">  <span class="k">def</span> <span class="n">apply</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">v</span><span class="k">:</span> <span class="kt">T</span><span class="o">)</span> <span class="k">=</span> <span class="nc">List</span><span class="o">(</span><span class="n">v</span><span class="o">);</span>
</span><span class="line">  <span class="k">def</span> <span class="n">flatMap</span><span class="o">[</span><span class="kt">T</span>,<span class="kt">U</span><span class="o">](</span><span class="n">m</span><span class="k">:</span> <span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">])(</span><span class="n">fn</span><span class="k">:</span> <span class="o">(</span><span class="kt">T</span><span class="o">)</span> <span class="o">=&gt;</span> <span class="nc">List</span><span class="o">[</span><span class="kt">U</span><span class="o">])</span> <span class="k">=</span> <span class="n">m</span><span class="o">.</span><span class="n">flatMap</span><span class="o">(</span><span class="n">fn</span><span class="o">)</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Here you can see that <code>Monad[List]</code> is simply a 1:1 adapter for the existing <code>apply()</code> and <code>flatMap()</code> functions of
<code>List</code>.  And that’s because <code>List</code> in Scala already ships with monad “look and feel”.</p>

<p>Before we move on to the next section there is one more interesting facet:  A monad can have monoid forms, too.
Algebird, for instance, provides a <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monad.scala">default monoid view for its semigroups and monads</a>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// This is a Semigroup, for all Monads.</span>
</span><span class="line"><span class="k">class</span> <span class="nc">MonadSemigroup</span><span class="o">[</span><span class="kt">T</span>,<span class="kt">M</span><span class="o">[</span><span class="k">_</span><span class="o">]](</span><span class="k">implicit</span> <span class="n">monad</span><span class="k">:</span> <span class="kt">Monad</span><span class="o">[</span><span class="kt">M</span><span class="o">],</span> <span class="n">sg</span><span class="k">:</span> <span class="kt">Semigroup</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span>
</span><span class="line">  <span class="k">extends</span> <span class="nc">Semigroup</span><span class="o">[</span><span class="kt">M</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="o">{</span>
</span><span class="line">  <span class="k">import</span> <span class="nn">Monad.operators</span>
</span><span class="line">  <span class="k">def</span> <span class="n">plus</span><span class="o">(</span><span class="n">l</span><span class="k">:</span> <span class="kt">M</span><span class="o">[</span><span class="kt">T</span><span class="o">],</span> <span class="n">r</span><span class="k">:</span> <span class="kt">M</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">=</span> <span class="k">for</span><span class="o">(</span><span class="n">lv</span> <span class="k">&lt;-</span> <span class="n">l</span><span class="o">;</span> <span class="n">rv</span> <span class="k">&lt;-</span> <span class="n">r</span><span class="o">)</span> <span class="k">yield</span> <span class="n">sg</span><span class="o">.</span><span class="n">plus</span><span class="o">(</span><span class="n">lv</span><span class="o">,</span> <span class="n">rv</span><span class="o">)</span>
</span><span class="line"><span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="c1">// This is a Monoid, for all Monads.</span>
</span><span class="line"><span class="k">class</span> <span class="nc">MonadMonoid</span><span class="o">[</span><span class="kt">T</span>,<span class="kt">M</span><span class="o">[</span><span class="k">_</span><span class="o">]](</span><span class="k">implicit</span> <span class="n">monad</span><span class="k">:</span> <span class="kt">Monad</span><span class="o">[</span><span class="kt">M</span><span class="o">],</span> <span class="n">mon</span><span class="k">:</span> <span class="kt">Monoid</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span>
</span><span class="line">  <span class="k">extends</span> <span class="nc">MonadSemigroup</span><span class="o">[</span><span class="kt">T</span>,<span class="kt">M</span><span class="o">]</span> <span class="k">with</span> <span class="nc">Monoid</span><span class="o">[</span><span class="kt">M</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="o">{</span>
</span><span class="line">  <span class="k">lazy</span> <span class="k">val</span> <span class="n">zero</span> <span class="k">=</span> <span class="n">monad</span><span class="o">(</span><span class="n">mon</span><span class="o">.</span><span class="n">zero</span><span class="o">)</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Groups, rings, and fields do not have such a default, “automatic” monoid view however.  For those algebraic structures
you must check yourself that the group/ring/field laws hold for your monad.</p>

<h3 id="what-can-i-use-a-monad-for--why-should-i-look-for-one">What can I use a monad for?  Why should I look for one?</h3>

<p>As we have already seen monads can be thought of as
<a href="http://www.haskell.org/haskellwiki/Monad"><em>composable</em> computation descriptions</a>.
This means you can use them to build powerful data processing pipelines.  And these pipelines are not only powerful in
terms of features and functionality, they can also be <em>parallelized</em>, which is one of the reasons why monads are so
attractive in the field of large-scale data processing where your code is run on many cores and on many machines at
the same time.</p>

<div class="note">
Now you might say that almost all we do in coding is to transform one value into another value, and I agree.  And this,
I think, is where the idea of the picture &#8220;Monads.  Monads, everywhere.&#8221; (see beginning of this article) originates
from.  Two of my two mysteries solved, yay!
</div>

<h1 id="algebird">Algebird</h1>

<p>Finally we are getting close to being productive with Algebird.  I figure the previous TL;DR section on monoids and
monads was still maybe a bit too long. :-)</p>

<p>If you recall, our original goal at the beginning of this post was to build a data structure <code>TwitterUser</code> accompanied
with a <code>Max[TwitterUser]</code> monoid view of it, using Algebird.  We wanted to use the two for implementing the analytics of
a simple popularity contest on Twitter:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// Let&#39;s have a popularity contest on Twitter.  The user with the most followers wins!</span>
</span><span class="line"><span class="k">val</span> <span class="n">barackobama</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;BarackObama&quot;</span><span class="o">,</span> <span class="mi">40267391</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">katyperry</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;katyperry&quot;</span><span class="o">,</span> <span class="mi">48013573</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">ladygaga</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;ladygaga&quot;</span><span class="o">,</span> <span class="mi">40756470</span><span class="o">)</span>
</span><span class="line"><span class="k">val</span> <span class="n">miguno</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;miguno&quot;</span><span class="o">,</span> <span class="mi">731</span><span class="o">)</span> <span class="c1">// I participate, too.  Olympic spirit!</span>
</span><span class="line"><span class="k">val</span> <span class="n">taylorswift</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;taylorswift13&quot;</span><span class="o">,</span> <span class="mi">37125055</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="k">val</span> <span class="n">winner</span><span class="k">:</span> <span class="kt">Max</span><span class="o">[</span><span class="kt">TwitterUser</span><span class="o">]</span> <span class="k">=</span> <span class="nc">Max</span><span class="o">(</span><span class="n">barackobama</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">katyperry</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">ladygaga</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">miguno</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Max</span><span class="o">(</span><span class="n">taylorswift</span><span class="o">)</span>
</span><span class="line"><span class="n">assert</span><span class="o">(</span><span class="n">winner</span><span class="o">.</span><span class="n">get</span> <span class="o">==</span> <span class="n">katyperry</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Let’s start!</p>

<h2 id="creating-a-monoid">Creating a monoid</h2>

<h3 id="the-twitteruser-type">The TwitterUser type</h3>

<p>Our first step is to create the data structure <code>TwitterUser</code> for which we will then create a monoid view.</p>

<p>Because we want to build a <code>Max</code> monoid for <code>TwitterUser</code> eventually, we must come up with a way to <em>order</em>
<code>TwitterUser</code> values.  For this we can either use the
<a href="http://www.scala-lang.org/api/current/#scala.math.Ordering">Ordering</a> or the
<a href="http://www.scala-lang.org/api/current/#scala.math.Ordered">Ordered</a> trait in Scala, either way will work.</p>

<p>Let’s say we go down the <code>Ordered</code> route.  Now we must answer a design question: <em>Do we consider the “ordering”
behavior to be a defining feature of <code>TwitterUser</code> in general, or do we need this behavior only for its
<code>Max[TwitterUser]</code> monoid view?</em>
If it’s a general feature we would add it to <code>TwitterUser</code> directly.  If it’s only needed for the monoid we can also
decide to add it only there.  In our case, we will add the ordering behavior to <code>TwitterUser</code> directly.  I will show
further down below how to implement the other option.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// Small note: To be future-proof we should make `numFollowers` a `Long`,</span>
</span><span class="line"><span class="c1">// because `Int.MaxValue` (~ 2 billion) is less than the potential number</span>
</span><span class="line"><span class="c1">// of Twitter users on planet earth.  I am happy to let this one slip though.</span>
</span><span class="line"><span class="k">case</span> <span class="k">class</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="k">val</span> <span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="k">val</span> <span class="n">numFollowers</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Ordered</span><span class="o">[</span><span class="kt">TwitterUser</span><span class="o">]</span> <span class="o">{</span>
</span><span class="line">  <span class="k">def</span> <span class="n">compare</span><span class="o">(</span><span class="n">that</span><span class="k">:</span> <span class="kt">TwitterUser</span><span class="o">)</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="k">val</span> <span class="n">c</span> <span class="k">=</span> <span class="k">this</span><span class="o">.</span><span class="n">numFollowers</span> <span class="o">-</span> <span class="n">that</span><span class="o">.</span><span class="n">numFollowers</span>
</span><span class="line">    <span class="k">if</span> <span class="o">(</span><span class="n">c</span> <span class="o">==</span> <span class="mi">0</span><span class="o">)</span> <span class="k">this</span><span class="o">.</span><span class="n">name</span><span class="o">.</span><span class="n">compareTo</span><span class="o">(</span><span class="n">that</span><span class="o">.</span><span class="n">name</span><span class="o">)</span> <span class="k">else</span> <span class="n">c</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The code above means that <code>TwitterUser</code> supports comparison operations like <code>&gt;=</code> as defined by the <code>compare</code> method of
the <code>Ordered</code> trait.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;foo&quot;</span><span class="o">,</span> <span class="mi">123</span><span class="o">)</span> <span class="o">&gt;</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;bar&quot;</span><span class="o">,</span> <span class="mi">99999</span><span class="o">)</span>
</span><span class="line"><span class="n">res5</span><span class="k">:</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="kc">false</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In our case this <code>compare()</code> method is also used as the monoidal binary function of the <code>Max[TwitterUser]</code> monoid we
will build in the next section.  This works because <code>compare()</code> satisfies all the three axioms described in our section
on monoids above.</p>

<h3 id="the-maxtwitteruser-monoid">The Max[TwitterUser] monoid</h3>

<p>Creating the <code>Max</code> monoid for <code>TwitterUser</code> is now very simple because we can leverage a factory method provided by
Algebird’s called <code>Max.monoid()</code>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="c1">// The &quot;zero&quot; element of the TwitterUser monoid.  Traditionally it is</span>
</span><span class="line"><span class="c1">// also called `mzero` in academic papers.  We use `Int.MinValue` here</span>
</span><span class="line"><span class="c1">// but in practice you would typically constrain `numFollowers` of</span>
</span><span class="line"><span class="c1">// TwitterUser to be &gt;= 0 anyways, so any negative value such as `-1`</span>
</span><span class="line"><span class="c1">// would do.</span>
</span><span class="line"><span class="k">val</span> <span class="n">zero</span> <span class="k">=</span> <span class="nc">TwitterUser</span><span class="o">(</span><span class="s">&quot;MinUser&quot;</span><span class="o">,</span> <span class="nc">Int</span><span class="o">.</span><span class="nc">MinValue</span><span class="o">)</span>
</span><span class="line">
</span><span class="line"><span class="c1">// Monoid in Algebird is a type class, hence we use implicits</span>
</span><span class="line"><span class="c1">// to make the monoid available to the rest of the code.</span>
</span><span class="line"><span class="k">implicit</span> <span class="k">def</span> <span class="n">twitterUserMonoid</span><span class="k">:</span> <span class="kt">Monoid</span><span class="o">[</span><span class="kt">Max</span><span class="o">[</span><span class="kt">TwitterUser</span><span class="o">]]</span> <span class="k">=</span> <span class="nc">Max</span><span class="o">.</span><span class="n">monoid</span><span class="o">(</span><span class="n">zero</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>That’s it!</p>

<p>Ok, maybe it feels a bit like cheating because the monoid is created behind the scenes by <code>Max.monoid()</code>.
So what does <code>Max.monoid()</code> do?</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="cm">/* This is Algebird code, not ours. */</span>
</span><span class="line">
</span><span class="line"><span class="c1">// Zero should have the property that it &lt;= all T</span>
</span><span class="line"><span class="k">def</span> <span class="n">monoid</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">zero</span><span class="k">:</span> <span class="o">=&gt;</span> <span class="n">T</span><span class="o">)(</span><span class="k">implicit</span> <span class="n">ord</span><span class="k">:</span> <span class="kt">Ordering</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span><span class="k">:</span> <span class="kt">Monoid</span><span class="o">[</span><span class="kt">Max</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span>
</span><span class="line">   <span class="nc">Monoid</span><span class="o">.</span><span class="n">from</span><span class="o">(</span><span class="nc">Max</span><span class="o">(</span><span class="n">zero</span><span class="o">))</span> <span class="o">{</span> <span class="o">(</span><span class="n">l</span><span class="o">,</span><span class="n">r</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="k">if</span><span class="o">(</span><span class="n">ord</span><span class="o">.</span><span class="n">gteq</span><span class="o">(</span><span class="n">l</span><span class="o">.</span><span class="n">get</span><span class="o">,</span> <span class="n">r</span><span class="o">.</span><span class="n">get</span><span class="o">))</span> <span class="n">l</span> <span class="k">else</span> <span class="n">r</span> <span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Still, it’s pretty straight-forward I would say.  Not a lot of magic as long as you know how implicits and type classes
in Scala work.</p>

<div class="note">
Generally, <tt>Max</tt> in Algebird is a semigroup &#8211; not a monoid &#8211; because not all types <tt>T</tt> you could come
up with would have the notion of a zero element when used with <tt>Max</tt>.  And the existence of such a zero element
is the one thing that separates a semigroup from a monoid.  You see this in Algebird&#8217;s
<a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/OrderedSemigroup.scala">OrderedSemigroup.scala</a>
where <tt>object Max</tt> defines an <tt>implicit def</tt> semigroup, and only for a few specific types such as
<tt>Int</tt> or <tt>Long</tt> it also defines monoid behavior.. This is because those types have the notion of a zero
element.  In our case we have such a zero element, too, hence we can not only support semigroup but also monoid
behavior.
</div>

<p>What would we do if we only wanted to add <code>compare()</code> to the monoid, but not to the original type?
The Algebird code has examples for this use case.  Here is the definition of the <code>Max[List]</code> monoid, which as you may
notice uses <code>Ordering</code> and not <code>Ordered</code> as in our example above.  You can ignore that small difference.  The key point
is that the <code>compare()</code> method is defined as part of the <code>Max[List]</code> monoid instead of being “duct-taped” to <code>List</code>
directly.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="k">implicit</span> <span class="k">def</span> <span class="n">listMonoid</span><span class="o">[</span><span class="kt">T:Ordering</span><span class="o">]</span><span class="k">:</span> <span class="kt">Monoid</span><span class="o">[</span><span class="kt">Max</span><span class="o">[</span><span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]]]</span> <span class="k">=</span> <span class="n">monoid</span><span class="o">[</span><span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]](</span><span class="nc">Nil</span><span class="o">)(</span><span class="k">new</span> <span class="nc">Ordering</span><span class="o">[</span><span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="o">{</span>
</span><span class="line">  <span class="nd">@tailrec</span>
</span><span class="line">  <span class="k">final</span> <span class="k">override</span> <span class="k">def</span> <span class="n">compare</span><span class="o">(</span><span class="n">left</span><span class="k">:</span> <span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">],</span> <span class="n">right</span><span class="k">:</span> <span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="o">{</span>
</span><span class="line">    <span class="o">(</span><span class="n">left</span><span class="o">,</span> <span class="n">right</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
</span><span class="line">      <span class="k">case</span> <span class="o">(</span><span class="nc">Nil</span><span class="o">,</span> <span class="nc">Nil</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="mi">0</span>
</span><span class="line">      <span class="k">case</span> <span class="o">(</span><span class="nc">Nil</span><span class="o">,</span> <span class="k">_</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="o">-</span><span class="mi">1</span>
</span><span class="line">      <span class="k">case</span> <span class="o">(</span><span class="k">_</span><span class="o">,</span> <span class="nc">Nil</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="mi">1</span>
</span><span class="line">      <span class="k">case</span> <span class="o">(</span><span class="n">lh</span><span class="o">::</span><span class="n">lt</span><span class="o">,</span> <span class="n">rh</span><span class="o">::</span><span class="n">rt</span><span class="o">)</span> <span class="k">=&gt;</span>
</span><span class="line">        <span class="k">val</span> <span class="n">c</span> <span class="k">=</span> <span class="nc">Ordering</span><span class="o">[</span><span class="kt">T</span><span class="o">].</span><span class="n">compare</span><span class="o">(</span><span class="n">lh</span><span class="o">,</span> <span class="n">rh</span><span class="o">)</span>
</span><span class="line">        <span class="k">if</span><span class="o">(</span><span class="n">c</span> <span class="o">==</span> <span class="mi">0</span><span class="o">)</span> <span class="n">compare</span><span class="o">(</span><span class="n">lt</span><span class="o">,</span> <span class="n">rt</span><span class="o">)</span> <span class="k">else</span> <span class="n">c</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line"><span class="o">})</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h3 id="where-to-go-from-here">Where to go from here?</h3>

<p>Now that we have one monoid view for <code>TwitterUser</code>, what else can we do?  Can we find another monoid form for it?
That’s one of the questions you should ask yourself when working with your own data structures.  If you take a look at
the Algebird code, you will notice that many types such as <code>List</code> will have quite a few algebraic forms.</p>

<p>There is one more thing I want to mention here:  You may consider creating an additive monoid for <code>TwitterUser</code>, i.e.
a monoid that supports a <code>+</code> like operation.  I couldn’t come up with any good example how the result of adding two such
values would make sense (e.g., how could you “add” their usernames in meaningful way?).  That being said there is one
case where adding two <code>TwitterUser</code> values would make sense: to capture the idea that one follows the other, i.e.
to create a relationship (a link) between the two.  Keep in mind though that monoids and friends must adhere to the
<em>closure</em> principle – if you start out with a <code>TwitterUser</code> value and perform monoid operations on it, the end result
must always be another <code>TwitterUser</code> value.  Of course such a relationship can be modeled in code, but you cannot do
this with a <code>TwitterUser</code> monoid as defined above.</p>

<h2 id="creating-a-monad">Creating a monad?</h2>

<p>By now you should have sufficient understanding of monads and Algebird to implement your own monad.  So I leave this
as an exercise for the reader.</p>

<p>A starting point for you is <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monad.scala">Monad.scala</a> in Algebird.</p>

<p>However if you do have a good idea what kind of monad I could showcase here – perhaps something related to Twitter to
match the <code>TwitterUser</code> monoid example above? – please let me know in the comments.</p>

<h2 id="key-algebraic-structures-in-algebird">Key algebraic structures in Algebird</h2>

<p>The following table is a juxtaposition of a few key algebraic structures, notably those that are implemented in
Algebird.  It should help you to navigate the Algebird code base, and also to figure out which algebraic structure
your own data types might support – i.e., <em>“Can I turn my <code>T</code> into a semigroup, or even a monoid?”</em>.</p>

<table>
  <tr>
    <th>Algebraic structure</th>
    <th>Binary op is associative</th>
    <th>Identity (has a zero element)</th>
    <th>+&nbsp;op</th>
    <th>-&nbsp;op</th>
    <th>*&nbsp;op</th>
    <th>/&nbsp;op</th>
    <th>References</th>
  </tr>
  <tr>
    <td>Semigroup</td>
    <td>YES</td>
    <td>-</td>
    <td>YES</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td><a href="https://en.wikipedia.org/wiki/Semigroup">Wikipedia</a>, <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Semigroup.scala">Algebird</a></td>
  </tr>
  <tr>
    <td>Monoid</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
    <td><a href="https://en.wikipedia.org/wiki/Monoid">Wikipedia</a>, <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Monoid.scala">Algebird</a></td>
  </tr>
  
  <tr>
    <td>Group</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>-</td>
    <td>-</td>
    <td><a href="https://en.wikipedia.org/wiki/Group_%28mathematics%29">Wikipedia</a>, <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Group.scala">Algebird</a></td>
  </tr>
  <tr>
    <td>Ring</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>-</td>
    <td><a href="https://en.wikipedia.org/wiki/Ring_%28mathematics%29">Wikipedia</a>, <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Ring.scala">Algebird</a></td>
  </tr>
  <tr>
    <td>Field</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td>YES</td>
    <td><a href="https://en.wikipedia.org/wiki/Field_%28mathematics%29">Wikipedia</a>, <a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Field.scala">Algebird</a></td>
  </tr>
</table>

<p>Think of <code>+</code> as the general notion of “adding one thing to another”, same for the other operations.  For two
<code>List[Int]</code>, for instance, <code>+</code> could be <em>concatenation</em> of the two (instead of, say, trying to add the individual <code>Int</code>
elements of the lists together).  The operators <code>+</code>, <code>-</code>, <code>*</code> and <code>/</code> are
<a href="https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Operators.scala">as defined in Algebird</a>.</p>

<h2 id="a-small-algebird-faq">A small Algebird FAQ</h2>

<h3 id="error-cannot-find-groupmonoid-type-class-for-a-type-t">Error “Cannot find Group/Monoid/… type class for a type T”?</h3>

<p>If you run into this error it means you are trying to use an operation that is not supported by the algebraic structure
you are working with.  In this specific example, a <code>Set()</code> in Algebird has a monoid form and thus supports an
addition-like operation <code>+</code> but not a multiplication-like operation <code>*</code>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="nc">Set</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span><span class="mi">2</span><span class="o">,</span><span class="mi">3</span><span class="o">)</span> <span class="o">*</span> <span class="nc">Set</span><span class="o">(</span><span class="mi">2</span><span class="o">,</span><span class="mi">3</span><span class="o">,</span><span class="mi">4</span><span class="o">)</span>
</span><span class="line"><span class="o">&lt;</span><span class="n">console</span><span class="k">&gt;:</span><span class="mi">2</span><span class="k">:</span> <span class="kt">error:</span> <span class="kt">Cannot</span> <span class="kt">find</span> <span class="kt">Ring</span> <span class="k">type</span> <span class="kt">class</span> <span class="kt">for</span> <span class="kt">scala.collection.immutable.Set</span><span class="o">[</span><span class="kt">Int</span><span class="o">]</span>
</span><span class="line">              <span class="nc">Set</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span><span class="mi">2</span><span class="o">,</span><span class="mi">3</span><span class="o">)</span> <span class="o">*</span> <span class="nc">Set</span><span class="o">(</span><span class="mi">2</span><span class="o">,</span><span class="mi">3</span><span class="o">,</span><span class="mi">4</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h3 id="combine-different-monoids">Combine different monoids?</h3>

<p>In theory you <em>can</em> combine different monoids such as <code>Max[Int]</code> and <code>Min[Int]</code> and form their product, but there must
exist an appropriate algebraic structure for that product.  Right now, for instance, the following code will not work
in Algebird because it does not ship with a algebraic structure for <code>(Max[Int], Min[Int])</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="scala"><span class="line"><span class="n">scala</span><span class="o">&gt;</span> <span class="nc">Max</span><span class="o">(</span><span class="mi">3</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Min</span><span class="o">(</span><span class="mi">4</span><span class="o">)</span>
</span><span class="line"><span class="o">&lt;</span><span class="n">console</span><span class="k">&gt;:</span><span class="mi">14</span><span class="k">:</span> <span class="kt">error:</span> <span class="kt">Cannot</span> <span class="kt">find</span> <span class="kt">Semigroup</span> <span class="k">type</span> <span class="kt">class</span> <span class="kt">for</span> <span class="kt">Product</span> <span class="kt">with</span> <span class="kt">Serializable</span>
</span><span class="line">              <span class="nc">Max</span><span class="o">(</span><span class="mi">3</span><span class="o">)</span> <span class="o">+</span> <span class="nc">Min</span><span class="o">(</span><span class="mi">4</span><span class="o">)</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="are-monads-really-everywhere">Are monads really everywhere?</h1>

<p>One thing that I have not yet investigated in further detail is how using monads compares to other patterns of
abstraction.  For instance, you can use <a href="http://www.clojure.net/2012/02/02/Monads-in-Clojure/">monads in Clojure</a>
(the author Jim Duey actually wrote a
<a href="http://www.clojure.net/tags.html#monads-ref">whole series of blog posts covering monads</a>), too, but in a quick initial
search I observed that Clojure developers apparently use different constructs to achieve similar effects.</p>

<p>If you have some insights to share here, please feel free to reply to this post!</p>

<h1 id="summary">Summary</h1>

<p>I hope this post contributes a little bit to the understanding of the rather abstract concepts of monoids and monads,
and how you can put them to good practical use via tools such as <a href="https://github.com/twitter/algebird">Algebird</a>,
<a href="https://github.com/twitter/scalding">Scalding</a> and <a href="https://github.com/twitter/summingbird">SummingBird</a>.</p>

<p>One of my lessons learned was that working with monoids and monads is a nice opportunity to read up on more formal
concepts (category theory), and at the same time realize how they can be put to practical use in engineering, notably
when doing large-scala data analytics.</p>

<p>On my side I want to thank the Twitter engineering team (<a href="https://twitter.com/TwitterEng">@TwitterEng</a>) not only for
making those tools available to the open source community, but also for sparking my interest in the practical
application of algebraic structures and category theory in general.  Same shout-out for all the various people who
wrote blog posts on the topic, or who shared their insights on places such as StackOverflow (see the reference section
at the end of this article for a few of them).  As I said there was a lot of new information to swallow – and in a
short period of time – but the quest was worth it.</p>

<p>Many thanks!  <em>–Michael</em></p>

<h1 id="references">References</h1>

<h2 id="monads-and-monoids">Monads and monoids</h2>

<p>I tried to categorize the references below into “easy” and “advanced” reads.  Of course this is highly subjective, and
your mileage may vary.</p>

<p>Easy reads:</p>

<ul>
  <li><a href="http://www.manning.com/bjarnason/">Functional Programming in Scala</a> by P. Chiusano and R. Bjarnason, published by
Manning.  Includes chapters on monoids and monads, and how to implement them in Scala.</li>
  <li><a href="http://www.codecommit.com/blog/ruby/monads-are-not-metaphors">Monads are not metaphors</a>, by Daniel Spiewak.</li>
  <li><a href="http://stackoverflow.com/questions/3870088">A monad is just a monoid in the category of endofunctors, what’s the problem?</a>,
question on StackOverflow.  If you are just starting out with monads etc. I’d recommend to read the
<a href="http://stackoverflow.com/a/7829607/1743580">second answer</a> first.</li>
  <li><a href="http://james-iry.blogspot.ch/2007/09/monads-are-elephants-part-1.html">Monads are elephants</a>, a series of blog posts
by James Iry.  In Scala.</li>
  <li><a href="http://adit.io/posts/2013-04-17-functors,_applicatives,_and_monads_in_pictures.html">Functors, Applicatives, And Monads In Pictures</a>,
by Aditya Bhargava.</li>
  <li><a href="http://www.haskell.org/haskellwiki/Monad_laws">Monad laws</a> (in Haskell).  Remember <code>return</code> in Haskell means our
constructor <code>apply()</code> in Scala, and <code>&gt;&gt;=</code> in Haskell is our <code>flatMap()</code>.</li>
  <li>Wikipedia articles on algebraic structures:  I found that selective reading of those did help my understanding (I did
not try to understand all the sections in those articles).  Notably, I liked the juxtaposition of semigroups, monoid,
groups, rings, etc. which highlighted their similarities and differences.  Later on I discovered that the Algebird
code is structured similarly, so if you can tell a semigroup from a monoid you will have an easier time navigating
the code.
    <ul>
      <li><a href="https://en.wikipedia.org/wiki/Semigroup">Semigroup</a></li>
      <li><a href="https://en.wikipedia.org/wiki/Monoid">Monoid</a></li>
      <li><a href="https://en.wikipedia.org/wiki/Group_%28mathematics%29">Group</a></li>
      <li><a href="https://en.wikipedia.org/wiki/Ring_%28mathematics%29">Ring</a></li>
      <li><a href="https://en.wikipedia.org/wiki/Monad_%28functional_programming%29">Monad</a></li>
    </ul>
  </li>
</ul>

<p>Advanced reads:</p>

<ul>
  <li><a href="http://www.stephendiehl.com/posts/monads.html">Monads Made Difficult</a>, by Stephen Diel.  In Haskell.</li>
  <li><a href="http://www.clojure.net/2012/02/02/Monads-in-Clojure/">Monads in Clojure</a>, by Jim Duey.  In Clojure.  Jim actually
wrote a <a href="http://www.clojure.net/tags.html#monads-ref">whole series of blog posts covering monads</a>.</li>
  <li><a href="http://learnyouahaskell.com/functors-applicative-functors-and-monoids">Functors, Applicative Functors and Monoids</a>,
a chapter in <a href="http://learnyouahaskell.com">Learn You a Haskell</a>.</li>
</ul>

<h2 id="summingbird">SummingBird</h2>

<ul>
  <li><a href="https://speakerdeck.com/sritchie/summingbird-at-cufp">SummingBird at CUFP 2013</a>, slides by Sam Ritchie (former
Twitter engineer)</li>
</ul>

<h2 id="category-theory">Category theory</h2>

<p>Speaking from my own experience I would say you do not need to understand the full details of category theory.  The
links above should contain all the information you need to gain enough understanding of monoids, monads and such that
you can be productive in a short period of time.  However I have used the references below to fill gaps that remained
after reading through the other sources above, and I remember I did jump back and forth between the academic references
below and the more hands-on resources above.</p>

<ul>
  <li><a href="http://www.amazon.com/dp/0521283043">An Introduction to Category Theory</a>, by Harold Simmons.  As a novice to category
theory I preferred this text over <em>Category Theory for Computing Science</em> (see below).  Unlike the latter though the
book of Simmons is not available for free.</li>
  <li><em>Category Theory for Computing Science</em>, by Michael Barr and Charles Wells, available as a
<a href="http://www.math.mcgill.ca/triples/Barr-Wells-ctcs.pdf">free PDF</a>.  This seems to be a seminal work on category
theory and worth the read if you are interested in the mathematical foundation of the theory in about 400 pages.</li>
</ul>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Sending Metrics from Storm to Graphite]]></title>
    <link href="http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2013-11-06T16:00:00+01:00</updated>
    <id>http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite</id>
    <content type="html"><![CDATA[<p>So you got your first <a href="http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/">distributed Storm cluster installed</a> and have
your first topologies up and running.  Great!  Now you want to integrate your Storm applications with your monitoring
systems and begin tracking application-level metrics from your topologies.  In this article I show you how to
integrate Storm with the popular Graphite monitoring system.  This, combined with the Storm UI, will provide you with
actionable information to
<a href="http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/">tune the performance of your topologies</a> and also help
you to track key business as well as technical metrics.</p>

<!-- more -->

<div class="note">
<strong>Update March 13, 2015</strong>: We have open sourced
<a href="https://github.com/verisign/storm-graphite">storm-graphite</a>, an Storm IMetricsConsumer implementation that
forwards Storm&#8217;s <em>built-in metrics</em> to a Graphite server for real-time graphing, visualization, and operational
dashboards.  These built-in metrics greatly augment the application-level metrics that you can send from your Storm
topologies to Graphite (sending application metrics is described in this article).  The built-in metrics include
execution count and latency of your bolts, Java heap space usage and garbage collection statistics, and much more.
So if you are interested in even better metrics and deeper insights into your Storm cluster, I&#8217;d strongly recommend to
take a look at
<a href="https://github.com/verisign/storm-graphite">storm-graphite</a>.  We also describe how to configure Graphite
and Grafana, a dashboard for Graphite, to make use of the built-in metrics provided by storm-graphite.
</div>

<h1 id="background-what-is-graphite">Background: What is Graphite?</h1>

<p>Quoting from <a href="http://graphite.readthedocs.org/en/latest/overview.html">Graphite’s documentation</a>, Graphite does two
things:</p>

<ol>
  <li>Store numeric time-series data</li>
  <li>Generate and render graphs of this data on demand</li>
</ol>

<p>What Graphite does not do is collect the actual input data for you, i.e. your system or application metrics.  The
purpose of this blog post is to show how you can do this for your Storm applications.</p>

<div class="note">
Note: The Graphite project is currently undergoing significant changes.  The project has been moved to GitHub and split
into individual components.  Also, the next version of Graphite will include for Ceres, which is a distributable
time-series database, and a major refactor of its Carbon daemon.  If that draws your interest then you can
<a href="http://graphite.wikidot.com/">read about the upcoming changes in further detail</a>.  I mention this just for
completeness &#8211; it should not deter you from jumping on the Graphite bandwagon.
</div>

<h1 id="what-we-want-to-do">What we want to do</h1>

<h2 id="spatial-granularity-of-metrics">Spatial granularity of metrics</h2>

<p>For the context of this post we want to use Graphite to track the number of received tuples of an example bolt
<em>per node</em> in the Storm cluster.  This allows us, say, to pinpoint a potential topology bottleneck to specific machines
in the Storm cluster – and this is particularly powerful if we already track system metrics (CPU load, memory usage,
network traffic and such) in Graphite because then you can correlate system and application level metrics.</p>

<p>Keep in mind that in Storm multiple instances of a bolt may run on a given node, and its instances may also run on many
different nodes.  Our challenge will be to configure Storm and Graphite in a way that we are able to correctly collect
and aggregate all individual values reported by those many instances of the bolt.  Also, the total value of these
per-host tuple counts should ideally match the bolt’s <code>Executed</code> value – which means the number of executed tuples of
a bolt (i.e. across all instances of the bolt in a topology) – in the Storm UI.</p>

<p>We will add Graphite support to our Java-based Storm topology by using Coda Hale/Yammer’s
<a href="http://metrics.codahale.com/">Metrics library for Java</a>, which directly supports
<a href="http://metrics.codahale.com/manual/graphite/">reporting metrics to Graphite</a>.</p>

<p>We will track the number of received tuples of our example bolt through the following metrics, where <em>HOSTNAME</em> is a
placeholder for the hostname of a particular Storm node (e.g. <code>storm-node01</code>):</p>

<ul>
  <li><code>production.apps.graphitedemo.HOSTNAME.tuples.received.count</code></li>
  <li><code>production.apps.graphitedemo.HOSTNAME.tuples.received.m1_rate</code> – 1-minute rate</li>
  <li><code>production.apps.graphitedemo.HOSTNAME.tuples.received.m5_rate</code> – 5-minute rate</li>
  <li><code>production.apps.graphitedemo.HOSTNAME.tuples.received.m15_rate</code> – 15-minute rate</li>
  <li><code>production.apps.graphitedemo.HOSTNAME.tuples.received.mean_rate</code> – average rate/sec</li>
</ul>

<p>Here, the prefix of the metric namespace <code>production.apps.graphitedemo.HOSTNAME.tuples.received</code> is defined by us.
Splitting up this “high-level” metric into a <code>count</code> metric and four rate metrics – <code>m{1,5,15}_rate</code> and <code>mean_rate</code> –
is automatically done by the Metrics Java library.</p>

<h2 id="temporal-granularity-of-metrics">Temporal granularity of metrics</h2>

<p>Because Storm is a real-time analytics platform we want to use a shorter time window for metrics updates than Graphite’s
default, which is one minute.  In our case we will report metrics data every 10 seconds (the finest granularity that
Graphite supports is one second).</p>

<h2 id="assumptions">Assumptions</h2>

<ul>
  <li>We are using a single Graphite server called <code>your.graphite.server.com</code>.</li>
  <li>The <code>carbon-cache</code> and <code>carbon-aggregator</code> daemons of Graphite are both running on the Graphite server machine, i.e.
<code>carbon-aggregator</code> will send its updates to the <code>carbon-cache</code> daemon running at <code>127.0.0.1</code>.  Also, our Storm
topology will send all its metrics to this Graphite server.</li>
</ul>

<p>Thankfully the specifics of the Storm cluster such as hostnames of nodes do not matter.  So the approach described here
should work nicely with your existing Storm cluster.</p>

<h1 id="desired-outcome-graphs-and-dashboards">Desired outcome: graphs and dashboards</h1>

<p>The desired end result are graphs and dashboards similar to the following Graphite screenshot:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/graphitedemo-storm-dashboard.png" title="Example graph in Graphite that displays number of received tuples" /></p>

<div class="caption">
Example graph in Graphite that displays the number of received tuples.  The brown line is the aggregate of all per-host
tuple counts of this 4-node Storm cluster and computed via Graphite&#8217;s
<a href="http://graphite.readthedocs.org/en/latest/functions.html#graphite.render.functions.sumSeries">sumSeries()</a>
function.  Note that only 3 of the 4 nodes are actually running instances of the bolt, hence you only see 3+1 lines in
the graph.
</div>

<h1 id="versions">Versions</h1>

<p>The instructions in this article have been tested on RHEL/CentOS 6 with the following software versions:</p>

<ul>
  <li><a href="http://storm-project.net/">Storm</a> 0.9.0-rc2</li>
  <li><a href="http://graphite.wikidot.com/">Graphite</a> 0.9.12 (stock version available in EPEL for RHEL6)</li>
  <li><a href="http://metrics.codahale.com/">Metrics</a> 3.0.1</li>
  <li>Oracle JDK 6</li>
</ul>

<p>Note that I will not cover the installation of Storm or Graphite in this post.</p>

<div class="note">
Heads up: I am currently working on open sourcing an automated deployment tool called Wirbelsturm that you can use to
install Storm clusters and Graphite servers (and other Big Data related software packages) from scratch.  Wirbelsturm
is based on the popular deployment tools <a href="http://puppetlabs.com/">Puppet</a> and
<a href="http://www.vagrantup.com/">Vagrant</a>.  Please stay tuned!
</div>

<h1 id="a-graphite-primer">A Graphite primer</h1>

<h2 id="understanding-how-graphite-handles-incoming-data">Understanding how Graphite handles incoming data</h2>

<p>One pitfall for Graphite beginners is the default behavior of Graphite to discard all but the last update message
received during a given time slot (the default size of a time slot for metrics in Graphite is 60 seconds).  For example,
if we are sending the metric values <code>5</code> and <code>4</code> during the same time slot then Graphite will first store a value of <code>5</code>,
and as soon as the value <code>4</code> arrives it will overwrite the stored value from <code>5</code> to <code>4</code> (but not sum it up to <code>9</code>).</p>

<p>The following diagram shows what happens if Graphite receives multiple updates during the same time slot when we are NOT
using an aggregator such as <a href="http://graphite.readthedocs.org/en/latest/carbon-daemons.html">carbon-aggregator</a> or
<a href="https://github.com/etsy/statsd">statsd</a> in between.  In this example we use a time slot of 10 seconds for the metric.
Note again that in this scenario you might see, for instance, “flapping” values for the second time slot (window of
seconds 10 to 20) depending on <em>when</em> you would query Graphite:  If you queried Graphite at second 15 for the 10-20 time
slot, you would receive a return value of <code>3</code>, and if you queried only a few seconds later you would start receiving the
final value of <code>7</code> ( the latter of which would then never change anymore).</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Graphite-update-behavior-01.png" title="Example Graphite behavior without carbon-aggregator or statsd" /></p>

<p>In most situations losing all but the last update of a given time slot is not what you want.  The next diagram shows how
aggregators solve the “only the last update counts” problem.  A nice property of aggregators is that they are
transparent to the client who can continue to send updates as soon as it sees fit – the aggregators will ensure that
Graphite will only see a single, merged update message for the time slot.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/Graphite-update-behavior-02.png" title="Example Graphite behavior with carbon-aggregator or statsd" /></p>

<h2 id="implications-of-storms-execution-model">Implications of Storm’s execution model</h2>

<p>In the case of Storm you implement a bolt (or spout) as a single class, e.g. by extending <code>BaseBasicBolt</code>.  So following
the <a href="http://metrics.codahale.com/manual/">User Manual</a> of the Metrics library seems to be a straight-forward way to add
Graphite support to your Storm bolts.  However you must be aware of how Storm will actually execute your topology
behind the scenes – see my earlier post on
<a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">Understanding the Parallelism of a Storm Topology</a>:</p>

<ol>
  <li>In Storm each bolt typically runs in the form of many bolt instances in a single worker process, and thus you have
many bolt instances in a single JVM.</li>
  <li>In Storm there are typically many such workers (and thus JVMs) per machine, so you end up with many instances of the
same bolt running across many workers/JVMs on a particular machine.</li>
  <li>On top of that a bolt’s instances will also be spread across many different machines in the Storm cluster, so in
total you will typically have many bolt instances running in many JVMs across many Storm nodes.</li>
</ol>

<p>Our challenge to integrate Storm with Graphite can thus be stated as:  How can we ensure that we are reporting metrics
from our Storm topology to Graphite in such way that a) we are <em>counting</em> tuples correctly across all bolt instances,
and b) the many metric update messages are not canceling each other out?  In other words, how can we keep Storm’s
highly distributed nature in check and make it play nice with Graphite?</p>

<h1 id="high-level-approach">High-level approach</h1>

<h2 id="overview-of-the-approach-described-in-this-post">Overview of the approach described in this post</h2>

<p>Here is an overview of the approach we will be using:</p>

<ul>
  <li><em>Each instance</em> of our example Storm bolt gets its own (Java) instance of
<a href="http://metrics.codahale.com/manual/core/">Meter</a>.  This ensures that each bolt instance tracks its count of received
tuples separately from any other bolt instance.</li>
  <li>Also, each bolt instance will get its own instance of <a href="http://metrics.codahale.com/manual/graphite/">GraphiteReporter</a>
to ensures that each bolt instance sends only a single metrics update every 10 seconds, which is the desired temporal
granularity for our monitoring setup.</li>
  <li>All bolt instances on a given Storm node report their metrics under the node’s <em>hostname</em>.  For instance, bolt
instances on the machine <code>storm-node01.example.com</code> will report their metrics under the namespace
<code>production.apps.graphitedemo.storm-node01.tuples.received.*</code>.</li>
  <li>Metrics are being sent to a <code>carbon-aggregator</code> instance running at <code>your.graphite.server.com:2023/tcp</code>.  The
<code>carbon-aggregator</code> ensures that all the individual metrics updates (from bolt instances) of a particular Storm node
are aggregated into a single, per-host metric update.  These per-host metric updates are then forwarded to the
<code>carbon-cache</code> instance, which will store the metric data in the corresponding Whisper database files.</li>
</ul>

<h2 id="other-approaches-not-used">Other approaches (not used)</h2>

<p>Another strategy is to install an aggregator intermediary (such as <a href="https://github.com/etsy/statsd">statsd</a>) on each
machine in the Storm cluster.  Instances of a bolt on the same machine would be sending their individual updates to this
per-host aggregator daemon, which in turn would send a single, per-host update message to Graphite.  I am sure this
approach would have worked but I decided to not go down this path.  It would have increased the deployment complexity
because now we’d have one more software package to understand, support and manage per machine.</p>

<p>The final setup described in this post achieves what we want by using <code>GraphiteReporter</code> in our Storm code in a way
that is compatible with Graphite’s built-in daemons without needing any additional software such as <code>statsd</code>.</p>

<p>On a completely different note, Storm 0.9 now also comes with its own metrics system, which I do not cover here.
This new metrics feature of Storm allows you to collect arbitrarily custom metrics over fixed time windows.  Those
metrics are exported to a metrics stream that you can consume by implementing
<a href="https://github.com/nathanmarz/storm/blob/master/storm-core/src/jvm/backtype/storm/metric/api/IMetricsConsumer.java">IMetricsConsumer</a>
and configured with
<a href="https://github.com/nathanmarz/storm/blob/master/storm-core/src/jvm/backtype/storm/Config.java">Config</a> – see the
various <code>*_METRICS_*</code> settings.  Then you need to use <code>TopologyContext#registerMetric()</code> to register new metrics.</p>

<h1 id="integrating-storm-with-graphite">Integrating Storm with Graphite</h1>

<h2 id="configuring-graphite">Configuring Graphite</h2>

<p>I will only cover the key settings of Graphite for the context of this article, which are the settings related to
<code>carbon-cache</code> and <code>carbon-aggregator</code>.  <strong>Those settings must match the settings in your Storm code.</strong>  Matching
settings between Storm and Graphite is critical – if they don’t you will end up with junk metric data.</p>

<h3 id="carbon-cache-configuration">carbon-cache configuration</h3>

<p>First we must add a <code>[production_apps]</code> section (the name itself is not relevant, it should only be descriptive) to
<code>/etc/carbon/storage-schemas.conf</code>.  This controls at which granularity Graphite will store incoming metrics that we are
sending from our Storm topology.  Notably these storage schema settings control:</p>

<ul>
  <li>The minimum temporal granularity for the “raw” incoming metric updates of a given metric namespace:  In our case, for
instance, we want Graphite to track metrics at a raw granularity of 10 seconds for the first two days.  We configure
this via <code>10s:2d</code>.  This minimum granularity (10 seconds) <strong>must match</strong> the report interval we use in our Storm code.</li>
  <li>How Graphite aggregates older metric values that have already been stored in its Whisper database files:
In our case we tell Graphite to aggregate any values older than two days into 5-minute buckets that we want to keep
for one year, hence <code>5m:1y</code>.  This setting (5 minutes) is independent from our Storm code.</li>
</ul>

<div class="warning">
Caution: Graphite knows two different kinds of aggregation.  First, the aggregation of metrics data that is already
stored in its Whisper database files; this aggregation is performed to save disk storage space and performed on aging
data.  Second, the real-time aggregation of incoming metrics performed by <tt>carbon-aggregator</tt>; this aggregation
happens for newly received data as it is flying in over the network, i.e. before that data even hits the Whisper
database files.  Do not confuse these two aggregations!
</div>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>/etc/carbon/storage-schemas.conf  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
</pre></td><td class="code"><pre><code class="text"><span class="line"># Schema definitions for whisper files. Entries are scanned in order, and first match wins.
</span><span class="line">[carbon]
</span><span class="line">pattern = ^carbon\.
</span><span class="line">retentions = 60:90d
</span><span class="line">
</span><span class="line">[production_apps]
</span><span class="line">pattern = ^production\.apps\.
</span><span class="line">retentions = 10s:2d,5m:1y
</span><span class="line">
</span><span class="line">[default_1min_for_1day]
</span><span class="line">pattern = .*
</span><span class="line">retentions = 60s:1d
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Next we must tell Graphite which aggregation method – e.g. <code>sum</code> or <code>average</code> – it should use to perform storage
aggregation of our metrics.  For count-type metrics, for instance, we want to use <code>sum</code> and for rate-type metrics we
want to use <code>average</code>.  By adding the following lines to <code>/etc/carbon/storage-aggregation.conf</code> we ensure that Graphite
correctly aggregates the default metrics sent by Metrics’ GraphiteReporter – <code>count</code>, <code>m1_rate</code>, <code>m5_rate</code>, <code>m15_rate</code>
and <code>mean_rate</code> – once two days have passed.</p>

<div class="note">
Note: The <tt>[min]</tt> and <tt>[max]</tt> sections are actually not used by the setup described in this article but I
decided to include them anyways to show the difference to the other settings.  Also, your production Graphite setup may
actually need to use such settings, too.
</div>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>/etc/carbon/storage-aggregation.conf  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
</pre></td><td class="code"><pre><code class="text"><span class="line">[min]
</span><span class="line">pattern = \.min$
</span><span class="line">xFilesFactor = 0.1
</span><span class="line">aggregationMethod = min
</span><span class="line">
</span><span class="line">[max]
</span><span class="line">pattern = \.max$
</span><span class="line">xFilesFactor = 0.1
</span><span class="line">aggregationMethod = max
</span><span class="line">
</span><span class="line">[sum]
</span><span class="line">pattern = \.count$
</span><span class="line">xFilesFactor = 0
</span><span class="line">aggregationMethod = sum
</span><span class="line">
</span><span class="line">[m1_rate]
</span><span class="line">pattern = \.m1_rate$
</span><span class="line">xFilesFactor = 0
</span><span class="line">aggregationMethod = average
</span><span class="line">
</span><span class="line">[m5_rate]
</span><span class="line">pattern = \.m5_rate$
</span><span class="line">xFilesFactor = 0
</span><span class="line">aggregationMethod = average
</span><span class="line">
</span><span class="line">[m15_rate]
</span><span class="line">pattern = \.m15_rate$
</span><span class="line">xFilesFactor = 0
</span><span class="line">aggregationMethod = average
</span><span class="line">
</span><span class="line">[default_average]
</span><span class="line">pattern = .*
</span><span class="line">xFilesFactor = 0.3
</span><span class="line">aggregationMethod = average
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Lastly, make sure that the <code>carbon-cache</code> daemon is actually enabled in your <code>/etc/carbon/carbon.conf</code> and configured to
receive incoming data on its <code>LINE_RECEIVER_PORT</code> at <code>2003/tcp</code> and also (!) on its <code>PICKLE_RECEIVER_PORT</code> at
<code>2004/tcp</code>.  The latter port is used by <code>carbon-aggregator</code>, which we will configure in the next section.</p>

<p>Example configuration snippet:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>/etc/carbon/carbon.conf  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
</pre></td><td class="code"><pre><code class="text"><span class="line"># ...snipp...
</span><span class="line">
</span><span class="line">[cache]
</span><span class="line">LINE_RECEIVER_INTERFACE = 0.0.0.0
</span><span class="line">LINE_RECEIVER_PORT = 2003
</span><span class="line">PICKLE_RECEIVER_INTERFACE = 0.0.0.0
</span><span class="line">PICKLE_RECEIVER_PORT = 2004
</span><span class="line">
</span><span class="line"># ...snipp...
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Don’t forget to restart <code>carbon-cache</code> after changing its configuration:</p>

<pre><code>$ sudo service carbon-cache restart
</code></pre>

<h3 id="carbon-aggregator-configuration">carbon-aggregator configuration</h3>

<p>The last Graphite configuration we must perform is to ensure that we can pre-aggregrate the number of reported
<code>tuples.received</code> values across all bolt instances that run on a particular Storm node.</p>

<p>To perform this per-host aggregation on the fly we must add the following lines to <code>/etc/carbon/aggregation-rules.conf</code>.
With those settings whenever we are sending a metric such as
<code>production.apps.graphitedemo.storm-node01.tuples.received.count</code> from any bolt instance running on <code>storm-node01</code> to
Graphite (more correctly, its <code>carbon-aggregator</code> daemon), it will aggregate (here: <code>sum</code>) all such update messages for
<code>storm-node01</code> into a single, aggregated update message every 10 seconds for that server.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>/etc/carbon/aggregation-rules.conf  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="text"><span class="line">&lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.all.tuples.received.count (10) = sum &lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.tuples.received.count
</span><span class="line">&lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.all.tuples.received.m1_rate (10) = sum &lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.tuples.received.m1_rate
</span><span class="line">&lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.all.tuples.received.m5_rate (10) = sum &lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.tuples.received.m5_rate
</span><span class="line">&lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.all.tuples.received.m15_rate (10) = sum &lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.tuples.received.m15_rate
</span><span class="line">&lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.all.tuples.received.mean_rate (10) = sum &lt;env&gt;.apps.&lt;app&gt;.&lt;server&gt;.tuples.received.mean_rate
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Lastly, make sure that the <code>carbon-aggregator</code> daemon is actually enabled in your <code>/etc/carbon/carbon.conf</code> and
configured to receive incoming data on its <code>LINE_RECEIVER_PORT</code> at <code>2023/tcp</code>.  Also, make sure it sends its aggregates
to the <code>PICKLE_RECEIVER_PORT</code> of <code>carbon-cache</code> (port <code>2004/tcp</code>).  See the <code>[aggregator]</code> section.</p>

<p>Example configuration snippet:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>/etc/carbon/carbon.conf  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="text"><span class="line"># ...snipp...
</span><span class="line">
</span><span class="line">[aggregator]
</span><span class="line">LINE_RECEIVER_INTERFACE = 0.0.0.0
</span><span class="line">LINE_RECEIVER_PORT = 2023
</span><span class="line">PICKLE_RECEIVER_INTERFACE = 0.0.0.0
</span><span class="line">PICKLE_RECEIVER_PORT = 2024
</span><span class="line">DESTINATIONS = 127.0.0.1:2004  # &lt;&lt;&lt; this points to the carbon-cache pickle port
</span><span class="line">
</span><span class="line"># ...snipp...
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Don’t forget to restart <code>carbon-aggregator</code> after changing its configuration:</p>

<pre><code>$ sudo service carbon-aggregator restart
</code></pre>

<h3 id="other-important-graphite-settings">Other important Graphite settings</h3>

<p>You may also want to check the values of the following Carbon settings in <code>/etc/carbon/carbon.conf</code>, particularly if you
are sending a lot of different metrics (= high number of metrics such as <code>my.foo</code> and <code>my.bar</code>) and/or a lot of metric
update messages per second (= high number of incoming metric updates for <code>my.foo</code>).</p>

<p>Whether or not you need to tune those settings depends on your specific use case.  As a rule of thumb: The more Storm
nodes you have, the higher the topology’s parallelism and the higher your data volume, the more likely you will need to
optimize those settings.  If you are not sure, leave them at their defaults and revisit later.</p>

<div class="note">
Note: I&#8217;d say the most important parameters at the very beginning are <tt>MAX_CREATES_PER_MINUTE</tt> (you might hit
this particularly when your topology starts to submit metrics for the very first time) and
<tt>MAX_UPDATES_PER_SECOND</tt>.
</div>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>/etc/carbon/carbon.conf  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
</pre></td><td class="code"><pre><code class="text"><span class="line">[cache]
</span><span class="line"># Limit the size of the cache to avoid swapping or becoming CPU bound.
</span><span class="line"># Sorts and serving cache queries gets more expensive as the cache grows.
</span><span class="line"># Use the value &quot;inf&quot; (infinity) for an unlimited cache size.
</span><span class="line">MAX_CACHE_SIZE = inf
</span><span class="line">
</span><span class="line"># Limits the number of whisper update_many() calls per second, which effectively
</span><span class="line"># means the number of write requests sent to the disk. This is intended to
</span><span class="line"># prevent over-utilizing the disk and thus starving the rest of the system.
</span><span class="line"># When the rate of required updates exceeds this, then carbon&#39;s caching will
</span><span class="line"># take effect and increase the overall throughput accordingly.
</span><span class="line">MAX_UPDATES_PER_SECOND = 500
</span><span class="line">
</span><span class="line"># Softly limits the number of whisper files that get created each minute.
</span><span class="line"># Setting this value low (like at 50) is a good way to ensure your graphite
</span><span class="line"># system will not be adversely impacted when a bunch of new metrics are
</span><span class="line"># sent to it. The trade off is that it will take much longer for those metrics&#39;
</span><span class="line"># database files to all get created and thus longer until the data becomes usable.
</span><span class="line"># Setting this value high (like &quot;inf&quot; for infinity) will cause graphite to create
</span><span class="line"># the files quickly but at the risk of slowing I/O down considerably for a while.
</span><span class="line">MAX_CREATES_PER_MINUTE = 50
</span><span class="line">
</span><span class="line">[aggregator]
</span><span class="line"># This is the maximum number of datapoints that can be queued up
</span><span class="line"># for a single destination. Once this limit is hit, we will
</span><span class="line"># stop accepting new data if USE_FLOW_CONTROL is True, otherwise
</span><span class="line"># we will drop any subsequently received datapoints.
</span><span class="line">MAX_QUEUE_SIZE = 10000
</span><span class="line">
</span><span class="line"># Set this to False to drop datapoints when any send queue (sending datapoints
</span><span class="line"># to a downstream carbon daemon) hits MAX_QUEUE_SIZE. If this is True (the
</span><span class="line"># default) then sockets over which metrics are received will temporarily stop accepting
</span><span class="line"># data until the send queues fall below 80% MAX_QUEUE_SIZE.
</span><span class="line">USE_FLOW_CONTROL = True
</span><span class="line">
</span><span class="line"># This defines the maximum &quot;message size&quot; between carbon daemons.
</span><span class="line"># You shouldn&#39;t need to tune this unless you really know what you&#39;re doing.
</span><span class="line">MAX_DATAPOINTS_PER_MESSAGE = 500
</span><span class="line">
</span><span class="line"># This defines how many datapoints the aggregator remembers for
</span><span class="line"># each metric. Aggregation only happens for datapoints that fall in
</span><span class="line"># the past MAX_AGGREGATION_INTERVALS * intervalSize seconds.
</span><span class="line">MAX_AGGREGATION_INTERVALS = 5
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="configuring-your-storm-code">Configuring your Storm code</h2>

<h3 id="add-the-metrics-library-to-your-storm-code-project">Add the Metrics library to your Storm code project</h3>

<p><em>The instructions below are for Gradle but it is straight-forward to adapt them to Maven if that’s your tool of choice.</em></p>

<p>Now that we have finished the Graphite setup we can turn our attention to augmenting our Storm code to work with
Graphite.  Make sure <code>build.gradle</code> in your Storm code project looks similar to the following:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>build.gradle  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
<span class="line-number">44</span>
<span class="line-number">45</span>
<span class="line-number">46</span>
<span class="line-number">47</span>
</pre></td><td class="code"><pre><code class="text"><span class="line">buildscript {
</span><span class="line">  repositories {
</span><span class="line">    mavenCentral()
</span><span class="line">  }
</span><span class="line">  dependencies {
</span><span class="line">    // see https://github.com/musketyr/gradle-fatjar-plugin
</span><span class="line">    classpath &#39;eu.appsatori:gradle-fatjar-plugin:0.2-rc1&#39;
</span><span class="line">  }
</span><span class="line">}
</span><span class="line">
</span><span class="line">apply plugin: &#39;java&#39;
</span><span class="line">apply plugin: &#39;fatjar&#39;
</span><span class="line">// ...other plugins may follow here...
</span><span class="line">
</span><span class="line">// We use JDK 6.
</span><span class="line">sourceCompatibility = 1.6
</span><span class="line">targetCompatibility = 1.6
</span><span class="line">
</span><span class="line">group = &#39;com.miguno.storm.graphitedemo&#39;
</span><span class="line">version = &#39;0.1.0-SNAPSHOT&#39;
</span><span class="line">
</span><span class="line">repositories {
</span><span class="line">    mavenCentral()
</span><span class="line">    // required for Storm jars
</span><span class="line">    mavenRepo url: &quot;http://clojars.org/repo&quot;
</span><span class="line">}
</span><span class="line">
</span><span class="line">dependencies {
</span><span class="line">  // Metrics library for reporting to Graphite
</span><span class="line">  compile &#39;com.codahale.metrics:metrics-core:3.0.1&#39;
</span><span class="line">  compile &#39;com.codahale.metrics:metrics-annotation:3.0.1&#39;
</span><span class="line">  compile &#39;com.codahale.metrics:metrics-graphite:3.0.1&#39;
</span><span class="line">
</span><span class="line">  // Storm
</span><span class="line">  compile &#39;storm:storm:0.9.0-rc2&#39;, {
</span><span class="line">    ext {
</span><span class="line">      // Storm puts its own jar files on the CLASSPATH of a running topology by itself,
</span><span class="line">      // and therefore does not want you to re-bundle Storm&#39;s class files with your
</span><span class="line">      // topology jar.
</span><span class="line">      fatJarExclude = true
</span><span class="line">    }
</span><span class="line">  }
</span><span class="line">
</span><span class="line">  // ...other dependencies may follow here...
</span><span class="line">}
</span><span class="line">
</span><span class="line">// ...other gradle settings may follow here...
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>You can then run the usual gradle commands to compile, test and package your code.  Particularly, you can now run:</p>

<pre><code>$ gradle clean fatJar
</code></pre>

<p>This command will create a  <em>fat jar</em> (also called <em>uber jar</em>) of your Storm topology code, which will be stored under
<code>build/libs/*.jar</code> by default.  You can use this jar file to submit your topology to Storm via the <code>storm jar</code> command.
See the section on how to
<a href="http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/#build-a-correct-standalone--fat-jar-file-of-my-storm-code">build a correct standalone jar file of your Storm code</a>
in my Storm multi-node cluster tutorial for details.</p>

<h3 id="sending-metrics-from-a-storm-bolt-to-graphite">Sending metrics from a Storm bolt to Graphite</h3>

<p>In this section we will augment a Storm bolt (spouts will work just the same) to report our <code>tuples.received</code> metrics to
Graphite.</p>

<p>Our bolt, i.e. its instances, will send this metric under the Graphite namespace
<code>production.apps.graphitedemo.HOSTNAME.tuples.received.*</code> every 10 seconds to the <code>carbon-aggregator</code> daemon running at
<code>your.graphite.server.com:2023/tcp</code>.</p>

<p>The <strong>key points</strong> of the code below are, firstly, the use of a <code>transient private</code> field for the <code>Meter</code> instance.  If
you do not make the field <code>transient</code> Storm will throw a <code>NotSerializableException</code> during runtime.  This is because
Storm will serialize the code that a Storm worker needs to execute and ship it to the worker via the network.  For this
reason the code of our bolt will initialize the <code>Meter</code> instance during the <code>prepare()</code> phase of a bolt instance, which
ensures that the <code>Meter</code> instance is set up before the first tuples arrive at the bolt instance.  So this part
achieves proper <em>counting</em> of the tuples.</p>

<div class="note">
Note: By the way, do not try to make the field a <tt>static</tt>.  While this will prevent the
<tt>NotSerializableException</tt> it will also result in all instances of the bolt running on the same JVM will share
the same <tt>Meter</tt> instance (and typically you will have many instances on many JVMs on many Storm nodes), which
will cause loss of metrics data.   In this case you would observe in Graphite that the <tt>tuples.received.*</tt>
metrics would significantly under-count the actual number of incoming tuples.  Been there, done that. :-)
</div>

<p>Secondly, the <code>prepare()</code> method also creates a new, dedicated <code>GraphiteReporter</code> instance for each bolt instance.  This
achieves proper <em>reporting</em> of metric updates to Graphite.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span>BoltThatAlsoReportsToGraphite.java  </span></figcaption>
 <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
<span class="line-number">26</span>
<span class="line-number">27</span>
<span class="line-number">28</span>
<span class="line-number">29</span>
<span class="line-number">30</span>
<span class="line-number">31</span>
<span class="line-number">32</span>
<span class="line-number">33</span>
<span class="line-number">34</span>
<span class="line-number">35</span>
<span class="line-number">36</span>
<span class="line-number">37</span>
<span class="line-number">38</span>
<span class="line-number">39</span>
<span class="line-number">40</span>
<span class="line-number">41</span>
<span class="line-number">42</span>
<span class="line-number">43</span>
<span class="line-number">44</span>
<span class="line-number">45</span>
<span class="line-number">46</span>
<span class="line-number">47</span>
<span class="line-number">48</span>
<span class="line-number">49</span>
<span class="line-number">50</span>
<span class="line-number">51</span>
<span class="line-number">52</span>
<span class="line-number">53</span>
<span class="line-number">54</span>
<span class="line-number">55</span>
<span class="line-number">56</span>
<span class="line-number">57</span>
<span class="line-number">58</span>
<span class="line-number">59</span>
<span class="line-number">60</span>
<span class="line-number">61</span>
<span class="line-number">62</span>
<span class="line-number">63</span>
<span class="line-number">64</span>
<span class="line-number">65</span>
<span class="line-number">66</span>
<span class="line-number">67</span>
<span class="line-number">68</span>
<span class="line-number">69</span>
<span class="line-number">70</span>
<span class="line-number">71</span>
<span class="line-number">72</span>
<span class="line-number">73</span>
<span class="line-number">74</span>
<span class="line-number">75</span>
<span class="line-number">76</span>
<span class="line-number">77</span>
<span class="line-number">78</span>
<span class="line-number">79</span>
<span class="line-number">80</span>
<span class="line-number">81</span>
<span class="line-number">82</span>
<span class="line-number">83</span>
<span class="line-number">84</span>
<span class="line-number">85</span>
<span class="line-number">86</span>
<span class="line-number">87</span>
<span class="line-number">88</span>
<span class="line-number">89</span>
<span class="line-number">90</span>
<span class="line-number">91</span>
<span class="line-number">92</span>
<span class="line-number">93</span>
<span class="line-number">94</span>
<span class="line-number">95</span>
<span class="line-number">96</span>
<span class="line-number">97</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="kn">package</span> <span class="n">com</span><span class="o">.</span><span class="na">miguno</span><span class="o">.</span><span class="na">storm</span><span class="o">.</span><span class="na">graphitedemo</span><span class="o">;</span>
</span><span class="line">
</span><span class="line"><span class="kn">import</span> <span class="nn">com.codahale.metrics.Meter</span><span class="o">;</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">com.codahale.metrics.MetricFilter</span><span class="o">;</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">com.codahale.metrics.MetricRegistry</span><span class="o">;</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">com.codahale.metrics.graphite.Graphite</span><span class="o">;</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">com.codahale.metrics.graphite.GraphiteReporter</span><span class="o">;</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">org.apache.log4j.Logger</span><span class="o">;</span>
</span><span class="line"><span class="c1">// ...other imports such as backtype.storm.* omitted for clarity...</span>
</span><span class="line">
</span><span class="line"><span class="kn">import</span> <span class="nn">java.net.InetAddress</span><span class="o">;</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">java.net.UnknownHostException</span><span class="o">;</span>
</span><span class="line"><span class="kn">import</span> <span class="nn">java.util.regex.Pattern</span><span class="o">;</span>
</span><span class="line">
</span><span class="line"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">BoltThatAlsoReportsToGraphite</span> <span class="kd">extends</span> <span class="n">BaseBasicBolt</span> <span class="o">{</span>
</span><span class="line">
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">Logger</span> <span class="n">LOG</span> <span class="o">=</span> <span class="n">Logger</span><span class="o">.</span><span class="na">getLogger</span><span class="o">(</span><span class="n">BoltThatAlsoReportsToGraphite</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">String</span> <span class="n">GRAPHITE_HOST</span> <span class="o">=</span> <span class="s">&quot;your.graphite.server.com&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="kt">int</span> <span class="n">CARBON_AGGREGATOR_LINE_RECEIVER_PORT</span> <span class="o">=</span> <span class="mi">2023</span><span class="o">;</span>
</span><span class="line">  <span class="c1">// The following value must match carbon-cache&#39;s storage-schemas.conf!</span>
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="kt">int</span> <span class="n">GRAPHITE_REPORT_INTERVAL_IN_SECONDS</span> <span class="o">=</span> <span class="mi">10</span><span class="o">;</span>
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">String</span> <span class="n">GRAPHITE_METRICS_NAMESPACE_PREFIX</span> <span class="o">=</span>
</span><span class="line">    <span class="s">&quot;production.apps.graphitedemo&quot;</span><span class="o">;</span>
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">Pattern</span> <span class="n">hostnamePattern</span> <span class="o">=</span>
</span><span class="line">    <span class="n">Pattern</span><span class="o">.</span><span class="na">compile</span><span class="o">(</span><span class="s">&quot;^[a-zA-Z0-9][a-zA-Z0-9-]*(\\.([a-zA-Z0-9][a-zA-Z0-9-]*))*$&quot;</span><span class="o">);</span>
</span><span class="line">
</span><span class="line">  <span class="kd">private</span> <span class="kd">transient</span> <span class="n">Meter</span> <span class="n">tuplesReceived</span><span class="o">;</span>
</span><span class="line">
</span><span class="line">  <span class="nd">@Override</span>
</span><span class="line">  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">prepare</span><span class="o">(</span><span class="n">Map</span> <span class="n">stormConf</span><span class="o">,</span> <span class="n">TopologyContext</span> <span class="n">context</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="n">initializeMetricReporting</span><span class="o">();</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="kd">private</span> <span class="kt">void</span> <span class="nf">initializeMetricReporting</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">    <span class="kd">final</span> <span class="n">MetricRegistry</span> <span class="n">registry</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MetricRegistry</span><span class="o">();</span>
</span><span class="line">    <span class="kd">final</span> <span class="n">Graphite</span> <span class="n">graphite</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Graphite</span><span class="o">(</span><span class="k">new</span> <span class="n">InetSocketAddress</span><span class="o">(</span><span class="n">GRAPHITE_HOST</span><span class="o">,</span>
</span><span class="line">        <span class="n">CARBON_AGGREGATOR_LINE_RECEIVER_PORT</span><span class="o">));</span>
</span><span class="line">    <span class="kd">final</span> <span class="n">GraphiteReporter</span> <span class="n">reporter</span> <span class="o">=</span> <span class="n">GraphiteReporter</span><span class="o">.</span><span class="na">forRegistry</span><span class="o">(</span><span class="n">registry</span><span class="o">)</span>
</span><span class="line">                                        <span class="o">.</span><span class="na">prefixedWith</span><span class="o">(</span><span class="n">metricsPath</span><span class="o">())</span>
</span><span class="line">                                        <span class="o">.</span><span class="na">convertRatesTo</span><span class="o">(</span><span class="n">TimeUnit</span><span class="o">.</span><span class="na">SECONDS</span><span class="o">)</span>
</span><span class="line">                                        <span class="o">.</span><span class="na">convertDurationsTo</span><span class="o">(</span><span class="n">TimeUnit</span><span class="o">.</span><span class="na">MILLISECONDS</span><span class="o">)</span>
</span><span class="line">                                        <span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="n">MetricFilter</span><span class="o">.</span><span class="na">ALL</span><span class="o">)</span>
</span><span class="line">                                        <span class="o">.</span><span class="na">build</span><span class="o">(</span><span class="n">graphite</span><span class="o">);</span>
</span><span class="line">    <span class="n">reporter</span><span class="o">.</span><span class="na">start</span><span class="o">(</span><span class="n">GRAPHITE_REPORT_INTERVAL_IN_SECONDS</span><span class="o">,</span> <span class="n">TimeUnit</span><span class="o">.</span><span class="na">SECONDS</span><span class="o">);</span>
</span><span class="line">    <span class="n">tuplesReceived</span> <span class="o">=</span> <span class="n">registry</span><span class="o">.</span><span class="na">meter</span><span class="o">(</span><span class="n">MetricRegistry</span><span class="o">.</span><span class="na">name</span><span class="o">(</span><span class="s">&quot;tuples&quot;</span><span class="o">,</span> <span class="s">&quot;received&quot;</span><span class="o">));</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="kd">private</span> <span class="n">String</span> <span class="nf">metricsPath</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">    <span class="kd">final</span> <span class="n">String</span> <span class="n">myHostname</span> <span class="o">=</span> <span class="n">extractHostnameFromFQHN</span><span class="o">(</span><span class="n">detectHostname</span><span class="o">());</span>
</span><span class="line">    <span class="k">return</span> <span class="n">GRAPHITE_METRICS_NAMESPACE_PREFIX</span> <span class="o">+</span> <span class="s">&quot;.&quot;</span> <span class="o">+</span> <span class="n">myHostname</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="nd">@Override</span>
</span><span class="line">  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">execute</span><span class="o">(</span><span class="n">Tuple</span> <span class="n">tuple</span><span class="o">,</span> <span class="n">BasicOutputCollector</span> <span class="n">collector</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="n">tuplesReceived</span><span class="o">.</span><span class="na">mark</span><span class="o">();</span>
</span><span class="line">
</span><span class="line">    <span class="c1">// FYI: We do not need to explicitly ack() the tuple because we are extending</span>
</span><span class="line">    <span class="c1">// BaseBasicBolt, which will automatically take care of that.</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="c1">// ...other bolt code may follow here...</span>
</span><span class="line">
</span><span class="line">  <span class="c1">//</span>
</span><span class="line">  <span class="c1">// Helper methods to detect the hostname of the machine that</span>
</span><span class="line">  <span class="c1">// executes this instance of a bolt.  Normally you&#39;d want to</span>
</span><span class="line">  <span class="c1">// move this functionality into a separate class to adhere</span>
</span><span class="line">  <span class="c1">// to the single responsibility principle.</span>
</span><span class="line">  <span class="c1">//</span>
</span><span class="line">
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="n">String</span> <span class="nf">detectHostname</span><span class="o">()</span> <span class="o">{</span>
</span><span class="line">    <span class="n">String</span> <span class="n">hostname</span> <span class="o">=</span> <span class="s">&quot;hostname-could-not-be-detected&quot;</span><span class="o">;</span>
</span><span class="line">    <span class="k">try</span> <span class="o">{</span>
</span><span class="line">      <span class="n">hostname</span> <span class="o">=</span> <span class="n">InetAddress</span><span class="o">.</span><span class="na">getLocalHost</span><span class="o">().</span><span class="na">getHostName</span><span class="o">();</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">    <span class="k">catch</span> <span class="o">(</span><span class="n">UnknownHostException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">      <span class="n">LOG</span><span class="o">.</span><span class="na">error</span><span class="o">(</span><span class="s">&quot;Could not determine hostname&quot;</span><span class="o">);</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">    <span class="k">return</span> <span class="n">hostname</span><span class="o">;</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line">  <span class="kd">private</span> <span class="kd">static</span> <span class="n">String</span> <span class="nf">extractHostnameFromFQHN</span><span class="o">(</span><span class="n">String</span> <span class="n">fqhn</span><span class="o">)</span> <span class="o">{</span>
</span><span class="line">    <span class="k">if</span> <span class="o">(</span><span class="n">hostnamePattern</span><span class="o">.</span><span class="na">matcher</span><span class="o">(</span><span class="n">fqhn</span><span class="o">).</span><span class="na">matches</span><span class="o">())</span> <span class="o">{</span>
</span><span class="line">      <span class="k">if</span> <span class="o">(</span><span class="n">fqhn</span><span class="o">.</span><span class="na">contains</span><span class="o">(</span><span class="s">&quot;.&quot;</span><span class="o">))</span> <span class="o">{</span>
</span><span class="line">        <span class="k">return</span> <span class="n">fqhn</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">&quot;\\.&quot;</span><span class="o">)[</span><span class="mi">0</span><span class="o">];</span>
</span><span class="line">      <span class="o">}</span>
</span><span class="line">      <span class="k">else</span> <span class="o">{</span>
</span><span class="line">        <span class="k">return</span> <span class="n">fqhn</span><span class="o">;</span>
</span><span class="line">      <span class="o">}</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">    <span class="k">else</span> <span class="o">{</span>
</span><span class="line">      <span class="c1">// We want to return the input as-is</span>
</span><span class="line">      <span class="c1">// when it is not a valid hostname/FQHN.</span>
</span><span class="line">      <span class="k">return</span> <span class="n">fqhn</span><span class="o">;</span>
</span><span class="line">    <span class="o">}</span>
</span><span class="line">  <span class="o">}</span>
</span><span class="line">
</span><span class="line"><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>That’s it!  Your Storm bolt instances will report their respective counts of received tuples to Graphite every 10
seconds.</p>

<h1 id="summary">Summary</h1>

<p>At this point you should have successfully married Storm with Graphite, and also learned a few basics about how
Graphite and Storm work along the way.  Now you can begin creating graphs and dashboards for your Storm applications,
which was the reason to do all this in the first place, right?</p>

<p>Enjoy! <em>–Michael</em></p>

<h1 id="appendix">Appendix</h1>

<h2 id="where-to-go-from-here">Where to go from here</h2>

<ul>
  <li>Want to install and configure Graphite automatically?  Take a look at my
<a href="https://github.com/miguno/puppet-graphite">puppet-graphite</a> module for <a href="http://puppetlabs.com/">Puppet</a>.  See also
my previous post on
<a href="http://www.michael-noll.com/blog/2013/06/06/installing-and-running-graphite-via-rpm-and-supervisord/">Installing and Running Graphite via RPM and Supervisord</a> for an alternative, manual installation approach.</li>
  <li>Storm exposes a plethora of built-in metrics that greatly augment the application-level metrics we described in this
article.  In 2015 we open sourced <a href="https://github.com/verisign/storm-graphite">storm-graphite</a>, which automatically
forwards these built-in metrics from Storm to Graphite.  You can enable storm-graphite globally in your Storm cluster
or selectively for only a subset of your topologies.</li>
  <li>You should start sending <em>system metrics</em> (CPU, memory and such) to Graphite, too.  This allows you to correlate the
performance of your Storm topologies with the health of the machines in the cluster.  Very helpful for detecting and
fixing bottlenecks!  There are a couple of tools that can collect these system metrics for you and forward them to
Graphite.  One of those tools is <a href="https://github.com/BrightcoveOS/Diamond">Diamond</a>.  Take a look at my
<a href="https://github.com/miguno/puppet-diamond">puppet-diamond</a> Puppet module to automatically install and configure
Diamond on your Storm cluster nodes.</li>
  <li>Want to install and configure Storm automatically?  I am about to release an automated deployment tool called
Wirbelsturm very soon, which will allow you to deploy software such as Storm and Kafka.  Wirbelsturm is essentially a
curated collection of <a href="http://puppetlabs.com/">Puppet</a> modules (that can be used standalone, too) plus a ready-to-use
<a href="http://www.vagrantup.com/">Vagrant</a> setup to deploy machines locally and to, say, Amazon AWS.  <code>puppet-graphite</code> and
<code>puppet-diamond</code> above are part of the package, by the way.  Please stay tuned!  In the meantime my tutorial
 <a href="http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/">Running a Multi-Node Storm Cluster</a> should get you started.</li>
</ul>

<h2 id="caveat-storm-samples-metrics-for-the-storm-ui">Caveat: Storm samples metrics for the Storm UI</h2>

<p>If you do want to compare values 1:1 between the Storm UI and Graphite please be aware that by default Storm samples
incoming tuples for computing stats.  By default it uses a sampling rate of 0.05 (5%), which is an option configurable
through <code>Config.TOPOLOGY_STATS_SAMPLE_RATE</code>.</p>

<blockquote><p>The way it works is that if you choose a sampling rate of 0.05, it will pick a random element of the next 20 events in which to increase the count by 20.  So if you have 20 tasks for that bolt, your stats could be off by +-380.</p><footer><strong>Nathan Marz on storm-user</strong> <cite><a href="https://groups.google.com/d/msg/storm-user/q40AQHCV1L4/-XrOmBIAAngJ">groups.google.com/d/msg/&hellip;</a></cite></footer></blockquote>

<p>To force Storm to count everything exactly to achieve accurate numbers at the cost of a big performance hit to your
topology you can set the sampling rate to 100%:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_STATS_SAMPLE_RATE</span><span class="o">,</span> <span class="mf">1.0</span><span class="o">);</span> <span class="c1">// default is 0.05</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Replephant: Analyzing Hadoop Cluster Usage with Clojure]]></title>
    <link href="http://www.michael-noll.com/blog/2013/09/17/replephant-analyzing-hadoop-cluster-usage-with-clojure/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2013-09-17T10:29:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2013/09/17/replephant-analyzing-hadoop-cluster-usage-with-clojure</id>
    <content type="html"><![CDATA[<p>Understanding how an Hadoop cluster is actually used in practice is paramount to properly manage and operate it.
In this article I introduce <a href="https://github.com/miguno/replephant">Replephant</a>, an open source Clojure library to
perform interactive analysis of Hadoop cluster usage via REPL and to generate usage reports.</p>

<!-- more -->

<p><br clear="all" /></p>

<div class="note">
  <strong>
    Replephant is available at <a href="https://github.com/miguno/replephant">replephant</a> on GitHub.
  </strong>
</div>

<h1 id="replephant-in-one-minute">Replephant in one minute</h1>

<p>This section is an appetizer of what you can do with Replephant.  Do not worry if something is not immediately obvious
to you – the <a href="https://github.com/miguno/replephant">Replephant documentation</a> describes everything in full detail.</p>

<p>First, clone the Replephant repository and start the Clojure REPL.  You must have <code>lein</code> (leiningen) already installed;
if you do not please follow the
<a href="https://github.com/miguno/replephant#Installation">Replephant installation instructions</a>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>git clone https://github.com/miguno/replephant.git
</span><span class="line"><span class="nv">$ </span><span class="nb">cd </span>replephant
</span><span class="line"><span class="nv">$ </span>lein repl
</span><span class="line">
</span><span class="line"><span class="c"># once the REPL is loaded the prompt will change to:</span>
</span><span class="line">replephant.core<span class="o">=</span>&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Then you can begin analyzing the usage of your own cluster:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="c1">; The root directory is usually the one defined by Hadoop&#39;s</span>
</span><span class="line"><span class="c1">; mapred.job.tracker.history.completed.location and/or</span>
</span><span class="line"><span class="c1">; hadoop.job.history.location settings</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">jobs</span> <span class="p">(</span><span class="nf">load-jobs</span> <span class="s">&quot;/local/path/to/hadoop/job-history-root-dir&quot;</span><span class="p">))</span>
</span><span class="line">
</span><span class="line"><span class="c1">; How many jobs are in the log data?</span>
</span><span class="line"><span class="p">(</span><span class="nb">count </span><span class="nv">jobs</span><span class="p">)</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="mi">12</span>
</span><span class="line">
</span><span class="line"><span class="c1">; Show me all the users who ran one or more jobs in the cluster</span>
</span><span class="line"><span class="p">(</span><span class="nb">distinct </span><span class="p">(</span><span class="nb">map </span><span class="ss">:user.name</span> <span class="nv">jobs</span><span class="p">))</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">(</span><span class="s">&quot;miguno&quot;</span>, <span class="s">&quot;alice&quot;</span>, <span class="s">&quot;bob&quot;</span>, <span class="s">&quot;daniel&quot;</span>, <span class="s">&quot;carl&quot;</span>, <span class="s">&quot;jim&quot;</span><span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="c1">; Consumption of computation resources: which Hadoop users</span>
</span><span class="line"><span class="c1">; account for most of the tasks launched?</span>
</span><span class="line"><span class="p">(</span><span class="nb">println </span><span class="p">(</span><span class="nf">utils/sort-by-value-desc</span> <span class="p">(</span><span class="nf">tasks-by-user</span> <span class="nv">jobs</span><span class="p">)))</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">{</span><span class="s">&quot;miguno&quot;</span> <span class="mi">2329</span>, <span class="s">&quot;alice&quot;</span> <span class="mi">2208</span>, <span class="s">&quot;carl&quot;</span> <span class="mi">1440</span>, <span class="s">&quot;daniel&quot;</span> <span class="mi">19</span>, <span class="s">&quot;bob&quot;</span> <span class="mi">2</span>, <span class="s">&quot;jim&quot;</span> <span class="mi">2</span><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Alright, that was a quick start!  The next sections cover Replephant in more depth.</p>

<h1 id="motivation">Motivation</h1>

<p>Understanding how an <a href="http://hadoop.apache.org/">Hadoop</a> cluster is actually used in practice is paramount to properly
manage and operate it.  This includes knowing cluster usage across the following dimensions:</p>

<ul>
  <li>Which <strong>users</strong> account for most of the resource consumption in the cluster (impacts e.g. capacity planning, budgeting
and billing in multi-tenant environments, cluster configuration settings such as scheduler pool/queue settings).</li>
  <li>Which <strong>analysis tools</strong> such as <a href="http://pig.apache.org/">Pig</a> or <a href="http://hive.apache.org/">Hive</a> are preferred by the
users (impacts e.g. cluster roadmap, trainings, providing custom helper libraries and UDFs).</li>
  <li>Which <strong>data sets</strong> account for most of the analyses being performed (impacts e.g. prolonging or canceling data
subscriptions, data archiving and aging, HDFS replication settings).</li>
  <li>Which <strong>MapReduce jobs</strong> consume most of the resources in the cluster and for how long (impacts e.g. how the jobs are
coded and configured, when and where they are launched; also allows your Ops team to point and shake fingers).</li>
</ul>

<p>Replephant was created to answer those important questions by inspecting production Hadoop logs (here: so-called Hadoop
job configuration and job history files) and allowing you to derive relevant statistics from the data.  Notably, it
enables you to leverage Clojure’s REPL to interactively perform such analyses.  You can even create visualizations and
plots from Replephant’s usage reports by drawing upon the data viz magics of tools such as <a href="http://www.r-project.org/">R</a>
and <a href="http://incanter.org/">Incanter</a> (see <a href="#FAQ">FAQ</a> section).</p>

<p>Apart from its original goals Replephant has also proven to be useful in cluster/job troubleshooting and debugging.
Because Replephant is <a href="https://github.com/miguno/replephant#Requirements">lightweight</a> and
<a href="https://github.com/miguno/replephant#Installation">easy to install</a> operations teams can conveniently run Replephant
in production environments if needed.</p>

<h2 id="related-work">Related work</h2>

<p>The following projects are similar to Replephant:</p>

<ul>
  <li><a href="https://github.com/harelba/hadoop-job-analyzer">hadoop-job-analyzer</a> – analyzes Hadoop jobs, aggregates the
information according to user specified crosssections, and sends the output to a metrics backend for visualization
and analysis (e.g.  Graphite).  Its analysis is based on parsing Hadoop’s job log files just like Replephant does.</li>
</ul>

<p>If you are interested in more sophisticated cluster usage analysis you may want to take a look at:</p>

<ul>
  <li><a href="http://data.linkedin.com/opensource/white-elephant">White Elephant</a> (by LinkedIn) is an open source Hadoop log
aggregator and dashboard which enables visualization of Hadoop cluster utilization across users and over time.</li>
  <li><a href="https://github.com/twitter/hraven">hRaven</a> (by Twitter) collects run time data and statistics from MapReduce jobs
running on Hadoop clusters and stores the collected job history in an easily queryable format.  A nice feature of
hRaven is that it can group related MapReduce jobs together that are spawned from a single higher-level analysis
job such as Pig (e.g. Pig jobs usually manifests themselves in several chained MapReduce jobs).  A current drawback
of hRaven is that it only supports Cloudera CDH3 up to CDH3u4 – CDH3u5, Hadoop 1.x and Hadoop 2.x are not supported
yet.</li>
  <li>Commercial offerings such as
<a href="http://www.cloudera.com/content/cloudera/en/products/cloudera-manager.html">Cloudera Manager (Enterprise Core)</a>,
<a href="http://hortonworks.com/products/hortonworksdataplatform/">Hortonworks Management Center</a> or
<a href="http://www.mapr.com/products/mapr-editions/m5-edition">MapR M5</a> include cluster usage reporting features.</li>
</ul>

<h1 id="features">Features</h1>

<p>Replephant’s main value proposition is to read and parse Hadoop’s raw log files and turn them into ready-to-use
<a href="http://clojure.org/">Clojure</a> data structures – because as is often the case for such a data analysis preparing and
loading the original raw data is the hardest part.</p>

<p>On top of this <a href="http://en.wikipedia.org/wiki/Extract,_transform,_load">ETL</a> functionality Replephant also includes a set
of basic usage reports such as <code>(tasks-by-user jobs)</code> and convenient filter predicates such as <code>pig?</code> (see
<a href="https://github.com/miguno/replephant#Usage">Usage</a> section on GitHub).  But even more interesting is the fact that you
can use the Clojure REPL including all of Clojure’s own powerful features to interactively drill down into the job data
yourself.</p>

<h1 id="getting-started">Getting started</h1>

<h2 id="requirements">Requirements</h2>

<ul>
  <li>Java JDK/JRE &gt;= 6</li>
  <li><a href="http://leiningen.org/">Leiningen version 2</a> – either install manually or use your favorite package manager such as
<a href="http://mxcl.github.io/homebrew/">HomeBrew</a> for Macs</li>
</ul>

<p>That’s it!</p>

<h2 id="installation">Installation</h2>

<p>Apart from meeting Replephant’s requirements (see above) you only need to clone Replephant’s git repository.</p>

<pre><code># Option 1: using HTTPS for data transfer
$ git clone https://github.com/miguno/replephant.git

# Option 2: using SSH for data transfer (requires GitHub user account)
$ git clone git@github.com:miguno/replephant.git
</code></pre>

<p><em>Note: This step requires a working Internet connection and appropriate firewall settings, which you may or may not</em>
<em>have in a production environment.</em></p>

<h1 id="data-structures-and-usage-analysis">Data structures and usage analysis</h1>

<p>When you analyze your Hadoop cluster’s usage with Replephant you will be working with two data structures:</p>

<ol>
  <li><em>Jobs</em>: The main data we are interested in for cluster usage analysis, parsed by Replephant from the raw Hadoop job
logs.</li>
  <li><em>Data sets</em>: Defined by the user, i.e. you!</li>
</ol>

<h2 id="jobs">Jobs</h2>

<p>Jobs are modelled as associative data structures that map Hadoop job parameters as well as Hadoop job history data to
their respective values.  Both the keys in the data structure – the names of job parameter and the name of data fields
in the job history data, which together we just call <em>fields</em> – as well as their values are derived straight from the
Hadoop logs.</p>

<p>Replephant converts the keys of the data fields into Clojure keywords according to the following schema:</p>

<ul>
  <li>Job parameters (from job configuration files) are directly converted into keywords.  For instance,
<code>mapred.input.dir</code> becomes <code>:mapred.input.dir</code> (note the leading colon, which denotes a Clojure keyword).</li>
  <li>Job history data including job counters (from job history files) are lowercased and converted into Lisp-style keywords
For instance, the job counter <code>HDFS_BYTES_WRITTEN</code> becomes <code>:hdfs-bytes-written</code> and a field such as
<code>JOB_PRIORITY</code> becomes <code>:job-priority</code>.</li>
</ul>

<p>Basically, everything that looks like <code>:words.with.dot.separators</code> is normally a job parameter whereas anything else
is derived from job history data.  The values of the various fields are, where possible, converted into the appropriate
Clojure data types (e.g. a value representing an integer will be correctly turned into an <code>int</code>, the strings “true”
and “false” are converted into their respective boolean values).</p>

<p>Here is a (shortened) example of a job data structure read from Hadoop log files:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="p">{</span>
</span><span class="line"> <span class="ss">:dfs.access.time.precision</span> <span class="mi">3600000</span>,    <span class="c1">; &lt;&lt;&lt; a job configuration data field</span>
</span><span class="line"> <span class="ss">:dfs.block.access.token.enable</span> <span class="nv">false</span>,
</span><span class="line"> <span class="c1">; *** SNIP ***</span>
</span><span class="line"> <span class="ss">:hdfs-bytes-read</span> <span class="mi">69815515804</span>,          <span class="c1">; &lt;&lt;&lt; a job history data field</span>
</span><span class="line"> <span class="ss">:hdfs-bytes-written</span> <span class="mi">848734873</span>,
</span><span class="line"> <span class="c1">; *** SNIP ***</span>
</span><span class="line"> <span class="ss">:io.sort.mb</span> <span class="mi">200</span>,
</span><span class="line"> <span class="ss">:job-priority</span> <span class="s">&quot;NORMAL&quot;</span>,
</span><span class="line"> <span class="ss">:job-queue</span> <span class="s">&quot;default&quot;</span>,
</span><span class="line"> <span class="ss">:job-status</span> <span class="s">&quot;SUCCESS&quot;</span>,
</span><span class="line"> <span class="ss">:jobid</span> <span class="s">&quot;job_201206011051_137865&quot;</span>,
</span><span class="line"> <span class="ss">:jobname</span> <span class="s">&quot;Facebook Social Graph analysis&quot;</span>,
</span><span class="line"> <span class="c1">; *** SNIP ***</span>
</span><span class="line"> <span class="ss">:user.name</span> <span class="s">&quot;miguno&quot;</span>
</span><span class="line"><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Here are some usage analysis examples:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
<span class="line-number">20</span>
<span class="line-number">21</span>
<span class="line-number">22</span>
<span class="line-number">23</span>
<span class="line-number">24</span>
<span class="line-number">25</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="c1">; Consumption of computation resources: which Hadoop users account for most of the tasks launched?</span>
</span><span class="line"><span class="p">(</span><span class="nb">println </span><span class="p">(</span><span class="nf">utils/sort-by-value-desc</span> <span class="p">(</span><span class="nf">tasks-by-user</span> <span class="nv">jobs</span><span class="p">)))</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">{</span><span class="s">&quot;miguno&quot;</span> <span class="mi">2329</span>, <span class="s">&quot;alice&quot;</span> <span class="mi">2208</span>, <span class="s">&quot;carl&quot;</span> <span class="mi">1440</span>, <span class="s">&quot;daniel&quot;</span> <span class="mi">19</span>, <span class="s">&quot;bob&quot;</span> <span class="mi">2</span>, <span class="s">&quot;jim&quot;</span> <span class="mi">2</span><span class="p">}</span>
</span><span class="line">
</span><span class="line"><span class="c1">; Consumption of computation resources: which Hadoop users account for most of the jobs launched?</span>
</span><span class="line"><span class="p">(</span><span class="nb">println </span><span class="p">(</span><span class="nf">utils/sort-by-value-desc</span> <span class="p">(</span><span class="nf">jobs-by-user</span> <span class="nv">jobs</span><span class="p">)))</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">{</span><span class="s">&quot;daniel&quot;</span> <span class="mi">3</span>, <span class="s">&quot;alice&quot;</span> <span class="mi">3</span>, <span class="s">&quot;carl&quot;</span> <span class="mi">2</span>, <span class="s">&quot;miguno&quot;</span> <span class="mi">2</span>, <span class="s">&quot;bob&quot;</span> <span class="mi">1</span>, <span class="s">&quot;jim&quot;</span> <span class="mi">1</span><span class="p">}</span>
</span><span class="line">
</span><span class="line"><span class="c1">; Consumption of computation resources: which MapReduce tools account for most of the tasks launched?</span>
</span><span class="line"><span class="p">(</span><span class="nb">println </span><span class="p">(</span><span class="nf">utils/sort-by-value-desc</span> <span class="p">(</span><span class="nf">tasks-by-tool</span> <span class="nv">jobs</span><span class="p">)))</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">{</span><span class="ss">:hive</span> <span class="mi">2329</span>, <span class="ss">:other</span> <span class="mi">1440</span>, <span class="ss">:streaming</span> <span class="mi">1778</span>, <span class="ss">:mahout</span> <span class="mi">432</span>, <span class="ss">:pig</span> <span class="mi">21</span><span class="p">}</span>
</span><span class="line">
</span><span class="line"><span class="c1">; Consumption of computation resources: which MapReduce tools account for most of the jobs launched?</span>
</span><span class="line"><span class="p">(</span><span class="nb">println </span><span class="p">(</span><span class="nf">utils/sort-by-value-desc</span> <span class="p">(</span><span class="nf">jobs-by-tool</span> <span class="nv">jobs</span><span class="p">)))</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">{</span><span class="ss">:pig</span> <span class="mi">4</span>, <span class="ss">:other</span> <span class="mi">2</span>, <span class="ss">:mahout</span> <span class="mi">2</span>, <span class="ss">:streaming</span> <span class="mi">2</span>, <span class="ss">:hive</span> <span class="mi">2</span><span class="p">}</span>
</span><span class="line">
</span><span class="line"><span class="c1">; Find jobs that violate data locality -- those are candidates for optimization and tuning.</span>
</span><span class="line"><span class="c1">;</span>
</span><span class="line"><span class="c1">; The example below is pretty basic.  It retrieves all jobs that have 1+ rack-local tasks,</span>
</span><span class="line"><span class="c1">; i.e. tasks where data needs to be transferred over the network (but at least they are from</span>
</span><span class="line"><span class="c1">; the same rack).</span>
</span><span class="line"><span class="c1">; A slightly improved version would also include jobs were data was retrieved from OTHER racks</span>
</span><span class="line"><span class="c1">; during a map tasks, which in pseudo-code is (- all-maps rack-local-maps data-local-maps).</span>
</span><span class="line"><span class="c1">;</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">optimization-candidates</span> <span class="p">(</span><span class="nb">filter </span><span class="o">#</span><span class="p">(</span><span class="nb">&gt; </span><span class="p">(</span><span class="ss">:rack-local-maps</span> <span class="nv">%</span> <span class="mi">0</span><span class="p">)</span> <span class="mi">0</span><span class="p">)</span> <span class="nv">jobs</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The following examples demonstrate the predicates built into Replephant:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="c1">; Restrict your analysis to a specific subset of all jobs according to one or more predicates</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">hive-jobs</span> <span class="p">(</span><span class="nb">filter </span><span class="nv">hive?</span> <span class="nv">jobs</span><span class="p">))</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">jobs-with-compressed-output</span> <span class="p">(</span><span class="nb">filter </span><span class="nv">compressed-output?</span> <span class="nv">jobs</span><span class="p">))</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">failed-jobs</span> <span class="p">(</span><span class="nb">filter </span><span class="nv">failed?</span> <span class="nv">jobs</span><span class="p">))</span>
</span><span class="line"><span class="c1">; Detect write-only jobs and jobs for which Replephant cannot yet extract input data information.</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">jobs-with-missing-input</span> <span class="p">(</span><span class="nb">filter </span><span class="nv">missing-input-data?</span> <span class="nv">jobs</span><span class="p">))</span>
</span><span class="line"><span class="c1">; Helpful to complete your data set definitions</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">jobs-with-unknown-input</span> <span class="p">(</span><span class="nb">filter </span><span class="p">(</span><span class="nb">partial </span><span class="nv">unknown-input-data?</span> <span class="nv">data-sets</span><span class="p">)</span> <span class="nv">jobs</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>In addition to the data derived from Hadoop log files Replephant also adds some
<a href="http://clojure.org/metadata">Clojure metadata</a> to each job data structure.  At the moment only a <code>:job-id</code> field is
available.  This helps to identify problematic job log files (e.g. those Replephant fails to parse) because at least
Replephant will tell you the job id, which you can then use to find the respective raw log files on disk.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">job</span> <span class="nv">...</span><span class="p">)</span> <span class="c1">;</span>
</span><span class="line"><span class="p">(</span><span class="nb">meta </span><span class="nv">job</span><span class="p">)</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">{</span><span class="ss">:job-id</span> <span class="s">&quot;job_201206011051_137865&quot;</span><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Note that even though this metadata follows the same naming conventions as the actual job data it is still metadata and
as such you must access it via <code>(meta ...)</code>.  Accessing the job data structure directly – without <code>meta</code> – only
provides you with the log-derived data.</p>

<h2 id="data-sets">Data sets</h2>

<p><em>You only need to define data sets if you use any of Replephant’s data set related functions such as</em>
<em><code>tasks-by-data-sets</code>.  Otherwise you can safely omit this step.</em></p>

<p>Data sets are used to describe the, well, data sets that are stored in an Hadoop cluster.  They allow you to define,
for example, that the Twitter Firehose data is stored in <em>this</em> particular location in the cluster.  Replephant can then
leverage this information to perform usage analysis related to these data sets; for instance, to answer questions such
as “How many Hadoop jobs were launched against the Twitter Firehose data in our cluster?”.</p>

<p>Thanks to Clojure’s <a href="http://en.wikipedia.org/wiki/Homoiconicity">homoiconicity</a> it is very straight-forward to define
data sets so that Replephant can understand which jobs read which data in your Hadoop cluster.  You only need to create
an associative data structure that maps the name of the data set to a regex pattern that is matched against a job’s
input directories (more correctly, input URIs) as configured via <code>mapred.input.dir</code> and <code>mapred.input.dir.mappers</code>.
You then pass this data structure to the appropriate Replephant function.</p>

<p><strong>Important note:</strong> In order to simplify data set definitions Replephant will automatically extract the path component
of input URIs, i.e. it will remove scheme and authority information from <code>mapred.input.dir</code> and
<code>mapred.input.dir.mappers</code> values.  This means you should write regexes that match against strings such as
<code>/path/to/foo/</code> instead of <code>hdfs:///path/to/foo/</code> or <code>hdfs://namenode.your.datacenter/path/to/foo/</code>.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">data-sets</span>
</span><span class="line">  <span class="p">{</span>
</span><span class="line">   <span class="c1">; Will match e.g. &quot;hdfs://namenode/twitter/firehose/*&quot;, &quot;/twitter/firehose&quot;</span>
</span><span class="line">   <span class="c1">; and &quot;/twitter/firehose/*&quot;; see note above</span>
</span><span class="line">   <span class="s">&quot;Twitter Firehose data&quot;</span> <span class="o">#</span><span class="s">&quot;^/twitter/firehose/?&quot;</span>
</span><span class="line">   <span class="p">})</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Here is another example:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="c1">; Consumption of computation resources: which data sets account for most of the tasks launched?</span>
</span><span class="line"><span class="c1">; (data sets are defined in a simple associative data structure; see section &quot;Data sets&quot; below)</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">data-sets</span> <span class="p">{</span><span class="s">&quot;Twitter Firehose data&quot;</span> <span class="o">#</span><span class="s">&quot;^/twitter/firehose/?&quot;</span>, <span class="s">&quot;Facebook Social Graph&quot;</span> <span class="o">#</span><span class="s">&quot;^/facebook/social-graph/?&quot;</span><span class="p">})</span>
</span><span class="line"><span class="p">(</span><span class="nb">println </span><span class="p">(</span><span class="nf">utils/sort-by-value-desc</span> <span class="p">(</span><span class="nf">tasks-by-data-set</span> <span class="nv">jobs</span> <span class="nv">data-sets</span><span class="p">)))</span>
</span><span class="line"><span class="nv">=&gt;</span> <span class="p">{</span><span class="s">&quot;Facebook Social Graph data&quot;</span> <span class="mi">2329</span>, <span class="s">&quot;UNKNOWN DATA SET&quot;</span> <span class="mi">1872</span>, <span class="s">&quot;Twitter Firehose data&quot;</span> <span class="mi">1799</span><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Replephant uses native <a href="http://clojure.org/other_functions">Clojure regex patterns</a>, which means you have the full
power of <a href="http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">java.util.regex.Pattern</a> at your
disposal.</p>

<p><em>How Replephant matches job input with data set definitions:</em>
Replephant will consider a MapReduce job to be reading a given data set if ANY of the job’s input URIs match the
respective regex of the data set.  In Hadoop the values of <code>mapred.input.dir</code> and <code>mapred.input.dir.mappers</code> maybe
be a single URI or a comma-separated list of URIs; in the latter case Replephant will automatically explode the
comma-separated string into a Clojure collection of individual URIs so that you don’t have to write complicated regexes
to handle multiple input URIs in your own code (the regex is matched against the individual URIs, one at a time).</p>

<p><em>Analyzing multiple cluster environments:</em>
If you are running, say, a production cluster and a test cluster that host different data sets (or at different
locations), it is convenient to create separate data set definitions such as <code>(def production-data-sets { ... })</code> and
<code>(def test-data-sets { ... })</code>.</p>

<p>See <a href="https://github.com/miguno/replephant/blob/master/src/replephant/data_sets.clj">data_sets.clj</a> for further
information and for an example definition of multiple data sets.</p>

<h2 id="visualization">Visualization</h2>

<p>Replehant itself does not implement any native visualization features.  However you can leverage all the existing data
visualization tools such as <a href="http://www.r-project.org/">R</a> or <a href="https://github.com/liebke/incanter">Incanter</a> (the latter
is basically a clone of R written in Clojure).</p>

<p>For your convenience Incanter has been added as a dependency of Replephant, which is a fancy way of saying that you can
use Incanter from Replephant’s REPL right out of the box.  Here is an example Incanter visualization of cluster usage
reported by <code>tasks-by-user</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="clojure"><span class="line"><span class="c1">;; Create a bar chart using Incanter</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">jobs</span> <span class="p">(</span><span class="nf">load-jobs</span> <span class="nv">...</span><span class="p">))</span>
</span><span class="line"><span class="p">(</span><span class="k">def </span><span class="nv">u-&gt;t</span> <span class="p">(</span><span class="nf">utils/sort-by-value-desc</span> <span class="p">(</span><span class="nf">tasks-by-user</span> <span class="nv">jobs</span><span class="p">)))</span>
</span><span class="line"><span class="p">(</span><span class="nf">use</span> <span class="o">&#39;</span><span class="p">(</span><span class="nf">incanter</span> <span class="nv">core</span> <span class="nv">charts</span><span class="p">))</span>
</span><span class="line"><span class="p">(</span><span class="nf">view</span> <span class="p">(</span><span class="nf">bar-chart</span>
</span><span class="line">       <span class="p">(</span><span class="nb">keys </span><span class="nv">u-&gt;t</span><span class="p">)</span>
</span><span class="line">       <span class="p">(</span><span class="nb">vals </span><span class="nv">u-&gt;t</span><span class="p">)</span>
</span><span class="line">       <span class="ss">:title</span> <span class="s">&quot;Computation resources consumed by user&quot;</span>
</span><span class="line">       <span class="ss">:x-label</span> <span class="s">&quot;Users&quot;</span>
</span><span class="line">       <span class="ss">:y-label</span> <span class="s">&quot;Tasks launched&quot;</span><span class="p">))</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p><em>Note: This specific example requires a window system such as X11.  In other words it will not work in a text terminal.</em></p>

<p>This produces the following chart:</p>

<p><img src="http://www.michael-noll.com/blog/uploads/replephant-incanter-tasks-by-user.png" title="Visualizing cluster usage reports in Replephant with Incanter" /></p>

<div class="caption">
Figure 1: Visualizing cluster usage reports in Replephant with Incanter
</div>

<h1 id="how-it-works">How it works</h1>

<p>In a nutshell Replephant reads the data in Hadoop job configuration files and job history files into a “job” data
structure, which can then be used for subsequent cluster usage analyses.</p>

<p>Background: Hadoop creates a pair of files for each MapReduce job that is executed in a cluster:</p>

<ul>
  <li>A <strong>job configuration file</strong>, which contains job-related data created at the time when the job was submitted to the
cluster.  For instance, the location of the job’s input data is specified in this file via the parameter
<code>mapred.input.dir</code>.
    <ul>
      <li>Format: XML</li>
      <li>Example filename: <code>job_201206222102_0003_conf.xml</code> for a job with ID <code>job_201206222102_0003</code></li>
    </ul>
  </li>
</ul>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="xml"><span class="line"><span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;no&quot;?&gt;</span><span class="nt">&lt;configuration&gt;</span>
</span><span class="line"><span class="nt">&lt;property&gt;&lt;name&gt;</span>io.bytes.per.checksum<span class="nt">&lt;/name&gt;&lt;value&gt;</span>512<span class="nt">&lt;/value&gt;&lt;/property&gt;</span>
</span><span class="line"><span class="nt">&lt;property&gt;&lt;name&gt;</span>mapred.input.dir<span class="nt">&lt;/name&gt;&lt;value&gt;</span>hdfs://namenode/facebook/social-graph/2012/06/22/<span class="nt">&lt;/value&gt;&lt;/property&gt;</span>
</span><span class="line"><span class="nt">&lt;property&gt;&lt;name&gt;</span>mapred.job.name<span class="nt">&lt;/name&gt;&lt;value&gt;</span>Facebook Social Graph analysis<span class="nt">&lt;/value&gt;&lt;/property&gt;</span>
</span><span class="line"><span class="nt">&lt;property&gt;&lt;name&gt;</span>mapred.task.profile.reduces<span class="nt">&lt;/name&gt;&lt;value&gt;</span>0-2<span class="nt">&lt;/value&gt;&lt;/property&gt;</span>
</span><span class="line"><span class="nt">&lt;property&gt;&lt;name&gt;</span>mapred.reduce.tasks.speculative.execution<span class="nt">&lt;/name&gt;&lt;value&gt;</span>true<span class="nt">&lt;/value&gt;&lt;/property&gt;</span>
</span><span class="line">...
</span><span class="line"><span class="nt">&lt;/configuration&gt;</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<ul>
  <li>An accompanying <strong>job history file</strong>, which captures run-time information on how the job was actually executed in the
cluster.  For instance, Hadoop stores a job’s run-time counters such as <code>HDFS_BYTES_WRITTEN</code> (a built-in counter of
Hadoop, which as a side note is also shown in the JobTracker web UI when looking at running or completed jobs) as
well as application-level custom counters (provided by user code).
    <ul>
      <li>Format: Custom plain-text encoded format for Hadoop 1.x and 0.20.x, described in
<a href="http://hadoop.apache.org/docs/r1.1.2/api/org/apache/hadoop/mapred/JobHistory.html">JobHistory</a> class</li>
      <li>Example filename: <code>job_201206222102_0003_1340394471252_miguno_Job2045189006031602801</code></li>
    </ul>
  </li>
</ul>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
</pre></td><td class="code"><pre><code class="xml"><span class="line">Meta VERSION=&quot;1&quot; .
</span><span class="line">Job JOBID=&quot;job_201206011051_137865&quot; JOBNAME=&quot;Facebook Social Graph analysis&quot; USER=&quot;miguno&quot; SUBMIT_TIME=&quot;1367518567144&quot; JOBCONF=&quot;hdfs://namenode/app/hadoop/staging/miguno/\.staging/job_201206011051_137865/job\.xml&quot; VIEW_JOB=&quot; &quot; MODIFY_JOB=&quot; &quot; JOB_QUEUE=&quot;default&quot; .
</span><span class="line">Job JOBID=&quot;job_201206011051_137865&quot; JOB_PRIORITY=&quot;NORMAL&quot; .
</span><span class="line">Job JOBID=&quot;job_201206011051_137865&quot; LAUNCH_TIME=&quot;1367518571729&quot; TOTAL_MAPS=&quot;2316&quot; TOTAL_REDUCES=&quot;12&quot; JOB_STATUS=&quot;PREP&quot; .
</span><span class="line">Task TASKID=&quot;task_201206011051_137865_r_000013&quot; TASK_TYPE=&quot;SETUP&quot; START_TIME=&quot;1367518572156&quot; SPLITS=&quot;&quot; .
</span><span class="line">ReduceAttempt TASK_TYPE=&quot;SETUP&quot; TASKID=&quot;task_201206011051_137865_r_000013&quot; TASK_ATTEMPT_ID=&quot;attempt_201206011051_137865_r_000013_0&quot; START_TIME=&quot;1367518575026&quot; TRACKER_NAME=&quot;slave406:localhost/127\.0\.0\.1:56910&quot; HTTP_PORT=&quot;50060&quot; .
</span><span class="line">...
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Depending on your Hadoop version and cluster configuration, Hadoop will store those files in directory trees rooted at
<code>mapred.job.tracker.history.completed.location</code> and/or <code>hadoop.job.history.location</code>.</p>

<p>Replephant uses standard XML parsing to read the job configuration files, and relies on the Hadoop 1.x Java API to parse
the job history files via <code>DefaultJobHistoryParser</code>. <strong>At the moment Replephant retrieves only such history data from
job history files that are related to job start, job finish or job failure (e.g. task attempt data is not retrieved).</strong>
For each job Replephant creates a single associative data structure that contains both the job configuration as well as
the job history data in a Clojure-friendly format.  This job data structure forms the basis for all subsequent cluster
usage analyses as we have seen in the previous section.</p>

<h1 id="summary">Summary</h1>

<p>Replephant is a work in progress but already a pretty valuable addition to our Hadoop toolset.  If you want to give it
a try, head over to the <a href="https://github.com/miguno/replephant">Replephant project homepage</a> and play with it!</p>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Using Avro in MapReduce jobs with Hadoop, Pig, Hive]]></title>
    <link href="http://www.michael-noll.com/blog/2013/07/04/using-avro-in-mapreduce-jobs-with-hadoop-pig-hive/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2013-07-04T08:29:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2013/07/04/using-avro-in-mapreduce-jobs-with-hadoop-pig-hive</id>
    <content type="html"><![CDATA[<p><a href="http://avro.apache.org/">Apache Avro</a> is a very popular data serialization format in the Hadoop technology stack.
In this article I show code examples of MapReduce jobs in Java, Hadoop Streaming, Pig and Hive that read and/or write
data in Avro format.  We will use a small, Twitter-like data set as input for our example MapReduce jobs.</p>

<!-- more -->

<div class="note">
  <strong>
    The latest version of this article and the corresponding code examples are available at
    <a href="https://github.com/miguno/avro-hadoop-starter">avro-hadoop-starter</a> on GitHub.
  </strong>
</div>

<h1 id="requirements">Requirements</h1>

<p>The examples require the following software versions:</p>

<ul>
  <li><a href="http://www.gradle.org/">Gradle</a> 1.3+ (only for the Java examples)</li>
  <li>Java JDK 7 (only for the Java examples)
    <ul>
      <li>It is easy to switch to JDK 6.  Mostly you will need to change the <code>sourceCompatibility</code> and
<code>targetCompatibility</code> parameters in
<a href="https://github.com/miguno/avro-hadoop-starter/blob/master/build.gradle">build.gradle</a> from <code>1.7</code> to <code>1.6</code>.
But since there are a couple of JDK 7 related gotchas (e.g. problems with its new bytecode verifier) that the Java
example code solves I decided to stick with JDK 7 as the default.</li>
    </ul>
  </li>
  <li><a href="http://hadoop.apache.org/">Hadoop</a> 2.x with MRv1 (not MRv2/YARN)
    <ul>
      <li>Tested with <a href="http://www.cloudera.com/content/cloudera/en/products/cdh.html">Cloudera CDH 4.3</a></li>
    </ul>
  </li>
  <li><a href="http://pig.apache.org/">Pig</a> 0.11
    <ul>
      <li>Tested with Pig 0.11.0-cdh4.3.0</li>
    </ul>
  </li>
  <li><a href="http://hive.apache.org/">Hive</a> 0.10
    <ul>
      <li>Tested with Hive 0.10.0-cdh4.3.0</li>
    </ul>
  </li>
  <li><a href="http://avro.apache.org/">Avro</a> 1.7.4</li>
</ul>

<h1 id="prerequisites">Prerequisites</h1>

<p>First you must clone my <a href="https://github.com/miguno/avro-hadoop-starter">avro-hadoop-starter</a> repository on GitHub.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>git clone git@github.com:miguno/avro-hadoop-starter.git
</span><span class="line"><span class="nv">$ </span><span class="nb">cd </span>avro-hadoop-starter
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="example-data">Example data</h1>

<p>We are using a small, Twitter-like data set as input for our example MapReduce jobs.</p>

<h2 id="avro-schema">Avro schema</h2>

<p><a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/main/resources/avro/twitter.avsc">twitter.avsc</a> defines
a basic schema for storing tweets:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
</pre></td><td class="code"><pre><code class="json"><span class="line"><span class="p">{</span>
</span><span class="line">  <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;record&quot;</span><span class="p">,</span>
</span><span class="line">  <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;Tweet&quot;</span><span class="p">,</span>
</span><span class="line">  <span class="nt">&quot;namespace&quot;</span> <span class="p">:</span> <span class="s2">&quot;com.miguno.avro&quot;</span><span class="p">,</span>
</span><span class="line">  <span class="nt">&quot;fields&quot;</span> <span class="p">:</span> <span class="p">[</span> <span class="p">{</span>
</span><span class="line">    <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;username&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;string&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;doc&quot;</span>  <span class="p">:</span> <span class="s2">&quot;Name of the user account on Twitter.com&quot;</span>
</span><span class="line">  <span class="p">},</span> <span class="p">{</span>
</span><span class="line">    <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;tweet&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;string&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;doc&quot;</span>  <span class="p">:</span> <span class="s2">&quot;The content of the user&#39;s Twitter message&quot;</span>
</span><span class="line">  <span class="p">},</span> <span class="p">{</span>
</span><span class="line">    <span class="nt">&quot;name&quot;</span> <span class="p">:</span> <span class="s2">&quot;timestamp&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;type&quot;</span> <span class="p">:</span> <span class="s2">&quot;long&quot;</span><span class="p">,</span>
</span><span class="line">    <span class="nt">&quot;doc&quot;</span>  <span class="p">:</span> <span class="s2">&quot;Unix epoch time in seconds&quot;</span>
</span><span class="line">  <span class="p">}</span> <span class="p">],</span>
</span><span class="line">  <span class="nt">&quot;doc:&quot;</span> <span class="p">:</span> <span class="s2">&quot;A basic schema for storing Twitter messages&quot;</span>
</span><span class="line"><span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>If you want to generate Java classes from this Avro schema follow the instructions described in section
<em>Java &gt; Usage</em>.  Alternatively you can also use the Avro Compiler directly.</p>

<h2 id="avro-data-files">Avro data files</h2>

<p>The actual data is stored in the following files:</p>

<ul>
  <li><a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/test/resources/avro/twitter.avro">twitter.avro</a>
– encoded (serialized) version of the example data in binary Avro format, compressed with Snappy</li>
  <li><a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/test/resources/avro/twitter.json">twitter.json</a>
– JSON representation of the same example data</li>
</ul>

<p>You can convert back and forth between the two encodings (Avro vs. JSON) using Avro Tools.  See
<a href="http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/">Reading and Writing Avro Files From the Command Line</a>
for instructions on how to do that.</p>

<p>Here is a snippet of the example data:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="json"><span class="line"><span class="p">{</span><span class="nt">&quot;username&quot;</span><span class="p">:</span><span class="s2">&quot;miguno&quot;</span><span class="p">,</span><span class="nt">&quot;tweet&quot;</span><span class="p">:</span><span class="s2">&quot;Rock: Nerf paper, scissors is fine.&quot;</span><span class="p">,</span><span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="mi">1366150681</span> <span class="p">}</span>
</span><span class="line"><span class="p">{</span><span class="nt">&quot;username&quot;</span><span class="p">:</span><span class="s2">&quot;BlizzardCS&quot;</span><span class="p">,</span><span class="nt">&quot;tweet&quot;</span><span class="p">:</span><span class="s2">&quot;Works as intended.  Terran is IMBA.&quot;</span><span class="p">,</span><span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="mi">1366154481</span> <span class="p">}</span>
</span><span class="line"><span class="p">{</span><span class="nt">&quot;username&quot;</span><span class="p">:</span><span class="s2">&quot;DarkTemplar&quot;</span><span class="p">,</span><span class="nt">&quot;tweet&quot;</span><span class="p">:</span><span class="s2">&quot;From the shadows I come!&quot;</span><span class="p">,</span><span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="mi">1366154681</span> <span class="p">}</span>
</span><span class="line"><span class="p">{</span><span class="nt">&quot;username&quot;</span><span class="p">:</span><span class="s2">&quot;VoidRay&quot;</span><span class="p">,</span><span class="nt">&quot;tweet&quot;</span><span class="p">:</span><span class="s2">&quot;Prismatic core online!&quot;</span><span class="p">,</span><span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="mi">1366160000</span> <span class="p">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="preparing-the-input-data">Preparing the input data</h2>

<p>The example input data we are using is
<a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/test/resources/avro/twitter.avro">twitter.avro</a>.
Upload <code>twitter.avro</code> to HDFS to make the input data available to our MapReduce jobs.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Upload the input data</span>
</span><span class="line"><span class="nv">$ </span>hadoop fs -mkdir examples/input
</span><span class="line"><span class="nv">$ </span>hadoop fs -copyFromLocal src/test/resources/avro/twitter.avro examples/input
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>We will also upload the Avro schema
<a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/main/resources/avro/twitter.avsc">twitter.avsc</a>
to HDFS because we will use a schema available at an HDFS location in one of the Hive examples.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Upload the Avro schema</span>
</span><span class="line"><span class="nv">$ </span>hadoop fs -mkdir examples/schema
</span><span class="line"><span class="nv">$ </span>hadoop fs -copyFromLocal src/main/resources/avro/twitter.avsc examples/schema
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="java">Java</h1>

<h2 id="usage">Usage</h2>

<p>To prepare your Java IDE:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># IntelliJ IDEA</span>
</span><span class="line"><span class="nv">$ </span>gradle cleanIdea idea   <span class="c"># then File &gt; Open... &gt; avro-hadoop-starter.ipr</span>
</span><span class="line">
</span><span class="line"><span class="c"># Eclipse</span>
</span><span class="line"><span class="nv">$ </span>gradle cleanEclipse eclipse
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>To build the Java code and to compile the Avro-based Java classes from the schemas (<code>*.avsc</code>) in
<code>src/main/resources/avro/</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>gradle clean build
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>The generated Avro-based Java classes are written under the directory tree <code>generated-sources/</code>.  The Avro
compiler will generate a Java class <code>Tweet</code> from the <code>twitter.avsc</code> schema.</p>

<p>To run the unit tests (notably <code>TweetCountTest</code>, see section <em>Examples</em> below):</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>gradle <span class="nb">test</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Note: <code>gradle test</code> executes any JUnit unit tests.  If you add any TestNG unit tests you need to run <code>gradle testng</code>
for executing those.</p>

<h2 id="examples">Examples</h2>

<h3 id="tweetcount">TweetCount</h3>

<p><a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/main/java/com/miguno/avro/hadoop/TweetCount.java">TweetCount</a>
implements a MapReduce job that counts the number of tweets created by Twitter users.</p>

<pre><code>TweetCount: Usage: TweetCount &lt;input path&gt; &lt;output path&gt;
</code></pre>

<h3 id="tweetcounttest">TweetCountTest</h3>

<p><a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/test/java/com/miguno/avro/hadoop/TweetCountTest.java">TweetCountTest</a>
is very similar to <code>TweetCount</code>.  It uses
<a href="https://github.com/miguno/avro-hadoop-starter/tree/master/src/test/resources/avro/twitter.avro">twitter.avro</a> as its
input and runs a unit test on it with the same MapReduce job as <code>TweetCount</code>.  The unit test includes comparing the
actual MapReduce output (in Snappy-compressed Avro format) with expected output.  <code>TweetCountTest</code> extends
<a href="https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/ClusterMapReduceTestCase.java">ClusterMapReduceTestCase</a>
(MRv1), which means that the corresponding MapReduce job is launched in-memory via
<a href="https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRCluster.java">MiniMRCluster</a>.</p>

<h2 id="minimrcluster-and-hadoop-mrv2">MiniMRCluster and Hadoop MRv2</h2>

<p>The MiniMRCluster that is used by <code>ClusterMapReduceTestCase</code> in MRv1 is deprecated in Hadoop MRv2.  When using MRv2
you should switch to
<a href="https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientClusterFactory.java">MiniMRClientClusterFactory</a>,
which provides a wrapper interface called
<a href="https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientCluster.java">MiniMRClientCluster</a>
around the
<a href="https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/MiniMRYarnCluster.java">MiniMRYarnCluster</a> (MRv2):</p>

<blockquote>
  <p>MiniMRClientClusterFactory:
A MiniMRCluster factory. In MR2, it provides a wrapper MiniMRClientCluster interface around the MiniMRYarnCluster.
While in MR1, it provides such wrapper around MiniMRCluster. This factory should be used in tests to provide an easy
migration of tests across MR1 and MR2.</p>
</blockquote>

<p>See <a href="http://blog.cloudera.com/blog/2012/07/experimenting-with-mapreduce-2-0/">Experimenting with MapReduce 2.0</a> for more
information.</p>

<h2 id="further-readings-on-java">Further readings on Java</h2>

<ul>
  <li><a href="http://avro.apache.org/docs/1.7.4/api/java/index.html?org/apache/avro/mapred/package-summary.html">Package Documentation for org.apache.avro.mapred</a>
– Run Hadoop MapReduce jobs over Avro data, with map and reduce functions written in Java.  This document provides
detailed information on how you should use the Avro Java API to implement MapReduce jobs that read and/or write data
in Avro format.</li>
  <li><a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_26_5.html">Java MapReduce and Avro</a>
– Cloudera CDH4 documentation</li>
</ul>

<h1 id="hadoop-streaming">Hadoop Streaming</h1>

<h2 id="preliminaries">Preliminaries</h2>

<p>Important: The examples below assume you have access to a running Hadoop cluster.</p>

<h2 id="how-streaming-sees-data-when-reading-via-avroastextinputformat">How Streaming sees data when reading via AvroAsTextInputFormat</h2>

<p>When using <a href="http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroAsTextInputFormat.html">AvroAsTextInputFormat</a>
as the input format your streaming code will receive the data in JSON format, one record (“datum” in Avro parlance) per
line.  Note that Avro will also add a trailing TAB (<code>\t</code>) at the end of each line.</p>

<pre><code>&lt;JSON representation of Avro record #1&gt;\t
&lt;JSON representation of Avro record #2&gt;\t
&lt;JSON representation of Avro record #3&gt;\t
...
</code></pre>

<p>Here is the basic data flow from your input data in binary Avro format to our streaming mapper:</p>

<pre><code>input.avro (binary)  ---AvroAsTextInputFormat---&gt; deserialized data (JSON) ---&gt; Mapper
</code></pre>

<h2 id="examples-1">Examples</h2>

<h3 id="prerequisites-1">Prerequisites</h3>

<p>The example commands below use the Hadoop Streaming jar <em>for MRv1</em> shipped with Cloudera CDH4:</p>

<ul>
  <li><a href="https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hadoop/hadoop-streaming/2.0.0-mr1-cdh4.3.0/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar">hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar</a>
(as of July 2013)</li>
</ul>

<p>If you are not using Cloudera CDH4 or are using a new version of CDH4 just replace the jar file with the one included
in your Hadoop installation.</p>

<p>The Avro jar files are straight from the <a href="https://avro.apache.org/releases.html">Avro project</a>:</p>

<ul>
  <li><a href="http://www.eu.apache.org/dist/avro/avro-1.7.4/java/avro-1.7.4.jar">avro-1.7.4.jar</a></li>
  <li><a href="http://www.eu.apache.org/dist/avro/avro-1.7.4/java/avro-mapred-1.7.4-hadoop1.jar">avro-mapred-1.7.4-hadoop1.jar</a></li>
  <li><a href="http://www.eu.apache.org/dist/avro/avro-1.7.4/java/avro-tools-1.7.4.jar">avro-tools-1.7.4.jar</a></li>
</ul>

<h3 id="reading-avro-writing-plain-text">Reading Avro, writing plain-text</h3>

<p>The following command reads Avro data from the relative HDFS directory <code>examples/input/</code> (which normally resolves
to <code>/user/&lt;your-unix-username&gt;/examples/input/</code>).  It writes the
deserialized version of each data record (see section <em>How Streaming sees data when reading via AvroAsTextInputFormat</em>
above) as is to the output HDFS directory <code>streaming/output/</code>.  For this simple demonstration we are using
the <code>IdentityMapper</code> as a naive map step implementation – it outputs its input data unmodified (equivalently we
coud use the Unix tool <code>cat</code>, here) .  We do not need to run a reduce phase here, which is why we disable the reduce
step via the option <code>-D mapred.reduce.tasks=0</code> (see
<a href="http://hadoop.apache.org/docs/r1.1.2/streaming.html#Specifying+Map-Only+Jobs">Specifying Map-Only Jobs</a> in the
Hadoop Streaming documentation).</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># Run the streaming job</span>
</span><span class="line"><span class="nv">$ </span>hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar <span class="se">\</span>
</span><span class="line">    -D mapred.job.name<span class="o">=</span><span class="s2">&quot;avro-streaming&quot;</span> <span class="se">\</span>
</span><span class="line">    -D mapred.reduce.tasks<span class="o">=</span>0 <span class="se">\</span>
</span><span class="line">    -files avro-1.7.4.jar,avro-mapred-1.7.4-hadoop1.jar <span class="se">\</span>
</span><span class="line">    -libjars avro-1.7.4.jar,avro-mapred-1.7.4-hadoop1.jar <span class="se">\</span>
</span><span class="line">    -input  examples/input/ <span class="se">\</span>
</span><span class="line">    -output streaming/output/ <span class="se">\</span>
</span><span class="line">    -mapper org.apache.hadoop.mapred.lib.IdentityMapper <span class="se">\</span>
</span><span class="line">    -inputformat org.apache.avro.mapred.AvroAsTextInputFormat
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Once the job completes you can inspect the output data as follows:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop fs -cat streaming/output/part-00000 | head -4
</span><span class="line"><span class="o">{</span><span class="s2">&quot;username&quot;</span>: <span class="s2">&quot;miguno&quot;</span>, <span class="s2">&quot;tweet&quot;</span>: <span class="s2">&quot;Rock: Nerf paper, scissors is fine.&quot;</span>, <span class="s2">&quot;timestamp&quot;</span>: 1366150681<span class="o">}</span>
</span><span class="line"><span class="o">{</span><span class="s2">&quot;username&quot;</span>: <span class="s2">&quot;BlizzardCS&quot;</span>, <span class="s2">&quot;tweet&quot;</span>: <span class="s2">&quot;Works as intended.  Terran is IMBA.&quot;</span>, <span class="s2">&quot;timestamp&quot;</span>: 1366154481<span class="o">}</span>
</span><span class="line"><span class="o">{</span><span class="s2">&quot;username&quot;</span>: <span class="s2">&quot;DarkTemplar&quot;</span>, <span class="s2">&quot;tweet&quot;</span>: <span class="s2">&quot;From the shadows I come!&quot;</span>, <span class="s2">&quot;timestamp&quot;</span>: 1366154681<span class="o">}</span>
</span><span class="line"><span class="o">{</span><span class="s2">&quot;username&quot;</span>: <span class="s2">&quot;VoidRay&quot;</span>, <span class="s2">&quot;tweet&quot;</span>: <span class="s2">&quot;Prismatic core online!&quot;</span>, <span class="s2">&quot;timestamp&quot;</span>: 1366160000<span class="o">}</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Please be aware that the output data just happens to be JSON.  This is because we opted not to modify any of the input
data in our MapReduce job.  And since the input data to our MapReduce job is deserialized by Avro into JSON, the output
turns out to be JSON, too.  With a different MapReduce job you could of course write the output data in TSV or CSV
format, for instance.</p>

<h3 id="reading-avro-writing-avro">Reading Avro, writing Avro</h3>

<h4 id="avrotextoutputformat-implies-bytes-schema">AvroTextOutputFormat (implies “bytes” schema)</h4>

<p>To write the output in Avro format instead of plain-text, use the same general options as in the previous example but
also add:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar <span class="se">\</span>
</span><span class="line">    <span class="o">[</span>...<span class="o">]</span>
</span><span class="line">    -outputformat org.apache.avro.mapred.AvroTextOutputFormat
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p><a href="http://avro.apache.org/docs/1.7.4/api/java/index.html?org/apache/avro/mapred/AvroTextOutputFormat.html">AvroTextOutputFormat</a>
is the equivalent of TextOutputFormat.  It writes Avro data files with a “bytes” schema.</p>

<p>Note that using <code>IdentityMapper</code> as a naive mapper as shown in the previous example will not result in the output file
being identical to the input file.  This is because <code>AvroTextOutputFormat</code> will escape (quote) the input data it
receives.  An illustration might be worth a thousand words:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># After having used IdentityMapper as in the previous example</span>
</span><span class="line"><span class="nv">$ </span>hadoop fs -copyToLocal streaming/output/part-00000.avro .
</span><span class="line">
</span><span class="line"><span class="nv">$ </span>java -jar avro-tools-1.7.4.jar tojson part-00000.avro  | head -4
</span><span class="line"><span class="s2">&quot;{\&quot;username\&quot;: \&quot;miguno\&quot;, \&quot;tweet\&quot;: \&quot;Rock: Nerf paper, scissors is fine.\&quot;, \&quot;timestamp\&quot;: 1366150681}\t&quot;</span>
</span><span class="line"><span class="s2">&quot;{\&quot;username\&quot;: \&quot;BlizzardCS\&quot;, \&quot;tweet\&quot;: \&quot;Works as intended.  Terran is IMBA.\&quot;, \&quot;timestamp\&quot;: 1366154481}\t&quot;</span>
</span><span class="line"><span class="s2">&quot;{\&quot;username\&quot;: \&quot;DarkTemplar\&quot;, \&quot;tweet\&quot;: \&quot;From the shadows I come!\&quot;, \&quot;timestamp\&quot;: 1366154681}\t&quot;</span>
</span><span class="line"><span class="s2">&quot;{\&quot;username\&quot;: \&quot;VoidRay\&quot;, \&quot;tweet\&quot;: \&quot;Prismatic core online!\&quot;, \&quot;timestamp\&quot;: 1366160000}\t&quot;</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h4 id="custom-avro-output-schema">Custom Avro output schema</h4>

<p>This looks not to be supported by stock Avro at the moment.  A related JIRA ticket
<a href="https://issues.apache.org/jira/browse/AVRO-1067">AVRO-1067</a>, created in April 2012, is still unresolved as of July
2013.</p>

<p>For a workaround take a look at the section <em>Avro output for Hadoop Streaming</em> at
<a href="https://github.com/tomslabs/avro-utils">avro-utils</a>, a third-party library for Avro.</p>

<h4 id="enabling-compression-of-avro-output-data-snappy-or-deflate">Enabling compression of Avro output data (Snappy or Deflate)</h4>

<p>If you want to enable compression for the Avro output data, you must add the following parameters to the streaming job:</p>

<pre><code># For compression with Snappy
-D mapred.output.compress=true -D avro.output.codec=snappy

# For compression with Deflate
-D mapred.output.compress=true -D avro.output.codec=deflate
</code></pre>

<p>Be aware that if you enable compression with <code>mapred.output.compress</code> but are NOT specifying an Avro output format
(such as AvroTextOutputFormat) your cluster’s configured default compression codec will determine the final format
of the output data.  For instance, if <code>mapred.output.compression.codec</code> is set to
<code>com.hadoop.compression.lzo.LzopCodec</code> then the job’s output files would be compressed with LZO (e.g. you would
see <code>part-00000.lzo</code> output files instead of uncompressed <code>part-00000</code> files).</p>

<p>See also <a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_26_2.html">Compression and Avro</a>
in the CDH4 documentation.</p>

<h2 id="further-readings-on-hadoop-streaming">Further readings on Hadoop Streaming</h2>

<ul>
  <li><a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_26_6.html">Streaming and Avro</a>
– Cloudera CDH4 documentation</li>
</ul>

<h1 id="hive">Hive</h1>

<h2 id="preliminaries-1">Preliminaries</h2>

<p>Important: The examples below assume you have access to a running Hadoop cluster.</p>

<h2 id="examples-2">Examples</h2>

<p>In this section we demonstrate how to create a Hive table backed by Avro data, followed by running a few simple Hive
queries against that data.</p>

<h3 id="defining-a-hive-table-backed-by-avro-data">Defining a Hive table backed by Avro data</h3>

<h4 id="using-avroschemaurl-to-point-to-remote-a-avro-schema-file">Using avro.schema.url to point to remote a Avro schema file</h4>

<p>The following <code>CREATE TABLE</code> statement creates an external Hive table named <code>tweets</code> for storing Twitter messages
in a very basic data structure that consists of username, content of the message and a timestamp.</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
</pre></td><td class="code"><pre><code class="sql"><span class="line"><span class="k">CREATE</span> <span class="k">EXTERNAL</span> <span class="k">TABLE</span> <span class="n">tweets</span>
</span><span class="line">    <span class="k">COMMENT</span> <span class="ss">&quot;A table backed by Avro data with the Avro schema stored in HDFS&quot;</span>
</span><span class="line">    <span class="k">ROW</span> <span class="n">FORMAT</span> <span class="n">SERDE</span> <span class="s1">&#39;org.apache.hadoop.hive.serde2.avro.AvroSerDe&#39;</span>
</span><span class="line">    <span class="n">STORED</span> <span class="k">AS</span>
</span><span class="line">    <span class="n">INPUTFORMAT</span>  <span class="s1">&#39;org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat&#39;</span>
</span><span class="line">    <span class="n">OUTPUTFORMAT</span> <span class="s1">&#39;org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat&#39;</span>
</span><span class="line">    <span class="k">LOCATION</span> <span class="s1">&#39;/user/YOURUSER/examples/input/&#39;</span>
</span><span class="line">    <span class="n">TBLPROPERTIES</span> <span class="p">(</span>
</span><span class="line">        <span class="s1">&#39;avro.schema.url&#39;</span><span class="o">=</span><span class="s1">&#39;hdfs:///user/YOURUSER/examples/schema/twitter.avsc&#39;</span>
</span><span class="line">    <span class="p">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p><em>Note: You must replace <code>YOURUSER</code> with your actual username.</em>
<em>See section Preparing the Input Data above.</em></p>

<p>The serde parameter <code>avro.schema.url</code> can use URI schemes such as <code>hdfs://</code>, <code>http://</code> and <code>file://</code>.  It is
<a href="https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html">recommended to use HDFS locations</a> though:</p>

<blockquote>
  <p>[If the avro.schema.url points] to a location on HDFS […], the AvroSerde will then read the file from HDFS, which
should provide resiliency against many reads at once [which can be a problem for HTTP locations].  Note that the serde
will read this file from every mapper, so it is a good idea to turn the replication of the schema file to a high value
to provide good locality for the readers.  The schema file itself should be relatively small, so this does not add a
significant amount of overhead to the process.</p>
</blockquote>

<p>That said, when hosting the schemas on a high-performance web server such as <a href="http://nginx.org/">nginx</a> that is very
efficient at serving static files then using HTTP locations for Avro schemas should not be a problem either.</p>

<p>If you need to point to a particular HDFS namespace you can include the hostname and port of the NameNode in
<code>avro.schema.url</code>:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="sql"><span class="line"><span class="k">CREATE</span> <span class="k">EXTERNAL</span> <span class="k">TABLE</span> <span class="p">[...]</span>
</span><span class="line">    <span class="n">TBLPROPERTIES</span> <span class="p">(</span>
</span><span class="line">        <span class="s1">&#39;avro.schema.url&#39;</span><span class="o">=</span><span class="s1">&#39;hdfs://namenode01:8020/path/to/twitter.avsc&#39;</span>
</span><span class="line">    <span class="p">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h4 id="using-avroschemaliteral-to-embed-an-avro-schema">Using avro.schema.literal to embed an Avro schema</h4>

<p>An alternative to setting <code>avro.schema.url</code> and using an external Avro schema is to embed the schema directly within
the <code>CREATE TABLE</code> statement:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
<span class="line-number">5</span>
<span class="line-number">6</span>
<span class="line-number">7</span>
<span class="line-number">8</span>
<span class="line-number">9</span>
<span class="line-number">10</span>
<span class="line-number">11</span>
<span class="line-number">12</span>
<span class="line-number">13</span>
<span class="line-number">14</span>
<span class="line-number">15</span>
<span class="line-number">16</span>
<span class="line-number">17</span>
<span class="line-number">18</span>
<span class="line-number">19</span>
</pre></td><td class="code"><pre><code class="sql"><span class="line"><span class="k">CREATE</span> <span class="k">EXTERNAL</span> <span class="k">TABLE</span> <span class="n">tweets</span>
</span><span class="line">    <span class="k">COMMENT</span> <span class="ss">&quot;A table backed by Avro data with the Avro schema embedded in the CREATE TABLE statement&quot;</span>
</span><span class="line">    <span class="k">ROW</span> <span class="n">FORMAT</span> <span class="n">SERDE</span> <span class="s1">&#39;org.apache.hadoop.hive.serde2.avro.AvroSerDe&#39;</span>
</span><span class="line">    <span class="n">STORED</span> <span class="k">AS</span>
</span><span class="line">    <span class="n">INPUTFORMAT</span>  <span class="s1">&#39;org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat&#39;</span>
</span><span class="line">    <span class="n">OUTPUTFORMAT</span> <span class="s1">&#39;org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat&#39;</span>
</span><span class="line">    <span class="k">LOCATION</span> <span class="s1">&#39;/user/YOURUSER/examples/input/&#39;</span>
</span><span class="line">    <span class="n">TBLPROPERTIES</span> <span class="p">(</span>
</span><span class="line">        <span class="s1">&#39;avro.schema.literal&#39;</span><span class="o">=</span><span class="s1">&#39;{</span>
</span><span class="line"><span class="s1">            &quot;type&quot;: &quot;record&quot;,</span>
</span><span class="line"><span class="s1">            &quot;name&quot;: &quot;Tweet&quot;,</span>
</span><span class="line"><span class="s1">            &quot;namespace&quot;: &quot;com.miguno.avro&quot;,</span>
</span><span class="line"><span class="s1">            &quot;fields&quot;: [</span>
</span><span class="line"><span class="s1">                { &quot;name&quot;:&quot;username&quot;,  &quot;type&quot;:&quot;string&quot;},</span>
</span><span class="line"><span class="s1">                { &quot;name&quot;:&quot;tweet&quot;,     &quot;type&quot;:&quot;string&quot;},</span>
</span><span class="line"><span class="s1">                { &quot;name&quot;:&quot;timestamp&quot;, &quot;type&quot;:&quot;long&quot;}</span>
</span><span class="line"><span class="s1">            ]</span>
</span><span class="line"><span class="s1">        }&#39;</span>
</span><span class="line">    <span class="p">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p><em>Note: You must replace <code>YOURUSER</code> with your actual username.</em>
<em>See section Preparing the Input Data above.</em></p>

<p>Hive can also use variable substitution to embed the required Avro schema at run-time of a Hive script:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="sql"><span class="line"><span class="k">CREATE</span> <span class="k">EXTERNAL</span> <span class="k">TABLE</span> <span class="n">tweets</span> <span class="p">[...]</span>
</span><span class="line">    <span class="n">TBLPROPERTIES</span> <span class="p">(</span><span class="s1">&#39;avro.schema.literal&#39;</span><span class="o">=</span><span class="s1">&#39;${hiveconf:schema}&#39;</span><span class="p">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>To execute the Hive script you would then run:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="c"># SCHEMA must be a properly escaped version of the Avro schema; i.e. carriage returns converted to \n, tabs to \t,</span>
</span><span class="line"><span class="c"># quotes escaped, and so on.</span>
</span><span class="line"><span class="nv">$ </span><span class="nb">export </span><span class="nv">SCHEMA</span><span class="o">=</span><span class="s2">&quot;...&quot;</span>
</span><span class="line"><span class="nv">$ </span>hive -hiveconf <span class="nv">schema</span><span class="o">=</span><span class="s2">&quot;${SCHEMA}&quot;</span> -f hive_script.hql
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h4 id="switching-from-avroschemaurl-to-avroschemaliteral-or-vice-versa">Switching from avro.schema.url to avro.schema.literal or vice versa</h4>

<p>If for a given Hive table you want to change how the Avro schema is specified you need to use a
<a href="https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html">workaround</a>:</p>

<blockquote>
  <p>Hive does not provide an easy way to unset or remove a property.  If you wish to switch from using url or schema to
the other, set the to-be-ignored value to none and the AvroSerde will treat it as if it were not set.</p>
</blockquote>

<h3 id="analyzing-the-data-with-hive">Analyzing the data with Hive</h3>

<p>After you have created the Hive table <code>tweets</code> with one of the <code>CREATE TABLE</code> statements above (no matter which),
you can start analyzing the example data with Hive.  We will demonstrate this via the interactive Hive shell, but you
can also use a Hive script, of course.</p>

<p>First, start the Hive shell:</p>

<div class="bogus-wrapper"><notextile><figure class="code"><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="bash"><span class="line"><span class="nv">$ </span>hive
</span><span class="line">hive&gt;
</span></code></pre></td></tr></table></div></figure></notextile></div>

<p>Let us inspect how Hive interprets the Avro data with <code>DESCRIBE</code>.  You can also use <code>DESCRIBE EXTENDED</code> to see even
more details, including the Avro schema of the table.</p>

<pre><code>hive&gt; DESCRIBE tweets;
OK
username        string  from deserializer
tweet   string  from deserializer
timestamp       bigint  from deserializer
Time taken: 1.786 seconds
</code></pre>

<p>Now we can perform interactive analysis of our example data:</p>

<pre><code>hive&gt; SELECT * FROM tweets LIMIT 5;
OK
miguno        Rock: Nerf paper, scissors is fine.   1366150681
BlizzardCS    Works as intended.  Terran is IMBA.   1366154481
DarkTemplar   From the shadows I come!              1366154681
VoidRay       Prismatic core online!                1366160000
VoidRay       Fire at will, commander.              1366160010
Time taken: 0.126 seconds
</code></pre>

<p>The following query will launch a MapReduce job to compute the result:</p>

<pre><code>hive&gt; SELECT DISTINCT(username) FROM tweets;
Total MapReduce jobs = 1
Launching Job 1 out of 1
[...snip...]
MapReduce Total cumulative CPU time: 4 seconds 290 msec
Ended Job = job_201305070634_0187
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.29 sec   HDFS Read: 1887 HDFS Write: 47 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 290 msec
OK
BlizzardCS          &lt;&lt;&lt; Query results start here
DarkTemplar
Immortal
VoidRay
miguno
Time taken: 16.782 seconds
</code></pre>

<p>As you can see Hive makes working Avro data completely transparent once you have defined the Hive table accordingly.</p>

<h3 id="enabling-compression-of-avro-output-data">Enabling compression of Avro output data</h3>

<p>To enable compression add the following statements to your Hive script or enter them into the Hive shell:</p>

<pre><code># For compression with Snappy
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

# For compression with Deflate
SET hive.exec.compress.output=true;
SET avro.output.codec=deflate;
</code></pre>

<p>To disable compression again in the same Hive script/Hive shell:</p>

<pre><code>SET hive.exec.compress.output=false;
</code></pre>

<h2 id="further-readings-on-hive">Further readings on Hive</h2>

<ul>
  <li><a href="https://cwiki.apache.org/Hive/avroserde-working-with-avro-from-hive.html">AvroSerDe - working with Avro from Hive</a>
– Hive documentation</li>
</ul>

<h1 id="pig">Pig</h1>

<h2 id="preliminaries-2">Preliminaries</h2>

<p>Important: The examples below assume you have access to a running Hadoop cluster.</p>

<h2 id="examples-3">Examples</h2>

<h3 id="prerequisites-2">Prerequisites</h3>

<p>First we must register the required jar files to be able to work with Avro.  In this example I am using the jar files
shipped with CDH4.  If you are not using CDH4 just adapt the paths to match your Hadoop distribution.</p>

<pre><code>REGISTER /app/cloudera/parcels/CDH/lib/pig/piggybank.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/avro-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-core-asl-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-mapper-asl-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/json-simple-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/snappy-java-*.jar
</code></pre>

<p>Note: If you also want to work with Python UDFs in PiggyBank you must also register the Jython jar file:</p>

<pre><code>REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jython-standalone-*.jar
</code></pre>

<h3 id="reading-avro">Reading Avro</h3>

<p>To read input data in Avro format you must use <code>AvroStorage</code>.  The following statements show various ways to load
Avro data.</p>

<pre><code>-- Easiest case: when the input data contains an embedded Avro schema (our example input data does).
-- Note that all the files under the directory should have the same schema.
records = LOAD 'examples/input/' USING org.apache.pig.piggybank.storage.avro.AvroStorage();

--
-- Next commands show how to manually specify the data schema
--

-- Using external schema file (stored on HDFS), relative path
records = LOAD 'examples/input/'
          USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check',
               'schema_file', 'examples/schema/twitter.avsc');

-- Using external schema file (stored on HDFS), absolute path
records = LOAD 'examples/input/'
          USING org.apache.pig.piggybank.storage.avro.AvroStorage(
            'no_schema_check',
            'schema_file', 'hdfs:///user/YOURUSERNAME/examples/schema/twitter.avsc');

-- Using external schema file (stored on HDFS), absolute path with explicit HDFS namespace
records = LOAD 'examples/input/'
          USING org.apache.pig.piggybank.storage.avro.AvroStorage(
            'no_schema_check',
            'schema_file', 'hdfs://namenode01:8020/user/YOURUSERNAME/examples/schema/twitter.avsc');
</code></pre>

<p><em>About “no_schema_check”:</em>
<code>AvroStorage</code> assumes that all Avro files in sub-directories of an input directory share the same schema, and by
default <code>AvroStorage</code> performs a schema check.  This process may take some time (seconds) when the input directory
contains many sub-directories and files.  You can set the option “no_schema_check” to disable this schema check.</p>

<p>See <a href="https://cwiki.apache.org/confluence/display/PIG/AvroStorage">AvroStorage</a> and
<a href="https://github.com/apache/pig/blob/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java">TestAvroStorage.java</a>
for further examples.</p>

<h3 id="analyzing-the-data-with-pig">Analyzing the data with Pig</h3>

<p>The <code>records</code> relation is already in perfectly usable format – you do not need to manually define a (Pig) schema as
you would usually do via <code>LOAD ... AS (...schema follows...)</code>.</p>

<pre><code>grunt&gt; DESCRIBE records;
records: {username: chararray,tweet: chararray,timestamp: long}
</code></pre>

<p>Let us take a first look at the contents of the our input data.  Note that the output you will see will vary at each
invocation due to how <a href="http://pig.apache.org/docs/r0.11.1/test.html">ILLUSTRATE</a> works.</p>

<pre><code>grunt&gt; ILLUSTRATE records;
&lt;snip&gt;
--------------------------------------------------------------------------------------------
| records     | username:chararray      | tweet:chararray            | timestamp:long      |
--------------------------------------------------------------------------------------------
|             | DarkTemplar             | I strike from the shadows! | 1366184681          |
--------------------------------------------------------------------------------------------
</code></pre>

<p>Now we can perform interactive analysis of our example data:</p>

<pre><code>grunt&gt; first_five_records = LIMIT records 5;
grunt&gt; DUMP first_five_records;   &lt;&lt;&lt; this will trigger a MapReduce job
[...snip...]
(miguno,Rock: Nerf paper, scissors is fine.,1366150681)
(VoidRay,Prismatic core online!,1366160000)
(VoidRay,Fire at will, commander.,1366160010)
(BlizzardCS,Works as intended.  Terran is IMBA.,1366154481)
(DarkTemplar,From the shadows I come!,1366154681)
</code></pre>

<p>List the (unique) names of users that created tweets:</p>

<pre><code>grunt&gt; usernames = DISTINCT (FOREACH records GENERATE username);
grunt&gt; DUMP usernames;            &lt;&lt;&lt; this will trigger a MapReduce job
[...snip...]
(miguno)
(VoidRay)
(Immortal)
(BlizzardCS)
(DarkTemplar)
</code></pre>

<h3 id="writing-avro">Writing Avro</h3>

<p>To write output data in Avro format you must use <code>AvroStorage</code> – just like for reading Avro data.</p>

<p>It is strongly recommended that you do specify an explicit output schema when writing Avro data.  If you don’t then Pig
will try to infer the output Avro schema from the data’s Pig schema – and this may result in undesirable schemas due
to discrepancies of Pig and Avro data models (or problems of Pig itself).  See
<a href="https://cwiki.apache.org/confluence/display/PIG/AvroStorage">AvroStorage</a> for details.</p>

<pre><code>-- Use the same output schema as an existing directory of Avro files (files should have the same schema).
-- This is helpful, for instance, when doing simple processing such as filtering the input data without modifying
-- the resulting data layout.
STORE records INTO 'pig/output/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        'no_schema_check',
        'data', 'examples/input/');

-- Use the same output schema as an existing Avro file as opposed to a directory of such files
STORE records INTO 'pig/output/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        'no_schema_check',
        'data', 'examples/input/twitter.avro');

-- Manually define an Avro schema (here, we rename 'username' to 'user' and 'tweet' to 'message')
STORE records INTO 'pig/output/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        '{
            "schema": {
                "type": "record",
                "name": "Tweet",
                "namespace": "com.miguno.avro",
                "fields": [
                    {
                        "name": "user",
                        "type": "string"
                    },
                    {
                        "name": "message",
                        "type": "string"
                    },
                    {
                        "name": "timestamp",
                        "type": "long"
                    }
                ],
                "doc:" : "A slightly modified schema for storing Twitter messages"
            }
        }');
</code></pre>

<p>If you need to store the data in two or more different ways (e.g. you want to rename fields) you must add the parameter
<a href="https://cwiki.apache.org/confluence/display/PIG/AvroStorage">“index”</a> to the <code>AvroStorage</code> arguments.  Pig uses this
information as a workaround to distinguish schemas specified by different AvroStorage calls until Pig’s StoreFunc
provides access to Pig’s output schema in the backend.</p>

<pre><code>STORE records INTO 'pig/output-variant-A/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        '{
            "index": 1,
            "schema": { ... }
        }');

STORE records INTO 'pig/output-variant-B/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        '{
            "index": 2,
            "schema": { ... }
        }');
</code></pre>

<p>See <a href="https://cwiki.apache.org/confluence/display/PIG/AvroStorage">AvroStorage</a> and
<a href="https://github.com/apache/pig/blob/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java">TestAvroStorage.java</a>
for further examples.</p>

<h4 id="enabling-compression-of-avro-output-data-1">Enabling compression of Avro output data</h4>

<p>To enable compression add the following statements to your Pig script or enter them into the Pig Grunt shell:</p>

<pre><code>-- We also enable compression of map output (which should be enabled by default anyways) because some Pig jobs
-- skip the reduce phase;  this ensures that we always generate compressed job output.
SET mapred.compress.map.output true;
SET mapred.output.compress true;
SET mapred.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec
SET avro.output.codec snappy;
</code></pre>

<p>To disable compression again in the same Pig script/Pig Grunt shell:</p>

<pre><code>SET mapred.output.compress false;
-- Optionally: disable compression of map output (normally you want to leave this enabled)
SET mapred.compress.map.output false;
</code></pre>

<h3 id="further-readings-on-pig">Further readings on Pig</h3>

<ul>
  <li><a href="https://cwiki.apache.org/confluence/display/PIG/AvroStorage">AvroStorage</a> on the Pig wiki</li>
  <li><a href="https://github.com/apache/pig/blob/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java">AvroStorage.java</a></li>
  <li><a href="https://github.com/apache/pig/blob/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java">TestAvroStorage.java</a>
– many unit test examples that demonstrate how to use <code>AvroStorage</code></li>
</ul>

<h1 id="where-to-go-from-here">Where to go from here</h1>

<p>As I said at the beginning of this article you can always find the latest version of the code examples at
<a href="https://github.com/miguno/avro-hadoop-starter">https://github.com/miguno/avro-hadoop-starter</a>.  I’d welcome any
code contributions, corrections, etc. you might have – just
<a href="https://github.com/miguno/avro-hadoop-starter/issues/new">create an issue ticket</a> or send me a pull request.</p>

<p>If you are interested in reading and writing Avro files in a shell environment – e.g. when you quickly want to
inspect a sample of MapReduce output in Avro format – please take a look at
<a href="http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/">Reading and Writing Avro Files From the Command Line</a>.</p>
]]></content>
  </entry>
  <entry>
    <title type="html"><![CDATA[Understanding the Internal Message Buffers of Storm]]></title>
    <link href="http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/">?utm_medium=feed&amp;utm_source=feedpress.me&amp;utm_campaign=Feed%3A+miguno</link>
    <updated>2013-06-21T22:35:00+02:00</updated>
    <id>http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers</id>
    <content type="html"><![CDATA[<p>When you are optimizing the performance of your Storm topologies it helps to understand how Storm’s internal message
queues are configured and put to use.  In this short article I will explain and illustrate how Storm version 0.8/0.9
implements the intra-worker communication that happens within a worker process and its associated executor threads.</p>

<!-- more -->

<h1 id="internal-messaging-within-storm-worker-processes">Internal messaging within Storm worker processes</h1>

<div class="note">
Terminology: I will use the terms <em>message</em> and (Storm) <em>tuple</em> interchangeably in the following sections.
</div>

<p>When I say “internal messaging” I mean the messaging that happens within a worker process in Storm, which is communication
that is restricted to happen within the same Storm machine/node.  For this communication Storm relies on various message
queues backed by <a href="http://lmax-exchange.github.io/disruptor/">LMAX Disruptor</a>, which is a high performance inter-thread
messaging library.</p>

<p>Note that this communication within the threads of a worker process is different from Storm’s <em>inter-worker</em>
communication, which normally happens across machines and thus over the network.  For the latter Storm uses
<a href="http://www.zeromq.org/">ZeroMQ</a> by default (in Storm 0.9 there is experimental support for <a href="http://netty.io/">Netty</a> as
the network messaging backend).  That is, ZeroMQ/Netty are used when a task in one worker process wants to send data to
a task that runs in a worker process on different machine in the Storm cluster.</p>

<p>So for your reference:</p>

<ul>
  <li>Intra-worker communication in Storm (inter-thread on the same Storm node): LMAX Disruptor</li>
  <li>Inter-worker communication (node-to-node across the network): ZeroMQ or Netty</li>
  <li>Inter-topology communication: nothing built into Storm, you must take care of this yourself with e.g. a messaging
system such as Kafka/RabbitMQ, a database, etc.</li>
</ul>

<p>If you do not know what the differences are between Storm’s worker processes, executor threads and tasks please take a
look at
<a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">Understanding the Parallelism of a Storm Topology</a>.</p>

<h1 id="illustration">Illustration</h1>

<p>Let us start with a picture before we discuss the nitty-gritty details in the next section.</p>

<p><img src="http://www.michael-noll.com/blog/uploads/storm-internal-message-queues.png" title="Overview of Storm's internal messaging setup" /></p>

<div class="caption">
Figure 1: Overview of a worker&#8217;s internal message queues in Storm.  Queues related to a worker process are colored in
red, queues related to the worker&#8217;s various executor threads are colored in green.  For readability reasons I show only
one worker process (though normally a single Storm node runs multiple such processes) and only one executor thread
within that worker process (of which, again, there are usually many per worker process).
</div>

<h1 id="detailed-description">Detailed description</h1>

<p>Now that you got a first glimpse of Storm’s intra-worker messaging setup we can discuss the details.</p>

<h2 id="worker-processes">Worker processes</h2>

<p>To manage its incoming and outgoing messages each worker process has a single receive thread that listens on the worker’s
TCP port (as configured via <code>supervisor.slots.ports</code>).  The parameter <code>topology.receiver.buffer.size</code> determines the
batch size that the receive thread uses to place incoming messages into the incoming queues of the worker’s executor
threads.  Similarly, each worker has a single send thread that is responsible for reading messages from the worker’s
transfer queue and sending them over the network to downstream consumers.  The size of the transfer queue is configured
via <code>topology.transfer.buffer.size</code>.</p>

<ul>
  <li>The <code>topology.receiver.buffer.size</code> is the maximum number of messages that are batched together at once for
appending to an executor’s incoming queue by the worker receive thread (which reads the messages from the network)
Setting this parameter too high may cause a lot of problems (“heartbeat thread gets starved, throughput plummets”).
The default value is 8 elements, and the value must be a power of 2 (this requirement comes indirectly from LMAX
Disruptor).</li>
</ul>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="c1">// Example: configuring via Java API</span>
</span><span class="line"><span class="n">Config</span> <span class="n">conf</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Config</span><span class="o">();</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_RECEIVER_BUFFER_SIZE</span><span class="o">,</span> <span class="mi">16</span><span class="o">);</span> <span class="c1">// default is 8</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<div class="note">
Note that <tt>topology.receiver.buffer.size</tt> is in contrast to the other buffer size related parameters described in this article actually not configuring the size of an LMAX Disruptor queue.  Rather it sets the size of a simple <a href="http://docs.oracle.com/javase/6/docs/api/java/util/ArrayList.html">ArrayList</a> that is used to buffer incoming messages because in this specific case the data structure does not need to be shared with other threads, i.e. it is local to the worker&#8217;s receive thread.  But because the content of this buffer is used to fill a Disruptor-backed queue (executor incoming queues) it must still be a power of 2.  See <tt>launch-receive-thread!</tt> in <a href="https://github.com/nathanmarz/storm/blob/master/storm-core/src/clj/backtype/storm/messaging/loader.clj">backtype.storm.messaging.loader</a> for details.
</div>

<ul>
  <li>Each element of the transfer queue configured with <code>topology.transfer.buffer.size</code> is actually a <em>list</em> of tuples.
The various executor send threads will batch outgoing tuples off their outgoing queues onto the transfer queue.  The
default value is 1024 elements.</li>
</ul>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="c1">// Example: configuring via Java API</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_TRANSFER_BUFFER_SIZE</span><span class="o">,</span> <span class="mi">32</span><span class="o">);</span> <span class="c1">// default is 1024</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h2 id="executors">Executors</h2>

<p>Each worker process controls one or more <em>executor threads</em>.  Each executor thread has its own <em>incoming queue</em> and
<em>outgoing queue</em>.  As described above, the worker process runs a dedicated worker receive thread that is responsible
for moving incoming messages to the appropriate incoming queue of the worker’s various executor threads.  Similarly,
each executor has its dedicated send thread that moves an executor’s outgoing messages from its outgoing queue to the
“parent” worker’s transfer queue.  The sizes of the executors’ incoming and outgoing queues are configured via
<code>topology.executor.receive.buffer.size</code> and <code>topology.executor.send.buffer.size</code>, respectively.</p>

<p>Each executor thread has a single thread that handles the user logic for the spout/bolt (i.e. your application code),
and a single send thread which moves messages from the executor’s outgoing queue to the worker’s transfer queue.</p>

<ul>
  <li>The <code>topology.executor.receive.buffer.size</code> is the size of the incoming queue for an executor.  Each element of
this queue is a <em>list</em> of tuples.  Here, tuples are appended in batch.  The default value is 1024 elements, and
the value must be a power of 2 (this requirement comes from LMAX Disruptor).</li>
</ul>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="c1">// Example: configuring via Java API</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE</span><span class="o">,</span> <span class="mi">16384</span><span class="o">);</span> <span class="c1">// batched; default is 1024</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<ul>
  <li>The <code>topology.executor.send.buffer.size</code> is the size of the outgoing queue for an executor. Each element of this
queue will contain a <em>single</em> tuple.  The default value is 1024 elements, and the value must be a power of 2 (this
requirement comes from LMAX Disruptor).</li>
</ul>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="c1">// Example: configuring via Java API</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE</span><span class="o">,</span> <span class="mi">16384</span><span class="o">);</span> <span class="c1">// individual tuples; default is 1024</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

<h1 id="where-to-go-from-here">Where to go from here</h1>

<h2 id="how-to-configure-storms-internal-message-buffers">How to configure Storm’s internal message buffers</h2>

<p>The various default values mentioned above are defined in
<a href="https://github.com/nathanmarz/storm/blob/master/conf/defaults.yaml">conf/defaults.yaml</a>.  You can override these values
globally in a Storm cluster’s <code>conf/storm.yaml</code>.  You can also configure these parameters per individual Storm
topology via <a href="http://nathanmarz.github.io/storm/doc/backtype/storm/Config.html">backtype.storm.Config</a> in Storm’s Java
API.</p>

<h2 id="how-to-configure-storms-parallelism">How to configure Storm’s parallelism</h2>

<p>The correct configuration of Storm’s message buffers is closely tied to the workload pattern of your topology as well
as the configured <em>parallelism</em> of your topologies.  See
<a href="http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/">Understanding the Parallelism of a Storm Topology</a>
for more details about the latter.</p>

<h2 id="understand-whats-going-on-in-your-storm-topology">Understand what’s going on in your Storm topology</h2>

<p>The Storm UI is a good start to inspect key metrics of your running Storm topologies.  For instance, it shows you the
so-called “capacity” of a spout/bolt.  The various metrics will help you decide whether your changes to the
buffer-related configuration parameters described in this article had a positive or negative effect on the performance
of your Storm topologies.  See
<a href="http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/">Running a Multi-Node Storm Cluster</a> for details.</p>

<p>Apart from that you can also generate your own application metrics and track them with a tool like Graphite.
See my articles <a href="http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/">Sending Metrics From Storm to Graphite</a> and
<a href="http://www.michael-noll.com/blog/2013/06/06/installing-and-running-graphite-via-rpm-and-supervisord/">Installing and Running Graphite via RPM and Supervisord</a>
for details.  It might also be worth checking out ooyala’s
<a href="https://github.com/ooyala/metrics_storm">metrics_storm</a> project on GitHub (I haven’t used it yet).</p>

<h2 id="advice-on-performance-tuning">Advice on performance tuning</h2>

<p>Watch Nathan Marz’s talk on
<a href="http://demo.ooyala.com/player.html?width=640&amp;height=360&amp;embedCode=Q1eXg5NzpKqUUzBm5WTIb6bXuiWHrRMi&amp;videoPcode=9waHc6zKpbJKt9byfS7l4O4sn7Qn">Tuning and Productionization of Storm</a>.</p>

<p>The TL;DR version is:  Try the following settings as a first start and see whether it improves the performance of your
Storm topology.</p>

<div class="bogus-wrapper"><notextile><figure class="code"> <div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class="line-number">1</span>
<span class="line-number">2</span>
<span class="line-number">3</span>
<span class="line-number">4</span>
</pre></td><td class="code"><pre><code class="java"><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_RECEIVER_BUFFER_SIZE</span><span class="o">,</span>             <span class="mi">8</span><span class="o">);</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_TRANSFER_BUFFER_SIZE</span><span class="o">,</span>            <span class="mi">32</span><span class="o">);</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE</span><span class="o">,</span> <span class="mi">16384</span><span class="o">);</span>
</span><span class="line"><span class="n">conf</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="n">Config</span><span class="o">.</span><span class="na">TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE</span><span class="o">,</span>    <span class="mi">16384</span><span class="o">);</span>
</span></code></pre></td></tr></table></div></figure></notextile></div>

]]></content>
  </entry>
</feed>
