Michael G. Noll

Integrating Kafka and Spark Streaming: Code Examples and State of the Game

2014-10-01T16:51:00+02:00

Spark Streaming has been getting some attention lately as a real-time data processing tool, often mentioned alongside Apache Storm. If you ask me, no real-time data processing tool is complete without Kafka integration (smile), hence I added an example Spark Streaming application to kafka-storm-starter that demonstrates how to read from Kafka and write to Kafka, using Avro as the data format and Twitter Bijection for handling the data serialization.

In this post I will explain this Spark Streaming example in further detail and also shed some light on the current state of Kafka integration in Spark Streaming. All this with the disclaimer that this happens to be my first experiment with Spark Streaming.

The Spark Streaming example code is available at kafka-storm-starter on GitHub. And yes, the project’s name might now be a bit misleading. :-)

What is Spark Streaming?

Spark Streaming is a sub-project of Apache Spark. Spark is a batch processing platform similar to Apache Hadoop, and Spark Streaming is a real-time processing tool that runs on top of the Spark engine.

Spark Streaming vs. Apache Storm

In terms of use cases Spark Streaming is closely related to Apache Storm, which is arguably today’s most popular real-time processing platform for Big Data. Bobby Evans and Tom Graves of Yahoo! Engineering recently gave a talk on Spark and Storm at Yahoo!, in which they compare the two platforms and also cover the question of when and why choosing one over the other. Similarly, P. Taylor Goetz of HortonWorks shared a slide deck titled Apache Storm and Spark Streaming Compared.

Here’s my personal, very brief comparison: Storm has higher industry adoption and better production stability compared to Spark Streaming. Spark on the other hand has a more expressive, higher level API than Storm, which is arguably more pleasant to use, at least if you write your Spark applications in Scala (I prefer the Spark API, too). But don’t trust my word, please do check out the talks/decks above yourself.

Both Spark and Storm are top-level Apache projects, and vendors have begun to integrate either or both tools into their commercial offerings, e.g. HortonWorks (Storm, Spark) and Cloudera (Spark).

Excursus: Machines, cores, executors, tasks, and receivers in Spark

The subsequent sections of this article talk a lot about parallelism in Spark and in Kafka. You need at least a basic understanding of some Spark terminology to be able to follow the discussion in those sections.

A Spark cluster contains 1+ worker nodes aka slave machines (simplified view; I exclude pieces like cluster managers here.)
A worker node can run 1+ executors.
An executor is a process launched for an application on a worker node, which runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. An executor has a certain amount of cores aka “slots” available to run tasks assigned to it.
A task is a unit of work that will be sent to one executor. That is, it runs (part of) the actual computation of your application. The SparkContext sends those tasks for the executors to run. Each task occupies one slot aka core in the parent executor.
A receiver (API, docs) is run within an executor as a long-running task. Each receiver is responsible for exactly one so-called input DStream (e.g. an input stream for reading from Kafka), and each receiver – and thus input DStream – occupies one core/slot.
An input DStream: an input DStream is a special DStream that connects Spark Streaming to external data sources for reading input data. For each external data source (e.g. Kafka) you need one such input DStream implementation. Once Spark Streaming is “connected” to an external data source via such input DStreams, any subsequent DStream transformations will create “normal” DStreams.

In Spark’s execution model, each application gets its own executors, which stay up for the duration of the whole application and run 1+ tasks in multiple threads. This isolation approach is similar to Storm’s model of execution. This architecture becomes more complicated once you introduce cluster managers like YARN or Mesos, which I do not cover here. See Cluster Overview in the Spark docs for further details.

Integrating Kafka with Spark Streaming

Overview

In short, Spark Streaming supports Kafka but there are still some rough edges.

A good starting point for me has been the KafkaWordCount example in the Spark code base (Update 2015-03-31: see also DirectKafkaWordCount). When I read this code, however, there were still a couple of open questions left.

Notably I wanted to understand how to:

Read from Kafka in parallel. In Kafka, a topic can have N partitions, and ideally we’d like to parallelize reading from those N partitions. This is what the Kafka spout in Storm does.
Write to Kafka from a Spark Streaming application, also, in parallel.

On top of those questions I also ran into several known issues in Spark and/or Spark Streaming, most of which have been discussed in the Spark mailing list. I’ll summarize the current state and known issues of the Kafka integration further down below.

Primer on topics, partitions, and parallelism in Kafka

For details see my articles Apache Kafka 0.8 Training Deck and Tutorial and Running a Multi-Broker Apache Kafka 0.8 Cluster on a Single Node.

Kafka stores data in topics, with each topic consisting of a configurable number of partitions. The number of partitions of a topic is very important for performance considerations as this number is an upper bound on the consumer parallelism: if a topic has N partitions, then your application can only consume this topic with a maximum of N threads in parallel. (At least this is the case when you use Kafka’s built-in Scala/Java consumer API.)

When I say “application” I should rather say consumer group in Kafka’s terminology. A consumer group, identified by a string of your choosing, is the cluster-wide identifier for a logical consumer application. All consumers that are part of the same consumer group share the burden of reading from a given Kafka topic, and only a maximum of N (= number of partitions) threads across all the consumers in the same group will be able to read from the topic. Any excess threads will sit idle.

Multiple Kafka consumer groups can be run in parallel: Of course you can run multiple, independent logical consumer applications against the same Kafka topic. Here, each logical application will run its consumer threads under a unique consumer group id. Each application can then also use different read parallelisms (see below). When I am talking about the various ways to configure read parallelism in the following sections, then I am referring to the settings of a single one of these logical consumer applications.

Here are some simplified examples.

Your application uses the consumer group id “terran” to read from a Kafka topic “zerg.hydra” that has 10 partitions. If you configure your application to consume the topic with only 1 thread, then this single thread will read data from all 10 partitions.
Same as above, but this time you configure 5 consumer threads. Here, each thread will read from 2 partitions.
Same as above, but this time you configure 10 consumer threads. Here, each thread will read from a single partition.
Same as above, but this time you configure 14 consumer threads. Here, 10 of the 14 threads will read from a single partition each, and the remaining 4 threads will be idle.

Let’s introduce some real-world complexity in this simple picture – the rebalancing event in Kafka. Rebalancing is a lifecycle event in Kafka that occurs when consumers join or leave a consumer group (there are more conditions that trigger rebalancing but these are not important in this context; see my Kafka training deck for details on rebalancing).

Your application uses the consumer group id “terran” and starts consuming with 1 thread. This thread will read from all 10 partitions. During runtime, you’ll increase the number of threads from 1 to 14. That is, there is suddenly a change of parallelism for the same consumer group. This triggers rebalancing in Kafka. Once rebalancing completes, you will have 10 of 14 threads consuming from a single partition each, and the 4 remaining threads will be idle. And as you might have guessed, the initial thread will now read from only one partition and will no longer see data from the other nine.

We have now a basic understanding of topics, partitions, and the number of partitions as an upper bound for the parallelism when reading from Kafka. But what are the resulting implications for an application – such as a Spark Streaming job or Storm topology – that reads its input data from Kafka?

Read parallelism: You typically want to read from all N partitions of a Kafka topic in parallel by consuming with N threads. And depending on the data volume you want to spread those threads across different NICs, which typically means across different machines. In Storm, this is achieved by setting the parallelism of the Kafka spout to N via TopologyBuilder#setSpout(). The Spark equivalent is a bit trickier, and I will describe how to do this in further detail below.
Downstream processing parallelism: Once retrieved from Kafka you want to process the data in parallel. Depending on your use case this level of parallelism must be different from the read parallelism. If your use case is CPU-bound, for instance, you want to have many more processing threads than read threads; this is achieved by shuffling or “fanning out” the data via the network from the few read threads to the many processing threads. Hence you pay for the access to more cores with increased network communication, serialization overhead, etc. In Storm, you perform such a shuffling via a shuffle grouping from the Kafka spout to the next downstream bolt. The Spark equivalent is the repartition transformation on DStreams.

The important takeaway is that it is possible – and often desired – to decouple the level of parallelisms for reading from Kafka and for processing the data once read. In the next sections I will describe the various options you have at your disposal to configure read parallelism and downstream processing parallelism in Spark Streaming.

Reading from Kafka

Read parallelism in Spark Streaming

Like Kafka, Spark Streaming has the concept of partitions. It is important to understand that Kafka’s per-topic partitions are not correlated to the partitions of RDDs in Spark.

The KafkaInputDStream of Spark Streaming – aka its Kafka “connector” – uses Kafka’s high-level consumer API, which means you have two control knobs in Spark that determine read parallelism for Kafka:

The number of input DStreams. Because Spark will run one receiver (= task) per input DStream, this means using multiple input DStreams will parallelize the read operations across multiple cores and thus, hopefully, across multiple machines and thereby NICs.
The number of consumer threads per input DStream. Here, the same receiver (= task) will run multiple threads. That is, read operations will happen in parallel but on the same core/machine/NIC.

For practical purposes option 1 is the preferred.

Why is that? First and foremost because reading from Kafka is normally network/NIC limited, i.e. you typically do not increase read-throughput by running more threads on the same machine. In other words, it is rare though possible that reading from Kafka runs into CPU bottlenecks. Second, if you go with option 2 then multiple threads will be competing for the lock to push data into so-called blocks (the += method of BlockGenerator that is used behind the scenes is synchronized on the block generator instance).

Number of partitions of the RDDs created by the input DStreams: The KafkaInputDStream will store individual messages received from Kafka into so-called blocks. From what I understand, a new block is generated every spark.streaming.blockInterval milliseconds, and each block is turned into a partition of the RDD that will eventually be created by the DStream. If this assumption of mine is true, then the number of partitions in the RDDs created by KafkaInputDStream is determined by batchInterval / spark.streaming.blockInterval, where batchInterval is the time interval at which streaming data will be divided into batches (set via a constructor parameter of StreamingContext). For example, if the batch interval is 2 seconds (default) and the block interval is 200ms (default), your RDD will contain 10 partitions. Please correct me if I’m mistaken.

Option 1: Controlling the number of input DStreams

The example below is taken from the Spark Streaming Programming Guide.

val ssc: StreamingContext = ??? // ignore for now
val kafkaParams: Map[String, String] = Map("group.id" -> "terran", /* ignore rest */)

val numInputDStreams = 5
val kafkaDStreams = (1 to numInputDStreams).map { _ => KafkaUtils.createStream(...) }

In this example we create five input DStreams, thus spreading the burden of reading from Kafka across five cores and, hopefully, five machines/NICs. (I say “hopefully” because I am not certain whether Spark Streaming task placement policy will try to place receivers on different machines.) All input DStreams are part of the “terran” consumer group, and the Kafka API will ensure that these five input DStreams a) will see all available data for the topic because it assigns each partition of the topic to an input DStream and b) will not see overlapping data because each partition is assigned to only one input DStream at a time. In other words, this setup of “collaborating” input DStreams works because of the consumer group behavior provided by the Kafka API, which is used behind the scenes by KafkaInputDStream.

What I have not shown in the example is how many threads are created per input DStream, which is done via parameters to the KafkaUtils.createStream method (the actual input topic(s) are also specified as parameters of this method). We will do this in the next section.

But before we continue let me highlight several known issues with this setup and with Spark Streaming in particular, which are caused on the one hand by current limitations of Spark in general and on the other hand by the current implementation of the Kafka input DStream in particular:

[When you use the multi-input-stream approach I described above, then] those consumers operate in one [Kafka] consumer group, and they try to decide which consumer consumes which partitions. And it may just fail to do syncpartitionrebalance, and then you have only a few consumers really consuming. To mitigate this problem, you can set rebalance retries very high, and pray it helps.
Then arises yet another “feature” — if your receiver dies (OOM, hardware failure), you just stop receiving from Kafka!
spark-user discussion markmail.org/message/…

The “stop receiving from Kafka” issue requires some explanation. Currently, when you start your streaming application via ssc.start() the processing starts and continues indefinitely – even if the input data source (e.g. Kafka) becomes unavailable. That is, streams are not able to detect if they have lost connection to the upstream data source and thus cannot react to this event, e.g. by reconnecting or by stopping the execution. Similarly, if you lose a receiver that reads from the data source, then your streaming application will generate empty RDDs.

This is a pretty unfortunate situation. One crude workaround is to restart your streaming application whenever it runs into an upstream data source failure or a receiver failure. This workaround may not help you though if your use case requires you to set the Kafka configuration option auto.offset.reset to “smallest” – because of a known bug in Spark Streaming the resulting behavior of your streaming application may not be what you want. See the section on Known issues in Spark Streaming below for further details.

Option 2: Controlling the number of consumer threads per input DStream

In this example we create a single input DStream that is configured to run three consumer threads – in the same receiver/task and thus on the same core/machine/NIC – to read from the Kafka topic “zerg.hydra”.

val ssc: StreamingContext = ??? // ignore for now
val kafkaParams: Map[String, String] = Map("group.id" -> "terran", ...)

val consumerThreadsPerInputDstream = 3
val topics = Map("zerg.hydra" -> consumerThreadsPerInputDstream)
val stream = KafkaUtils.createStream(ssc, kafkaParams, topics, ...)

The KafkaUtils.createStream method is overloaded, so there are a few different method signatures. In this example we pick the Scala variant that gives us the most control.

Combining options 1 and 2

Here is a more complete example that combines the previous two techniques:

val ssc: StreamingContext = ???
val kafkaParams: Map[String, String] = Map("group.id" -> "terran", ...)

val numDStreams = 5
val topics = Map("zerg.hydra" -> 1)
val kafkaDStreams = (1 to numDStreams).map { _ =>
    KafkaUtils.createStream(ssc, kafkaParams, topics, ...)
  }

We are creating five input DStreams, each of which will run a single consumer thread. If the input topic “zerg.hydra” has five partitions (or less), then this is normally the best way to parallelize read operations if you care primarily about maximizing throughput.

Downstream processing parallelism in Spark Streaming

In the previous sections we covered parallelizing reads from Kafka. Now we can tackle parallelizing the downstream data processing in Spark. Here, you must keep in mind how Spark itself parallelizes its processing. Like Kafka, Spark ties the parallelism to the number of (RDD) partitions by running one task per RDD partition (sometimes partitions are still called “slices” in the docs).

Just like any Spark application: Once a Spark Streaming application has received its input data, any further processing is identical to non-streaming Spark applications. That is, you use exactly the same tools and patterns to scale your application as you would for “normal” Spark data flows. See Level of Parallelism in Data Processing.

This gives us two control knobs:

The number of input DStreams, i.e. what we receive as a result of the previous sections on read parallelism. This is our starting point, which we can either take as-is or modify with the next option.
The repartition DStream transformation. It returns a new DStream with an increased or decreased level N of parallelism. Each RDD in the returned DStream has exactly N partitions. DStreams are a continuous series of RDDs, and behind the scenes DStream.repartition calls RDD.repartition. The latter “reshuffles the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.” In other words, DStream.repartition is very similar to Storm’s shuffle grouping.

Hence repartition is our primary means to decouple read parallelism from processing parallelism. It allows us to set the number of processing tasks and thus the number of cores that will be used for the processing. Indirectly, we also influence the number of machines/NICs that will be involved.

A related DStream transformation is union. (This method also exists for StreamingContext, where it returns the unified DStream from multiple DStreams of the same type and same slide duration. Most likely you would use the StreamingContext variant.) A union will return a UnionDStream backed by a UnionRDD. A UnionRDD is comprised of all the partitions of the RDDs being unified, i.e. if you unite 3 RDDs with 10 partitions each, then your union RDD instance will contain 30 partitions. In other words, union will squash multiple DStreams into a single DStream/RDD, but it will not change the level of parallelism. Whether you need to use union or not depends on whether your use case requires information from all Kafka partitions “in one place”, so it’s primarily because of semantic requirements. One such example is when you need to perform a (global) count of distinct elements.

Note: RDDs are not ordered. So when you union RDDs, then the resulting RDD itself will not have a well-defined ordering either. If you need ordering sort the RDD.

Your use case will determine which knobs and which combination thereof you need to use. Let’s say your use case is CPU-bound. Here, you may want to consume the Kafka topic “zerg.hydra” (which has five Kafka partitions) with a read parallelism of 5 – i.e. 5 receivers with 1 consumer thread each – but bump up the processing parallelism to 20:

val ssc: StreamingContext = ???
val kafkaParams: Map[String, String] = Map("group.id" -> "terran", ...)
val readParallelism = 5
val topics = Map("zerg.hydra" -> 1)

val kafkaDStreams = (1 to readParallelism).map { _ =>
    KafkaUtils.createStream(ssc, kafkaParams, topics, ...)
  }
//> collection of five *input* DStreams = handled by five receivers/tasks

val unionDStream = ssc.union(kafkaDStreams) // often unnecessary, just showcasing how to do it
//> single DStream

val processingParallelism = 20
val processingDStream = unionDStream(processingParallelism)
//> single DStream but now with 20 partitions

In the next section we tie all the pieces together and also cover the actual data processing.

Writing to Kafka

Writing to Kafka should be done from the foreachRDD output operation:

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed at the driver, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Note: The remark “the function func is executed at the driver” does not mean that, say, a Kafka producer itself would be run from the driver. Rather, read this remark more as “the function func is evaluated at the driver”. The actual behavior will become more clear once you read Design Patterns for using foreachRDD.

You should read the section Design Patterns for using foreachRDD in the Spark docs, which explains the recommended patterns as well as common pitfalls when using foreachRDD to talk to external systems.

In my case, I decided to follow the recommendation to re-use Kafka producer instances across multiple RDDs/batches via a pool of producers. I implemented such a pool with Apache Commons Pool, see PooledKafkaProducerAppFactory. Factories are helpful in this context because of Spark’s execution and serialization model. The pool itself is provided to the tasks via a broadcast variable.

The end result looks as follows:

val producerPool = {
  // See the full code on GitHub for details on how the pool is created
  val pool = createKafkaProducerPool(kafkaZkCluster.kafka.brokerList, outputTopic.name)
  ssc.sparkContext.broadcast(pool)
}

stream.map { ... }.foreachRDD(rdd => {
  rdd.foreachPartition(partitionOfRecords => {
    // Get a producer from the shared pool
    val p = producerPool.value.borrowObject()
    partitionOfRecords.foreach { case tweet: Tweet =>
      // Convert pojo back into Avro binary format
      val bytes = converter.value.apply(tweet)
      // Send the bytes to Kafka
      p.send(bytes)
    }
    // Returning the producer to the pool also shuts it down
    producerPool.value.returnObject(p)
  })
})

Keep in mind that Spark Streaming creates many RRDs per minute, each of which contains multiple partitions, so preferably you shouldn’t create new Kafka producers for each partition, let alone for each Kafka message. The setup above minimizes the creation of Kafka producer instances, and also minimizes the number of TCP connections that are being established with the Kafka cluster. You can use this pool setup to precisely control the number of Kafka producer instances that are being made available to your streaming application (if in doubt, use fewer).

Complete example

The code example below is the gist of my example Spark Streaming application (see the full code for details and explanations). Here, I demonstrate how to:

Read Avro-encoded data (the Tweet class) from a Kafka topic in parallel. We use a the optimal read parallelism of one single-threaded input DStream per Kafka partition.
Deserialize the Avro-encoded data back into pojos, then serializing them back into binary. The serialization is performed via Twitter Bijection.
Write the results back into a different Kafka topic via a Kafka producer pool.

// Set up the input DStream to read from Kafka (in parallel)
val kafkaStream = {
  val sparkStreamingConsumerGroup = "spark-streaming-consumer-group"
  val kafkaParams = Map(
    "zookeeper.connect" -> "zookeeper1:2181",
    "group.id" -> "spark-streaming-test",
    "zookeeper.connection.timeout.ms" -> "1000")
  val inputTopic = "input-topic"
  val numPartitionsOfInputTopic = 5
  val streams = (1 to numPartitionsOfInputTopic) map { _ =>
    KafkaUtils.createStream(ssc, kafkaParams, Map(inputTopic -> 1), StorageLevel.MEMORY_ONLY_SER).map(_._2)
  }
  val unifiedStream = ssc.union(streams)
  val sparkProcessingParallelism = 1 // You'd probably pick a higher value than 1 in production.
  unifiedStream.repartition(sparkProcessingParallelism)
}

// We use accumulators to track global "counters" across the tasks of our streaming app
val numInputMessages = ssc.sparkContext.accumulator(0L, "Kafka messages consumed")
val numOutputMessages = ssc.sparkContext.accumulator(0L, "Kafka messages produced")
// We use a broadcast variable to share a pool of Kafka producers, which we use to write data from Spark to Kafka.
val producerPool = {
  val pool = createKafkaProducerPool(kafkaZkCluster.kafka.brokerList, outputTopic.name)
  ssc.sparkContext.broadcast(pool)
}
// We also use a broadcast variable for our Avro Injection (Twitter Bijection)
val converter = ssc.sparkContext.broadcast(SpecificAvroCodecs.toBinary[Tweet])

// Define the actual data flow of the streaming job
kafkaStream.map { case bytes =>
  numInputMessages += 1
  // Convert Avro binary data to pojo
  converter.value.invert(bytes) match {
    case Success(tweet) => tweet
    case Failure(e) => // ignore if the conversion failed
  }
}.foreachRDD(rdd => {
  rdd.foreachPartition(partitionOfRecords => {
    val p = producerPool.value.borrowObject()
    partitionOfRecords.foreach { case tweet: Tweet =>
      // Convert pojo back into Avro binary format
      val bytes = converter.value.apply(tweet)
      // Send the bytes to Kafka
      p.send(bytes)
      numOutputMessages += 1
    }
    producerPool.value.returnObject(p)
  })
})

// Run the streaming job
ssc.start()
ssc.awaitTermination()

See the full source code for further details and explanations.

Personally, I really like the conciseness and expressiveness of the Spark Streaming code. As Bobby Evans and Tom Graves are eluding to in their talk, the Storm equivalent of this code is more verbose and comparatively lower level: The KafkaStormSpec in kafka-storm-starter wires and runs a Storm topology that performs the same computations. Well, the spec file itself is only a few lines of code once you exclude the code comments, which I only keep for didactic reasons; however, keep in mind that in Storm’s Java API you cannot use Scala-like anonymous functions as I show in the Spark Streaming example above (e.g. the map and foreach steps). Instead you must write “full” classes – bolts in plain Storm, functions/filters in Storm Trident – to achieve the same functionality, see e.g. AvroDecoderBolt. This feels a bit similar to, say, having to code against Spark’s own API using Java, where juggling with anonymous functions is IMHO just as painful.

Lastly, I also liked the Spark documentation. It was very easy to get started, and even some more advanced use is covered (e.g. Tuning Spark). I still had to browse the mailing list and also dive into the source code, but the general starting experience was ok – only the Kafka integration part was lacking (hence this blog post). Good job to everyone involved maintaining the docs!

Known issues in Spark Streaming

Update Jan 20, 2015: Spark 1.2+ includes features such as write ahead logs (WAL) that help to minimize some of the data loss scenarios for Spark Streaming that are described below. See Improved Fault-tolerance and Zero Data Loss in Spark Streaming.

You might have guessed by now that there are indeed a number of unresolved issues in Spark Streaming. I try to summarize my findings below.

On the one hand there are issues due to some confusion about how to correctly read from and write to Kafka, which you can follow in mailing list discussions such as Multiple Kafka Receivers and Union and How to scale more consumer to Kafka stream .

On the other hand there are apparently still some inherent issues in Spark Streaming as well as Spark itself, notably with regard to data loss in failure scenarios. In other words, issues that you do not want to run into in production!

The current (v1.1) driver in Spark does not recover such raw data that has been received but not processed (source). Here, your Spark application may lose data under certain conditions. Tathagata Das points out that driver recovery should be fixed in Spark v1.2, which will be released around the end of 2014.
The current Kafka “connector” of Spark is based on Kafka’s high-level consumer API. One effect of this is that Spark Streaming cannot rely on its KafkaInputDStream to properly replay data from Kafka in case of a downstream data loss (e.g. Spark machines died).
- Some people even advocate that the current Kafka connector of Spark should not be used in production because it is based on the high-level consumer API of Kafka. Instead Spark should use the simple consumer API (like Storm’s Kafka spout does), which allows you to control offsets and partition assignment deterministically.
The Spark community has been working on filling the previously mentioned gap with e.g. Dibyendu Bhattacharya’s kafka-spark-consumer. The latter is a port of Apache Storm’s Kafka spout, which is based on Kafka’s so-called simple consumer API, which provides better replaying control in case of downstream failures.
Even given those volunteer efforts, the Spark team would prefer to not special-case data recovery for Kafka, as their goal is “to provide strong guarantee, exactly-once semantics in all transformations” (source), which is understandable. On the flip side it still feels a bit like a wasted opportunity to not leverage Kafka’s built-in replaying capabilities. Tough call!
SPARK-1340: In the case of Kafka input DStreams, receivers are not getting restarted if the worker running the receiver fails. So if a worker dies in production, you will simply miss the data the receiver(s) was/were responsible to retrieve from Kafka.
See also Failure of a Worker Node for further discussions on data loss scenarios (“lost input data!”) as well as data duplication scenarios (“wrote output data twice!”). Applies to Kafka, too.
Spark’s usage of the Kafka consumer parameter auto.offset.reset is different from Kafka’s semantics. In Kafka, the behavior of setting auto.offset.reset to “smallest” is that the consumer will automatically reset the offset to the smallest offset when a) there is no existing offset stored in ZooKeeper or b) there is an existing offset but it is out of range. Spark however will always remove existing offsets and then start all the way from zero again. This means whenever you restart your application with auto.offset.reset = "smallest", your application will completely re-process all available Kafka data. Doh! See this discussion and that discussion.
SPARK-1341: Ability to control the data rate in Spark Streaming. This is relevant in so far that if you are already in trouble because of the other Kafka-relatd issues above (e.g. the auto.offset.reset misbehavior), then what may happen is that your streaming application must or thinks it must re-process a lot of older data. But since there is no built-in rate limitation this may cause your workers to become overwhelmed and run out of memory.

Apart from those failure handling and Kafka-focused issues there are also scaling and stability concerns. Again, please refer to the Spark and Storm talk of Bobby and Tom for further details. Both of them have more experience with Spark than I do.

I also came across one comment that there may be issues with the (awesome!) G1 garbage collector that is available in Java 1.7.0u4+, but I didn’t run into any such issue so far.

Spark tips and tricks

I compiled a list of notes while I was implementing the example code. This list is by no means a comprehensive guide, but it may serve you as a starting point when implementing your own Spark Streaming jobs. It contains references to the Spark Streaming programming guide as well as information compiled from the spark-user mailing list.

General

When creating your Spark context pay special attention to the configuration that sets the number of cores used by Spark. You must configure enough cores for running both all the required for receivers (see below) and for the actual data processing part. In Spark, each receiver is responsible for exactly one input DStream, and each receiver (and thus each input DStream) occies one core – the only exception is when reading from a file stream (see docs). So if, for instance, your job reads from 2 input streams but only has access to 2 cores than the data will be read but no processing will happen.
- Note that in a streaming application, you can create multiple input DStreams to receive multiple streams of data in parallel. I demonstrate such a setup in the example job where we parallelize reading from Kafka.
You can use broadcast variables to share common, read-only variables across machines (see also the relevant section in the Tuning Guide). In the example job I use broadcast variables to share a) a Kafka producer pool (through which the job writes its output to Kafka) and b) an injection for encoding/decoding Avro data (from Twitter Bijection). Passing functions to Spark.
You can use accumulator variables to track global “counters” across the tasks of your streaming job (think: Hadoop job counters). In the example job I use accumulators to track how many total messages the job has been consumed from and produced to Kafka, respectively. If you give your accumulators a name (see link), then they will also be displayed in the Spark UI.

Do not forget to import the relevant implicits of Spark in general and Spark Streaming in particular:

// Required to gain access to RDD transformations via implicits.
import org.apache.spark.SparkContext._

// Required when working on `PairDStreams` to gain access to e.g. `DStream.reduceByKey`
// (versus `DStream.transform(rddBatch => rddBatch.reduceByKey()`) via implicits.
//
// See also http://spark.apache.org/docs/1.1.0/programming-guide.html#working-with-key-value-pairs
import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions

If you’re a fan of Twitter Algebird, then you will like how you can leverage Count-Min Sketch and friends in Spark. Typically you’ll use operations such as reduce or reduceByWindow (cf. transformations on DStreams). The Spark project includes examples for Count-Min Sketch and HyperLogLog.
If you need to determine the memory consumption of, say, your fancy Algebird data structure – e.g. Count-Min Sketch, HyperLogLog, or Bloom Filters – as it is being used in your Spark application, then the SparkContext logs might be an option for you. See Determining Memory Consumption.

Kafka integration

Beyond what I already said in the article above:

You may need to tweak the Kafka consumer configuration of Spark Streaming. For example, if you need to read large messages from Kafka you must increase the fetch.message.max.bytes consumer setting. You can pass such custom Kafka parameters to Spark Streaming when calling KafkaUtils.createStream(...).

Testing

Make sure you stop the StreamingContext and/or SparkContext (via stop()) within a finally block or your test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program (or JVM?). (source)
In my experience, when using sbt, you want to configure your build to fork JVMs during testing. At least in the case of kafka-storm-starter the tests must run several threads in parallel, e.g. in-memory instances of ZooKeeper, Kafka, Spark. See build.sbt for a starting point.
Also, if you are on Mac OS X, you may want to disable IPv6 in your JVMs to prevent DNS-related timeouts. This issue is unrelated to Spark. See .sbtopts for how to do disable IPv6.

Performance tuning

Make sure you understand the runtime implications of your job if it needs to talk to external systems such as Kafka. You should read the section Design Patterns for using foreachRDD in the Spark Streaming programming guide. For instance, my example application uses a pool of Kafka producers to optimize writing from Spark Streaming to Kafka. Here, “optimizing” means sharing the same (few) producers across tasks, notably to reduce the number of new TCP connections being established with the Kafka cluster.
Use Kryo for serialization instead of the (slow) default Java serialization (see Tuning Spark). My example enables Kryo and registers e.g. the Avro-generated Java classes with Kryo to speed up serialization. See KafkaSparkStreamingRegistrator. By the way, the use of Kryo is recommended in Spark for the very same reason it is recommended in Storm.
Configure Spark Streaming jobs to clear persistent RDDs by setting spark.streaming.unpersist to true. This is likely to reduce the RDD memory usage of Spark, potentially improving GC behavior as well. (source)
Start your P&S tests with storage level MEMORY_ONLY_SER (here, RDD are stored as serialized Java objects, one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer like Kryo, but more CPU-intensive to read. This option is often the best for Spark Streaming jobs. For local testing you may want to not use the *_2 variants (2 = replication factor).

Wrapping up

The full Spark Streaming code is available in kafka-storm-starter. I’d recommend to begin reading with the KafkaSparkStreamingSpec. This spec launches in-memory instances of Kafka, ZooKeeper, and Spark, and then runs the example streaming application I covered in this post.

In summary I enjoyed my initial Spark Streaming experiment. While there are still several problems with Spark/Spark Streaming that need to be sorted out, I am sure the Spark community will eventually be able to address those. I have found the Spark community to be positive and willing to help, and I am looking forward to what will be happening over the next few months.

Given that Spark Streaming still needs some TLC to reach Storm’s capabilities in large-scale production settings, would I use it in 24x7 production? Most likely not, with the addendum “not yet”. So where would I use Spark Streaming in its current state right now? Here are two ideas, and I am sure there are even more:

It seems a good fit to prototype data flows very rapidly. If you run into scalability issues because your data flows are too large, you can e.g. opt to run Spark Streaming against only a sample or subset of the data.
What about combining Storm and Spark Streaming? For example, you could use Storm to crunch the raw, large-scale input data down to manageable levels, and then perform follow-up analysis with Spark Streaming, benefitting from the latter’s out-of-the-box support for many interesting algorithms and computations. use cases.

Thanks to the Spark community for all their great work!

References

Spark Streaming + Kafka Integration Guide
Deep Dive with Spark Streaming, by Tathagata Das, Jun 2013
Mailing list discussions:
- Spark Streaming threading model – also contains some information on how Spark Streaming pushes input data into blocks
- Low Level Kafka Consumer for Spark – lots of information about the current state of Kafka integration in Spark Streaming, known issues, possible remedies, etc.
- How are the executors used in Spark Streaming in terms of receiver and driver program? – machines vs. cores vs. executors vs. receivers vs. DStreams in Spark

Apache Storm 0.9 training deck and tutorial

2014-09-15T12:00:00+02:00

Today I am happy to share an extensive training deck on Apache Storm version 0.9, which covers Storm’s core concepts, operating Storm in production, and developing Storm applications. I also discuss data serialization with Apache Avro and Twitter Bijection.

The training deck (130 slides) is aimed at developers, operations, and architects.

What the training deck covers

Introducing Storm: history, Storm adoption in the industry, why Storm
Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
Operating Storm: architecture, hardware specs, deploying, monitoring
Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps (with kafka-storm-starter), performance and scalability tuning
Playing with Storm using Wirbelsturm

Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!

Apache Kafka 0.8 training deck and tutorial

2014-08-18T12:00:00+02:00

Today I am happy to share an extensive training deck on Apache Kafka version 0.8, which covers Kafka’s core concepts, operating Kafka in production, and developing Kafka applications. I also discuss data serialization with Apache Avro and Twitter Bijection.

Update 2015-08-01: Shameless plug! Since publishing this Kafka training deck I joined Confluent Inc. as their Developer Evangelist. Confluent is the US startup founded in 2014 by the creators of Apache Kafka who developed Kafka while at LinkedIn (see this Forbes article about Confluent). Next to building the world’s best stream data platform we are also providing professional Kafka trainings, which go even deeper as well as beyond my extensive training deck below.

I can say with confidence that these are the authoritative and most effective Apache Kafka trainings available on the market. But you don’t have to take my word for it – feel free to take a look yourself and reach out to us if you are interested. —Michael

The training deck (120 slides) is aimed at developers, operations, and architects.

What the training deck covers

Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
Operating Kafka: architecture, hardware specs, deploying, monitoring, performance and scalability tuning
Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps (with kafka-storm-starter)
Playing with Kafka using Wirbelsturm

Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!

Integrating Kafka and Storm: Code Examples and State of the Game

2014-05-27T16:51:00+02:00

The only thing that’s even better than Apache Kafka and Apache Storm is to use the two tools in combination. Unfortunately, their integration can and is still a pretty challenging task, at least judged by the many discussion threads on the respective mailing lists. In this post I am introducing kafka-storm-starter, which contains many code examples that show you how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format. I will also briefly summarize the current state of their integration on a high level to give you additional context of where the two projects are headed in this regard.

kafka-storm-starter is available at kafka-storm-starter on GitHub.

State of the (integration) game

For the lazy reader here’s the TL;DR version of Kafka and Storm integration:

You can indeed integrate Kafka 0.8.1.1 (latest stable) and Storm 0.9.1-incubating (latest stable). I mention this explicitly only to clear up any confusion whatsoever that may have resulted from you reading the mailing lists.
The Kafka/Storm integration is, at this time, still more complicated and error prone than it should be. For this reason I released the code project kafka-storm-starter (more details below), which should answer most questions you may have when setting out to connect Storm to Kafka for both reading and writing data. As such kafka-storm-starter can serve as a bootstrapping template to build your own real-time data processing pipelines with Kafka and Storm.
In the Storm project we are actively working on closing this integration gap. For instance, we have recently merged the most popular Kafka spout into the core Storm project. This Kafka spout will be included in the next version of Storm, 0.9.2-incubating, which is just around the corner. And the spout is now compatible with the latest Kafka 0.8.1.1. Kudos to P. Taylor Goetz of HortonWorks for acting as the initial sponsor of the storm-kafka component! For more information see external/storm-kafka in the Storm code base.
The Kafka project is working on an improved, consolidated consumer API for Kafka 0.9. Take a look at the respective discussions in the kafka-user and kafka-dev mailing lists. The Kafka 0.9 Consumer Rewrite Design document is also worth a read. Moving forward this API initiative should simplify interaction with Kafka in general and integration with storm-kafka in particular.

kafka-storm-starter

Overview and quick start

A few days ago I released kafka-storm-starter as a means to jumpstart developers interested in integrating Kafka 0.8 and Storm 0.9. Without further ado let’s take a first quick look.

Before we start we must grab the latest version of the code, which is implemented in Scala 2.10:

$ git clone https://github.com/miguno/kafka-storm-starter.git
$ cd kafka-storm-starter

We begin the tour by running the test suite:

$ ./sbt test

Notably this command will run end-to-end tests of Kafka, Storm, and Kafka/Storm integration. See this shortened version of the test output:

[...other tests removed...]

[info] KafkaSpec:
[info] Kafka
[info] - should synchronously send and receive a Tweet in Avro format
[info]   + Given a ZooKeeper instance
[info]   + And a Kafka broker instance
[info]   + And some tweets
[info]   + And a single-threaded Kafka consumer group
[info]   + When I start a synchronous Kafka producer that sends the tweets in Avro binary format
[info]   + Then the consumer app should receive the tweets
[info] - should asynchronously send and receive a Tweet in Avro format
[info]   + Given a ZooKeeper instance
[info]   + And a Kafka broker instance
[info]   + And some tweets
[info]   + And a single-threaded Kafka consumer group
[info]   + When I start an asynchronous Kafka producer that sends the tweets in Avro binary format
[info]   + Then the consumer app should receive the tweets
[info] StormSpec:
[info] Storm
[info] - should start a local cluster
[info]   + Given no cluster
[info]   + When I start a LocalCluster instance
[info]   + Then the local cluster should start properly
[info] - should run a basic topology
[info]   + Given a local cluster
[info]   + And a wordcount topology
[info]   + And the input words alice, bob, joe, alice
[info]   + When I submit the topology
[info]   + Then the topology should properly count the words
[info] KafkaStormSpec:
[info] Feature: AvroDecoderBolt[T]
[info]   Scenario: User creates a Storm topology that uses AvroDecoderBolt
[info]     Given a ZooKeeper instance
[info]     And a Kafka broker instance
[info]     And a Storm topology that uses AvroDecoderBolt and that reads tweets from topic testing-input and writes them as-is to topic testing-output
[info]     And some tweets
[info]     And a synchronous Kafka producer app that writes to the topic testing-input
[info]     And a single-threaded Kafka consumer app that reads from topic testing-output
[info]     And a Storm topology configuration that registers an Avro Kryo decorator for Tweet
[info]     When I run the Storm topology
[info]     And I use the Kafka producer app to Avro-encode the tweets and sent them to Kafka
[info]     Then the Kafka consumer app should receive the decoded, original tweets from the Storm topology
[info] Feature: AvroScheme[T] for Kafka spout
[info]   Scenario: User creates a Storm topology that uses AvroScheme in Kafka spout
[info]     Given a ZooKeeper instance
[info]     And a Kafka broker instance
[info]     And a Storm topology that uses AvroScheme and that reads tweets from topic testing-input and writes them as-is to topic testing-output
[info]     And some tweets
[info]     And a synchronous Kafka producer app that writes to the topic testing-input
[info]     And a single-threaded Kafka consumer app that reads from topic testing-output
[info]     And a Storm topology configuration that registers an Avro Kryo decorator for Tweet
[info]     When I run the Storm topology
[info]     And I use the Kafka producer app to Avro-encode the tweets and sent them to Kafka
[info]     Then the Kafka consumer app should receive the decoded, original tweets from the Storm topology
[info] Run completed in 21 seconds, 852 milliseconds.
[info] Total number of tests run: 25
[info] Suites: completed 8, aborted 0
[info] Tests: succeeded 25, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 22 s, completed May 23, 2014 12:31:09 PM

We finish the tour by launching the KafkaStormDemo application:

$ ./sbt run

This demo starts in-memory instances of ZooKeeper, Kafka, and Storm. It then runs a demo Storm topology that connects to and reads from the Kafka instance.

You will see output similar to the following (some parts removed to improve readability):

[Thread-19] INFO  backtype.storm.daemon.worker - Worker 3f7f1a51-5c9e-43a5-b431-e39a7272215e for storm kafka-storm-starter-1-1400839826 on daa60807-d440-4b45-94fc-8dd7798453d2:1027 has finished loading
[Thread-29-kafka-spout] INFO  storm.kafka.DynamicBrokersReader - Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=127.0.0.1:9092}}
[Thread-29-kafka-spout] INFO  backtype.storm.daemon.executor - Opened spout kafka-spout:(1)
[Thread-29-kafka-spout] INFO  backtype.storm.daemon.executor - Activating spout kafka-spout:(1)
[Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - Refreshing partition manager connections
[Thread-29-kafka-spout] INFO  storm.kafka.DynamicBrokersReader - Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=127.0.0.1:9092}}
[Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - Deleted partition managers: []
[Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - New partition managers: [Partition{host=127.0.0.1:9092, partition=0}]
[Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Read partition information from: /kafka-spout/kafka-storm-starter/partition_0  --> null
[Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - No partition information found, using configuration to determine offset
[Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Starting Kafka 127.0.0.1:0 from offset 18
[Thread-29-kafka-spout] INFO  storm.kafka.ZkCoordinator - Finished refreshing
[Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committing offset for Partition{host=127.0.0.1:9092, partition=0}
[Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committed offset 18 for Partition{host=127.0.0.1:9092, partition=0} for topology: 47e82e34-fb36-427e-bde6-8cd971db2527
[Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committing offset for Partition{host=127.0.0.1:9092, partition=0}
[Thread-29-kafka-spout] INFO  storm.kafka.PartitionManager - Committed offset 18 for Partition{host=127.0.0.1:9092, partition=0} for topology: 47e82e34-fb36-427e-bde6-8cd971db2527

At this point Storm is connected to Kafka (more precisely: to the testing topic in Kafka). The last few lines from above – “Committing offset …” — will be repeated again and again, because a) this demo Storm topology only reads from the Kafka topic but it does nothing to the data that was read and b) because we are not sending any data to the Kafka topic.

Note: This example will actually run two in-memory instances of ZooKeeper: the first (listening at 127.0.0.1:2181/tcp) is used by the Kafka instance, the second (listening at 127.0.0.1:2000/tcp) is automatically started and used by the in-memory Storm cluster. This is because, when running in local aka in-memory mode, Storm does not allow you to reconfigure or disable its own ZooKeeper instance.

To stop the demo application you must kill or Ctrl-C the process in the terminal.

You can use KafkaStormDemo as a starting point to create your own, “real” Storm topologies that read from a “real” Kafka, Storm, and ZooKeeper infrastructure. An easy way to get started with such an infrastructure is by deploying Kafka, Storm, and ZooKeeper via a tool such as Wirbelsturm.

Features

I showcase the following features in kafka-storm-starter. Note that I focus on showcasing, and not necessarily on “production ready”.

How to integrate Kafka and Storm.
How to use Avro with Kafka and Storm for serializing and deserializing the data payload. For this I leverage Twitter Bijection and Twitter Chill.
Kafka standalone code examples
- KafkaProducerApp: A simple Kafka producer app for writing Avro-encoded data into Kafka. KafkaSpec puts this producer to use and shows how to use Twitter Bijection to Avro-encode the messages being sent to Kafka.
- KafkaConsumerApp: A simple Kafka consumer app for reading Avro-encoded data from Kafka. KafkaSpec puts this consumer to use and shows how to use Twitter Bijection to Avro-decode the messages being read from Kafka.
Storm standalone code examples
- AvroDecoderBolt[T]: An AvroDecoderBolt[T <: org.apache.avro.specific.SpecificRecordBase] that can be parameterized with the type of the Avro record T it will deserialize its data to (i.e. no need to write another decoder bolt just because the bolt needs to handle a different Avro schema).
- AvroScheme[T]: An AvroScheme[T <: org.apache.avro.specific.SpecificRecordBase] scheme, i.e. a custom backtype.storm.spout.Scheme to auto-deserialize a spout’s incoming data. The scheme can be parameterized with the type of the Avro record T it will deserializes its data to (i.e. no need to write another scheme just because the scheme needs to handle a different Avro schema).
  - You can opt to configure a spout (such as the Kafka spout) with AvroScheme if you want to perform the Avro decoding step directly in the spout instead of placing an AvroDecoderBolt after the Kafka spout. You may want to profile your topology which of the two approaches works best for your use case.
- TweetAvroKryoDecorator: A custom backtype.storm.serialization.IKryoDecorator, i.e. a custom Kryo serializer for Storm.
  - Unfortunately we have not figured out a way to implement a parameterized AvroKryoDecorator[T] variant yet. (A “straight-forward” approach we tried – similar to the other parameterized components – compiled fine but failed at runtime when running the tests). Code contributions are welcome!
Kafka and Storm integration
- AvroKafkaSinkBolt[T]: An AvroKafkaSinkBolt[T <: org.apache.avro.specific.SpecificRecordBase] that can be parameterized with the type of the Avro record T it will serialize its data to before sending the encoded data to Kafka (i.e. no need to write another Kafka sink bolt just because the bolt needs to handle a different Avro schema).
- Storm topologies that read Avro-encoded data from Kafka: KafkaStormDemo and KafkaStormSpec
- A Storm topology that writes Avro-encoded data to Kafka: KafkaStormSpec
Unit testing
- AvroDecoderBoltSpec
- AvroSchemeSpec
- And more under src/test/scala
Integration testing
- KafkaSpec: Tests for Kafka, which launch and run against in-memory instances of Kafka and ZooKeeper.
- StormSpec: Tests for Storm, which launch and run against in-memory instances of Storm and ZooKeeper.
- KafkaStormSpec: Tests for integrating Storm and Kafka, which launch and run against in-memory instances of Kafka, Storm, and ZooKeeper.

Interested in more?

All the gory details are available at kafka-storm-starter. Apart from the code and build script (sbt) I provide information about how to create Cobertura code coverage reports, to package the code, to create Java “sources” and “javadoc” jars, to generate API docs, to integrate with Jenkins CI and TeamCity build servers, and to set up kafka-storm-starter as a project in IntelliJ IDEA and Eclipse.

Moving forward my plan is to keep kafka-storm-starter up to date with the latest versions of Kafka and Storm. The next version of Storm, 0.9.2, will already simplify the current setup quite a lot. Of course I welcome any code, docs, or similar contributions you may have.

The quest to get there

Just for the historical record here are some of the gotchas that are addressed by kafka-storm-starter, i.e. problems you do not need to solve yourself anymore:

Figuring out which Kafka spout in Storm 0.9 works with the latest Kafka 0.8 version. A lot of people tried in vain to use a Kafka spout built for Kafka 0.7 to read from Kafka 0.8. Others didn’t know how to use the available Kafka 0.8 spouts in their code, and so on. In the case of kafka-storm-starter I opted to go with the spout created by wurstmeister, primarily because this spout will soon by the “official” Kafka spout maintained by the Storm project. Unfortunately the latest version of the spout was/is not available in a public Maven repository, so I had take care of that, too, until Storm 0.9.2 will provide the official version.
- Alternatively you can also try Kafka spout of HolmesNL, developed by Mattijs Ugen. I don’t want to talk about the differences to the wurstmeister spout in detail, but essentially the wurstmeister spout uses the Simple Consumer API of Kafka 0.8 whereas the Mattijs’ spout uses the High Level Consumer API.
Resolving version conflicts between the various software packages. For instance, Storm 0.9.1 has a transitive dependency on Kryo 2.17 because Storm depends on an old version of Carbonite. This causes problems when trying to use Twitter Bijection or Twitter Chill, because those require a newer version of Kryo. (Apart from that Kryo 2.21 also fixes data corruption issues, so you do want the newer version.) To address this issue I filed STORM-263, which is included in upcoming Storm 0.9.2. Thanks to Sam Ritchie, the maintainer of Carbonite, and everyone else involved to get the patch included. Another example is that you must exclude javax.jms:jms (and a few others) when including Kafka into your build dependencies. Or how to handle Netflix (now: Apache) Curator conflicts.
Understanding the various conflicting ZooKeeper versions, and picking a version to go with. Right now Storm and Kafka still prefer very old 3.3.x versions of ZooKeeper, whereas in practice many people run 3.4.x in their infrastructure (e.g. because ZooKeeper 3.4.x is already deployed alongside other infrastructure pieces such as Hadoop clusters when using commercial Hadoop distributions).
How to write unit tests for Storm topologies. A lot of people seem to find references to TestingApiDemo.java while searching the Internet but struggle with extracting these examples out of the Storm code base and merging them into their own project.
How to write Storm topologies in a way that you can parameterize its components (bolts etc.) with the Avro record type T, so that you don’t need to write a new bolt only because your Avro schema changes. The goal of this code is to show how you can improve the developer/user experience by providing ready-to-use functionality, in this case with regards to (Avro) serialization/deserialization. To tackle this you must understand Storm’s serialization system as well as its run-time behavior.
- While doing that I discovered a (known) Scala bug when I tried to use TypeTag instead of deprecated Manifest to implement e.g. AvroDecoderBolt[T], see SI-5919. This bug is still not fixed in the latest Scala 2.11.1, by the way.
How to write end-to-end Kafka->Storm->Kafka tests.
And so on…

Conclusion

I hope you find kafka-storm-starter useful to bootstrap your own Kafka/Storm application. In the Storm community we are actively working on improving and simplifying the Kafka/Storm integration, so please stay tuned and, above all, thanks for your patience. The upcoming 0.9.2 version of Storm is already a first step in the right direction by bundling a Kafka spout that works with the latest stable version of Kafka (0.8.1.1 at the time of this writing).

Now where to go once you have your Kafka and Storm code ready? At this point you can then use a tool such as Wirbelsturm and its associated Puppet modules to deploy production Kafka and Storm clusters and run your own real-time data processing pipelines at scale.

Wirbelsturm: 1-Click Deployments of Storm and Kafka clusters with Vagrant and Puppet

2014-03-17T17:58:00+01:00

I am happy to announce the first public release of Wirbelsturm, a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure. Wirbelsturm’s goal is to make tasks such as “I want to deploy a multi-node Storm cluster” simple, easy, and fun. In this post I will introduce you to Wirbelsturm, talk a bit about its history, and show you how to launch a multi-node Storm (or Kafka or …) cluster faster than you can brew an espresso.

Wirbelsturm is available at wirbelsturm on GitHub.

Update May 27, 2014: If you want to build real-time data processing pipelines based on Kafka and Storm, you may be interested in kafka-storm-starter. It contains code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+, while using Apache Avro as the data serialization format.

Wirbelsturm quick start

This section is an appetizer of what you can do with Wirbelsturm. Do not worry if something is not immediately obvious to you – the Wirbelsturm documentation describes everything in full detail.

Assuming you are using a reasonably powerful computer and have already installed Vagrant (1.4.x – 1.5.x is not supported yet) and VirtualBox you can launch a multi-node Apache Storm cluster on your local machine with the following commands.

$ git clone https://github.com/miguno/wirbelsturm.git
$ cd wirbelsturm
$ ./bootstrap     # <<< May take a while depending on how fast your Internet connection is.
$ vagrant up      # <<< ...and this step also depends on how powerful your computer is.

Done – you now have a fully functioning Storm cluster up and running on your computer! The deployment should have taken you significantly less time and effort than going through long blog posts or working through the official documentation. On top of that, you can now re-deploy your setup everywhere and every time you need it, thanks to automation.

Note: Running a small, local Storm cluster is just the default example. You can do much more with Wirbelsturm than this.

Let’s take a look at which virtual machines back this cluster behind the scenes:

$ vagrant status
Current machine states:

zookeeper1                running (virtualbox)
nimbus1                   running (virtualbox)
supervisor1               running (virtualbox)
supervisor2               running (virtualbox)

Storm also ships with a web UI that shows you the cluster’s state, e.g. how many nodes it has, whether any processing jobs (topologies) are being executed, etc. Wait 20-30 seconds after the deployment is done and then open the Storm UI at http://localhost:28080/.

Figure 1: The default example of Wirbelsturm deploys a multi-node Storm cluster. In this screenshot of the Storm UI you can see the two slave nodes – named supervisor1 and supervisor2 – running Storm’s Supervisor daemons. The third machine acts as the Storm master node and runs the Nimbus daemon and this Storm UI. The fourth machine runs ZooKeeper.

What’s more, Wirbelsturm also allows you to use Ansible to interact with the deployed machines via its ansible wrapper script:

$ ./ansible all -m ping
zookeeper1 | success >> {
    "changed": false,
    "ping": "pong"
}

supervisor1 | success >> {
    "changed": false,
    "ping": "pong"
}

nimbus1 | success >> {
    "changed": false,
    "ping": "pong"
}

supervisor2 | success >> {
    "changed": false,
    "ping": "pong"
}

Want to run more Storm slaves? As long as your computer has enough horsepower you only need to change a single number in wirbelsturm.yaml:

# wirbelsturm.yaml
nodes:
  ...
  storm_slave:
      count: 2     # <<< changing 2 to 4 is all it takes
  ...

Then run vagrant up again and shortly after supervisor3 and supervisor4 will be up and running.

Want to run an Apache Kafka broker? Just uncomment the kafka_broker section in your wirbelsturm.yaml that it looks similar to the following example snippet (only remove the leading # characters, do not remove any whitespace):

# wirbelsturm.yaml
nodes:
  ...
  # Deploys Kafka brokers.
  kafka_broker:
    count: 1
    hostname_prefix: kafka
    ip_range_start: 10.0.0.20
    node_role: kafka_broker
    providers:
      virtualbox:
        memory: 1536
      aws:
        instance_type: t1.micro
        ami: ami-86cdb3ef
        security_groups:
          - wirbelsturm
  ...

Then run vagrant up kafka1. Now you have Kafka running alongside Storm.

Once you have finished playing around, you can stop all the machines in the cluster again by executing vagrant destroy.

Motivation

Let me use an analogy to explain the motivation to build Wirbelsturm. While I assume every last one of us wants to work somehow like this…

…most of our actual time is rather spent on doing something like that:

Without any automated deployment tools the task of setting up cluster environments with (say) Storm or Kafka is simply a very time-consuming, complicated, and – let’s face it – mind-numbingly boring experience. So the motivation for Wirbelsturm was really simple: first, minimize frustration, and second, help others.

While these were the primary reasons there were also secondary aspects: Wirbelsturm should integrate nicely with existing deployment infrastructures and the associated skills of Operations teams – that’s why it is so heavily based on Puppet, though e.g. Chef and Ansible would have been good candidates, too. Also, it should allow you to perform local deployments (say, your dev laptop) as well as remote deployments (larger-scale environments, production, etc.) – that’s why Vagrant was added to the picture. You should also be able to easily transition from a Wirbelsturm/Vagrant backed setup to a “real” production setup without having to re-architect your deployment, switch tools, etc.

As such Wirbelsturm is one of the tools that help to make the process of going from “Hey, I have this cool idea” to “It’s live in production!” as simple, easy, and fun as possible. A developer should be free to completely screw up “his” test environment; two developers in the same team should always have the same copy of an environment; the integration environment of that team should look and feel the same way, too; and for sure that should apply to the production environment as well.

I think at this point the motivation should be pretty clear, and in the section Is Wirbelsturm for me? I list further examples on what you can do with Wirbelsturm.

Current Wirbelsturm features

In its first public release Wirbelsturm supports the following high-level features:

Launching machines: Wirbelsturm uses Vagrant to launch the machines that make up your infrastructure as VMs running locally in VirtualBox (default) or remotely in Amazon AWS/EC2 (OpenStack support is in the works).
Provisioning machines: Machines are provisioned via Puppet.
- Wirbelsturm uses a master-less Puppet setup, i.e. provisioning is ultimately performed through puppet apply.
- Puppet modules are managed via librarian-puppet.
(Some) batteries included: We maintain a number of standard Puppet modules that work well with Wirbelsturm, some of which are included in the default configuration of Wirbelsturm. However you can use any Puppet module with Wirbelsturm, of course. See Supported Puppet modules for more information.
Ansible support: The Ansible aficionados amongst us can use Ansible to interact with machines once deployed through Wirbelsturm and Puppet.
Host operating system support: Wirbelsturm has been tested with Mac OS X 10.8+ and RHEL/CentOS 6 as host machines. Debian/Ubuntu should work, too.
Guest operating system support: The target OS version for deployed machines is RHEL/CentOS 6 (64-bit). Amazon Linux is supported, too.
- For local deployments (via VirtualBox) and AWS deployments Wirbelsturm uses a CentOS 6 box created by PuppetLabs.
- Switching to RHEL 6 only requires specifying a different Vagrant box in bootstrap (for VirtualBox) or a different AMI image in wirbelsturm.yaml (for Amazon AWS).
When using tools other than Vagrant to launch machines: Wirbelsturm-compatible Puppet modules are standard Puppet modules, so of course they can be used standalone, too. This way you can deploy against bare metal machines even if you are not able to or do not want to run Wirbelsturm and/or Vagrant directly.

Is Wirbelsturm for me?

Here are some ideas for what you can do with Wirbelsturm:

Evaluate new technologies such as Kafka and Storm in a temporary environment that you can set up and tear down at will. Without having to spend hours and stay late figuring out how to install those tools. Then tell your boss how hard you worked for it.
Provide your teams with a consistent look and feel of infrastructure environments from initial prototyping to development & testing and all the way to production. Banish “But it does work fine on my machine!” remarks from your daily standups. Well, hopefully.
Save money if (at least some of) these environments run locally instead of in an IAAS cloud or on bare-metal machines that you would need to purchase first. Make Finance happy for the first time.
Create production-like environments for training classes. Use them to get new hires up to speed. Or unleash a Chaos Monkey and check how well your applications, DevOps tools, or technical staff can handle the mess. Bring coke and popcorn.
Create sandbox environments to demo your product to customers. If Sales can run it, so can they.
Develop and test-drive your or other people’s Puppet modules. But see also beaker and serverspec if your focus is on testing.

Wirbelsturm in detail

Actually I will not talk a whole lot about Wirbelsturm itself in this blog post anymore. If I managed to spark your interest feel free to head over to the Wirbelsturm project page and start reading – and fooling around – there. There is also a list of supported Puppet modules in case you’re wondering what kind of software you can deploy with Wirbelsturm (summary: you can use any Puppet module with Wirbelsturm, but some are easier to use than others).

Instead I want to spend a few minutes in the next sections talking about what tasks and problems had to be solved to put Wirbelsturm together, and also share some lessons learned along the way.

The long road of getting there

What needed to be done to create the first version of Wirbelsturm? Here’s a non-comprehensive list, I hope my memory serves me well.

Packaging the relevant software where official packages (here: RPMs for RHEL 6 family) weren’t available.
- The packaging code is also open sourced at e.g. wirbelsturm-rpm-kafka and wirbelsturm-rpm-storm.
- Of course the packages also need to be digitally signed for security reasons.
- Kudos to Jordan Sissel for creating fpm!
Making this build process deterministic, and publish that code as open source, too. That is, don’t use an internal infrastructure for that because a) people may not be easily able to reproduce it, and b) people may not trust what strangers put together behind closed doors. Think: Reflections on Trusting Trust.
- The code to deploy a Wirbelsturm build server – which is used to build and sign the RPMs – is available as open source at puppet-wirbelsturm_build.
Understanding how to manage and host a public yum respository on Amazon S3. Please note that the idea has never been to become a third-party package maintainer or third-party package repository. Instead the idea was to provide just enough so that Wirbelsturm beginners can follow a quick start and have something deployed in a matter of minutes. And then let the users leverage the provided tools (see above) to run their own show.
- Hosting some pre-built RPMs on a public yum repo also meant to check whether the license of the respective software would allow that, and under which conditions. I am not a lawyer and made my best effort to comply with all the respective licenses. If you have some concerns in this regard please do let me know!
Learning that RHEL/CentOS 6 ships with significantly outdated versions of many packages, notably supervisord (but e.g. also nginx). Supervisord version 2.x turned out to be a problem in practice because a properly functioning process supervisor is highly recommended for running Storm & Co. in production. Hence supervisord version 3.x needed to be packaged because that version is not yet available for the RHEL 6 OS family in any “official” repository (e.g. EPEL’s version is outdated, too).
Speaking of outdated or at least different versions: Ruby on RHEL/CentOS 6 and Amazon Linux: 1.8.x. On Mac OS X 10.9: 1.9.x. And then we also have different versions of Puppet etc. While every version discrepancy is likely to complicate development and testing, Ruby and Puppet versions were particularly annoying to deal with as they are “bootstrap” packages that we need as the foundation of any Puppet-based deployments. I eventually created ruby-bootstrap, which addresses a part of those problems.
Many Puppet modules needed to be made. Where possible I tried to use existing modules as-is but in practice that goal was hard to hit. Some modules didn’t really work, some used completely different coding styles, some did support Hiera while others didn’t, and so on. I ended up creating several modules from scratch – e.g. puppet-kafka, puppet-storm, and puppet-zookeeper – as well as forking others. In the latter case, I tried to contribute back changes to the upstream project where possible and feasible (e.g. I contributed a bug fix to puppet-lib-file_concat). But because my plan was also to come up with a consistent style and feature support across all Puppet modules – notably Hiera support – the code of many forks stayed in that particular fork. Also, some bug fixes or features that I contributed back upstream were never merged, but since Wirbelsturm wouldn’t function properly without those changes I didn’t have an alternative to maintaining my own fork.
I ran into many bugs in many places. Vagrant couldn’t consistently deploy to AWS, for instance. Vagrant plugins broke amidst Vagrant version upgrades. RHEL support suddenly stopped working in Vagrant, which I fixed and contributed back. I learned that Puppet has, for instance, a very weird way of handling boolean values when defined in Hiera. Or requires you to resort to a hacky mkdir -p based workaround using exec to create directories recursively. Most of those problems weren’t huge deals, but in combination they turned out to be death by a thousand cuts.
Separating Puppet code from Wirbelsturm code. I didn’t know about librarian-puppet during the first early versions of Wirbelsturm, which made it more difficult than necessary for Wirbelsturm users to keep their installations up to date. In the beginning they needed to change Puppet code in place, i.e. files checked into the Wirbelsturm git repo, so they would often run into merge conflicts when pulling the latest upstream changes. This unfortunate problem was resolved once I introduced librarian-puppet.
Speeding up local deployments. If I recall correctly Mitchell Hashimoto – the creator of Vagrant – actually tried parallel VM creation at some point but his (host) machine was completely overwhelmed by this, and the feature was not introduced officially into Vagrant. However, what is still possible is to perform the provisioning of booted VMs in parallel. But…the Puppet provisioner of Vagrant does not support that. I therefore created a wrapper shell script based on para-vagrant.sh so that you can benefit from faster local deployments when using Wirbelsturm.
Adding support for Ansible turned out to be quick and easy, once I understood how to create dynamic inventory scripts. 30 mins total.
Automating the setup steps for Amazon AWS has been tricky. Apart from so-so Vagrant support for AWS, there were a couple of additonal problems I ran into. I remember issues with Amazon’s implementation of cloud-init when using custom AMI, for instance. Figuring out how to configure DNS in AWS (currently Wirbelsturm uses Amazon Route 53) took some time. Other tasks I remember include automatically creating restricted IAM users and tighter security groups. I am still not perfectly happy with the Wirbelsturm user experience when deploying to AWS, and for a number of reasons listed in the AWS related documentation of Wirbelsturm a code refactoring may be possible in the near future.
After reading through the various issues listed above you may also understand now why at some point I decided to postpone supporting any other operating system than the RHEL 6 OS family (which includes CentOS and Amazon Linux). There were simply too many moving parts, and trying to tackle e.g. Debian/Ubuntu as well might have significantly delayed the progress on Wirbelsturm.

Lessons learned: mistakes made along the way

The wall of shame. But hey, hindsight is 20/20.

Underestimating the amount of work it eventually took. See the previous section, and even what I wrote there is not the complete picture. Now, thanks to good roadmap planning early adopters of Wirbelsturm were productive from the very early beginning, and a close feedback loop helped a lot to keep the project on track and running in the right direction. Still the amount of work that actually needed to go into Wirbelsturm was significantly more than anticipated. It wasn’t as easy as going through Vagrant’s Puppet provisioner documentation and writing a few lines of Puppet code. In retrospect, knowing what I know today, Wirbelsturm could have been built much faster though.
Not realizing quickly enough how valuable it is to separate code from configuration data in Puppet manifests, using Hiera. Particularly because this is so second-nature when coding in “real” programming languages instead of Puppet (which is a DSL on top of Ruby). To my defense I can only say that my hands-on knowledge of Puppet was very limited at the beginning, and I hadn’t even heard about Hiera (and a lot of people I talked to didn’t use it). In retrospect I should have spent more time up-front figuring out what the Puppet ecosystem had in store to address the code-vs-data problem, because it was pretty obvious right from the beginning that mixing the two would quickly lead to pain.
Adding tests to Puppet modules too soon and too late. At the beginning the Puppet modules were refactored a lot in the quest of finding a reasonably good coding style, writing idiomatic Puppet manifests, etc. – and here dragging unit tests along the way turned out to be a chore and a waste of time. So I stopped writing tests. While that decision was ok, I made the mistake of postponing the re-introduction of proper tests for too long once the code across the modules became more stable. Well, at least puppet-kafka and puppet-storm have a good base test setup now thanks to puppet-module-skeleton, which means there isn’t any excuse left to postpone adding meaningful tests.

Of course there were more mistakes, but the ones above were the most noteworthy ones. :-)

Summary

I am really happy that Wirbelsturm is finally available as free and open source software. Hopefully it will help you to quickly get up and running with technologies such as Graphite, Kafka, Storm, Redis, and ZooKeeper. Enjoy!

The following projects are similar to Wirbelsturm:

storm-deploy – Deploys Storm clusters to AWS, by Nathan Marz, the creator of Storm. storm-deploy has been around for much longer than Wirbelsturm, so it might be more mature. It is a nice example of a deployment tool implemented in Clojure, using pallet and jclouds. Because of jclouds you should also be able to deploy to clouds other than AWS, though I haven’t found examples or documentation references on how to do so. (If you have pointers please let me know.) Unfortunately, its Clojure roots may make storm-deploy less popular within Operations teams, who typically are more familiar with tools such as Puppet, Chef, or Ansible. Also, storm-deploy seems to address only Storm deployments, and you require additional tools to deploy any other infrastructure pieces that you require (or enhance storm-deploy).
kafka-deploy – Deploys Kafka to AWS, also by Nathan Marz. It has the same pros and cons as storm-deploy. Unfortunately, kafka-deploy has seen any updates since two years (Feb 2012), which is around the time it was originally published.

Commercial Hadoop vendors have also begun to integrate Storm into their product offerings:

Apache Storm at HortonWorks – HortonWorks are working on Storm support for their product line. In this context they have added Storm support to their so-called Hortonworks Sandbox, which is a self-contained virtual machine with Hadoop & Co. pre-configured.
If I recall correctly MapR were also looking at integrating Storm into their platform, but I could not find more concrete details apart from a few news articles and blog posts.

Another way of deploying Storm is via platforms such as Hadoop YARN and Apache Meos:

storm-on-yarn – Enables Storm clusters to be deployed into machines managed by Hadoop YARN. The project says it is still a work in progress.
storm-mesos – Storm integration with the Mesos cluster resource manager. The project says storm-mesos runs in production at Twitter.

Lastly, there are also a few open source Puppet modules for Hadoop, Kafka, Storm, ZooKeeper & Co. I don’t want to give an comprehensive overview of these modules in this post, but you can head over to places such as PuppetForge and GitHub and take a look yourself. Feel free to drop those modules into Wirbelsturm and give them a go!

Of Algebirds, Monoids, Monads, and other Bestiary for Large-Scale Data Analytics

2013-12-02T16:45:00+01:00

Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you should get interested in those in the first place.

Goal of this article

The main goal of this is article is to spark your curiosity and motivation for Algebird and the concepts of monoid, monads, and category theory in general. In other words, I want to address the questions “What’s the big deal? Why should I care? And how can these theoretical concepts help me in my daily work?”

While I will explain a little bit what the various concepts such as monoids are, this is not the focus of this post. If in doubt I will rather err on the side of grossly oversimplifying a topic to get the point across even at the expense of correctness. There are much better resources available online and offline that can teach you the full details of the various items I will discuss here. That being said, I compiled a list of references at the end of this article so that you have a starting point to understand the following concepts in full detail, and with more accurate and thorough explanations than I could come up with.

Motivating example

A first look at Algebird

Here is a simple example what you can do with monoids and monads, based on the starter example in Algebird.

scala> Max(10) + Max(30) + Max(20)
res1: com.twitter.algebird.Max[Int] = Max(30)

// Alternative, Java-like (read: ugly) syntax for readers unfamiliar with Scala.
scala> Max(10).+(Max(30)).+(Max(20))
res2: com.twitter.algebird.Max[Int] = Max(30)

What is happening here? Basically, we are boxing two numbers, the Int values 4 and 5, into Max, and then we are “adding” them. The behavior of Max[T] turns the + operator into a function that returns the largest boxed T.

Conceptually this is similar to the following native Scala code:

// This is native Scala.
scala> 10 max 30 max 20
res3: Int = 30

// Alternative, Java-like syntax.
scala> 10.max(30).max(20)
res4: Int = 30

At this point you may ask, “Alright, what is the big deal? The native Scala example looks actually better!”

At least, that is what I thought myself at first. But the simplicity of this example is deceptive. There is a lot more to it than meets the eye at first sight.

Beyond trivial examples

Admittedly, the first example used a very dull data structure, Int. Any programming language comes with built-in functionality to add two integers, right? So you would be hardly convinced of the value of a tool like Algebird if all it allowed you to do was 4 + 3 = 7, particularly when doing those simple things would require you to understand sophisticated concepts such as monoids and monads. Too much effort for too little value I would say!

So let me use a different example because adding Int values is indeed trivial. Imagine that you are working on large-scale data analytics that make heavy use of Bloom filters. Your applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom filters in parallel. Now the money question is: How do you combine or add two Bloom filters in an easy way? (This is where monoids come into play.)

val first = BloomFilter(...)
val second = BloomFilter(...)
first + second == uh?

And what about performing other operations on those Bloom filter instances, notably data processing pipelines based on common functions such as map, flatMap, foldLeft, reduceLeft? (And this is where monads come to into play.)

val filters = Seq[BloomFilter](...)
val summary = filters flatMap { /* magic happens here */ } reduceLeft { /* more magic */ } ...

And what about combining two HyperLogLog instances?

Intuitively we could say that the general idea of “adding” two Bloom filters is quite similar to how we would add two sets A and B, where adding would mean creating the union set of A and B.

Now Algebird addresses this problem of abstraction. In a nutshell, if you can turn a data structure into a monoid (or semigroup, or …), then Algebird allows you to put it to good use. You can then work with your data structure just as nicely as you are so used to when dealing with Int, Double or List. And you can use it with large-scale data processing tools such as Hadoop and Storm, too.

Wait a minute!

In case you are asking yourself the following question (which I did): Is the magic of Algebird simply something like a custom Max[Int] class that defines a +() method, similar to the following snippet but actually with a bounded type parameter T : Ordering[T]? (If you do not understand the latter, take a look at this StackOverflow thread.)

// Is Algebird implemented like this? (hint: nope)
scala> case class Max(val i: Int) { def +(that: Max) = if (this.i >= that.i) this else that }
defined class Max

scala> Max(10) + Max(30) + Max(20)
res5: Max = Max(30)

The answer is yes and no. “Yes” because it is similar. And “no” because the implementation is quite different from the above analogy, and provides you with significantly more algebra-fu (but again, it has the same spirit).

What we want to do

Our goal in this post is to build a data structure TwitterUser accompanied by a Max[TwitterUser] monoid view of it. We want to use the two for implementing the analytics of a fictional popularity contest on Twitter, like so:

// Let's have a popularity contest on Twitter.  The user with the most followers wins!
val barackobama = TwitterUser("BarackObama", 40267391)
val katyperry = TwitterUser("katyperry", 48013573)
val ladygaga = TwitterUser("ladygaga", 40756470)
val miguno = TwitterUser("miguno", 731) // I participate, too.  Olympic spirit!
val taylorswift = TwitterUser("taylorswift13", 37125055)

val winner: Max[TwitterUser] = Max(barackobama) + Max(katyperry) + Max(ladygaga) + Max(miguno) + Max(taylorswift)
assert(winner.get == katyperry)

Figuring out how to do this with monoids, monads, and Algebird is the objective of this article.

Of course, instead of using Algebird and monoids we could also project the number-of-followers field from each user and perform any such analytics directly on the Int values. That’s not the point however. I intentionally wanted a very simple example use case because, as you will see, there is so much to understand about what’s going on behind the scenes that any further distraction should be avoided. At least, that was my personal experience. :-)

My journey down the rabbit hole

This section is more for entertainment. Feel free to skip it.

How this post started

I am following a few Twitter folks on, well, Twitter such as Dmitriy Ryaboy (@squarecog) and Oscar Boykin (@posco). And lately they talked a lot about how data analytics at Twitter is powered by “monoids” and “monads”, and how tools such as Algebird and Scalding form the code foundation of their analytics infrastructure.

Here is an example of such a conversation:

(Link to full image “Monads, Monads Everywhere!”)

A mo-what? And how comes those things are apparently spreading like a contagious disease throughout their data analytics code?

Another trigger was a discussion involing Ted Dunning of MapR (@ted_dunning) and his work on a new data structure called t-digest:

Why was Ted being asked whether t-digest is associative? And how does all this relate to semigroups and monoids? And finally, what the heck are semigroups in the first place?

Now a dangerous series of events began to take place on my side.

First I thought, “Hey, coincidentally I have started to pick up Scala around a month ago. Given that Algebird is written in Scala this might turn into an interesting finger exercise.” (Note my focus on “finger exercise”.) On top of that I knew that the use of Algebird extends to other interesting big data tools such as Storm and Scalding, so it could turn out that I would not only learn something for learning’s sake but that I could put it to practical use in my daily work, too. The combination of these two factors – general interest and practical applicability – eventually caused me to give in to my curiosity and decided to put “an hour or two aside” to read up on those monoid thingies and figure out whether and how I could leverage Algebird for my own purposes.

You might notice at this point that it all started quite innocently. But what I did not realize at that moment was that I was opening Pandora’s box on an otherwise quiet and peaceful Swiss weekend…

Scala, functors, monoids, monads, category theory, implicits, type classes, aaargh!

What started as a seemingly innocent journey down a calm park lane quickly turned into the opening of the gates of functional programming and category theory hell. Not only did I struggle to understand what things like functors, semigroups, monoids, and other algebraic structures that only a mother could love are. No, on top of that I quickly realized that how these things can be implemented in Scala in general and in Algebird in particular meant I had to take my beginner Scala-fu to a whole new level. In the end it took me the full weekend to grasp all those concepts to the point where I’d say right now that I know enough to be dangerous.

The learning curve reminded me a lot of the following famous picture:

Figure 1: Learning curve for some common editors. Image courtesy of Jose M. Gilgado.

And it did feel like the vi curve – the brick wall experience. What else could it be, right? That being said I still fear that, after having hit and finally made it over that initial brick wall, it may still spiral out of control again like the Emacs curve. :-)

Picture myself sitting in front of my keyboard, frantically interacting with your favorite search engine, StackOverflow, Wikipedia, your usual suspects of Scala books, and what not:

Me: “What is a monad?”

Internet: “A monad is just a monoid in the category of endofunctors.”

Me: “Hmm, ok. So what is a monoid?”

Internet: “A monoid is a semigroup with identity.”

Me: “Then what is a semigroup??” (number of question marks increases with anxiety level)

Internet: “An algebraic structure consisting of a set together with an associative binary operation.”

Me: “Alright, I see the mathematical definition and I do see a soup of greek letters. Still, what is it? Where can I get one from, and what can I use it for?”

Internet: “Here is an example in the Haskell programming language.”

Me:

On a more serious note, the past few days have really been a tour de force where I felt I would recursively dive from one new term or concept into yet more new terms and concepts, to the point where my brain would run into a stack overflow. “Why am I actually reading about magmas, or co- and contra-variance in Scala, or bounded type parameters? What was the original question I tried to find an answer for?”

To make a long story short I was really deep down the rabbit hole, with no Alice in sight but fully surrounded by semigroups of monoidal and diabolical jabberwockies on a big night out. Given the questions, comments and blog posts of other folks at least I found consolation in the fact that I was apparently not alone.

And, finally, at the end of the hole there was a bit of light. In the next sections I want to share what I have learned so far in the hope that it will prove helpful for you, too. We start with a brief introduction to monoids and monads, followed by how to apply what we have learned in Algebird hands-on.

The TL;DR version of monoids and monads

A monad is a monoid where you blend the “oi” into an “a”. Depending on your typesettings (pun intended) this blend will be easier or harder for you to see. If in doubt, squint more.
Michael’s abridged relation of monoids and monads

As a grossly simplified rule of thumb:

Monoid: If you want to “attach” operations such as +, -, *, / or <= to data objects – say, adding two Bloom filters – then you want to provide monoid forms for those data objects (e.g. a monoid for your Bloom filter data structure). This way you can combine and juggle your custom data structures just like you would do with plain integer numbers.
Monad: If you want to create data processing pipelines that turn data objects step-by-step into the desired, final output (e.g. aggregating raw records into summary statistics), then you want to build one or more monads to model these data pipelines. Particularly if you want to run those pipelines in large-scala data processing platforms such as Hadoop or Storm.

The intent of this section is to give you a high-level idea what those concepts are, and what you can use them for. That is, this section should help you determine whether you want to venture down the rabbit hole, too.

I did not want to add yet another variant to the pool of “what is a monoid/monad” articles, but at the same time I felt I need to explain at least very briefly what the various concepts are (as good as I can) so that you can better understand how to use a tool such as Algebird.

Of course, if you ran across a blatant mistake on my side please do let me know!

Monoids

What is a monoid?

A monoid is a structure that consists of:

a set of objects (such as numbers)
a binary operation as a method of combining them (such as adding those numbers)

The small catch is that the way you can combine the objects in your set must adhere to a few rules, which are described in the next section.

One way to explain a monoid in the context of programming is as a kind of adapter or bounded view of a type T. Imagine a data structure of type T – say, a List. If you can find a way to use T in a way that conforms to the monoid laws (see next section), then you can say “type T forms a monoid” Monoid[T]; for instance, if the binary operation you picked behaves like the concept of addition, you have an additive monoid view of T.

Note: What I tried to highlight in the previous paragraph is that a given type T can have multiple monoidal forms. An additive monoid of T is just an example, and T might have more monoids than the additive variant. Also – sorry for the forward reference – a type T can form both a monoid and a monad. One such dual-headed hydra is the well-known List.

So you can read Monoid[T] as “T looks like a monoid and quacks like a monoid, so it must be a monoid”. This notion is related to the concept of duck typing in languages such as Python. Scala, in which Algebird is implemented, has a static type system though, and to achieve such ad-hoc polymorphism we typically use type classes to achieve a similar effect. A nifty feature of type classes is that they allow you to retroactively add polymorphism even to existing types that are not under your own control: examples are Seq or List, which are provided by the Scala standard library and thus not under your control.

Figure 2: A monoid seen as a bounded view. In this analogy we are looking at the original type T from a different, “monoidal angle”. Here, we are combining two values of type T under the laws of the pink-colored monoid view of T (whatever this particular monoid might actually be doing).

Monoids in more detail

A monoid is a set of objects, T, together with a binary operation ⋅ that satisfies the three axioms listed below.

One way to express a monoid in Scala would be the following trait, used as a type class:

// Important: What you see here is only part of the contract.
// The monoid, and thus `e` and `o`, must also adhere to the monoid laws.
trait Monoid[T] {
  def e: T
  def op(a: T, b: T): T
}

Closure: For all a, b in T, the result of the operation a ⋅ b is also in T: $$ \forall a,b \in T: a \bullet b \in T $$ In Scala, we could express this axiom with the following function signature for ⋅: def op(a: T, b: T): T
Associativity: For all a, b, and c in T, the equation (a ⋅ b) ⋅ c = a ⋅ (b ⋅ c) holds: $$ \forall a,b,c \in T: (a \bullet b) \bullet c = a \bullet (b \bullet c) $$ In Scala, we could express this axiom with: (a op b) op c == a op (b op c)
Identity element: There exists an element e (we could also call it zero to draw a link to addition) in T, such that for all elements a in T, the equation e ⋅ a = a ⋅ e = a holds: $$ \exists e \in T: \forall a \in T: e \bullet a = a \bullet e = a $$ In Scala, we could express this axiom with the following, which as you might note captures the idea of a no-op: e op a == a op e == a

Note: Any binary operation satisfying the three axioms above qualifies your data structure to be a monoid. It does not necessarily need to be an addition-like operation.

Before we move on and look at examples of monads, I want to mention one more thing about the binary function of a monoid. We have learned that it must be associative. Wouldn’t it be helpful if the binary function were commutative, too, even though this optional feature would not be required to make a monoid?

Here is a transcripted reply during Sam Ritchie’s SummingBird talk at CUFP:

Question: Associativity is one nice thing about monoids, but what about commutativity [which] is also important. Are there examples of non-commutative datastructures
Answer: It should be baked into the algebra (non-commutativity). This helps with data skew in particular. An important non-commutative application is Twitter itself! When you want to build the List monoid, the key is userid,time and the value is the list of tweets over that timeline (so ordering matters here). It’s not good to get a non-deterministic order when building up these lists in parallel, so that’s a good example of when associativity and commutativity are both important.
Transcript of Sam Ritchie’s SummingBird talk at CUFP 2013 www.syslog.cl.cam.ac.uk/2013/09/…

What are example monoids?

Numbers (= the set of objects) you can add (= the method of combining them).
- For integer addition, e == 0 and op == +.
- For integer multiplication, e == 1 and op == *.
Lists you can concatenate.
- With e == Nil and op == concat.
Sets you can union.
- With e == Set() and op == union.

There are more and also more sophisticated examples, of course. Max[Int] at the beginning of this article is a monoid, too.

Here is how Algebird defines an additive monoid for the standard type Seq:

// A `Seq` concatenation monoid.
// Plus (the `op`) means concatenation,
// zero (the identity element `e`) is the empty Seq.
class SeqMonoid[T] extends Monoid[Seq[T]] {
  override def zero = Seq[T]()
  override def plus(left : Seq[T], right : Seq[T]) = left ++ right
}

// Make an instance of `SeqMonoid` available as an implicit value.
// This is a Scala-specific implementation action that needs to be done,
// i.e. it is not related to the abstract concept of monoids.
//
// The effect of this statement is to add the "monoid view" of Seq
// as defined above to all `Seq` instances in the code.  If you
// define your own monoid for a type `T` in Algebird and forget
// this statement, Algebird will complain with the following
// @implicitNotFound error message:
//
//   "Cannot find Monoid type class for T"
//
// Implicits need to be used because this is how the notion of
// type classes is implemented in Scala.
implicit def seqMonoid[T] : Monoid[Seq[T]] = new SeqMonoid[T]

Algebird actually includes a few more methods for the Monoid[T] type class – which SeqMonoid[T] extends – but the key functionality is shown above.

What can I use a monoid for? Why should I look for one?

Whenever you have a data structure (which backs your “set of objects”, e.g. the Int data structure or the List[T] data structure) you can begin checking whether you can define one or more monoids for that data structure. Here you will start looking for operations you can perform on any two instances of your data structure that satisfy the three monoid axioms: closure, associativity, and identity element (the latter gives your monoid a no-op function, and is the one thing that turns a semigroup into a monoid).

If you do find any such monoids for your data structure, hooray! On the practical side this means means that you can now use your data structure in any code that expects a monoid. As I said above you can think of a monoid as an adapter, or shape, for (some monoid-compatible aspects of) your data structure that allows you to fit your data structure peg into a monoid hole. Some such holes are Algebird, Scalding and Summingbird of Twitter. Being supported by those tools also means that you can now plug your data structure into big data analytics tools such as Hadoop and Storm, which might be a huge seller and productivity gain for your new data structure.

Secondly, and in more general terms, it signifies because of the associativity of monoid operations that those operations on your data structure can be parallelized in order to utilize multiple CPU cores efficiently. Speaking in code, that means you can run operations such as foldLeft() and reduceLeft() on them. And parallelization support is yet another reason why monoids (and monads) are so attractive for big data tools such as Hadoop and Storm, where your code not only runs on many cores per machine but on many such machines in a cluster. In other words: If your data structure has a monoid form, this means you can plug the data structure directly into large-scale data processing platforms such as Hadoop and Storm. Hence monoids enable you to MapReduce and to divide and conquer.

Let me quote Sam Ritchie (@sritchie), former Twitter engineer and now founder of PaddleGuru (cool idea, by the way – go sports!) for a very concrete practical application of monoids at Twitter. Well, actually I am quoting a transcript of his talk.

One cool feature: When you visit a tweet, you want the reverse feed of things that have embedded the tweet. The MapReduce graph for this comes from: When you see an impression, find the key of the tweet and emit a tuple of the tweetId and Map[URL, Long]. Since Maps have a monoid, this can be run in parallel, and it will contain a list of who has viewed it and from where. The Map has a Long since popular tweets can be embedded in millions of websites and so they use a “CountMinSketch” [Note: Reader Sam Bessalah points out that the transcript is wrong when it said “accountment sketch”.] which is an approximate data structure to deal with scale there. The Summingbird layer which the speaker [Sam Ritchie] shows on stage filters events, and generates key-value pairs and emits events.
Twitter advertising is also built on Summingbird. Various campaigns can be built by building a backend using a monoid that expresses the needs, and then the majority of the work is on the UI work in the frontend (where it should be — remember, solve systems problems once is part of the vision).
Transcript of Sam Ritchie’s SummingBird talk at CUFP 2013 www.syslog.cl.cam.ac.uk/2013/09/…

See his CUFP slides on Summingbird for further detail.

Thirdly, you can compose monoids. For instance, you can form the product of two monoids M1 and M2, which is the tuple type (M1, M2). This product is also a monoid.

Lastly, you can now combine your monoidal data structure with monads (see below) and benefit from all the features that those monads provide.

At this point you might guess the reason why Ted Dunning was asked whether the t-digest data structure he is working on is associative and can be turned into a semigroup or monoid. One of my two mysteries solved!

Monads

What is a monad?

Update: A few readers pointed out that this section explains rather what monads are used for than what they really are. I concur! And I even skip a discussion of Monad laws etc. intentionally because the post is already quite long, and the focus and motivation of this article (see above) is not an in-depth introduction to monoids or monads. It’s about the questions “Why should I be interested in the first place, and what can I use them for?”. Of course I can understand the need for further details, so I added a list of references and literature to the end of this article, which you can read at your leisure. Of course if you think that some important piece of information should be mentioned here directly (or something happens to be plain wrong), please let me know. It’s difficult to write an article about such a topic in a way that can be understood by beginners and at the same time also pleases the experts.

A monad is a structure that defines a way to combine functions. It represents computations defined as a sequence of transformations that turn an original input into a final output, one step at a time. Think of them like function chaining similar to y = h(g(f(x))).

An interesting aspect is that in the case of a monad the type of the value being piped through the function chain may change along the way. For instance, you may start with an Int but end up with a Double or BloomFilter. This is different from a monoid, which will always retain the original type because of the closure requirement (see monoid laws above).

One of the best analogies for monads I found is the following, adapted from Wikipedia: You can compare monads to physical assembly lines, where a conveyor belt (the monad) transports a piece of input material (the data) between functional units (functions on the data) that transform the piece one step at a time. Think of the skeleton of a car that is turned into the final car in a sequence of steps. Or of web server log files with raw data that is turned into business information such as the increase of ad impressions in the EMEA market for this month.

Figure 3: A monad seen as a data processing pipeline. The monad M is used to turn the original input into the final output one step at a time.

Sticking with this analogy, a monad enables you to decorate each processing step in the assembly pipeline with additional context (or an “environment”). For instance, your monad could carry state information that is used by the functions in the pipeline – this would be the example of a state monad. Alternatively, your monad could log what is going on before, within, or after a function to a file or database – this would be the example of an I/O monad. If you are a game developer, you could use a monad to carry the representation and state of the game environment (such as the current level), and the functions in the pipeline would model how players can interact with the environment.

Before we look at monads in more detail, let us take a brief detour to Storm. When you are implementing bolts in Storm – i.e. Storm’s version of the “functional units” in a data processing pipeline – you will come across the prepare() and execute() methods (see the Storm tutorial):

public class TripleBolt extends BaseRichBolt {
  private OutputCollectorBase collector;

  // Note how the Storm provides "context" -- a literal context value
  // and a collector value -- to the bolt as the functional unit in
  // the data processing pipeline.
  @Override
  public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) {
    this.collector = collector;
  }

  // This is Storm's version of a monad's `fn` function,
  // which we will discuss in the next section.
  @Override
  public void execute(Tuple input) {
    int val = input.getInteger(0);
    int tripled = val * 3;
    collector.emit(input, new Values(tripled));
    collector.ack(input);
  }

  // ...rest omitted...
}

Note how Storm provides environmental information and context to the bolt. This is one example where you could point your finger at the code and say, “This would be a good place to use a monad.” In this specific case I would say it would be primarily a kind of I/O-monad because the collector instance allows the bolt write its output to downstream bolts via network communication.

Monads in more detail

Here is one way to capture the concept of a monad in Scala. It is basically the same as the definition of a monad in Algebird.

// Important: What you see here is only part of the contract.
// The monad, and thus `apply` and `flatMap`, must also adhere to the monad laws.
trait Monad[M[_]] {
  // Also called `unit` (in papers) or `return` (in Haskell).
  def apply[T](v: T): M[T]

  // Also called `bind` (in papers) or `>>=` (in Haskell).
  def flatMap[T, U](m: M[T])(fn: (T) => M[U]): M[U]
}

Alright, what is going on here?

apply() boxes a T value into the monad M[T]. For example, T is an Int, the monad M[T] is a List[T]. In other words, it is a good-ol’ constructor for the monad.

flatMap() turns a T into a potentially different type parameter U (but it can also be a T again) that is boxed into the same type of monad M, i.e. M[U]. In plain English, this means that if you have List monad all it will ever produce for you is another List monad, but the type of elements in the List monad may change. The way this happens is controlled by the second parameter of flatMap(), which is a function from T to M[U].

For example, T is an Int, U is a Double, and M is a List monad; fn is (i: Int) => List(i.toDouble / 4, i.toDouble / 2), i.e. T -> M[U]. If you ran this combination over the input List[Int](1, 2), you would get the output: List[Double](0.25, 0.5, 0.5, 1.0).

Note how flatMap() provides the boxing M instance m of the input T value to the function fn via currying. This way fn may leverage information or functionality embedded in the monad, including functions beyond the contractually required flatMap(). One such example is Monad[Some], i.e. the Some monad in Algebird. The flatMap() function of this monad calls Some#get(), which is a function of Some but not of Monad[Some]. As such a monad is also a kind of adapter or view, similar to the way we described monoids above. If you still cannot see how similar monoids and monads are, just try squinting harder!

Similar to the monoid laws we discussed above, monads have their own laws – and these rules are actually very similar to its monoid brethren! I decided not to discuss monad laws in this post because I feel it is already very long. I may update the post at a later point though. In the meantime take a look at the following references:

Monad laws (in Haskell). Remember return in Haskell means our constructor apply() in Scala, and >>= in Haskell is our flatMap().
Monads are elephants, a series of blog posts by James Iry. In Scala.

I hope you will notice their similarities:

The identity rules of monads are similar to the identity element e of monoids.
Both monoids and monads have functions that must be associative.

What are example monads?

At the beginning of the section on monads I already mentioned the state-monad and the I/O-monad.

Well, this may still be a bit vague. Let us look at a more concrete (and maybe simpler) example. Any collection type is typically a monad. For example, take List[T]:

The constructor of List[T] acts as unit as it gives you a List[T] box for T instances.
List has an appropriate flatMap() function – and map(), which can be built from flatMap() and the constructor.

Here is the implementation of Monad[List] in Algebird:

implicit val list: Monad[List] = new Monad[List] {
  def apply[T](v: T) = List(v);
  def flatMap[T,U](m: List[T])(fn: (T) => List[U]) = m.flatMap(fn)
}

Here you can see that Monad[List] is simply a 1:1 adapter for the existing apply() and flatMap() functions of List. And that’s because List in Scala already ships with monad “look and feel”.

Before we move on to the next section there is one more interesting facet: A monad can have monoid forms, too. Algebird, for instance, provides a default monoid view for its semigroups and monads:

// This is a Semigroup, for all Monads.
class MonadSemigroup[T,M[_]](implicit monad: Monad[M], sg: Semigroup[T])
  extends Semigroup[M[T]] {
  import Monad.operators
  def plus(l: M[T], r: M[T]) = for(lv <- l; rv <- r) yield sg.plus(lv, rv)
}

// This is a Monoid, for all Monads.
class MonadMonoid[T,M[_]](implicit monad: Monad[M], mon: Monoid[T])
  extends MonadSemigroup[T,M] with Monoid[M[T]] {
  lazy val zero = monad(mon.zero)
}

Groups, rings, and fields do not have such a default, “automatic” monoid view however. For those algebraic structures you must check yourself that the group/ring/field laws hold for your monad.

What can I use a monad for? Why should I look for one?

As we have already seen monads can be thought of as composable computation descriptions. This means you can use them to build powerful data processing pipelines. And these pipelines are not only powerful in terms of features and functionality, they can also be parallelized, which is one of the reasons why monads are so attractive in the field of large-scale data processing where your code is run on many cores and on many machines at the same time.

Now you might say that almost all we do in coding is to transform one value into another value, and I agree. And this, I think, is where the idea of the picture “Monads. Monads, everywhere.” (see beginning of this article) originates from. Two of my two mysteries solved, yay!

Algebird

Finally we are getting close to being productive with Algebird. I figure the previous TL;DR section on monoids and monads was still maybe a bit too long. :-)

If you recall, our original goal at the beginning of this post was to build a data structure TwitterUser accompanied with a Max[TwitterUser] monoid view of it, using Algebird. We wanted to use the two for implementing the analytics of a simple popularity contest on Twitter:

// Let's have a popularity contest on Twitter.  The user with the most followers wins!
val barackobama = TwitterUser("BarackObama", 40267391)
val katyperry = TwitterUser("katyperry", 48013573)
val ladygaga = TwitterUser("ladygaga", 40756470)
val miguno = TwitterUser("miguno", 731) // I participate, too.  Olympic spirit!
val taylorswift = TwitterUser("taylorswift13", 37125055)

val winner: Max[TwitterUser] = Max(barackobama) + Max(katyperry) + Max(ladygaga) + Max(miguno) + Max(taylorswift)
assert(winner.get == katyperry)

Let’s start!

Creating a monoid

The TwitterUser type

Our first step is to create the data structure TwitterUser for which we will then create a monoid view.

Because we want to build a Max monoid for TwitterUser eventually, we must come up with a way to order TwitterUser values. For this we can either use the Ordering or the Ordered trait in Scala, either way will work.

Let’s say we go down the Ordered route. Now we must answer a design question: Do we consider the “ordering” behavior to be a defining feature of TwitterUser in general, or do we need this behavior only for its Max[TwitterUser] monoid view? If it’s a general feature we would add it to TwitterUser directly. If it’s only needed for the monoid we can also decide to add it only there. In our case, we will add the ordering behavior to TwitterUser directly. I will show further down below how to implement the other option.

// Small note: To be future-proof we should make `numFollowers` a `Long`,
// because `Int.MaxValue` (~ 2 billion) is less than the potential number
// of Twitter users on planet earth.  I am happy to let this one slip though.
case class TwitterUser(val name: String, val numFollowers: Int) extends Ordered[TwitterUser] {
  def compare(that: TwitterUser): Int = {
    val c = this.numFollowers - that.numFollowers
    if (c == 0) this.name.compareTo(that.name) else c
  }
}

The code above means that TwitterUser supports comparison operations like >= as defined by the compare method of the Ordered trait.

scala> TwitterUser("foo", 123) > TwitterUser("bar", 99999)
res5: Boolean = false

In our case this compare() method is also used as the monoidal binary function of the Max[TwitterUser] monoid we will build in the next section. This works because compare() satisfies all the three axioms described in our section on monoids above.

The Max[TwitterUser] monoid

Creating the Max monoid for TwitterUser is now very simple because we can leverage a factory method provided by Algebird’s called Max.monoid().

// The "zero" element of the TwitterUser monoid.  Traditionally it is
// also called `mzero` in academic papers.  We use `Int.MinValue` here
// but in practice you would typically constrain `numFollowers` of
// TwitterUser to be >= 0 anyways, so any negative value such as `-1`
// would do.
val zero = TwitterUser("MinUser", Int.MinValue)

// Monoid in Algebird is a type class, hence we use implicits
// to make the monoid available to the rest of the code.
implicit def twitterUserMonoid: Monoid[Max[TwitterUser]] = Max.monoid(zero)

That’s it!

Ok, maybe it feels a bit like cheating because the monoid is created behind the scenes by Max.monoid(). So what does Max.monoid() do?

/* This is Algebird code, not ours. */

// Zero should have the property that it <= all T
def monoid[T](zero: => T)(implicit ord: Ordering[T]): Monoid[Max[T]] =
   Monoid.from(Max(zero)) { (l,r) => if(ord.gteq(l.get, r.get)) l else r }

Still, it’s pretty straight-forward I would say. Not a lot of magic as long as you know how implicits and type classes in Scala work.

Generally, Max in Algebird is a semigroup – not a monoid – because not all types T you could come up with would have the notion of a zero element when used with Max. And the existence of such a zero element is the one thing that separates a semigroup from a monoid. You see this in Algebird’s OrderedSemigroup.scala where object Max defines an implicit def semigroup, and only for a few specific types such as Int or Long it also defines monoid behavior.. This is because those types have the notion of a zero element. In our case we have such a zero element, too, hence we can not only support semigroup but also monoid behavior.

What would we do if we only wanted to add compare() to the monoid, but not to the original type? The Algebird code has examples for this use case. Here is the definition of the Max[List] monoid, which as you may notice uses Ordering and not Ordered as in our example above. You can ignore that small difference. The key point is that the compare() method is defined as part of the Max[List] monoid instead of being “duct-taped” to List directly.

implicit def listMonoid[T:Ordering]: Monoid[Max[List[T]]] = monoid[List[T]](Nil)(new Ordering[List[T]] {
  @tailrec
  final override def compare(left: List[T], right: List[T]): Int = {
    (left, right) match {
      case (Nil, Nil) => 0
      case (Nil, _) => -1
      case (_, Nil) => 1
      case (lh::lt, rh::rt) =>
        val c = Ordering[T].compare(lh, rh)
        if(c == 0) compare(lt, rt) else c
    }
  }
})

Where to go from here?

Now that we have one monoid view for TwitterUser, what else can we do? Can we find another monoid form for it? That’s one of the questions you should ask yourself when working with your own data structures. If you take a look at the Algebird code, you will notice that many types such as List will have quite a few algebraic forms.

There is one more thing I want to mention here: You may consider creating an additive monoid for TwitterUser, i.e. a monoid that supports a + like operation. I couldn’t come up with any good example how the result of adding two such values would make sense (e.g., how could you “add” their usernames in meaningful way?). That being said there is one case where adding two TwitterUser values would make sense: to capture the idea that one follows the other, i.e. to create a relationship (a link) between the two. Keep in mind though that monoids and friends must adhere to the closure principle – if you start out with a TwitterUser value and perform monoid operations on it, the end result must always be another TwitterUser value. Of course such a relationship can be modeled in code, but you cannot do this with a TwitterUser monoid as defined above.

Creating a monad?

By now you should have sufficient understanding of monads and Algebird to implement your own monad. So I leave this as an exercise for the reader.

A starting point for you is Monad.scala in Algebird.

However if you do have a good idea what kind of monad I could showcase here – perhaps something related to Twitter to match the TwitterUser monoid example above? – please let me know in the comments.

Key algebraic structures in Algebird

The following table is a juxtaposition of a few key algebraic structures, notably those that are implemented in Algebird. It should help you to navigate the Algebird code base, and also to figure out which algebraic structure your own data types might support – i.e., “Can I turn my T into a semigroup, or even a monoid?”.

Algebraic structure	Binary op is associative	Identity (has a zero element)	+ op	- op	* op	/ op	References
Semigroup	YES	-	YES	-	-	-	Wikipedia, Algebird
Monoid	YES	YES	YES	-	-	-	Wikipedia, Algebird
Group	YES	YES	YES	YES	-	-	Wikipedia, Algebird
Ring	YES	YES	YES	YES	YES	-	Wikipedia, Algebird
Field	YES	YES	YES	YES	YES	YES	Wikipedia, Algebird

Think of + as the general notion of “adding one thing to another”, same for the other operations. For two List[Int], for instance, + could be concatenation of the two (instead of, say, trying to add the individual Int elements of the lists together). The operators +, -, * and / are as defined in Algebird.

A small Algebird FAQ

Error “Cannot find Group/Monoid/… type class for a type T”?

If you run into this error it means you are trying to use an operation that is not supported by the algebraic structure you are working with. In this specific example, a Set() in Algebird has a monoid form and thus supports an addition-like operation + but not a multiplication-like operation *.

scala> Set(1,2,3) * Set(2,3,4)
<console>:2: error: Cannot find Ring type class for scala.collection.immutable.Set[Int]
              Set(1,2,3) * Set(2,3,4)

Combine different monoids?

In theory you can combine different monoids such as Max[Int] and Min[Int] and form their product, but there must exist an appropriate algebraic structure for that product. Right now, for instance, the following code will not work in Algebird because it does not ship with a algebraic structure for (Max[Int], Min[Int]):

scala> Max(3) + Min(4)
<console>:14: error: Cannot find Semigroup type class for Product with Serializable
              Max(3) + Min(4)

Are monads really everywhere?

One thing that I have not yet investigated in further detail is how using monads compares to other patterns of abstraction. For instance, you can use monads in Clojure (the author Jim Duey actually wrote a whole series of blog posts covering monads), too, but in a quick initial search I observed that Clojure developers apparently use different constructs to achieve similar effects.

If you have some insights to share here, please feel free to reply to this post!

Summary

I hope this post contributes a little bit to the understanding of the rather abstract concepts of monoids and monads, and how you can put them to good practical use via tools such as Algebird, Scalding and SummingBird.

One of my lessons learned was that working with monoids and monads is a nice opportunity to read up on more formal concepts (category theory), and at the same time realize how they can be put to practical use in engineering, notably when doing large-scala data analytics.

On my side I want to thank the Twitter engineering team (@TwitterEng) not only for making those tools available to the open source community, but also for sparking my interest in the practical application of algebraic structures and category theory in general. Same shout-out for all the various people who wrote blog posts on the topic, or who shared their insights on places such as StackOverflow (see the reference section at the end of this article for a few of them). As I said there was a lot of new information to swallow – and in a short period of time – but the quest was worth it.

Many thanks! –Michael

References

Monads and monoids

I tried to categorize the references below into “easy” and “advanced” reads. Of course this is highly subjective, and your mileage may vary.

Easy reads:

Functional Programming in Scala by P. Chiusano and R. Bjarnason, published by Manning. Includes chapters on monoids and monads, and how to implement them in Scala.
Monads are not metaphors, by Daniel Spiewak.
A monad is just a monoid in the category of endofunctors, what’s the problem?, question on StackOverflow. If you are just starting out with monads etc. I’d recommend to read the second answer first.
Monads are elephants, a series of blog posts by James Iry. In Scala.
Functors, Applicatives, And Monads In Pictures, by Aditya Bhargava.
Monad laws (in Haskell). Remember return in Haskell means our constructor apply() in Scala, and >>= in Haskell is our flatMap().
Wikipedia articles on algebraic structures: I found that selective reading of those did help my understanding (I did not try to understand all the sections in those articles). Notably, I liked the juxtaposition of semigroups, monoid, groups, rings, etc. which highlighted their similarities and differences. Later on I discovered that the Algebird code is structured similarly, so if you can tell a semigroup from a monoid you will have an easier time navigating the code.
- Semigroup
- Monoid
- Group
- Ring
- Monad

Advanced reads:

Monads Made Difficult, by Stephen Diel. In Haskell.
Monads in Clojure, by Jim Duey. In Clojure. Jim actually wrote a whole series of blog posts covering monads.
Functors, Applicative Functors and Monoids, a chapter in Learn You a Haskell.

SummingBird

SummingBird at CUFP 2013, slides by Sam Ritchie (former Twitter engineer)

Category theory

Speaking from my own experience I would say you do not need to understand the full details of category theory. The links above should contain all the information you need to gain enough understanding of monoids, monads and such that you can be productive in a short period of time. However I have used the references below to fill gaps that remained after reading through the other sources above, and I remember I did jump back and forth between the academic references below and the more hands-on resources above.

An Introduction to Category Theory, by Harold Simmons. As a novice to category theory I preferred this text over Category Theory for Computing Science (see below). Unlike the latter though the book of Simmons is not available for free.
Category Theory for Computing Science, by Michael Barr and Charles Wells, available as a free PDF. This seems to be a seminal work on category theory and worth the read if you are interested in the mathematical foundation of the theory in about 400 pages.

Sending Metrics from Storm to Graphite

2013-11-06T16:00:00+01:00

So you got your first distributed Storm cluster installed and have your first topologies up and running. Great! Now you want to integrate your Storm applications with your monitoring systems and begin tracking application-level metrics from your topologies. In this article I show you how to integrate Storm with the popular Graphite monitoring system. This, combined with the Storm UI, will provide you with actionable information to tune the performance of your topologies and also help you to track key business as well as technical metrics.

Update March 13, 2015: We have open sourced storm-graphite, an Storm IMetricsConsumer implementation that forwards Storm’s built-in metrics to a Graphite server for real-time graphing, visualization, and operational dashboards. These built-in metrics greatly augment the application-level metrics that you can send from your Storm topologies to Graphite (sending application metrics is described in this article). The built-in metrics include execution count and latency of your bolts, Java heap space usage and garbage collection statistics, and much more. So if you are interested in even better metrics and deeper insights into your Storm cluster, I’d strongly recommend to take a look at storm-graphite. We also describe how to configure Graphite and Grafana, a dashboard for Graphite, to make use of the built-in metrics provided by storm-graphite.

Background: What is Graphite?

Quoting from Graphite’s documentation, Graphite does two things:

Store numeric time-series data
Generate and render graphs of this data on demand

What Graphite does not do is collect the actual input data for you, i.e. your system or application metrics. The purpose of this blog post is to show how you can do this for your Storm applications.

Note: The Graphite project is currently undergoing significant changes. The project has been moved to GitHub and split into individual components. Also, the next version of Graphite will include for Ceres, which is a distributable time-series database, and a major refactor of its Carbon daemon. If that draws your interest then you can read about the upcoming changes in further detail. I mention this just for completeness – it should not deter you from jumping on the Graphite bandwagon.

What we want to do

Spatial granularity of metrics

For the context of this post we want to use Graphite to track the number of received tuples of an example bolt per node in the Storm cluster. This allows us, say, to pinpoint a potential topology bottleneck to specific machines in the Storm cluster – and this is particularly powerful if we already track system metrics (CPU load, memory usage, network traffic and such) in Graphite because then you can correlate system and application level metrics.

Keep in mind that in Storm multiple instances of a bolt may run on a given node, and its instances may also run on many different nodes. Our challenge will be to configure Storm and Graphite in a way that we are able to correctly collect and aggregate all individual values reported by those many instances of the bolt. Also, the total value of these per-host tuple counts should ideally match the bolt’s Executed value – which means the number of executed tuples of a bolt (i.e. across all instances of the bolt in a topology) – in the Storm UI.

We will add Graphite support to our Java-based Storm topology by using Coda Hale/Yammer’s Metrics library for Java, which directly supports reporting metrics to Graphite.

We will track the number of received tuples of our example bolt through the following metrics, where HOSTNAME is a placeholder for the hostname of a particular Storm node (e.g. storm-node01):

production.apps.graphitedemo.HOSTNAME.tuples.received.count
production.apps.graphitedemo.HOSTNAME.tuples.received.m1_rate – 1-minute rate
production.apps.graphitedemo.HOSTNAME.tuples.received.m5_rate – 5-minute rate
production.apps.graphitedemo.HOSTNAME.tuples.received.m15_rate – 15-minute rate
production.apps.graphitedemo.HOSTNAME.tuples.received.mean_rate – average rate/sec

Here, the prefix of the metric namespace production.apps.graphitedemo.HOSTNAME.tuples.received is defined by us. Splitting up this “high-level” metric into a count metric and four rate metrics – m{1,5,15}_rate and mean_rate – is automatically done by the Metrics Java library.

Temporal granularity of metrics

Because Storm is a real-time analytics platform we want to use a shorter time window for metrics updates than Graphite’s default, which is one minute. In our case we will report metrics data every 10 seconds (the finest granularity that Graphite supports is one second).

Assumptions

We are using a single Graphite server called your.graphite.server.com.
The carbon-cache and carbon-aggregator daemons of Graphite are both running on the Graphite server machine, i.e. carbon-aggregator will send its updates to the carbon-cache daemon running at 127.0.0.1. Also, our Storm topology will send all its metrics to this Graphite server.

Thankfully the specifics of the Storm cluster such as hostnames of nodes do not matter. So the approach described here should work nicely with your existing Storm cluster.

Desired outcome: graphs and dashboards

The desired end result are graphs and dashboards similar to the following Graphite screenshot:

Example graph in Graphite that displays the number of received tuples. The brown line is the aggregate of all per-host tuple counts of this 4-node Storm cluster and computed via Graphite’s sumSeries() function. Note that only 3 of the 4 nodes are actually running instances of the bolt, hence you only see 3+1 lines in the graph.

Versions

The instructions in this article have been tested on RHEL/CentOS 6 with the following software versions:

Storm 0.9.0-rc2
Graphite 0.9.12 (stock version available in EPEL for RHEL6)
Metrics 3.0.1
Oracle JDK 6

Note that I will not cover the installation of Storm or Graphite in this post.

Heads up: I am currently working on open sourcing an automated deployment tool called Wirbelsturm that you can use to install Storm clusters and Graphite servers (and other Big Data related software packages) from scratch. Wirbelsturm is based on the popular deployment tools Puppet and Vagrant. Please stay tuned!

A Graphite primer

Understanding how Graphite handles incoming data

One pitfall for Graphite beginners is the default behavior of Graphite to discard all but the last update message received during a given time slot (the default size of a time slot for metrics in Graphite is 60 seconds). For example, if we are sending the metric values 5 and 4 during the same time slot then Graphite will first store a value of 5, and as soon as the value 4 arrives it will overwrite the stored value from 5 to 4 (but not sum it up to 9).

The following diagram shows what happens if Graphite receives multiple updates during the same time slot when we are NOT using an aggregator such as carbon-aggregator or statsd in between. In this example we use a time slot of 10 seconds for the metric. Note again that in this scenario you might see, for instance, “flapping” values for the second time slot (window of seconds 10 to 20) depending on when you would query Graphite: If you queried Graphite at second 15 for the 10-20 time slot, you would receive a return value of 3, and if you queried only a few seconds later you would start receiving the final value of 7 ( the latter of which would then never change anymore).

In most situations losing all but the last update of a given time slot is not what you want. The next diagram shows how aggregators solve the “only the last update counts” problem. A nice property of aggregators is that they are transparent to the client who can continue to send updates as soon as it sees fit – the aggregators will ensure that Graphite will only see a single, merged update message for the time slot.

Implications of Storm’s execution model

In the case of Storm you implement a bolt (or spout) as a single class, e.g. by extending BaseBasicBolt. So following the User Manual of the Metrics library seems to be a straight-forward way to add Graphite support to your Storm bolts. However you must be aware of how Storm will actually execute your topology behind the scenes – see my earlier post on Understanding the Parallelism of a Storm Topology:

In Storm each bolt typically runs in the form of many bolt instances in a single worker process, and thus you have many bolt instances in a single JVM.
In Storm there are typically many such workers (and thus JVMs) per machine, so you end up with many instances of the same bolt running across many workers/JVMs on a particular machine.
On top of that a bolt’s instances will also be spread across many different machines in the Storm cluster, so in total you will typically have many bolt instances running in many JVMs across many Storm nodes.

Our challenge to integrate Storm with Graphite can thus be stated as: How can we ensure that we are reporting metrics from our Storm topology to Graphite in such way that a) we are counting tuples correctly across all bolt instances, and b) the many metric update messages are not canceling each other out? In other words, how can we keep Storm’s highly distributed nature in check and make it play nice with Graphite?

High-level approach

Overview of the approach described in this post

Here is an overview of the approach we will be using:

Each instance of our example Storm bolt gets its own (Java) instance of Meter. This ensures that each bolt instance tracks its count of received tuples separately from any other bolt instance.
Also, each bolt instance will get its own instance of GraphiteReporter to ensures that each bolt instance sends only a single metrics update every 10 seconds, which is the desired temporal granularity for our monitoring setup.
All bolt instances on a given Storm node report their metrics under the node’s hostname. For instance, bolt instances on the machine storm-node01.example.com will report their metrics under the namespace production.apps.graphitedemo.storm-node01.tuples.received.*.
Metrics are being sent to a carbon-aggregator instance running at your.graphite.server.com:2023/tcp. The carbon-aggregator ensures that all the individual metrics updates (from bolt instances) of a particular Storm node are aggregated into a single, per-host metric update. These per-host metric updates are then forwarded to the carbon-cache instance, which will store the metric data in the corresponding Whisper database files.

Other approaches (not used)

Another strategy is to install an aggregator intermediary (such as statsd) on each machine in the Storm cluster. Instances of a bolt on the same machine would be sending their individual updates to this per-host aggregator daemon, which in turn would send a single, per-host update message to Graphite. I am sure this approach would have worked but I decided to not go down this path. It would have increased the deployment complexity because now we’d have one more software package to understand, support and manage per machine.

The final setup described in this post achieves what we want by using GraphiteReporter in our Storm code in a way that is compatible with Graphite’s built-in daemons without needing any additional software such as statsd.

On a completely different note, Storm 0.9 now also comes with its own metrics system, which I do not cover here. This new metrics feature of Storm allows you to collect arbitrarily custom metrics over fixed time windows. Those metrics are exported to a metrics stream that you can consume by implementing IMetricsConsumer and configured with Config – see the various *_METRICS_* settings. Then you need to use TopologyContext#registerMetric() to register new metrics.

Integrating Storm with Graphite

Configuring Graphite

I will only cover the key settings of Graphite for the context of this article, which are the settings related to carbon-cache and carbon-aggregator. Those settings must match the settings in your Storm code. Matching settings between Storm and Graphite is critical – if they don’t you will end up with junk metric data.

carbon-cache configuration

First we must add a [production_apps] section (the name itself is not relevant, it should only be descriptive) to /etc/carbon/storage-schemas.conf. This controls at which granularity Graphite will store incoming metrics that we are sending from our Storm topology. Notably these storage schema settings control:

The minimum temporal granularity for the “raw” incoming metric updates of a given metric namespace: In our case, for instance, we want Graphite to track metrics at a raw granularity of 10 seconds for the first two days. We configure this via 10s:2d. This minimum granularity (10 seconds) must match the report interval we use in our Storm code.
How Graphite aggregates older metric values that have already been stored in its Whisper database files: In our case we tell Graphite to aggregate any values older than two days into 5-minute buckets that we want to keep for one year, hence 5m:1y. This setting (5 minutes) is independent from our Storm code.

Caution: Graphite knows two different kinds of aggregation. First, the aggregation of metrics data that is already stored in its Whisper database files; this aggregation is performed to save disk storage space and performed on aging data. Second, the real-time aggregation of incoming metrics performed by carbon-aggregator; this aggregation happens for newly received data as it is flying in over the network, i.e. before that data even hits the Whisper database files. Do not confuse these two aggregations!

/etc/carbon/storage-schemas.conf

# Schema definitions for whisper files. Entries are scanned in order, and first match wins.
[carbon]
pattern = ^carbon\.
retentions = 60:90d

[production_apps]
pattern = ^production\.apps\.
retentions = 10s:2d,5m:1y

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

Next we must tell Graphite which aggregation method – e.g. sum or average – it should use to perform storage aggregation of our metrics. For count-type metrics, for instance, we want to use sum and for rate-type metrics we want to use average. By adding the following lines to /etc/carbon/storage-aggregation.conf we ensure that Graphite correctly aggregates the default metrics sent by Metrics’ GraphiteReporter – count, m1_rate, m5_rate, m15_rate and mean_rate – once two days have passed.

Note: The [min] and [max] sections are actually not used by the setup described in this article but I decided to include them anyways to show the difference to the other settings. Also, your production Graphite setup may actually need to use such settings, too.

/etc/carbon/storage-aggregation.conf

[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum

[m1_rate]
pattern = \.m1_rate$
xFilesFactor = 0
aggregationMethod = average

[m5_rate]
pattern = \.m5_rate$
xFilesFactor = 0
aggregationMethod = average

[m15_rate]
pattern = \.m15_rate$
xFilesFactor = 0
aggregationMethod = average

[default_average]
pattern = .*
xFilesFactor = 0.3
aggregationMethod = average

Lastly, make sure that the carbon-cache daemon is actually enabled in your /etc/carbon/carbon.conf and configured to receive incoming data on its LINE_RECEIVER_PORT at 2003/tcp and also (!) on its PICKLE_RECEIVER_PORT at 2004/tcp. The latter port is used by carbon-aggregator, which we will configure in the next section.

Example configuration snippet:

/etc/carbon/carbon.conf

# ...snipp...

[cache]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004

# ...snipp...

Don’t forget to restart carbon-cache after changing its configuration:

$ sudo service carbon-cache restart

carbon-aggregator configuration

The last Graphite configuration we must perform is to ensure that we can pre-aggregrate the number of reported tuples.received values across all bolt instances that run on a particular Storm node.

To perform this per-host aggregation on the fly we must add the following lines to /etc/carbon/aggregation-rules.conf. With those settings whenever we are sending a metric such as production.apps.graphitedemo.storm-node01.tuples.received.count from any bolt instance running on storm-node01 to Graphite (more correctly, its carbon-aggregator daemon), it will aggregate (here: sum) all such update messages for storm-node01 into a single, aggregated update message every 10 seconds for that server.

/etc/carbon/aggregation-rules.conf

.apps...all.tuples.received.count (10) = sum .apps...tuples.received.count
.apps...all.tuples.received.m1_rate (10) = sum .apps...tuples.received.m1_rate
.apps...all.tuples.received.m5_rate (10) = sum .apps...tuples.received.m5_rate
.apps...all.tuples.received.m15_rate (10) = sum .apps...tuples.received.m15_rate
.apps...all.tuples.received.mean_rate (10) = sum .apps...tuples.received.mean_rate

Lastly, make sure that the carbon-aggregator daemon is actually enabled in your /etc/carbon/carbon.conf and configured to receive incoming data on its LINE_RECEIVER_PORT at 2023/tcp. Also, make sure it sends its aggregates to the PICKLE_RECEIVER_PORT of carbon-cache (port 2004/tcp). See the [aggregator] section.

Example configuration snippet:

/etc/carbon/carbon.conf

# ...snipp...

[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
DESTINATIONS = 127.0.0.1:2004  # <<< this points to the carbon-cache pickle port

# ...snipp...

Don’t forget to restart carbon-aggregator after changing its configuration:

$ sudo service carbon-aggregator restart

Other important Graphite settings

You may also want to check the values of the following Carbon settings in /etc/carbon/carbon.conf, particularly if you are sending a lot of different metrics (= high number of metrics such as my.foo and my.bar) and/or a lot of metric update messages per second (= high number of incoming metric updates for my.foo).

Whether or not you need to tune those settings depends on your specific use case. As a rule of thumb: The more Storm nodes you have, the higher the topology’s parallelism and the higher your data volume, the more likely you will need to optimize those settings. If you are not sure, leave them at their defaults and revisit later.

Note: I’d say the most important parameters at the very beginning are MAX_CREATES_PER_MINUTE (you might hit this particularly when your topology starts to submit metrics for the very first time) and MAX_UPDATES_PER_SECOND.

/etc/carbon/carbon.conf

[cache]
# Limit the size of the cache to avoid swapping or becoming CPU bound.
# Sorts and serving cache queries gets more expensive as the cache grows.
# Use the value "inf" (infinity) for an unlimited cache size.
MAX_CACHE_SIZE = inf

# Limits the number of whisper update_many() calls per second, which effectively
# means the number of write requests sent to the disk. This is intended to
# prevent over-utilizing the disk and thus starving the rest of the system.
# When the rate of required updates exceeds this, then carbon's caching will
# take effect and increase the overall throughput accordingly.
MAX_UPDATES_PER_SECOND = 500

# Softly limits the number of whisper files that get created each minute.
# Setting this value low (like at 50) is a good way to ensure your graphite
# system will not be adversely impacted when a bunch of new metrics are
# sent to it. The trade off is that it will take much longer for those metrics'
# database files to all get created and thus longer until the data becomes usable.
# Setting this value high (like "inf" for infinity) will cause graphite to create
# the files quickly but at the risk of slowing I/O down considerably for a while.
MAX_CREATES_PER_MINUTE = 50

[aggregator]
# This is the maximum number of datapoints that can be queued up
# for a single destination. Once this limit is hit, we will
# stop accepting new data if USE_FLOW_CONTROL is True, otherwise
# we will drop any subsequently received datapoints.
MAX_QUEUE_SIZE = 10000

# Set this to False to drop datapoints when any send queue (sending datapoints
# to a downstream carbon daemon) hits MAX_QUEUE_SIZE. If this is True (the
# default) then sockets over which metrics are received will temporarily stop accepting
# data until the send queues fall below 80% MAX_QUEUE_SIZE.
USE_FLOW_CONTROL = True

# This defines the maximum "message size" between carbon daemons.
# You shouldn't need to tune this unless you really know what you're doing.
MAX_DATAPOINTS_PER_MESSAGE = 500

# This defines how many datapoints the aggregator remembers for
# each metric. Aggregation only happens for datapoints that fall in
# the past MAX_AGGREGATION_INTERVALS * intervalSize seconds.
MAX_AGGREGATION_INTERVALS = 5

Configuring your Storm code

Add the Metrics library to your Storm code project

The instructions below are for Gradle but it is straight-forward to adapt them to Maven if that’s your tool of choice.

Now that we have finished the Graphite setup we can turn our attention to augmenting our Storm code to work with Graphite. Make sure build.gradle in your Storm code project looks similar to the following:

build.gradle

buildscript {
  repositories {
    mavenCentral()
  }
  dependencies {
    // see https://github.com/musketyr/gradle-fatjar-plugin
    classpath 'eu.appsatori:gradle-fatjar-plugin:0.2-rc1'
  }
}

apply plugin: 'java'
apply plugin: 'fatjar'
// ...other plugins may follow here...

// We use JDK 6.
sourceCompatibility = 1.6
targetCompatibility = 1.6

group = 'com.miguno.storm.graphitedemo'
version = '0.1.0-SNAPSHOT'

repositories {
    mavenCentral()
    // required for Storm jars
    mavenRepo url: "http://clojars.org/repo"
}

dependencies {
  // Metrics library for reporting to Graphite
  compile 'com.codahale.metrics:metrics-core:3.0.1'
  compile 'com.codahale.metrics:metrics-annotation:3.0.1'
  compile 'com.codahale.metrics:metrics-graphite:3.0.1'

  // Storm
  compile 'storm:storm:0.9.0-rc2', {
    ext {
      // Storm puts its own jar files on the CLASSPATH of a running topology by itself,
      // and therefore does not want you to re-bundle Storm's class files with your
      // topology jar.
      fatJarExclude = true
    }
  }

  // ...other dependencies may follow here...
}

// ...other gradle settings may follow here...

You can then run the usual gradle commands to compile, test and package your code. Particularly, you can now run:

$ gradle clean fatJar

This command will create a fat jar (also called uber jar) of your Storm topology code, which will be stored under build/libs/*.jar by default. You can use this jar file to submit your topology to Storm via the storm jar command. See the section on how to build a correct standalone jar file of your Storm code in my Storm multi-node cluster tutorial for details.

Sending metrics from a Storm bolt to Graphite

In this section we will augment a Storm bolt (spouts will work just the same) to report our tuples.received metrics to Graphite.

Our bolt, i.e. its instances, will send this metric under the Graphite namespace production.apps.graphitedemo.HOSTNAME.tuples.received.* every 10 seconds to the carbon-aggregator daemon running at your.graphite.server.com:2023/tcp.

The key points of the code below are, firstly, the use of a transient private field for the Meter instance. If you do not make the field transient Storm will throw a NotSerializableException during runtime. This is because Storm will serialize the code that a Storm worker needs to execute and ship it to the worker via the network. For this reason the code of our bolt will initialize the Meter instance during the prepare() phase of a bolt instance, which ensures that the Meter instance is set up before the first tuples arrive at the bolt instance. So this part achieves proper counting of the tuples.

Note: By the way, do not try to make the field a static. While this will prevent the NotSerializableException it will also result in all instances of the bolt running on the same JVM will share the same Meter instance (and typically you will have many instances on many JVMs on many Storm nodes), which will cause loss of metrics data. In this case you would observe in Graphite that the tuples.received.* metrics would significantly under-count the actual number of incoming tuples. Been there, done that. :-)

Secondly, the prepare() method also creates a new, dedicated GraphiteReporter instance for each bolt instance. This achieves proper reporting of metric updates to Graphite.

BoltThatAlsoReportsToGraphite.java

package com.miguno.storm.graphitedemo;

import com.codahale.metrics.Meter;
import com.codahale.metrics.MetricFilter;
import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.graphite.Graphite;
import com.codahale.metrics.graphite.GraphiteReporter;
import org.apache.log4j.Logger;
// ...other imports such as backtype.storm.* omitted for clarity...

import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.regex.Pattern;

public class BoltThatAlsoReportsToGraphite extends BaseBasicBolt {

  private static final Logger LOG = Logger.getLogger(BoltThatAlsoReportsToGraphite.class);
  private static final String GRAPHITE_HOST = "your.graphite.server.com";
  private static final int CARBON_AGGREGATOR_LINE_RECEIVER_PORT = 2023;
  // The following value must match carbon-cache's storage-schemas.conf!
  private static final int GRAPHITE_REPORT_INTERVAL_IN_SECONDS = 10;
  private static final String GRAPHITE_METRICS_NAMESPACE_PREFIX =
    "production.apps.graphitedemo";
  private static final Pattern hostnamePattern =
    Pattern.compile("^[a-zA-Z0-9][a-zA-Z0-9-]*(\\.([a-zA-Z0-9][a-zA-Z0-9-]*))*$");

  private transient Meter tuplesReceived;

  @Override
  public void prepare(Map stormConf, TopologyContext context) {
    initializeMetricReporting();
  }

  private void initializeMetricReporting() {
    final MetricRegistry registry = new MetricRegistry();
    final Graphite graphite = new Graphite(new InetSocketAddress(GRAPHITE_HOST,
        CARBON_AGGREGATOR_LINE_RECEIVER_PORT));
    final GraphiteReporter reporter = GraphiteReporter.forRegistry(registry)
                                        .prefixedWith(metricsPath())
                                        .convertRatesTo(TimeUnit.SECONDS)
                                        .convertDurationsTo(TimeUnit.MILLISECONDS)
                                        .filter(MetricFilter.ALL)
                                        .build(graphite);
    reporter.start(GRAPHITE_REPORT_INTERVAL_IN_SECONDS, TimeUnit.SECONDS);
    tuplesReceived = registry.meter(MetricRegistry.name("tuples", "received"));
  }

  private String metricsPath() {
    final String myHostname = extractHostnameFromFQHN(detectHostname());
    return GRAPHITE_METRICS_NAMESPACE_PREFIX + "." + myHostname;
  }

  @Override
  public void execute(Tuple tuple, BasicOutputCollector collector) {
    tuplesReceived.mark();

    // FYI: We do not need to explicitly ack() the tuple because we are extending
    // BaseBasicBolt, which will automatically take care of that.
  }

  // ...other bolt code may follow here...

  //
  // Helper methods to detect the hostname of the machine that
  // executes this instance of a bolt.  Normally you'd want to
  // move this functionality into a separate class to adhere
  // to the single responsibility principle.
  //

  private static String detectHostname() {
    String hostname = "hostname-could-not-be-detected";
    try {
      hostname = InetAddress.getLocalHost().getHostName();
    }
    catch (UnknownHostException e) {
      LOG.error("Could not determine hostname");
    }
    return hostname;
  }

  private static String extractHostnameFromFQHN(String fqhn) {
    if (hostnamePattern.matcher(fqhn).matches()) {
      if (fqhn.contains(".")) {
        return fqhn.split("\\.")[0];
      }
      else {
        return fqhn;
      }
    }
    else {
      // We want to return the input as-is
      // when it is not a valid hostname/FQHN.
      return fqhn;
    }
  }

}

That’s it! Your Storm bolt instances will report their respective counts of received tuples to Graphite every 10 seconds.

Summary

At this point you should have successfully married Storm with Graphite, and also learned a few basics about how Graphite and Storm work along the way. Now you can begin creating graphs and dashboards for your Storm applications, which was the reason to do all this in the first place, right?

Enjoy! –Michael

Appendix

Where to go from here

Want to install and configure Graphite automatically? Take a look at my puppet-graphite module for Puppet. See also my previous post on Installing and Running Graphite via RPM and Supervisord for an alternative, manual installation approach.
Storm exposes a plethora of built-in metrics that greatly augment the application-level metrics we described in this article. In 2015 we open sourced storm-graphite, which automatically forwards these built-in metrics from Storm to Graphite. You can enable storm-graphite globally in your Storm cluster or selectively for only a subset of your topologies.
You should start sending system metrics (CPU, memory and such) to Graphite, too. This allows you to correlate the performance of your Storm topologies with the health of the machines in the cluster. Very helpful for detecting and fixing bottlenecks! There are a couple of tools that can collect these system metrics for you and forward them to Graphite. One of those tools is Diamond. Take a look at my puppet-diamond Puppet module to automatically install and configure Diamond on your Storm cluster nodes.
Want to install and configure Storm automatically? I am about to release an automated deployment tool called Wirbelsturm very soon, which will allow you to deploy software such as Storm and Kafka. Wirbelsturm is essentially a curated collection of Puppet modules (that can be used standalone, too) plus a ready-to-use Vagrant setup to deploy machines locally and to, say, Amazon AWS. puppet-graphite and puppet-diamond above are part of the package, by the way. Please stay tuned! In the meantime my tutorial Running a Multi-Node Storm Cluster should get you started.

Caveat: Storm samples metrics for the Storm UI

If you do want to compare values 1:1 between the Storm UI and Graphite please be aware that by default Storm samples incoming tuples for computing stats. By default it uses a sampling rate of 0.05 (5%), which is an option configurable through Config.TOPOLOGY_STATS_SAMPLE_RATE.

The way it works is that if you choose a sampling rate of 0.05, it will pick a random element of the next 20 events in which to increase the count by 20. So if you have 20 tasks for that bolt, your stats could be off by +-380.
Nathan Marz on storm-user groups.google.com/d/msg/…

To force Storm to count everything exactly to achieve accurate numbers at the cost of a big performance hit to your topology you can set the sampling rate to 100%:

conf.put(Config.TOPOLOGY_STATS_SAMPLE_RATE, 1.0); // default is 0.05

Replephant: Analyzing Hadoop Cluster Usage with Clojure

2013-09-17T10:29:00+02:00

Understanding how an Hadoop cluster is actually used in practice is paramount to properly manage and operate it. In this article I introduce Replephant, an open source Clojure library to perform interactive analysis of Hadoop cluster usage via REPL and to generate usage reports.

Replephant is available at replephant on GitHub.

Replephant in one minute

This section is an appetizer of what you can do with Replephant. Do not worry if something is not immediately obvious to you – the Replephant documentation describes everything in full detail.

First, clone the Replephant repository and start the Clojure REPL. You must have lein (leiningen) already installed; if you do not please follow the Replephant installation instructions.

$ git clone https://github.com/miguno/replephant.git
$ cd replephant
$ lein repl

# once the REPL is loaded the prompt will change to:
replephant.core=>

Then you can begin analyzing the usage of your own cluster:

; The root directory is usually the one defined by Hadoop's
; mapred.job.tracker.history.completed.location and/or
; hadoop.job.history.location settings
(def jobs (load-jobs "/local/path/to/hadoop/job-history-root-dir"))

; How many jobs are in the log data?
(count jobs)
=> 12

; Show me all the users who ran one or more jobs in the cluster
(distinct (map :user.name jobs))
=> ("miguno", "alice", "bob", "daniel", "carl", "jim")

; Consumption of computation resources: which Hadoop users
; account for most of the tasks launched?
(println (utils/sort-by-value-desc (tasks-by-user jobs)))
=> {"miguno" 2329, "alice" 2208, "carl" 1440, "daniel" 19, "bob" 2, "jim" 2}

Alright, that was a quick start! The next sections cover Replephant in more depth.

Motivation

Understanding how an Hadoop cluster is actually used in practice is paramount to properly manage and operate it. This includes knowing cluster usage across the following dimensions:

Which users account for most of the resource consumption in the cluster (impacts e.g. capacity planning, budgeting and billing in multi-tenant environments, cluster configuration settings such as scheduler pool/queue settings).
Which analysis tools such as Pig or Hive are preferred by the users (impacts e.g. cluster roadmap, trainings, providing custom helper libraries and UDFs).
Which data sets account for most of the analyses being performed (impacts e.g. prolonging or canceling data subscriptions, data archiving and aging, HDFS replication settings).
Which MapReduce jobs consume most of the resources in the cluster and for how long (impacts e.g. how the jobs are coded and configured, when and where they are launched; also allows your Ops team to point and shake fingers).

Replephant was created to answer those important questions by inspecting production Hadoop logs (here: so-called Hadoop job configuration and job history files) and allowing you to derive relevant statistics from the data. Notably, it enables you to leverage Clojure’s REPL to interactively perform such analyses. You can even create visualizations and plots from Replephant’s usage reports by drawing upon the data viz magics of tools such as R and Incanter (see FAQ section).

Apart from its original goals Replephant has also proven to be useful in cluster/job troubleshooting and debugging. Because Replephant is lightweight and easy to install operations teams can conveniently run Replephant in production environments if needed.

The following projects are similar to Replephant:

hadoop-job-analyzer – analyzes Hadoop jobs, aggregates the information according to user specified crosssections, and sends the output to a metrics backend for visualization and analysis (e.g. Graphite). Its analysis is based on parsing Hadoop’s job log files just like Replephant does.

If you are interested in more sophisticated cluster usage analysis you may want to take a look at:

White Elephant (by LinkedIn) is an open source Hadoop log aggregator and dashboard which enables visualization of Hadoop cluster utilization across users and over time.
hRaven (by Twitter) collects run time data and statistics from MapReduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. A nice feature of hRaven is that it can group related MapReduce jobs together that are spawned from a single higher-level analysis job such as Pig (e.g. Pig jobs usually manifests themselves in several chained MapReduce jobs). A current drawback of hRaven is that it only supports Cloudera CDH3 up to CDH3u4 – CDH3u5, Hadoop 1.x and Hadoop 2.x are not supported yet.
Commercial offerings such as Cloudera Manager (Enterprise Core), Hortonworks Management Center or MapR M5 include cluster usage reporting features.

Features

Replephant’s main value proposition is to read and parse Hadoop’s raw log files and turn them into ready-to-use Clojure data structures – because as is often the case for such a data analysis preparing and loading the original raw data is the hardest part.

On top of this ETL functionality Replephant also includes a set of basic usage reports such as (tasks-by-user jobs) and convenient filter predicates such as pig? (see Usage section on GitHub). But even more interesting is the fact that you can use the Clojure REPL including all of Clojure’s own powerful features to interactively drill down into the job data yourself.

Getting started

Requirements

Java JDK/JRE >= 6
Leiningen version 2 – either install manually or use your favorite package manager such as HomeBrew for Macs

That’s it!

Installation

Apart from meeting Replephant’s requirements (see above) you only need to clone Replephant’s git repository.

# Option 1: using HTTPS for data transfer
$ git clone https://github.com/miguno/replephant.git

# Option 2: using SSH for data transfer (requires GitHub user account)
$ git clone git@github.com:miguno/replephant.git

Note: This step requires a working Internet connection and appropriate firewall settings, which you may or may not have in a production environment.

Data structures and usage analysis

When you analyze your Hadoop cluster’s usage with Replephant you will be working with two data structures:

Jobs: The main data we are interested in for cluster usage analysis, parsed by Replephant from the raw Hadoop job logs.
Data sets: Defined by the user, i.e. you!

Jobs

Jobs are modelled as associative data structures that map Hadoop job parameters as well as Hadoop job history data to their respective values. Both the keys in the data structure – the names of job parameter and the name of data fields in the job history data, which together we just call fields – as well as their values are derived straight from the Hadoop logs.

Replephant converts the keys of the data fields into Clojure keywords according to the following schema:

Job parameters (from job configuration files) are directly converted into keywords. For instance, mapred.input.dir becomes :mapred.input.dir (note the leading colon, which denotes a Clojure keyword).
Job history data including job counters (from job history files) are lowercased and converted into Lisp-style keywords For instance, the job counter HDFS_BYTES_WRITTEN becomes :hdfs-bytes-written and a field such as JOB_PRIORITY becomes :job-priority.

Basically, everything that looks like :words.with.dot.separators is normally a job parameter whereas anything else is derived from job history data. The values of the various fields are, where possible, converted into the appropriate Clojure data types (e.g. a value representing an integer will be correctly turned into an int, the strings “true” and “false” are converted into their respective boolean values).

Here is a (shortened) example of a job data structure read from Hadoop log files:

{
 :dfs.access.time.precision 3600000,    ; <<< a job configuration data field
 :dfs.block.access.token.enable false,
 ; *** SNIP ***
 :hdfs-bytes-read 69815515804,          ; <<< a job history data field
 :hdfs-bytes-written 848734873,
 ; *** SNIP ***
 :io.sort.mb 200,
 :job-priority "NORMAL",
 :job-queue "default",
 :job-status "SUCCESS",
 :jobid "job_201206011051_137865",
 :jobname "Facebook Social Graph analysis",
 ; *** SNIP ***
 :user.name "miguno"
}

Here are some usage analysis examples:

; Consumption of computation resources: which Hadoop users account for most of the tasks launched?
(println (utils/sort-by-value-desc (tasks-by-user jobs)))
=> {"miguno" 2329, "alice" 2208, "carl" 1440, "daniel" 19, "bob" 2, "jim" 2}

; Consumption of computation resources: which Hadoop users account for most of the jobs launched?
(println (utils/sort-by-value-desc (jobs-by-user jobs)))
=> {"daniel" 3, "alice" 3, "carl" 2, "miguno" 2, "bob" 1, "jim" 1}

; Consumption of computation resources: which MapReduce tools account for most of the tasks launched?
(println (utils/sort-by-value-desc (tasks-by-tool jobs)))
=> {:hive 2329, :other 1440, :streaming 1778, :mahout 432, :pig 21}

; Consumption of computation resources: which MapReduce tools account for most of the jobs launched?
(println (utils/sort-by-value-desc (jobs-by-tool jobs)))
=> {:pig 4, :other 2, :mahout 2, :streaming 2, :hive 2}

; Find jobs that violate data locality -- those are candidates for optimization and tuning.
;
; The example below is pretty basic.  It retrieves all jobs that have 1+ rack-local tasks,
; i.e. tasks where data needs to be transferred over the network (but at least they are from
; the same rack).
; A slightly improved version would also include jobs were data was retrieved from OTHER racks
; during a map tasks, which in pseudo-code is (- all-maps rack-local-maps data-local-maps).
;
(def optimization-candidates (filter #(> (:rack-local-maps % 0) 0) jobs))

The following examples demonstrate the predicates built into Replephant:

; Restrict your analysis to a specific subset of all jobs according to one or more predicates
(def hive-jobs (filter hive? jobs))
(def jobs-with-compressed-output (filter compressed-output? jobs))
(def failed-jobs (filter failed? jobs))
; Detect write-only jobs and jobs for which Replephant cannot yet extract input data information.
(def jobs-with-missing-input (filter missing-input-data? jobs))
; Helpful to complete your data set definitions
(def jobs-with-unknown-input (filter (partial unknown-input-data? data-sets) jobs))

In addition to the data derived from Hadoop log files Replephant also adds some Clojure metadata to each job data structure. At the moment only a :job-id field is available. This helps to identify problematic job log files (e.g. those Replephant fails to parse) because at least Replephant will tell you the job id, which you can then use to find the respective raw log files on disk.

(def job ...) ;
(meta job)
=> {:job-id "job_201206011051_137865"}

Note that even though this metadata follows the same naming conventions as the actual job data it is still metadata and as such you must access it via (meta ...). Accessing the job data structure directly – without meta – only provides you with the log-derived data.

Data sets

You only need to define data sets if you use any of Replephant’s data set related functions such as tasks-by-data-sets. Otherwise you can safely omit this step.

Data sets are used to describe the, well, data sets that are stored in an Hadoop cluster. They allow you to define, for example, that the Twitter Firehose data is stored in this particular location in the cluster. Replephant can then leverage this information to perform usage analysis related to these data sets; for instance, to answer questions such as “How many Hadoop jobs were launched against the Twitter Firehose data in our cluster?”.

Thanks to Clojure’s homoiconicity it is very straight-forward to define data sets so that Replephant can understand which jobs read which data in your Hadoop cluster. You only need to create an associative data structure that maps the name of the data set to a regex pattern that is matched against a job’s input directories (more correctly, input URIs) as configured via mapred.input.dir and mapred.input.dir.mappers. You then pass this data structure to the appropriate Replephant function.

Important note: In order to simplify data set definitions Replephant will automatically extract the path component of input URIs, i.e. it will remove scheme and authority information from mapred.input.dir and mapred.input.dir.mappers values. This means you should write regexes that match against strings such as /path/to/foo/ instead of hdfs:///path/to/foo/ or hdfs://namenode.your.datacenter/path/to/foo/.

(def data-sets
  {
   ; Will match e.g. "hdfs://namenode/twitter/firehose/*", "/twitter/firehose"
   ; and "/twitter/firehose/*"; see note above
   "Twitter Firehose data" #"^/twitter/firehose/?"
   })

Here is another example:

; Consumption of computation resources: which data sets account for most of the tasks launched?
; (data sets are defined in a simple associative data structure; see section "Data sets" below)
(def data-sets {"Twitter Firehose data" #"^/twitter/firehose/?", "Facebook Social Graph" #"^/facebook/social-graph/?"})
(println (utils/sort-by-value-desc (tasks-by-data-set jobs data-sets)))
=> {"Facebook Social Graph data" 2329, "UNKNOWN DATA SET" 1872, "Twitter Firehose data" 1799}

Replephant uses native Clojure regex patterns, which means you have the full power of java.util.regex.Pattern at your disposal.

How Replephant matches job input with data set definitions: Replephant will consider a MapReduce job to be reading a given data set if ANY of the job’s input URIs match the respective regex of the data set. In Hadoop the values of mapred.input.dir and mapred.input.dir.mappers maybe be a single URI or a comma-separated list of URIs; in the latter case Replephant will automatically explode the comma-separated string into a Clojure collection of individual URIs so that you don’t have to write complicated regexes to handle multiple input URIs in your own code (the regex is matched against the individual URIs, one at a time).

Analyzing multiple cluster environments: If you are running, say, a production cluster and a test cluster that host different data sets (or at different locations), it is convenient to create separate data set definitions such as (def production-data-sets { ... }) and (def test-data-sets { ... }).

See data_sets.clj for further information and for an example definition of multiple data sets.

Visualization

Replehant itself does not implement any native visualization features. However you can leverage all the existing data visualization tools such as R or Incanter (the latter is basically a clone of R written in Clojure).

For your convenience Incanter has been added as a dependency of Replephant, which is a fancy way of saying that you can use Incanter from Replephant’s REPL right out of the box. Here is an example Incanter visualization of cluster usage reported by tasks-by-user:

;; Create a bar chart using Incanter
(def jobs (load-jobs ...))
(def u->t (utils/sort-by-value-desc (tasks-by-user jobs)))
(use '(incanter core charts))
(view (bar-chart
       (keys u->t)
       (vals u->t)
       :title "Computation resources consumed by user"
       :x-label "Users"
       :y-label "Tasks launched"))

Note: This specific example requires a window system such as X11. In other words it will not work in a text terminal.

This produces the following chart:

Figure 1: Visualizing cluster usage reports in Replephant with Incanter

How it works

In a nutshell Replephant reads the data in Hadoop job configuration files and job history files into a “job” data structure, which can then be used for subsequent cluster usage analyses.

Background: Hadoop creates a pair of files for each MapReduce job that is executed in a cluster:

A job configuration file, which contains job-related data created at the time when the job was submitted to the cluster. For instance, the location of the job’s input data is specified in this file via the parameter mapred.input.dir.
- Format: XML
- Example filename: job_201206222102_0003_conf.xml for a job with ID job_201206222102_0003

io.bytes.per.checksum512
mapred.input.dirhdfs://namenode/facebook/social-graph/2012/06/22/
mapred.job.nameFacebook Social Graph analysis
mapred.task.profile.reduces0-2
mapred.reduce.tasks.speculative.executiontrue
...

An accompanying job history file, which captures run-time information on how the job was actually executed in the cluster. For instance, Hadoop stores a job’s run-time counters such as HDFS_BYTES_WRITTEN (a built-in counter of Hadoop, which as a side note is also shown in the JobTracker web UI when looking at running or completed jobs) as well as application-level custom counters (provided by user code).
- Format: Custom plain-text encoded format for Hadoop 1.x and 0.20.x, described in JobHistory class
- Example filename: job_201206222102_0003_1340394471252_miguno_Job2045189006031602801

Meta VERSION="1" .
Job JOBID="job_201206011051_137865" JOBNAME="Facebook Social Graph analysis" USER="miguno" SUBMIT_TIME="1367518567144" JOBCONF="hdfs://namenode/app/hadoop/staging/miguno/\.staging/job_201206011051_137865/job\.xml" VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE="default" .
Job JOBID="job_201206011051_137865" JOB_PRIORITY="NORMAL" .
Job JOBID="job_201206011051_137865" LAUNCH_TIME="1367518571729" TOTAL_MAPS="2316" TOTAL_REDUCES="12" JOB_STATUS="PREP" .
Task TASKID="task_201206011051_137865_r_000013" TASK_TYPE="SETUP" START_TIME="1367518572156" SPLITS="" .
ReduceAttempt TASK_TYPE="SETUP" TASKID="task_201206011051_137865_r_000013" TASK_ATTEMPT_ID="attempt_201206011051_137865_r_000013_0" START_TIME="1367518575026" TRACKER_NAME="slave406:localhost/127\.0\.0\.1:56910" HTTP_PORT="50060" .
...

Depending on your Hadoop version and cluster configuration, Hadoop will store those files in directory trees rooted at mapred.job.tracker.history.completed.location and/or hadoop.job.history.location.

Replephant uses standard XML parsing to read the job configuration files, and relies on the Hadoop 1.x Java API to parse the job history files via DefaultJobHistoryParser. At the moment Replephant retrieves only such history data from job history files that are related to job start, job finish or job failure (e.g. task attempt data is not retrieved). For each job Replephant creates a single associative data structure that contains both the job configuration as well as the job history data in a Clojure-friendly format. This job data structure forms the basis for all subsequent cluster usage analyses as we have seen in the previous section.

Summary

Replephant is a work in progress but already a pretty valuable addition to our Hadoop toolset. If you want to give it a try, head over to the Replephant project homepage and play with it!

Using Avro in MapReduce jobs with Hadoop, Pig, Hive

2013-07-04T08:29:00+02:00

Apache Avro is a very popular data serialization format in the Hadoop technology stack. In this article I show code examples of MapReduce jobs in Java, Hadoop Streaming, Pig and Hive that read and/or write data in Avro format. We will use a small, Twitter-like data set as input for our example MapReduce jobs.

The latest version of this article and the corresponding code examples are available at avro-hadoop-starter on GitHub.

Requirements

The examples require the following software versions:

Gradle 1.3+ (only for the Java examples)
Java JDK 7 (only for the Java examples)
- It is easy to switch to JDK 6. Mostly you will need to change the sourceCompatibility and targetCompatibility parameters in build.gradle from 1.7 to 1.6. But since there are a couple of JDK 7 related gotchas (e.g. problems with its new bytecode verifier) that the Java example code solves I decided to stick with JDK 7 as the default.
Hadoop 2.x with MRv1 (not MRv2/YARN)
- Tested with Cloudera CDH 4.3
Pig 0.11
- Tested with Pig 0.11.0-cdh4.3.0
Hive 0.10
- Tested with Hive 0.10.0-cdh4.3.0
Avro 1.7.4

Prerequisites

First you must clone my avro-hadoop-starter repository on GitHub.

$ git clone git@github.com:miguno/avro-hadoop-starter.git
$ cd avro-hadoop-starter

Example data

We are using a small, Twitter-like data set as input for our example MapReduce jobs.

Avro schema

twitter.avsc defines a basic schema for storing tweets:

{
  "type" : "record",
  "name" : "Tweet",
  "namespace" : "com.miguno.avro",
  "fields" : [ {
    "name" : "username",
    "type" : "string",
    "doc"  : "Name of the user account on Twitter.com"
  }, {
    "name" : "tweet",
    "type" : "string",
    "doc"  : "The content of the user's Twitter message"
  }, {
    "name" : "timestamp",
    "type" : "long",
    "doc"  : "Unix epoch time in seconds"
  } ],
  "doc:" : "A basic schema for storing Twitter messages"
}

If you want to generate Java classes from this Avro schema follow the instructions described in section Java > Usage. Alternatively you can also use the Avro Compiler directly.

Avro data files

The actual data is stored in the following files:

twitter.avro – encoded (serialized) version of the example data in binary Avro format, compressed with Snappy
twitter.json – JSON representation of the same example data

You can convert back and forth between the two encodings (Avro vs. JSON) using Avro Tools. See Reading and Writing Avro Files From the Command Line for instructions on how to do that.

Here is a snippet of the example data:

{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended.  Terran is IMBA.","timestamp": 1366154481 }
{"username":"DarkTemplar","tweet":"From the shadows I come!","timestamp": 1366154681 }
{"username":"VoidRay","tweet":"Prismatic core online!","timestamp": 1366160000 }

Preparing the input data

The example input data we are using is twitter.avro. Upload twitter.avro to HDFS to make the input data available to our MapReduce jobs.

# Upload the input data
$ hadoop fs -mkdir examples/input
$ hadoop fs -copyFromLocal src/test/resources/avro/twitter.avro examples/input

We will also upload the Avro schema twitter.avsc to HDFS because we will use a schema available at an HDFS location in one of the Hive examples.

# Upload the Avro schema
$ hadoop fs -mkdir examples/schema
$ hadoop fs -copyFromLocal src/main/resources/avro/twitter.avsc examples/schema

Java

Usage

To prepare your Java IDE:

# IntelliJ IDEA
$ gradle cleanIdea idea   # then File > Open... > avro-hadoop-starter.ipr

# Eclipse
$ gradle cleanEclipse eclipse

To build the Java code and to compile the Avro-based Java classes from the schemas (*.avsc) in src/main/resources/avro/:

$ gradle clean build

The generated Avro-based Java classes are written under the directory tree generated-sources/. The Avro compiler will generate a Java class Tweet from the twitter.avsc schema.

To run the unit tests (notably TweetCountTest, see section Examples below):

$ gradle test

Note: gradle test executes any JUnit unit tests. If you add any TestNG unit tests you need to run gradle testng for executing those.

Examples

TweetCount

TweetCount implements a MapReduce job that counts the number of tweets created by Twitter users.

TweetCount: Usage: TweetCount

TweetCountTest

TweetCountTest is very similar to TweetCount. It uses twitter.avro as its input and runs a unit test on it with the same MapReduce job as TweetCount. The unit test includes comparing the actual MapReduce output (in Snappy-compressed Avro format) with expected output. TweetCountTest extends ClusterMapReduceTestCase (MRv1), which means that the corresponding MapReduce job is launched in-memory via MiniMRCluster.

MiniMRCluster and Hadoop MRv2

The MiniMRCluster that is used by ClusterMapReduceTestCase in MRv1 is deprecated in Hadoop MRv2. When using MRv2 you should switch to MiniMRClientClusterFactory, which provides a wrapper interface called MiniMRClientCluster around the MiniMRYarnCluster (MRv2):

MiniMRClientClusterFactory: A MiniMRCluster factory. In MR2, it provides a wrapper MiniMRClientCluster interface around the MiniMRYarnCluster. While in MR1, it provides such wrapper around MiniMRCluster. This factory should be used in tests to provide an easy migration of tests across MR1 and MR2.

See Experimenting with MapReduce 2.0 for more information.

Hadoop Streaming

Preliminaries

Important: The examples below assume you have access to a running Hadoop cluster.

How Streaming sees data when reading via AvroAsTextInputFormat

When using AvroAsTextInputFormat as the input format your streaming code will receive the data in JSON format, one record (“datum” in Avro parlance) per line. Note that Avro will also add a trailing TAB (\t) at the end of each line.

\t
\t
\t
...

Here is the basic data flow from your input data in binary Avro format to our streaming mapper:

input.avro (binary)  ---AvroAsTextInputFormat---> deserialized data (JSON) ---> Mapper

Examples

Prerequisites

The example commands below use the Hadoop Streaming jar for MRv1 shipped with Cloudera CDH4:

hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar (as of July 2013)

If you are not using Cloudera CDH4 or are using a new version of CDH4 just replace the jar file with the one included in your Hadoop installation.

The Avro jar files are straight from the Avro project:

Reading Avro, writing plain-text

The following command reads Avro data from the relative HDFS directory examples/input/ (which normally resolves to /user//examples/input/). It writes the deserialized version of each data record (see section How Streaming sees data when reading via AvroAsTextInputFormat above) as is to the output HDFS directory streaming/output/. For this simple demonstration we are using the IdentityMapper as a naive map step implementation – it outputs its input data unmodified (equivalently we coud use the Unix tool cat, here) . We do not need to run a reduce phase here, which is why we disable the reduce step via the option -D mapred.reduce.tasks=0 (see Specifying Map-Only Jobs in the Hadoop Streaming documentation).

# Run the streaming job
$ hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar \
    -D mapred.job.name="avro-streaming" \
    -D mapred.reduce.tasks=0 \
    -files avro-1.7.4.jar,avro-mapred-1.7.4-hadoop1.jar \
    -libjars avro-1.7.4.jar,avro-mapred-1.7.4-hadoop1.jar \
    -input  examples/input/ \
    -output streaming/output/ \
    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
    -inputformat org.apache.avro.mapred.AvroAsTextInputFormat

Once the job completes you can inspect the output data as follows:

$ hadoop fs -cat streaming/output/part-00000 | head -4
{"username": "miguno", "tweet": "Rock: Nerf paper, scissors is fine.", "timestamp": 1366150681}
{"username": "BlizzardCS", "tweet": "Works as intended.  Terran is IMBA.", "timestamp": 1366154481}
{"username": "DarkTemplar", "tweet": "From the shadows I come!", "timestamp": 1366154681}
{"username": "VoidRay", "tweet": "Prismatic core online!", "timestamp": 1366160000}

Please be aware that the output data just happens to be JSON. This is because we opted not to modify any of the input data in our MapReduce job. And since the input data to our MapReduce job is deserialized by Avro into JSON, the output turns out to be JSON, too. With a different MapReduce job you could of course write the output data in TSV or CSV format, for instance.

Reading Avro, writing Avro

AvroTextOutputFormat (implies “bytes” schema)

To write the output in Avro format instead of plain-text, use the same general options as in the previous example but also add:

$ hadoop jar hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar \
    [...]
    -outputformat org.apache.avro.mapred.AvroTextOutputFormat

AvroTextOutputFormat is the equivalent of TextOutputFormat. It writes Avro data files with a “bytes” schema.

Note that using IdentityMapper as a naive mapper as shown in the previous example will not result in the output file being identical to the input file. This is because AvroTextOutputFormat will escape (quote) the input data it receives. An illustration might be worth a thousand words:

# After having used IdentityMapper as in the previous example
$ hadoop fs -copyToLocal streaming/output/part-00000.avro .

$ java -jar avro-tools-1.7.4.jar tojson part-00000.avro  | head -4
"{\"username\": \"miguno\", \"tweet\": \"Rock: Nerf paper, scissors is fine.\", \"timestamp\": 1366150681}\t"
"{\"username\": \"BlizzardCS\", \"tweet\": \"Works as intended.  Terran is IMBA.\", \"timestamp\": 1366154481}\t"
"{\"username\": \"DarkTemplar\", \"tweet\": \"From the shadows I come!\", \"timestamp\": 1366154681}\t"
"{\"username\": \"VoidRay\", \"tweet\": \"Prismatic core online!\", \"timestamp\": 1366160000}\t"

Custom Avro output schema

This looks not to be supported by stock Avro at the moment. A related JIRA ticket AVRO-1067, created in April 2012, is still unresolved as of July 2013.

For a workaround take a look at the section Avro output for Hadoop Streaming at avro-utils, a third-party library for Avro.

Enabling compression of Avro output data (Snappy or Deflate)

If you want to enable compression for the Avro output data, you must add the following parameters to the streaming job:

# For compression with Snappy
-D mapred.output.compress=true -D avro.output.codec=snappy

# For compression with Deflate
-D mapred.output.compress=true -D avro.output.codec=deflate

Be aware that if you enable compression with mapred.output.compress but are NOT specifying an Avro output format (such as AvroTextOutputFormat) your cluster’s configured default compression codec will determine the final format of the output data. For instance, if mapred.output.compression.codec is set to com.hadoop.compression.lzo.LzopCodec then the job’s output files would be compressed with LZO (e.g. you would see part-00000.lzo output files instead of uncompressed part-00000 files).

See also Compression and Avro in the CDH4 documentation.

Hive

Preliminaries

Important: The examples below assume you have access to a running Hadoop cluster.

Examples

In this section we demonstrate how to create a Hive table backed by Avro data, followed by running a few simple Hive queries against that data.

Defining a Hive table backed by Avro data

Using avro.schema.url to point to remote a Avro schema file

The following CREATE TABLE statement creates an external Hive table named tweets for storing Twitter messages in a very basic data structure that consists of username, content of the message and a timestamp.

CREATE EXTERNAL TABLE tweets
    COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS
    INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION '/user/YOURUSER/examples/input/'
    TBLPROPERTIES (
        'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc'
    );

Note: You must replace YOURUSER with your actual username. See section Preparing the Input Data above.

The serde parameter avro.schema.url can use URI schemes such as hdfs://, http:// and file://. It is recommended to use HDFS locations though:

[If the avro.schema.url points] to a location on HDFS […], the AvroSerde will then read the file from HDFS, which should provide resiliency against many reads at once [which can be a problem for HTTP locations]. Note that the serde will read this file from every mapper, so it is a good idea to turn the replication of the schema file to a high value to provide good locality for the readers. The schema file itself should be relatively small, so this does not add a significant amount of overhead to the process.

That said, when hosting the schemas on a high-performance web server such as nginx that is very efficient at serving static files then using HTTP locations for Avro schemas should not be a problem either.

If you need to point to a particular HDFS namespace you can include the hostname and port of the NameNode in avro.schema.url:

CREATE EXTERNAL TABLE [...]
    TBLPROPERTIES (
        'avro.schema.url'='hdfs://namenode01:8020/path/to/twitter.avsc'
    );

Using avro.schema.literal to embed an Avro schema

An alternative to setting avro.schema.url and using an external Avro schema is to embed the schema directly within the CREATE TABLE statement:

CREATE EXTERNAL TABLE tweets
    COMMENT "A table backed by Avro data with the Avro schema embedded in the CREATE TABLE statement"
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS
    INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION '/user/YOURUSER/examples/input/'
    TBLPROPERTIES (
        'avro.schema.literal'='{
            "type": "record",
            "name": "Tweet",
            "namespace": "com.miguno.avro",
            "fields": [
                { "name":"username",  "type":"string"},
                { "name":"tweet",     "type":"string"},
                { "name":"timestamp", "type":"long"}
            ]
        }'
    );

Note: You must replace YOURUSER with your actual username. See section Preparing the Input Data above.

Hive can also use variable substitution to embed the required Avro schema at run-time of a Hive script:

CREATE EXTERNAL TABLE tweets [...]
    TBLPROPERTIES ('avro.schema.literal'='${hiveconf:schema}');

To execute the Hive script you would then run:

# SCHEMA must be a properly escaped version of the Avro schema; i.e. carriage returns converted to \n, tabs to \t,
# quotes escaped, and so on.
$ export SCHEMA="..."
$ hive -hiveconf schema="${SCHEMA}" -f hive_script.hql

Switching from avro.schema.url to avro.schema.literal or vice versa

If for a given Hive table you want to change how the Avro schema is specified you need to use a workaround:

Hive does not provide an easy way to unset or remove a property. If you wish to switch from using url or schema to the other, set the to-be-ignored value to none and the AvroSerde will treat it as if it were not set.

Analyzing the data with Hive

After you have created the Hive table tweets with one of the CREATE TABLE statements above (no matter which), you can start analyzing the example data with Hive. We will demonstrate this via the interactive Hive shell, but you can also use a Hive script, of course.

First, start the Hive shell:

$ hive
hive>

Let us inspect how Hive interprets the Avro data with DESCRIBE. You can also use DESCRIBE EXTENDED to see even more details, including the Avro schema of the table.

hive> DESCRIBE tweets;
OK
username        string  from deserializer
tweet   string  from deserializer
timestamp       bigint  from deserializer
Time taken: 1.786 seconds

Now we can perform interactive analysis of our example data:

hive> SELECT * FROM tweets LIMIT 5;
OK
miguno        Rock: Nerf paper, scissors is fine.   1366150681
BlizzardCS    Works as intended.  Terran is IMBA.   1366154481
DarkTemplar   From the shadows I come!              1366154681
VoidRay       Prismatic core online!                1366160000
VoidRay       Fire at will, commander.              1366160010
Time taken: 0.126 seconds

The following query will launch a MapReduce job to compute the result:

hive> SELECT DISTINCT(username) FROM tweets;
Total MapReduce jobs = 1
Launching Job 1 out of 1
[...snip...]
MapReduce Total cumulative CPU time: 4 seconds 290 msec
Ended Job = job_201305070634_0187
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.29 sec   HDFS Read: 1887 HDFS Write: 47 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 290 msec
OK
BlizzardCS          <<< Query results start here
DarkTemplar
Immortal
VoidRay
miguno
Time taken: 16.782 seconds

As you can see Hive makes working Avro data completely transparent once you have defined the Hive table accordingly.

Enabling compression of Avro output data

To enable compression add the following statements to your Hive script or enter them into the Hive shell:

# For compression with Snappy
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

# For compression with Deflate
SET hive.exec.compress.output=true;
SET avro.output.codec=deflate;

To disable compression again in the same Hive script/Hive shell:

SET hive.exec.compress.output=false;

Pig

Preliminaries

Important: The examples below assume you have access to a running Hadoop cluster.

Examples

Prerequisites

First we must register the required jar files to be able to work with Avro. In this example I am using the jar files shipped with CDH4. If you are not using CDH4 just adapt the paths to match your Hadoop distribution.

REGISTER /app/cloudera/parcels/CDH/lib/pig/piggybank.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/avro-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-core-asl-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jackson-mapper-asl-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/json-simple-*.jar
REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/snappy-java-*.jar

Note: If you also want to work with Python UDFs in PiggyBank you must also register the Jython jar file:

REGISTER /app/cloudera/parcels/CDH/lib/pig/lib/jython-standalone-*.jar

Reading Avro

To read input data in Avro format you must use AvroStorage. The following statements show various ways to load Avro data.

-- Easiest case: when the input data contains an embedded Avro schema (our example input data does).
-- Note that all the files under the directory should have the same schema.
records = LOAD 'examples/input/' USING org.apache.pig.piggybank.storage.avro.AvroStorage();

--
-- Next commands show how to manually specify the data schema
--

-- Using external schema file (stored on HDFS), relative path
records = LOAD 'examples/input/'
          USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check',
               'schema_file', 'examples/schema/twitter.avsc');

-- Using external schema file (stored on HDFS), absolute path
records = LOAD 'examples/input/'
          USING org.apache.pig.piggybank.storage.avro.AvroStorage(
            'no_schema_check',
            'schema_file', 'hdfs:///user/YOURUSERNAME/examples/schema/twitter.avsc');

-- Using external schema file (stored on HDFS), absolute path with explicit HDFS namespace
records = LOAD 'examples/input/'
          USING org.apache.pig.piggybank.storage.avro.AvroStorage(
            'no_schema_check',
            'schema_file', 'hdfs://namenode01:8020/user/YOURUSERNAME/examples/schema/twitter.avsc');

About “no_schema_check”: AvroStorage assumes that all Avro files in sub-directories of an input directory share the same schema, and by default AvroStorage performs a schema check. This process may take some time (seconds) when the input directory contains many sub-directories and files. You can set the option “no_schema_check” to disable this schema check.

See AvroStorage and TestAvroStorage.java for further examples.

Analyzing the data with Pig

The records relation is already in perfectly usable format – you do not need to manually define a (Pig) schema as you would usually do via LOAD ... AS (...schema follows...).

grunt> DESCRIBE records;
records: {username: chararray,tweet: chararray,timestamp: long}

Let us take a first look at the contents of the our input data. Note that the output you will see will vary at each invocation due to how ILLUSTRATE works.

grunt> ILLUSTRATE records;

--------------------------------------------------------------------------------------------
| records     | username:chararray      | tweet:chararray            | timestamp:long      |
--------------------------------------------------------------------------------------------
|             | DarkTemplar             | I strike from the shadows! | 1366184681          |
--------------------------------------------------------------------------------------------

Now we can perform interactive analysis of our example data:

grunt> first_five_records = LIMIT records 5;
grunt> DUMP first_five_records;   <<< this will trigger a MapReduce job
[...snip...]
(miguno,Rock: Nerf paper, scissors is fine.,1366150681)
(VoidRay,Prismatic core online!,1366160000)
(VoidRay,Fire at will, commander.,1366160010)
(BlizzardCS,Works as intended.  Terran is IMBA.,1366154481)
(DarkTemplar,From the shadows I come!,1366154681)

List the (unique) names of users that created tweets:

grunt> usernames = DISTINCT (FOREACH records GENERATE username);
grunt> DUMP usernames;            <<< this will trigger a MapReduce job
[...snip...]
(miguno)
(VoidRay)
(Immortal)
(BlizzardCS)
(DarkTemplar)

Writing Avro

To write output data in Avro format you must use AvroStorage – just like for reading Avro data.

It is strongly recommended that you do specify an explicit output schema when writing Avro data. If you don’t then Pig will try to infer the output Avro schema from the data’s Pig schema – and this may result in undesirable schemas due to discrepancies of Pig and Avro data models (or problems of Pig itself). See AvroStorage for details.

-- Use the same output schema as an existing directory of Avro files (files should have the same schema).
-- This is helpful, for instance, when doing simple processing such as filtering the input data without modifying
-- the resulting data layout.
STORE records INTO 'pig/output/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        'no_schema_check',
        'data', 'examples/input/');

-- Use the same output schema as an existing Avro file as opposed to a directory of such files
STORE records INTO 'pig/output/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        'no_schema_check',
        'data', 'examples/input/twitter.avro');

-- Manually define an Avro schema (here, we rename 'username' to 'user' and 'tweet' to 'message')
STORE records INTO 'pig/output/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        '{
            "schema": {
                "type": "record",
                "name": "Tweet",
                "namespace": "com.miguno.avro",
                "fields": [
                    {
                        "name": "user",
                        "type": "string"
                    },
                    {
                        "name": "message",
                        "type": "string"
                    },
                    {
                        "name": "timestamp",
                        "type": "long"
                    }
                ],
                "doc:" : "A slightly modified schema for storing Twitter messages"
            }
        }');

If you need to store the data in two or more different ways (e.g. you want to rename fields) you must add the parameter “index” to the AvroStorage arguments. Pig uses this information as a workaround to distinguish schemas specified by different AvroStorage calls until Pig’s StoreFunc provides access to Pig’s output schema in the backend.

STORE records INTO 'pig/output-variant-A/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        '{
            "index": 1,
            "schema": { ... }
        }');

STORE records INTO 'pig/output-variant-B/'
    USING org.apache.pig.piggybank.storage.avro.AvroStorage(
        '{
            "index": 2,
            "schema": { ... }
        }');

See AvroStorage and TestAvroStorage.java for further examples.

Enabling compression of Avro output data

To enable compression add the following statements to your Pig script or enter them into the Pig Grunt shell:

-- We also enable compression of map output (which should be enabled by default anyways) because some Pig jobs
-- skip the reduce phase;  this ensures that we always generate compressed job output.
SET mapred.compress.map.output true;
SET mapred.output.compress true;
SET mapred.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec
SET avro.output.codec snappy;

To disable compression again in the same Pig script/Pig Grunt shell:

SET mapred.output.compress false;
-- Optionally: disable compression of map output (normally you want to leave this enabled)
SET mapred.compress.map.output false;

Where to go from here

As I said at the beginning of this article you can always find the latest version of the code examples at https://github.com/miguno/avro-hadoop-starter. I’d welcome any code contributions, corrections, etc. you might have – just create an issue ticket or send me a pull request.

If you are interested in reading and writing Avro files in a shell environment – e.g. when you quickly want to inspect a sample of MapReduce output in Avro format – please take a look at Reading and Writing Avro Files From the Command Line.

Understanding the Internal Message Buffers of Storm

2013-06-21T22:35:00+02:00

When you are optimizing the performance of your Storm topologies it helps to understand how Storm’s internal message queues are configured and put to use. In this short article I will explain and illustrate how Storm version 0.8/0.9 implements the intra-worker communication that happens within a worker process and its associated executor threads.

Internal messaging within Storm worker processes

Terminology: I will use the terms message and (Storm) tuple interchangeably in the following sections.

When I say “internal messaging” I mean the messaging that happens within a worker process in Storm, which is communication that is restricted to happen within the same Storm machine/node. For this communication Storm relies on various message queues backed by LMAX Disruptor, which is a high performance inter-thread messaging library.

Note that this communication within the threads of a worker process is different from Storm’s inter-worker communication, which normally happens across machines and thus over the network. For the latter Storm uses ZeroMQ by default (in Storm 0.9 there is experimental support for Netty as the network messaging backend). That is, ZeroMQ/Netty are used when a task in one worker process wants to send data to a task that runs in a worker process on different machine in the Storm cluster.

So for your reference:

Intra-worker communication in Storm (inter-thread on the same Storm node): LMAX Disruptor
Inter-worker communication (node-to-node across the network): ZeroMQ or Netty
Inter-topology communication: nothing built into Storm, you must take care of this yourself with e.g. a messaging system such as Kafka/RabbitMQ, a database, etc.

If you do not know what the differences are between Storm’s worker processes, executor threads and tasks please take a look at Understanding the Parallelism of a Storm Topology.

Illustration

Let us start with a picture before we discuss the nitty-gritty details in the next section.

Figure 1: Overview of a worker’s internal message queues in Storm. Queues related to a worker process are colored in red, queues related to the worker’s various executor threads are colored in green. For readability reasons I show only one worker process (though normally a single Storm node runs multiple such processes) and only one executor thread within that worker process (of which, again, there are usually many per worker process).

Detailed description

Now that you got a first glimpse of Storm’s intra-worker messaging setup we can discuss the details.

Worker processes

To manage its incoming and outgoing messages each worker process has a single receive thread that listens on the worker’s TCP port (as configured via supervisor.slots.ports). The parameter topology.receiver.buffer.size determines the batch size that the receive thread uses to place incoming messages into the incoming queues of the worker’s executor threads. Similarly, each worker has a single send thread that is responsible for reading messages from the worker’s transfer queue and sending them over the network to downstream consumers. The size of the transfer queue is configured via topology.transfer.buffer.size.

The topology.receiver.buffer.size is the maximum number of messages that are batched together at once for appending to an executor’s incoming queue by the worker receive thread (which reads the messages from the network) Setting this parameter too high may cause a lot of problems (“heartbeat thread gets starved, throughput plummets”). The default value is 8 elements, and the value must be a power of 2 (this requirement comes indirectly from LMAX Disruptor).

// Example: configuring via Java API
Config conf = new Config();
conf.put(Config.TOPOLOGY_RECEIVER_BUFFER_SIZE, 16); // default is 8

Note that topology.receiver.buffer.size is in contrast to the other buffer size related parameters described in this article actually not configuring the size of an LMAX Disruptor queue. Rather it sets the size of a simple ArrayList that is used to buffer incoming messages because in this specific case the data structure does not need to be shared with other threads, i.e. it is local to the worker’s receive thread. But because the content of this buffer is used to fill a Disruptor-backed queue (executor incoming queues) it must still be a power of 2. See launch-receive-thread! in backtype.storm.messaging.loader for details.

Each element of the transfer queue configured with topology.transfer.buffer.size is actually a list of tuples. The various executor send threads will batch outgoing tuples off their outgoing queues onto the transfer queue. The default value is 1024 elements.

// Example: configuring via Java API
conf.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE, 32); // default is 1024

Executors

Each worker process controls one or more executor threads. Each executor thread has its own incoming queue and outgoing queue. As described above, the worker process runs a dedicated worker receive thread that is responsible for moving incoming messages to the appropriate incoming queue of the worker’s various executor threads. Similarly, each executor has its dedicated send thread that moves an executor’s outgoing messages from its outgoing queue to the “parent” worker’s transfer queue. The sizes of the executors’ incoming and outgoing queues are configured via topology.executor.receive.buffer.size and topology.executor.send.buffer.size, respectively.

Each executor thread has a single thread that handles the user logic for the spout/bolt (i.e. your application code), and a single send thread which moves messages from the executor’s outgoing queue to the worker’s transfer queue.

The topology.executor.receive.buffer.size is the size of the incoming queue for an executor. Each element of this queue is a list of tuples. Here, tuples are appended in batch. The default value is 1024 elements, and the value must be a power of 2 (this requirement comes from LMAX Disruptor).

// Example: configuring via Java API
conf.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE, 16384); // batched; default is 1024

The topology.executor.send.buffer.size is the size of the outgoing queue for an executor. Each element of this queue will contain a single tuple. The default value is 1024 elements, and the value must be a power of 2 (this requirement comes from LMAX Disruptor).

// Example: configuring via Java API
conf.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE, 16384); // individual tuples; default is 1024

Where to go from here

How to configure Storm’s internal message buffers

The various default values mentioned above are defined in conf/defaults.yaml. You can override these values globally in a Storm cluster’s conf/storm.yaml. You can also configure these parameters per individual Storm topology via backtype.storm.Config in Storm’s Java API.

How to configure Storm’s parallelism

The correct configuration of Storm’s message buffers is closely tied to the workload pattern of your topology as well as the configured parallelism of your topologies. See Understanding the Parallelism of a Storm Topology for more details about the latter.

Understand what’s going on in your Storm topology

The Storm UI is a good start to inspect key metrics of your running Storm topologies. For instance, it shows you the so-called “capacity” of a spout/bolt. The various metrics will help you decide whether your changes to the buffer-related configuration parameters described in this article had a positive or negative effect on the performance of your Storm topologies. See Running a Multi-Node Storm Cluster for details.

Apart from that you can also generate your own application metrics and track them with a tool like Graphite. See my articles Sending Metrics From Storm to Graphite and Installing and Running Graphite via RPM and Supervisord for details. It might also be worth checking out ooyala’s metrics_storm project on GitHub (I haven’t used it yet).

Advice on performance tuning

Watch Nathan Marz’s talk on Tuning and Productionization of Storm.

The TL;DR version is: Try the following settings as a first start and see whether it improves the performance of your Storm topology.

conf.put(Config.TOPOLOGY_RECEIVER_BUFFER_SIZE,             8);
conf.put(Config.TOPOLOGY_TRANSFER_BUFFER_SIZE,            32);
conf.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE, 16384);
conf.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE,    16384);

Michael G. Noll

Integrating Kafka and Spark Streaming: Code Examples and State of the Game

What is Spark Streaming?

Spark Streaming vs. Apache Storm

Excursus: Machines, cores, executors, tasks, and receivers in Spark

Integrating Kafka with Spark Streaming

Overview

Primer on topics, partitions, and parallelism in Kafka

Reading from Kafka

Read parallelism in Spark Streaming

Option 1: Controlling the number of input DStreams

Option 2: Controlling the number of consumer threads per input DStream

Combining options 1 and 2

Downstream processing parallelism in Spark Streaming

Writing to Kafka

Complete example

Known issues in Spark Streaming

Spark tips and tricks

General

Kafka integration

Testing

Performance tuning

Wrapping up

References

Apache Storm 0.9 training deck and tutorial

Apache Kafka 0.8 training deck and tutorial

Integrating Kafka and Storm: Code Examples and State of the Game

State of the (integration) game

kafka-storm-starter

Overview and quick start

Features

Interested in more?

The quest to get there

Conclusion

Wirbelsturm: 1-Click Deployments of Storm and Kafka clusters with Vagrant and Puppet

Wirbelsturm quick start

Motivation

Current Wirbelsturm features

Is Wirbelsturm for me?

Wirbelsturm in detail

The long road of getting there

Lessons learned: mistakes made along the way

Summary

Related work

Of Algebirds, Monoids, Monads, and other Bestiary for Large-Scale Data Analytics

Goal of this article

Motivating example

A first look at Algebird

Beyond trivial examples

Wait a minute!

What we want to do

My journey down the rabbit hole

How this post started

Scala, functors, monoids, monads, category theory, implicits, type classes, aaargh!

The TL;DR version of monoids and monads

Monoids

What is a monoid?

Monoids in more detail

What are example monoids?

What can I use a monoid for? Why should I look for one?

Monads

What is a monad?

Monads in more detail

What are example monads?

What can I use a monad for? Why should I look for one?

Algebird

Creating a monoid

The TwitterUser type

The Max[TwitterUser] monoid

Where to go from here?

Creating a monad?

Key algebraic structures in Algebird

A small Algebird FAQ

Error “Cannot find Group/Monoid/… type class for a type T”?

Combine different monoids?

Are monads really everywhere?

Summary

References

Monads and monoids

SummingBird