Fork me on GitHub

Spark it Up

In this guide, We show how to use Spark and Jupyter notebooks to store, process, and visualize events in Kafka on Hopsworks. This is part two in a series of tutorials looking into how to work with streaming events using the HopsWorks platform. The examples in this guid builds on the previous examples, so make sure to read part one first.

This tutorial was tested using Hopsworks version 2.2.

Store Events Permanently

We’ll start by preparing the schema, creating a Kafka topic, and downloading security credentials that we’ll need in this tutorial.

Read the Kafka Spark Integration docs

Note

For streaming queries, the startingOffsets only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.

Source Code

All source code is available at Kafka HopsWorks Examples at GitHub

links

social