In this guide, We show how to use Spark and Jupyter notebooks to store, process, and visualize events in Kafka on Hopsworks. This is part two in a series of tutorials looking into how to work with streaming events using the HopsWorks platform. The examples in this guid builds on the previous examples, so make sure to read part one first.
This tutorial was tested using Hopsworks version 2.2.
Store Events Permanently
We’ll start by preparing the schema, creating a Kafka topic, and downloading security credentials that we’ll need in this tutorial.
Read the Kafka Spark Integration docs
Note
For streaming queries, the startingOffsets only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
Source Code
All source code is available at Kafka HopsWorks Examples at GitHub