masterzphotofo - Fotolia

Apache Spark streaming analytics engine key to integrating big data

Learn how a variety of improvements to the open source Apache Spark streaming analytics engine promises to improve the development of big data applications.

The Spark streaming analytics engine is one of the most popular open source tools for weaving big data into modern applications architectures with over 800 contributors from 200 organizations. It makes it easy to develop new algorithms, which can be integrated into cloud-based applications in production.

The Apache Spark project is getting ready to release Spark 2.0, the second major version for the Spark platform this summer. An unstable preview release is now available. The new release includes over 2,000 patches. At the Spark Summit in San Francisco, Matei Zaharia, CTO and co-founder of Databricks, who developed Spark, gave a rundown of some of the technology and community improvements for the Spark platform.

Part of the success of the Spark project has been due to its focus on the underlying architecture for crafting big data applications. This makes it well suited for implementing a lifecycle that involves:

1) Exploring and crafting big data algorithms

2) Integrating these with cloud applications

3) Optimizing these for production applications at scale

The unified engine underlying Spark makes it easy for people to build applications. Zaharia said that most users combine different processing models. The high-level APIs allow developers to build systems while doing optimizations for different kinds of analytics under the covers. Broad integration is important because data is often spread out.

New improvements for Spark 2.0 include structured API improvements, structured Spark streaming, machine learning model export, SQL 2003 support, and Scala 2.12 support. There are also new language bindings for C# and JavaScript. The structure API improvements give the engine more information about application data and facilitate richer optimizations. SparkSession provides a new entry point for these. Structured Spark streaming is a high-level streaming API built on the structure engine.

The machine learning MLlib model export makes it easy to push machine learning models into production. There are four major efforts to integrate deep learning systems with Apache Spark from Baidu, Yahoo!, UC Berkeley and Databricks.

Structure APIs for new use cases

The structured APIs are designed to give more information about data and the computation attached to it. Zaharia said, "The really cool thing about structured APIs is that the engine optimizes them in a way a traditional database would optimize a query plan."

This is cool because you don't have to worry too much about things like late data ...

Matei Zaharia,
CTO and co-founder, Databricks

This makes it easier to move operations like filtering, joining and processing around as part of the data processing pipeline. Structured streaming can do things like aggregate data in a stream on the server. This makes it possible to change queries at run-time that can be reflected in applications and new machine learning models. The other thing that is unique about structured streaming is that a lot of streaming APIs are complex. Structured Spark streaming makes it easier to adapt the algorithms to manage corner cases or process data that arrives late.

The new Spark 2.0 engine also does a better job of generating optimized code for runtime. A static code generator can optimize across multiple operators to provide a 10-fold performance increase. The engine is also getting a nine-fold improvement in leveraging server caching mechanisms to generate output.

Take advantage of ongoing queries

Spark processes queries in a batch mode. These can be very short batches, which lends the Spark engine to streaming like performance. The traditional programming model of Spark was to execute a query using a defined window. A new feature available in Spark 2.0 allows programmers to implement a virtually open window to support more interactive programming constructs.

Spark 1.X had a notion of data frames instantiated to handle a finite amount of data. Spark 2.0 introduces the notion of infinite data frames. This makes it easier to implement algorithms for doing things like computing a running average. As new data arrives, the engine will transform it using an incremental execution plan. For example, a developer might have a batch job that loads data from Amazon S3 into SQL. As new data arrives, the jobs could automatically transform the data in the SQL file. Zaharia said, "This is cool because you don't have to worry too much about things like late data since the engine will handle it for you."

Sharing Spark skills

Databricks has also launched a new initiative for developers and enterprises to cultivate Spark development skills. A key part of this is the release of the Databricks Community Edition. It includes the Apache Spark engine, data science tools, libraries, visualizations, and debugging tools. It leverages the concept of interactive notebooks that make it easier to share information about data models, algorithms, and use cases.

Developers can use Databricks community edition to create private notebooks within the enterprise or public notebooks with the wider community of developers. This platform is also being leveraged as part of five massive online courses in conjunction with partners that include UC Berkeley and UCLA.

How have you integrated Apache Spark into your applications? Let us know.

Next Steps:
Apache Spark shows independence
How Apache spark architecture speeds data
How well do you know Spark?

Dig Deeper on Front-end, back-end and middle-tier frameworks