alphaspirit - Fotolia

Apache Arrow to revolutionize big data analytics performance

Listen to this podcast

Apache Arrow will not only improve the performance of your big data analytics engines, but it will make system interoperability more efficient as well.

There are significant performance optimizations available for systems that consume and process primarily analytical data, but seldom are software applications written in a way that can take advantage of various optimizations available in modern computer hardware. Apache Arrow is focused on addressing this shortcoming in modern enterprise applications by changing the way data is organized and shared between applications.

Jacques Nadeau, CTO, Dremio Corp and Apache Arrow committerJacques Nadeau

Modern processors can easily optimize commands that are executed on multiple data points. With the data structured in a consumable format, a computer can easily brighten a photo or clip all of the high frequencies in an audio file simply by applying a single operation to all of the data points that the JPEG or AVI file contains. Unfortunately, application data is rarely structured in a way that allows single input multiple data (SIMD) operations to be performed properly. And this is where Apache Arrow comes in.

Apache Arrow provides a common data and a common memory layer allowing you to move between systems with low to no cost.
Jacques NadeauApache Arrow committer

Using a structured, columnar layout for storing information, Apache Arrow not only simplifies the processes of caching data, but it allows for effective SIMD operations to take place. The result is better performance, which is a difficult to achieve nonfunctional requirement in the world of big data analytics. "The first goal of Apache Arrow is to make each system independently faster," Jacques Nadeau, CTO at Dremio Corp and committer to the Apache Arrow project, says. "That is the primary benefit to people using Arrow in the execution environment."

Low cost data interoperability

What's the other benefit for users of Apache Arrow? Intra-process and intersystem communications become more efficient. Assuming other projects use the same format, the overhead of serialization and deserialization goes away, and fewer steps are required to share and consume data generated from another program. "Apache Arrow provides a common data and a common memory layer, allowing you to move between systems with low to no cost," Nadeau says. Given the fact that developers from various other Apache projects, including Drill, Hadoop, Parquet and Spark, are backers, there are going to be plenty of optimization opportunities for big data tools and analytical systems.

To learn more about Apache Arrow and how it is changing the way big data is shared, analyzed and consumed, listen to TheServerSide's podcast with Jacques Nadeau of the Apache Arrow project.

Next Steps

Big data isn't going to go mainstream, it's already there

How big data can help your enterprise

Breaking down Google's big data offerings