Choosing the right big data analysis tools

In order to choose the right big data analysis tools, it’s important to understand the transactional and analytical data processing requirements of your systems and choose accordingly.

Big data keeps getting bigger, but not every activity involving the use of this data is created equal. Sometimes, utilizing data is like running a small but critical errand at the corner store. At other times, it's about going for a leisurely stroll through a warehouse and taking a good hard look at the inventory. The objectives, and therefore the technology needed to handle transactional data, as opposed to the tools needed for analytical data processing are quite different. In order to choose the right big data analysis tools for the job, it's important to understand both the big differences and subtle nuances that differentiate operational data from data that is more analytical.

Operational workloads are about getting things done right, right now

Operational or transactional data handling has a focus on low latency for response times and handling many concurrent requests. Some real-time analytics may be involved, but they are typically limited to a small set of variables that are relevant to immediate decision-making processes for the end user. Such information might be displayed on a simple dashboard that allows business users to run standard or custom reports based on their own needs and experience level.

Analytics is about extracting the parts you are interested in very efficiently and producing results based on that data.

Julien Le Dem,
principal architect at Dremio and Apache Parquet co-founder

One of the most important features of a data transaction is reliability. "In a bank transfer, you need the from account and the to account to maintain transactional consistency so the money doesn't fall on the floor if something breaks in the middle. You're interested in updating a very small amount of records and making a rigid concept around transactionality," Jacques Nadeau, vice president of Apache Arrow and Apache Drill, said.

Analytics and the right answers

In contrast, analytics typically involves the ability to process large volumes of data throughput using complex query structures. While Streaming analytics may be a feature for specific use cases, analysis for many enterprises is still focused primarily on review of historical data for longer range planning and prediction. As an example, a business might want to analyze sales in the last quarter or use machine learning operations to see what customers buy in a given situation. In the most challenging cases, businesses may not know exactly what they are looking for -- or they may be intentionally experimenting with different ways to derive value from their existing data stores. Data scientists may be called upon to craft the right queries that deliver relevant business insights.

Julien Le Dem, principal architect at Dremio and Apache Parquet co-founder, offered a simple way of thinking about the difference: Moving data around is transactional, processing it is analytical. "You are working with a lot of records at the same time versus working with only one or a few records at one time. Analytics is about extracting the parts you are interested in very efficiently and producing results based on that data."

Choosing the right solution for your data

Big data analysis tools have emerged for real-time, interactive workloads and retrospective, complex analysis of larger data sets. MongoDB and IBM, both major players in the big data analysis tools space, offer some key insights into the differences between the two. Here's a brief overview.

According to IBM, NoSQL systems such as document databases and key-value stores are common solutions for fast and scalable operational databases. With an appropriate NoSQL database, transactions can be processed more quickly, and the system can handle many small transactions at the same time during periods of peak activity. Transactions per second are viewed as a more relevant indicator of performance than response time.

Massively Parallel Processing (MPP) databases and MapReduce -- including variants like Hadoop -- are key solutions in the analytical space. There are even emerging solutions that are designed to meet the needs of enterprises in analyzing data across both SQL and NoSQL, presenting Graph, R and MapReduce within a single analytics platform.

Distinguishing features for operational vs. analytical data processing systems

Experts at MongoDB offer additional detail about the technical distinctions between analytics and online transaction processing systems.

Transactional systems are optimized for short, atomic, repetitive, select-oriented operations and transactions -- these systems can be very finely tuned for frequently used operations. They feature heavy reliance on caching, lots of resource sharing and prescribed code paths.

Analytical systems provide functional richness; processing speed, or fast response time; and ease of use. They typically feature lots of capacity within an MPP. Such systems have the ability to move data quickly when needed but are designed to reduce data movement overall. They rely on few shared structures. The functions may be built into the server and extensible to meet evolving end-user requirements.

Relying on a single database system to handle both types of activity is labor intensive for IT, since conventional database systems demonstrate a great deal of variability in performance when asked to handle analytic and transactional workloads. Of course, not all big data analysis tools suit every possible need, which means that at the enterprise level, most organizations end up using complementary systems to meet all their data workload needs.

What are your favorite big data analysis tools for data processing? Let us know.

Next Steps

Why big data analysis tools are a must for developers

Improving an enterprise with open source big data tools

BI big data tool options expand

12 must-have features for big data analytics tools

Dig Deeper on Front-end, back-end and middle-tier frameworks