How should you test your big data apps?

How should you test your big data apps?

Several years ago, IT organizations of all stripes were looking for the perfect combo of hardware, software and experienced personnel (e.g., data scientists) that could help them harness the massive quantities of information being gathered by their systems. Big data was one of the most talked-about issues in software development and release.
In 2016, big data is still an important focus area, although it is now frequently woven into other initiatives such as cloud computing adoption as well as app development for the Internet of Things. Dealing with big data indeed requires a multi-pronged strategy, due to the sheer size of the challenge.
According to IBM, more than 2.5 quintillion bytes of data are generated each day, and 90 percent of all electronic data in history has been created in just the last two years. Applications developed to handle information at such scale must be flexible and adaptable, which can only be ensured through rigorous testing. How do you effectively test a big data app?

Outlining a big data testing strategy: Functional and nonfunctional testing

The first thing to know about big data is that it is not solely about quantity. Quality assurance teams also have to account for:

  • Velocity: Sometimes data is produced at an abnormally quick pace, such as in the wake of a breaking news story or during periods of high activity on streaming video services. It becomes highly sensitive to performance in these circumstances.
  • Variety: All data is not created equal. For example, each web browser takes its own distinctive approach to how it formats and withholds certain types of information. Moreover, data like video and VoIP is more bandwidth-intensive than file transfers.
  • Volume: The rollout of the literally tens of billions of connected devices and sensors as part of the IoT means that a sharply increasing amount of data will require disk storage, likely via infrastructure-as-a-service

Addressing all of these requirements – the so-called “Three V’s” – will likely entail a mix of functional and nonfunctional testing, such as failover testing in order to thoroughly vet a system’s ability to allocate resources and draw upon backups during an interruption. Overall, a sound process is essential for ensuring that any big data application can validate its structured and unstructured sets under realistic conditions.
“Performing functional testing can identify data quality issues that originate in errors with coding or node configuration; effective test data and test environment management ensures that data from a variety of sources is of sufficient quality for accurate analysis and can be processed without error,” explained a 2015 paper on the topic. “Nonfunctional testing (specifically performance and failover testing) plays a key role in ensuring the scalability of the process.”

Hadoop, MapReduce and other specific considerations for big data apps

Just as certain utilities like Docker and Puppet are synonymous with the DevOps movement, solutions such as the Hadoop framework and the MapReduce are closely associated with big data. Hadoop is open source and is designed to take in large amounts of data using cost-effective general purpose servers. It has its own file system, commonly referred to as HFDS, that extends the capability of the Windows or Linux machine that is feeding data to it.
Effective use of Hadoop requires deploying a collection of specific testing tools, including BeeTest for Hive and PigUnit. Prior to performing Hadoop processing, a handful of tests must be run, including ones that validate data types, ranges, constraints and cross-references. These tests improve results once the data goes into HFDS, which can support up to 200 petabytes of data in a cluster. The pre-Hadoop step will require validating original data against what is going into Hadoop, checking HFDS locations and making sure that sufficient data has been collected from a wide range of possible sources.
If Hadoop is the “where” for big data processing, MapReduce is often the “how.” MapReduce provides a framework for crafting applications that can handle massive amounts of data across parallel clusters of commodity hardware (i.e., the types of infrastructures essential to operating a Hadoop cluster). There is a variety of utilities for making MapReduce easier to use, such as the MRUnit tool originally developed by Cloudera.
In general, MapReduce validation will involve checking that the proper data aggregation and segmentation rules are applied to your data, key-value pairs are generated and the data is fine after the MapReduce process has finished. After all of this is done, the next step is to validate the output of your data before it is moved to a data warehouse or incorporated into one of your applications.

Enterprise test management and other big data testing tips

Ultimately, you will likely rely upon multiple testing frameworks, suites and tools to move your big data app from conception all the way through to production. Getting the most out of data-intensive applications isn’t always easy, but it doesn’t hurt to have a trusted and well-defined set of tools at your disposal.