Big Data Testing







Big Data is a term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications. The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations.

“Big data creates a new layer in the economy which is all about information, turning information, or data, into revenue. This will accelerate growth in the global economy and create jobs. In 2013, big data is forecast to drive $34 billion of IT spending” – Gartner

Testing Big Data applications requires a specific mindset, skillset and deep understanding of the technologies and pragmatic approaches to data science. Big Data from a tester’s perspective is an interesting aspect. Understanding the evolution of Big Data, What is Big Data meant for, Why Test Big Applications is fundamentally important

Must read: Absolute path vs relative path

Increasing need for Live integration of information

With multiple sources of information from different data, it has become imminent to facilitate live integration of information. This forces enterprises to have constantly clean and reliable data, which can only be ensured through end to end testing of the data sources and integrators .


Fig 2                                  Big Data 


Instant Data Collection and Deployment

Power of Predictive analytics and the ability to take Decisive Actions have pushed enterprises to adopt instant data collection solutions. These decisions bring in significant business impact by leveraging the insights from the minute patterns in large data sets. Add that to the CIO’s profile which demands deployment of instant solutions to stay in tune with changing dynamics of business. Unless the applications and data feeds are tested and certified for live deployment, these challenges cannot be met with the assurance that is essential for every critical operation.

Real-time scalability challenges

Big Data Applications are built to match the level of scalability and monumental data processing that is involved in a given scenario. Critical errors in the architectural elements governing the design of Big Data Applications can lead to catastrophic situations. Hardcore testing involving smarter data sampling and cataloguing techniques coupled with high end performance testing capabilities are essential to meet the scalability problems that Big Data Applications pose.

Big data tools by their very design will incorporate indexing and layers of abstraction from the data itself in order to efficiently process massive volumes of data in usable timescales. In order to test these applications our testing too, must look at these same indexes and abstraction layers and leverage corresponding tests at the appropriate layers. In this way we can test the scalability of the system components without necessarily having to process the full data load that would normally accompany that scale of operation.

As testers we have to work to understand this metadata database and the relationships that exist between it and the data. This knowledge allows us to create test archives in which each layer in the system behaves exactly as it would in a massive production system, but with a much lower setup and storage overhead (Figure 2). By essentially “mocking” out very small partitions for all but a target range of dates or imports, we create a metadata layer that is representative of a much larger system. Our knowledge of the query mechanism allows us to seed the remaining partitions (Figure 2 – Partitions 3 and 4) with realistic data to service a customer query across that data range that is functionally equivalent to the same query on a much larger system.