How to do Big Data testing ? Which testing tools should be used ?

Some of the widely used testing tools for Big Data testing are:

1. TestingWhiz,

2. QuerySurge,

3. Tricentis

Among these tools, TestingWhiz provides an automated Big Data testing solution, which helps you to verify structured and unstructured data sets, schema, approaches and inherent processes residing at different sources in your application in languages such as ‘Hive’, ‘Map-reduce’ ‘Sqoop’ and ‘Pig’. With its out-of-the-box connectors, you can validate volume, variety and velocity of data, identify the differences and bad data after various implementations, migration and integration processes and ensure functional and non-functional requirements of data are met accurately to perform error-free processes and analytics. I would suggest you to try its 30 Days Free Trial by downloading it from the website.

There are mostly commercial tools for testing Bid Data. If you can tweak the open source tools api you might be able to get your work done. Again depends on your project requirement. QuerySurge is one of the commercial tool which i found it good.

010 editor is the editor which can handle 2GB text files without any glitches.

Big Data Testing Scenarios

Let us have a look at the scenarios for which Big Data Testing can be used in the Big Data components:-

Data Ingestion :
This step is considered as pre-Hadoop stage where data is generated from multiple sources and data flows into HDFS. In this step the testers verifies that data is extracted properly and data is loaded into HDFS.

Ensure proper data from multiple data sources is ingested i.e. all required data is ingested as per their defined schema and data not matching schema should not be ingested. Data which has not matched with schema should be stored for stats reporting purpose. Also ensure there is no data corruption.

Comparison of source data with data ingested to simply validate that correct data is pushed.
Verify that correct data files are generated and loaded into HDFS correctly into desired location.

Data Processing :

This step is used for validating Map-Reduce jobs. Map-Reduce is a concept used for condensing large amount of data into aggregated data. The data ingested is processed using execution of Map-Reduce jobs which provides desired results. In this step the tester verifies that ingested data is processed using Map-Reduce jobs and validate whether business logic is implemented correctly.

Ensure Map Reduce Jobs run properly without any exceptions.

Ensure key-value pairs are correctly generated post MR Jobs.
Validate business rules are implemented on data.
Validate data aggregation is implemented on data and data is consolidated post reduce operations.
Validate that data is processed correctly post Map-Reduce Jobs by comparing output files with input files.
Note: - For validation at data ingestion or data processing layers, we should use a small set of sample data (in KB’s or MB). By using a small sample data we can easily verify that correct data is ingested by comparing source data with output data at ingestion layer. It becomes easier to verify that MR jobs are run without any error, business rules are correctly implemented on ingested data and validate data aggregation is correctly done by comparing output file with input file.

Initially for testing at data ingestion or data processing layers if we use large data (in GB’s), it becomes very difficult to validate or verify each input record with output record and validating whether business rules are implemented correctly becomes difficult.

Data Storage :

This step is used for storing output data in HDFS or any other storage system (such as Data Warehouse). In this step the tester verifies that output data is correctly generated and loaded into storage system.

Validate data is aggregated post Map-Reduce Jobs.

Verify that correct data is loaded into storage system & discard any intermediate data which is present.
Verify that there is no data corruption by comparing output data with HDFS (or any storage system) data.
The other type of testing scenarios a Big Data Tester can do is:-

Check whether proper alert mechanisms are implemented such as Mail on alert, sending metrics on Cloud watch etc.

Check Exceptions or errors are displayed properly with appropriate exception message so that solving an error becomes easy.
Performance testing to test the different parameters to process a random chunk of large data and monitor parameters such as time taken to complete Map-Reduce Jobs, memory utilization, disk utilization and other metrics as required.
Integration testing for testing complete workflow directly from data ingestion to data storage/visualization.
Architecture testing for testing that Hadoop is highly available all the time & Failover services are properly implemented to ensure data is processed even in case of failure of nodes.

List of few tools used in Big Data :

  • Data Ingestion - Kafka, Zookeeper, Sqoop, Flume, Storm, Amazon Kinesis.
  • Data Processing - Hadoop (Map-Reduce), Cascading, Oozie, Hive, Pig.
  • Data Storage - HDFS (Hadoop Distributed File System), Amazon S3, HBase.

Testing should be performed at each of the three phases of big data processing to ensure that data is getting processed without any errors. 

Functional testing includes  :

  • Validation of pre-hadoop processing -- Dta from various sources like social n/w,call logs etc and loaded into HDFS 
  • Validation of hadoop reduce process data output -- one data is loaded in hdfc, hadoop map reduce process is run to process the data coming from diff sources 
  • Validation of data extract and load into EDW -- once map-reduce process is complted and data o/p files are generated this data is moved to EDW or any other trans sys based on requirement.

Non-functional testing includes :

  • Performance Testing 
  • Failover Testing
Note: - For testing it is very important to generate data for testing covering various test scenarios (positive and negative). Positive test scenarios cover scenarios which are directly related to the functionality. Negative test scenarios cover scenarios which do not have direct relation with the desired functionality.