Flexible Data Generation With Datafaker Gen

In this article, explore Datafaker Gen, a command-line tool designed to generate realistic data in various formats and sink to different destinations.

Introduction to Datafaker

Datafaker is a modern framework that enables JVM programmers to efficiently generate fake data for their projects using over 200 data providers allowing for quick setup and usage. Custom providers could be written when you need some domain-specific data. In addition to providers, the generated data can be exported to popular formats like CSV, JSON, SQL, XML, and YAML.

For a good introduction to the basic features, see “Datafaker: An Alternative to Using Production Data.

Datafaker offers many features, such as working with sequences and collections and generating custom objects based on schemas (see “Datafaker 2.0”).

Bulk Data Generation

In software development and testing, the need to frequently generate data for various purposes arises, whether it’s to conduct non-functional tests or to simulate burst loads. Let’s consider a straightforward scenario when we have the task of generating 10,000 messages in JSON format to be sent to RabbitMQ.

From my perspective, these options are worth considering:

  1. Developing your own tool: One option is to write a custom application from scratch to generate these records(messages). If the generated data needs to be more realistic, it makes sense to use Datafaker or JavaFaker.
  2. Using specific tools: Alternatively, we could select specific tools designed for particular databases or message brokers. For example, tools like voluble for Kafka provide specialized functionalities for generating and publishing messages to Kafka topics; or a more modern tool like ShadowTraffic, which is currently under development and directed towards a container-based approach, which may not always be necessary.
  3. Datafaker Gen: Finally, we have the option to use Datafaker Gen, which I want to consider in the current article.

Datafaker Gen Overview

Datafaker Gen offers a command-line generator based on the Datafaker library which allows for the continuous generation of data in various formats and integration with different storage systems, message brokers, and backend services. Since this tool uses Datafaker, there may be a possibility that the data is realistic. Configuration of the scheme, format type, and sink can be done without rebuilding the project.

Datafake Gen consists of the following main components that can be configured:

1. Schema Definition

Users can define the schema for their records in the config.yaml file. The schema specifies the field definitions of the record based on the Datafaker provider. It also allows for the definition of embedded fields.

YAML

default_locale: en-EN
fields:
  - name: lastname
    generators: [ Name#lastName ]
  - name: firstname
  generators: [ Name#firstName ]

2. Format

Datafake Gen allows users to specify the format in which records will be generated. Currently, there are basic implementations for CSV, JSON, SQL, XML, and YAML formats. Additionally, formats can be extended with custom implementations. The configuration for formats is specified in the output.yaml file.

YAML

formats:
  csv:
    quote: "@"
    separator: $$$$$$$
  json:
    formattedAs: "[]"
  yaml:
  xml:
  pretty: true

3. Sink

The sink component determines where the generated data will be stored or published. The basic implementation includes command-line output and text file sinks. Additionally, sinks can be extended with custom implementations such as RabbitMQ, as demonstrated in the current article. The configuration for sinks is specified in the output.yamlfile.

YAML
sinks: rabbitmq: batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents host: localhost port: 5672 username: guest password: guest exchange: test.direct.exchange routingkey: products.key

Extensibility via Java SPI

Datafake Gen uses the Java SPI (Service Provider Interface) to make it easy to add new formats or sinks. This extensibility allows for customization of Datafake Gen according to specific requirements.

How To Add a New Sink in Datafake Gen

Before adding a new sink, you may want to check if it already exists in the datafaker-gen-examples repository. If it does not exist, you can refer to examples on how to add a new sink.

When it comes to extending Datafake Gen with new sink implementations, developers have two primary options to consider:

  1. By using this parent project, developers can implement sink interfaces for their sink extensions, similar to those available in the datafaker-gen-examples repository.
  2. Include dependencies from the Maven repository to access the required interfaces. For this approach, Datafake Gen should be built and exist in the local Maven repository. This approach provides flexibility in project structure and requirements.

1. Implementing RabbitMQ Sink

To add a new RabbitMQ sink, one simply needs to implement the net.datafaker.datafaker_gen.sink.Sink interface.

This interface contains two methods:

  1. getName – This method defines the sink name.
  2. run – This method triggers the generation of records and then sends or saves all the generated records to the specified destination. The method parameters include the configuration specific to this sink retrieved from the output.yaml file as well as the data generation function and the desired number of lines to be generated.
Java
import net.datafaker.datafaker_gen.sink.Sink; public class RabbitMqSink implements Sink { @Override public String getName() { return "rabbitmq"; } @Override public void run(Map<String, ?> config, Function<Integer, ?> function, int numberOfLines) { // Read output configuration ... int numberOfLinesToPrint = numberOfLines; String host = (String) config.get("host"); // Generate lines String lines = (String) function.apply(numberOfLinesToPrint); // Sending or saving results to the expected resource // In this case, this is connecting to RebbitMQ and sending messages. ConnectionFactory factory = getConnectionFactory(host, port, username, password); try (Connection connection = factory.newConnection()) { Channel channel = connection.createChannel(); JsonArray jsonArray = JsonParser.parseString(lines).getAsJsonArray(); jsonArray.forEach(jsonElement -> { try { channel.basicPublish(exchange, routingKey, null, jsonElement.toString().getBytes()); } catch (Exception e) { throw new RuntimeException(e); } }); } catch (Exception e) { throw new RuntimeException(e); } } }

2. Adding Configuration for the New RabbitMQ Sink

As previously mentioned, the configuration for sinks or formats can be added to the output.yaml file. The specific fields may vary depending on your custom sink. Below is an example configuration for a RabbitMQ sink:

YAML
sinks: rabbitmq: batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents host: localhost port: 5672 username: guest password: guest exchange: test.direct.exchange routingkey: products.key

3. Adding Custom Sink via SPI

Adding a custom sink via SPI (Service Provider Interface) involves including the provider configuration in the ./resources/META-INF/services/net.datafaker.datafaker_gen.sink.Sink file. This file contains paths to the sink implementations:

Properties files
net.datafaker.datafaker_gen.sink.RabbitMqSink

These are all 3 simple steps on how to expand Datafake Gen. In this example, we are not providing a complete implementation of the sink, as well as how to use additional libraries. To see the complete implementations, you can refer to the datafaker-gen-rabbitmq module in the example repository.

How To Run

Step 1

Build a JAR file based on the new implementation:

Shell
./mvnw clean verify

Step 2

Define the schema for records in the config.yaml file and place this file in the appropriate location where the generator should run. Additionally, define the sinks and formats in the output.yaml file, as demonstrated previously.

Step 3

Datafake Gen can be executed through two options:

  1. Use bash script from the bin folder in the parent project:

    Shell

    # Format json, number of lines 10000 and new RabbitMq Sink
    bin/datafaker_gen -f json -n 10000 -sink rabbitmq

2. Execute the JAR directly, like this:

Shell
java -cp [path_to_jar] net.datafaker.datafaker_gen.DatafakerGen -f json -n 10000 -sink rabbitmq

How Fast Is It?

The test was done based on the scheme described above, which means that one document consists of two fields. Documents are recorded one by one in the RabbitMQ queue in JSON format. The table below shows the speed for 10,000, 100,000, and 1M records on my local machine:

Records Time
10000 401 ms
100000 11613ms
1000000 121601ms

Conclusion

The Datafake Gen tool enables the creation of flexible and fast data generators for various types of destinations. Built on Datafaker, it facilitates realistic data generation. Developers can easily configure the content of records, formats, and sinks to suit their needs. As a simple Java application, it can be deployed anywhere you want, whether it’s in Docker or on-premise machines.

  • The full source code is available here.
  • I would like to thank Sergey Nuyanzin for reviewing this article.

Thank you for reading, and I am glad to be of help.

You may also like