bottom

probability-distributionProbability Distribution and Control

Randomly choosen data among reference values is a fundation of the test data generation engines. But a pure random choice often misses the representativiness issue. To enforce the realism of generated data, the underlying engine should allow to control probability distribution.

What is Probability Distribution

The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values.

For example, consider the content of a shoping cart from an eBusiness site such as Amazon.com. Each item of a cart is a product made available from the catalog of products and selected by a customer from the catalog.

In that case, our list of values will be the list of products in the catalog.

Now, if you consider all the carts built by customers of the web site in a day, you can associate each product to a probability of being selected by a customer, just by counting the number of times it has been selected.

This is how we have our probability distribution of product in a cart during a day.

Adding control

The probability distribution control of generated list of values is a mechanism allowing the designer of a generator to control the number of occurences of each possible value in the list by associating it with a weight.

Principles of Distribution Control

The probability of a value is computed by dividing its own weight by the sum of all the weights of the other possible values in the list.

The basis of this mechanism is the Weighted List generation rule. In this rule, the designer can define a list of value and for each of them choose a weight. The generation engine will then make random choice amongt those values but ensuring that the number of time it choose a value matches the weight of that value.

Take for example the following weighted list of colors :

blue 1 0.17
red 2 0.33
black 3 0.5

The probability of occurence of the value "red" will be 2 / ( 1 + 2 + 3) = 0.33 (~ 33 %)

Basic Application of Distribution Control

The basic application of distribution control is for a single field where a list of all the possible values allowed for that field, each with an appropriate weight, is defined in a Weighted List generation rule.