MDCGen: Generator of Multidimensional Datasets for Clustering

Last updated:

Dec 2017

Publications:

If your are using any of the material below please cite the corresponding publication:

Description:

MdcGen allows a high-flexibility for parameterization, implementing clusters with variated shapes and generated by diverse underlying distributions. The tool enables the creation of clusters based on multivariate distributions but also clusters where distributions directly determine cluster intra-distances (i.e., the distance of objects to cluster centroids). Additionally, MDCGen implements classic functionalities, e.g., customization of cluster-separation, overlap control, addition of outliers and noisy features, correlated variables, rotations, and dataset quality evaluations, among others.

In order to allow a broad generation variety and flexibility, some configurations might create meaningless or useless datasets. Therefore, some experience dealing with the parameters is advisable (parameters are widely explained in the documentation). To validate the dataset, Silhouette evaluations provide performance indices to assess if the generated data follows a clear cluster-like structure.

*This online version generates datasets with sizes below 100 thousand objects (m) and 50 dimensions (n). Silhouette evaluations considerably slows down the process, so they are available for datasets below 10 thousand objects (m). For larger datasets (or evaluated datasets), please use either the MATLAB or the Python version.

Software, scripts, tools:

The MATLAB version can be downloaded from the mdcgen-matlab GitHub repository.

The Python version can be downloaded from the mdcgenpy Python GitHub repository.

MDCGEN online version:

Upload JSON file

Warning: The online generator and the online dataset visualization are temporarily deactivated due to changes in our server configuration. We apologize for any inconvenience. (Dec 2017)

Formular for generating a basic parameterization (JSON file)

This formular allows the creation of a basic parameterization in JSON format to configure the online MDCGen. For more complex parameterizations, please edit JSON files manually following the syntax explained in the provided documentation.