The sim_df() function produces a dataframe with the same distributions and correlations as an existing dataframe. It only returns numeric columns and simulates all numeric variables from a continuous normal distribution (for now).

For example, here is the relationship between speed and distance in the built-in dataset cars.

Original cars dataset

Original cars dataset

You can create a new sample with the same parameters and 500 rows with the code sim_df(cars, 500).

Simulated cars dataset

Simulated cars dataset

Between-subject variables

You can also optionally add between-subject variables. For example, here is the relationship between sepal length and width in the built-in dataset iris.

Original iris dataset

Original iris dataset

And here is a new sample with 50 observations of each species, made with the code sim_df(iris, 100, "Species").

Simulated iris dataset

Simulated iris dataset

Empirical

Set empirical = TRUE to return a data frame with exactly the same means, SDs, and correlations as the original dataset.

exact_iris <- sim_df(iris, 50, between = "Species", empirical = TRUE)

Rounding

For now, the function only creates new variables sampled from a continuous normal distribution. I hope to add in other sampling distributions in the future. So you’d need to do any rounding or truncating yourself.

Simulated iris dataset (rounded)

Simulated iris dataset (rounded)