The simdf() function produces a dataframe with the same distributions and correlations as an existing dataframe. It only returns numeric columns and simulates all numeric variables from a continuous normal distribution (for now).

For example, here is the relationship between speed and distance in the built-in dataset cars.

cars %>%
  ggplot(aes(speed, dist)) + 
  geom_point() +
  geom_smooth(method = "lm")
Original cars dataset

Original cars dataset

You can create a new sample with the same parameters and 500 rows with the code simdf(cars, 500).

simdf(cars, 500) %>%
  ggplot(aes(speed, dist)) + 
    geom_point() +
    geom_smooth(method = "lm")
Simulated cars dataset

Simulated cars dataset

Grouping Variables

You can also optionally add grouping variables. For example, here is the relationship between sepal length and width in the built-in dataset iris.

iris %>%
  ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm")
Original iris dataset

Original iris dataset

And here is a new sample with 50 observations of each species, made with the code simdf(iris, 100, "Species").

simdf(iris, 50, "Species") %>%
  ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm")
Simulated iris dataset

Simulated iris dataset

For now, the function only creates new variables sampled from a continuous normal distribution. I hope to add in other sampling distributions in the future. So you’d need to do any rounding or truncating yourself.

simdf(iris, 50, "Species") %>%
  mutate_if(is.numeric, round, 1) %>%
  ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm")
Simulated iris dataset (rounded)

Simulated iris dataset (rounded)