The sim_df() function produces a dataframe with the same distributions and correlations as an existing dataframe. It only returns numeric columns and simulates all numeric variables from a continuous normal distribution (for now).
For example, here is the relationship between speed and distance in the built-in dataset cars.

Original cars dataset
You can create a new sample with the same parameters and 500 rows with the code sim_df(cars, 500).
sim_df(cars, 500) %>%
ggplot(aes(speed, dist)) +
geom_point() +
geom_smooth(method = "lm")
#> Warning: `...` must not be empty for ungrouped data frames.
#> Did you want `data = everything()`?
Simulated cars dataset
You can also optionally add between-subject variables. For example, here is the relationship between sepal length and width in the built-in dataset iris.
iris %>%
ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
Original iris dataset
And here is a new sample with 50 observations of each species, made with the code sim_df(iris, 100, "Species").
sim_df(iris, 50, between = "Species") %>%
ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
Simulated iris dataset
Set empirical = TRUE to return a data frame with exactly the same means, SDs, and correlations as the original dataset.
For now, the function only creates new variables sampled from a continuous normal distribution. I hope to add in other sampling distributions in the future. So you’d need to do any rounding or truncating yourself.
sim_df(iris, 50, between = "Species") %>%
mutate_if(is.numeric, round, 1) %>%
ggplot(aes(Sepal.Width, Sepal.Length, color = Species)) +
geom_point() +
geom_smooth(method = "lm")
Simulated iris dataset (rounded)