# Data wrangling ## Quick summary Use the Julia code in `data_wrangling.jl` to replicate. The data goes through four stages, to which we refer as `df_single`, `df_pairs`, `df_cor`, and `df_sum` respectivelt. The initial input is the output of the `save_vocab` procedure in the `broad_examination` experiment in the Netlogo model's BehaviorSpace experiments. The raw Netlogo output in the `netlogo_output/df_single` directory. The naming convention for the files is `vocab_popsize_hostility_memsize_numskills.csv`.o The corresponding code can be found in `data_wrangling.jl`. Use `make_df_pairs()` to create a DataFrame of agent pairs, then use `make_df_cor()` to compute the correlation between the agent pairs' vocabulary distance and group distance. Finally, use `make_df_sum()` to create a single data frame containing the correlation values for all parameter configurations. To create distance matrices for the clustering, use `create_distance_matrix()`. All these functions come with wrapper functions that facilitate iteration over the correct files. The remainder of this file explains the columns of the different kinds of data frames in detail. _Note:_ For some parameter configurations, notably some of those who only have a single item in their global vocabulary (`numskills = 1`), these functions will fail. We did not include these parameter configurations in our analysis. ## Raw Netlogo output The raw output of Netlogo looks as follows: | who | intimate | effective | extended | knowns | needs | | --- | --- | --- | --- | --- | --- | | 1765 | 1682 | 5793 | 6731 | [2 95 6 4 2 35] | [3 5 67 5 3] | | 7654 | 16873 | 9157 | 4615 | [6 45 9 1 3 5 5] | [3] | | 198 | 6457 | 2467 | 648 | [6 5 74 132 56] | [6 5 31 5] | We have 30 of these DataFrames for every parameter configuration, resulting in 22500 DataFrames in total. Every row in the DataFrame represents an agent. - `who::Int`: unique number, like an agent ID, not used any further - `intimate::Int`: the agent's intimate (smallest) group - `effective::Int`: the agent's effective (intermediate) group - `extended::Int`: the agent's extended (largest) group - `knowns::Int[]`: the agent's known skills - `needs::Int[]`: the agent's needed skills ## Agent pairs Next, for each single one of these DataFrames we compute a DataFrame of agent pairs, which looks as follows: | who | group_dist | vocab_dist_norm | | --- | --- | --- | | 1 | 0 | 2.1 | | 2 | 3 | 4.23 | | 3 | 2 | 0.96 | A row represents a unique pair of two agents. Again, we have 22500 of these DataFrames; 30 for each global parameter configuration. Columns are as follows: - `who::Int`: an agent-pair ID, not used - `group_dist::Int`: the group relation between the pair, usual indication of - 0 = same intimate group - 1 = same effective group - 2 = same extended group - 3 = same population (= stranger) - `vocab_dist_norm::Float`: Levenshtein/edit distance between the agent's `knowns`, we refer to this as vocabulary. Normalised as follows: For a given global parameter configuration, compute the edit distance of two strings with length `memsize` randomly generated out of a set of `numskills` characters and take the mean distance of 1000 such pairs. Divide the non-normalised value by this mean. `group_dist` is not normalised for two reasons. 1. Normalisation is trivial, just divide by 3. 2. Saving 1/3 or 2/3 as floats will lead to inexactness and them no longer being equal. ## Further wrangling Taking all 30 `df_pairs`'s of a given parameter configuration as input, we compute the following DataFrame `df_cor`: | row | measure | group_level | value | | --- | --- | --- | --- | | 1 | mean | intimate | 0.971 | | 2 | mean | effective | 0.581 | | 3 | mean | extended | 0.350 | | 4 | mean | population | 0.348 | | 5 | stddev | intimate | 0.100 | | 6 | stddev | effective | 0.918 | | 7 | stddev | extended | 0.606 | | 8 | stddev | population | 0.929 | | 9 | r | population | 0.987 | | 10 | p | population | 0.253 | Rows 1–8 show the mean and standard deviation of of the agent pairs' `vocab_dist` and were not used any further. Row 9 shows Pearson's _r_ of the correlation between an agent pair's `vocab_dist_norm` and `group_dist_norm` (normalised by the `df_cor()` function). Row 10 show's the correlation's _p_-value. Finally, a summary data frame with columne `[popsize, hostility, memsize, numskills, dist_cor, p]` can be created, where `dist_cor` is the Pearson _r_ described in the previous paragraph and `p` its _p_-value.