The game provides API endpoints for download the data. A simple curl suffices. Please find more information here: https://forum.die-staemme.de/index.php?threads/weltdaten-und-configs.183996/
Note: From our observation, we have not determined any specific frequency in updates. Thus, our crawl acquires the data hourly (being well below actual update frequencies). For many analyses, we drop fine detail in favor of a daily-aggregation basis anyway.
Our paper bases at times a little heavy on data engineering. While the plots mostly may be created without too complex queries, creating and gathering domain-specific features/metrics definitely requires deeper lookups.
The dataset contains various data sources/tables from RAW crawling data (in case you should require raw data, please send an inquiry via email):
Further, we derived tables gathering lots of information or being architectual helpers for more efficient queries
We use ClickHouse (a powerful and highly efficient analytical columnar DBMS). We suggest following the Quick Start Tutorial (https://clickhouse.com/docs/en/getting-started/install) to get things running.
As soon as your database works clickhouse-client --query="select 'Hello World :-)'"
, please execute the setup_databse.sql
script within its directory as follows:
clickhouse-client --queries-file="setup_databse.sql"
This process will take a while. Time to grab some coffee.
It will create a new database ds
and load all data tables.
If interested more in ClickHouse, we suggest following the Tutorial (https://clickhouse.com/docs/en/tutorial/#3-analyze-the-data).
In case you want to go beyond rather simple aggregates, but focus on time sequences, ClickHouse arrays come in handy as they allow for another dimension within aggregations with higher order functions & map-reduce approaches.
Note: The database queries and this script have been executed on a machine with 1T memory; You might run into memory issues on lightweight hardware. In case you encounter memory limits, you might want to use our pre-calculated aggregate. Other options are sharding execution by a suitable primary key into materialized pre-aggregates or apply sampling (alike you would shard the data) as necessary; this is no fun though.
You may also work with pandas dataframes directly, e.g.:
import pandas as pd
villages = pd.read_parquet('villages.parquet.gz')
unique_village_ids = villages.village_id.unique()
Note: These used libraries should work across python 3.6 to 3.9.
# install dependencies first
import sys
!{sys.executable} -m pip install -U numpy pandas clickhouse_driver matplotlib seaborn fastplot scipy
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: numpy in /home/helge/.local/lib/python3.9/site-packages (1.23.1) Requirement already satisfied: pandas in /home/helge/.local/lib/python3.9/site-packages (1.4.3) Requirement already satisfied: clickhouse_driver in /home/helge/.local/lib/python3.9/site-packages (0.2.4) Requirement already satisfied: matplotlib in /home/helge/.local/lib/python3.9/site-packages (3.5.2) Requirement already satisfied: seaborn in /home/helge/.local/lib/python3.9/site-packages (0.11.2) Requirement already satisfied: fastplot in /home/helge/.local/lib/python3.9/site-packages (1.1.0) Requirement already satisfied: scipy in /home/helge/.local/lib/python3.9/site-packages (1.8.1) Requirement already satisfied: pytz>=2020.1 in /home/helge/.local/lib/python3.9/site-packages (from pandas) (2022.1) Requirement already satisfied: python-dateutil>=2.8.1 in /home/helge/.local/lib/python3.9/site-packages (from pandas) (2.8.2) Requirement already satisfied: tzlocal in /home/helge/.local/lib/python3.9/site-packages (from clickhouse_driver) (4.1) Requirement already satisfied: fonttools>=4.22.0 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (4.32.0) Requirement already satisfied: kiwisolver>=1.0.1 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (1.4.4) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib) (21.3) Requirement already satisfied: pyparsing>=2.2.1 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (3.0.9) Requirement already satisfied: cycler>=0.10 in /usr/lib/python3/dist-packages (from matplotlib) (0.10.0) Requirement already satisfied: pillow>=6.2.0 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (9.1.0) Requirement already satisfied: statsmodels in /home/helge/.local/lib/python3.9/site-packages (from fastplot) (0.13.2) Requirement already satisfied: six>=1.5 in /home/helge/.local/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) Requirement already satisfied: patsy>=0.5.2 in /home/helge/.local/lib/python3.9/site-packages (from statsmodels->fastplot) (0.5.2) Requirement already satisfied: pytz-deprecation-shim in /home/helge/.local/lib/python3.9/site-packages (from tzlocal->clickhouse_driver) (0.1.0.post0) Requirement already satisfied: tzdata in /home/helge/.local/lib/python3.9/site-packages (from pytz-deprecation-shim->tzlocal->clickhouse_driver) (2021.5) WARNING: You are using pip version 22.0.3; however, version 22.1.2 is available. You should consider upgrading via the '/usr/bin/python3.9 -m pip install --upgrade pip' command.
# common libs
import os
import pandas as pd
import numpy as np
import math
import random
from datetime import datetime
import operator
import itertools
import scipy.stats
# database
from clickhouse_driver import Client
DB_username='default'
DB_password=''
DB_database='ds'
CH = clickhouse = Client('localhost', user=DB_username, password=DB_password, database=DB_database)
# DB Test
CH.execute('select 1')
# plotting
import matplotlib
import matplotlib.pyplot as plt
import fastplot
%matplotlib inline
import seaborn as sns
plt.rc('text', usetex=True)
plt.rc('font', family='sans-serif')
plt.rcParams.update({'font.size': 14})
QUERY = '''
select
d,
avg(neighbors_count_bb_unique),
quantile(.25)(neighbors_count_bb_unique),
quantile(.50)(neighbors_count_bb_unique),
quantile(.75)(neighbors_count_bb_unique),
avg(player_points_potential),
quantile(.25)(player_points_potential),
quantile(.50)(player_points_potential),
quantile(.75)(player_points_potential),
quantile(.25)(local_bb_pressure_avg),
quantile(.50)(local_bb_pressure_avg),
quantile(.75)(local_bb_pressure_avg),
any(global_bb_pressure),
count(distinct player_id) as players
from
ds.bb_pressure
where
1
and player_id > 0 --players' perspective does not include barbarians
group by
d
order by
d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['timestamp', 'neighbors_count_bb_unique__avg', 'neighbors_count_bb_unique__q25', 'neighbors_count_bb_unique__q50', 'neighbors_count_bb_unique__q75', 'player_points_potential__avg', 'player_points_potential__q25', 'player_points_potential__q50', 'player_points_potential__q75', 'local_bb_pressure_avg__q25', 'local_bb_pressure_avg__q50', 'local_bb_pressure_avg__q75', 'global_bb_pressure', 'active_players'])
view = data
plt.figure(figsize=(6,4))
plt.plot(view['timestamp'], view['local_bb_pressure_avg__q25'], linestyle='dashed', color='black')
plt.plot(view['timestamp'], view['local_bb_pressure_avg__q50'], linestyle='solid', color='black', label='pressure local Qs')
plt.plot(view['timestamp'], view['local_bb_pressure_avg__q75'], linestyle='dashed', color='black')
plt.plot(view['timestamp'], view['global_bb_pressure'], label='pressure global')
plt.plot(view['timestamp'], view['active_players'] / max(view['active_players']), label='active players')
plt.plot(view['timestamp'], view['player_points_potential__avg'], label='points potential')
plt.grid(True)
plt.legend(loc='lower left', ncol=1, prop={'size': 15})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
plt.show()
last_buckets = None
heatmapdata = {}
for i, date in enumerate(sorted(['2020-06-01', '2020-07-01', '2020-08-01', '2020-10-01', '2020-12-01'])):
resolution = 5
QUERY = f'''
with {resolution} as resolution
select
toDate(timestamp) as d,
round(village_x / resolution, 0) * resolution as bucket_x,
round(village_y / resolution, 0) * resolution as bucket_y,
1 - countIf(village_player = 0) / count() as bb_pressure
from
ds.village
where
d = toDate('{date}')
group by
d,
bucket_x,
bucket_y
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'x', 'y', 'pressure'])
data['x'] = data['x'].astype(int) - 500
data['y'] = data['y'].astype(int) - 500
view = pd.pivot_table(data, values='pressure', index='y', columns='x', aggfunc='mean', fill_value=None)
buckets = len(view)
plt.figure()
sns.heatmap(view).set_title(date)
if i > 0:
c = plt.Circle((buckets//2, buckets//2), last_buckets//2, color='red', fill=False, linewidth=2)
plt.gca().add_patch(c)
plt.tight_layout()
last_buckets = buckets
heatmapdata[date] = data
# plot for the paper:
date = ['2020-06-01', '2020-07-01', '2020-08-01']
radius = [
(
max(heatmapdata.get(date[0]).x.max(), heatmapdata.get(date[0]).y.max()),
min(heatmapdata.get(date[0]).x.min(), heatmapdata.get(date[0]).y.min()),
),
(
max(heatmapdata.get(date[1]).x.max(), heatmapdata.get(date[1]).y.max()),
min(heatmapdata.get(date[1]).x.min(), heatmapdata.get(date[1]).y.min()),
),
(
max(heatmapdata.get(date[2]).x.max(), heatmapdata.get(date[2]).y.max()),
min(heatmapdata.get(date[2]).x.min(), heatmapdata.get(date[2]).y.min()),
)
]
data = heatmapdata.get(date[1]).drop(columns=['d'])
view = data.append(pd.Series({'x': radius[2][0], 'y': radius[2][1], 'pressure': 1}), ignore_index=True)
view = view.append(pd.Series({'x': radius[2][1], 'y': radius[2][0], 'pressure': 1}), ignore_index=True)
pivot = pd.pivot_table(view, values='pressure', index='y', columns='x', aggfunc='mean', fill_value=None)
buckets = len(pivot)
plt.figure(figsize=(6,4))
sns.heatmap(pivot.sort_index(ascending=False), cmap='magma').set_title(f'{date[1]} +/-1 month')
c0 = plt.Circle((buckets//2, buckets//2), (radius[0][0])/resolution, color='seagreen', fill=False, linewidth=2)
c2 = plt.Circle((buckets//2, buckets//2), (radius[2][0])/resolution, color='seagreen', fill=False, linewidth=2)
plt.gca().add_patch(c0)
plt.gca().add_patch(c2)
n = 3
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
for index, label in enumerate(plt.gca().yaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
# plt.grid()
plt.tight_layout()
QUERY = '''
select
d,
countIf(player_id_old = 0) as cnt_bb,
countIf(player_id_new = player_id_old) as cnt_self,
countIf(
(player_ally_new = player_ally_old)
and (player_ally_new > 0)
and (player_id_new <> player_id_old)
) as cnt_ally_only,
count() as total,
countIf(
(player_id_old > 0)
and (player_points_new > player_points_old)
) as cnt_victim_smaller_nonbb,
countIf(
(player_id_old > 0)
and (player_id_old <> player_id_new)
and (player_ally_old <> player_ally_new)
and (player_points_new > player_points_old)
) as cnt_hostile_victim_smaller_nonbb,
total - cnt_bb - cnt_self - cnt_ally_only as cnt_hostile
from (
select
d,
player_id_new,
player_id_old,
player_ally_new,
player_ally_old,
player_points_new,
player_points_old,
1
from (
select
toDate(toStartOfWeek(conquer_timestamp)) as d, --(!)
player_id_new,
player_id_old,
player_ally_new,
player_points_new,
1
from
ds.conquer
any left join (
select
toDate(toStartOfWeek(timestamp)) as d,
player_id as player_id_new,
player_ally as player_ally_new,
player_points as player_points_new,
1
from
ds.player
) as X1
using(d, player_id_new)
where
1
-- and d = toStartOfWeek(toDate('2021-11-13'))
) as X2
any left join (
select
toDate(toStartOfWeek(timestamp)) as d,
player_id as player_id_old,
player_ally as player_ally_old,
player_points as player_points_old,
1
from
ds.player
) as X3
using(d, player_id_old)
) as X4
group by
d
order by
d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'cnt_bb', 'cnt_self', 'cnt_ally_only', 'total', 'cnt_victim_smaller_nonbb', 'cnt_hostile_victim_smaller_nonbb', 'cnt_hostile'])
view = data
plt.figure(figsize=(6,4))
plt.plot(view['d'], view['total'] - view['cnt_self'], label='total') # wo self
plt.plot(view['d'], view['cnt_bb'], label='barbarian')
plt.plot(view['d'], view['cnt_hostile'], label='hostile', linestyle='dashdot')
plt.plot(view['d'], view['cnt_victim_smaller_nonbb'], label='onto smaller', linestyle='dashed')
plt.grid(True)
plt.legend(ncol=2, prop={'size': 15})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
plt.show()
QUERY = '''
-- STATISTICAL TEST raw data
select
week,
count(distinct player_id) as players_num,
churns_soon,
groupArray(player_points) as arr__player_points,
groupArray(player_villages) as arr__player_villages,
groupArray(player_rank) as arr__player_rank,
groupArray(player_has_ally) as arr__player_has_ally,
groupArray(num_allies) as arr__num_allies,
groupArray(villages_gain) as arr__villages_gain,
groupArray(villages_gain_bb) as arr__villages_gain_bb,
groupArray(villages_gain_self) as arr__villages_gain_self,
groupArray(villages_gain_ally_count) as arr__villages_gain_ally_count,
groupArray(villages_loss) as arr__villages_loss,
groupArray(villages_loss_self) as arr__villages_loss_self,
groupArray(villages_loss_ally_count) as arr__villages_loss_ally_count,
groupArray(kill_att) as arr__kill_att,
groupArray(kill_def) as arr__kill_def,
groupArray(kill_all) as arr__kill_all,
groupArray(kill_other) as arr__kill_other,
groupArray(has_ally_fraction) as arr__has_ally_fraction,
groupArray(global_bb_pressure) as arr__global_bb_pressure,
groupArray(local_bb_pressure_avg) as arr__local_bb_pressure_avg,
groupArray(player_points_potential) as arr__player_points_potential
from (
select
d,
player_id,
round(active_time_real_virtual / 60 / 60 / 24) as active_time_real_vitual_days,
round(t_relative_virtual / 60 / 60 / 24) as t_relative_virtual_days,
(select max(d) from ds.playerstats) - (select min(t_min_virtual) from ds.playerstats) as observation_period_days,
round(t_relative_virtual_days / 7) as week,
(t_max - d) as remaining_days,
remaining_days < 14 as churns_soon,
*,
1
from
ds.playerstats
any inner join (--discretization into weeks
select distinct
toStartOfWeek(d) as d,
player_id
from
ds.playerstats
) as X1
using(player_id, d)
where
1
-- and player_name = 'derVernichter'
-- and player_id in [1577194388, 9201456]
-- and player_id in [1577194388]
order by
d
)
group by
churns_soon,
week,
1
--limit 1
--format Vertical
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['week', 'players_num', 'churns_soon', 'player_points', 'player_villages', 'player_rank', 'player_has_ally', 'num_allies', 'villages_gain', 'villages_gain_bb', 'villages_gain_self', 'villages_gain_ally_count', 'villages_loss', 'villages_loss_self', 'villages_loss_ally_count', 'kill_att', 'kill_def', 'kill_all', 'kill_other', 'has_ally_fraction', 'global_bb_pressure', 'local_bb_pressure_avg', 'player_points_potential'])
view = data.set_index(['week', 'churns_soon', 'players_num']).sort_index()
for col in view.columns:
view[col] = view[col].apply(list)
from scipy.stats import mannwhitneyu
results = []
for week, row in view.groupby(by=['week']).agg(list).iterrows():
for col in view.columns:
try:
#difference in medians is significant? low p-values say yes, rejecting H0 -- ie populations are likely not the same.
U, p = mannwhitneyu(row[col][0], row[col][1])
results.append(
(week, col, p)
)
except ValueError as e:
# print(f'{week} {col}: {str(e)}')
# values are identical
results.append(
(week, col, 1)
)
except IndexError as e:
#print(f'{week} {col}: {str(e)}')
continue
results = pd.DataFrame(results, columns=['week', 'metric', 'p']).set_index(['metric'])
p=0.05
themeans = results[['p']].groupby(by=['metric']).mean()
thequantiles = results[['p']].groupby(by=['metric']).quantile(.9)
print('Overall p-values 90% quantile\n', thequantiles)
first_n_weeks = 4
themeans = results[results['week'] < first_n_weeks][['p']].groupby(by=['metric']).mean()
thequantiles = results[results['week'] < first_n_weeks][['p']].groupby(by=['metric']).quantile(.9)
print('First 4 Weeks p-values 90% quantile\n', thequantiles)
Overall p-values 90% quantile p metric global_bb_pressure 0.230208 has_ally_fraction 0.007655 kill_all 0.073920 kill_att 0.070090 kill_def 0.144612 kill_other 0.067183 local_bb_pressure_avg 0.414349 num_allies 0.084200 player_has_ally 0.030915 player_points 0.000018 player_points_potential 0.388789 player_rank 0.004248 player_villages 0.000003 villages_gain 0.067301 villages_gain_ally_count 0.339895 villages_gain_bb 0.219401 villages_gain_self 0.304913 villages_loss 0.435828 villages_loss_ally_count 0.402696 villages_loss_self 0.304913 First 4 Weeks p-values 90% quantile p metric global_bb_pressure 1.441534e-11 has_ally_fraction 4.208970e-117 kill_all 2.371890e-12 kill_att 2.381867e-09 kill_def 7.095659e-06 kill_other 4.408445e-01 local_bb_pressure_avg 1.783070e-04 num_allies 2.304371e-124 player_has_ally 1.681814e-21 player_points 2.865796e-02 player_points_potential 1.861066e-01 player_rank 1.625595e-07 player_villages 2.103746e-07 villages_gain 1.568761e-08 villages_gain_ally_count 7.871758e-01 villages_gain_bb 5.365761e-08 villages_gain_self 1.134528e-01 villages_loss 2.355080e-02 villages_loss_ally_count 8.178269e-01 villages_loss_self 1.134528e-01
QUERY = '''
select
avgOrNull(rel__avg__player_points),
avgOrNull(rel__avg__player_villages),
avgOrNull(rel__avg__player_rank),
avgOrNull(rel__avg__player_has_ally),
avgOrNull(rel__avg__num_allies),
avgOrNull(rel__avg__villages_gain),
avgOrNull(rel__avg__villages_gain_bb),
avgOrNull(rel__avg__villages_gain_self),
avgOrNull(rel__avg__villages_gain_ally_count),
avgOrNull(rel__avg__villages_loss),
avgOrNull(rel__avg__villages_loss_self),
avgOrNull(rel__avg__villages_loss_ally_count),
avgOrNull(rel__avg__kill_att),
avgOrNull(rel__avg__kill_def),
avgOrNull(rel__avg__kill_all),
avgOrNull(rel__avg__kill_other),
avgOrNull(rel__avg__has_ally_fraction),
avgOrNull(rel__avg__global_bb_pressure),
avgOrNull(rel__avg__local_bb_pressure_avg),
avgOrNull(rel__avg__player_points_potential),
1
from (
select
week,
is_first_4_weeks,
groupArray(players_num) as arr__players_num,
groupArray(churns_soon) as arr__churns_soon,
groupArray(avg__player_points) as arr__avg__player_points,
groupArray(avg__player_villages) as arr__avg__player_villages,
groupArray(avg__player_rank) as arr__avg__player_rank,
groupArray(avg__player_has_ally) as arr__avg__player_has_ally,
groupArray(avg__num_allies) as arr__avg__num_allies,
groupArray(avg__villages_gain) as arr__avg__villages_gain,
groupArray(avg__villages_gain_bb) as arr__avg__villages_gain_bb,
groupArray(avg__villages_gain_self) as arr__avg__villages_gain_self,
groupArray(avg__villages_gain_ally_count) as arr__avg__villages_gain_ally_count,
groupArray(avg__villages_loss) as arr__avg__villages_loss,
groupArray(avg__villages_loss_self) as arr__avg__villages_loss_self,
groupArray(avg__villages_loss_ally_count) as arr__avg__villages_loss_ally_count,
groupArray(avg__kill_att) as arr__avg__kill_att,
groupArray(avg__kill_def) as arr__avg__kill_def,
groupArray(avg__kill_all) as arr__avg__kill_all,
groupArray(avg__kill_other) as arr__avg__kill_other,
groupArray(avg__has_ally_fraction) as arr__avg__has_ally_fraction,
groupArray(avg__global_bb_pressure) as arr__avg__global_bb_pressure,
groupArray(avg__local_bb_pressure_avg) as arr__avg__local_bb_pressure_avg,
groupArray(avg__player_points_potential) as arr__avg__player_points_potential,
if(arr__players_num[2] <> 0, arr__players_num[1] / arr__players_num[2], NULL) as rel__players_num,
if(arr__avg__player_points[2] <> 0, arr__avg__player_points[1] / arr__avg__player_points[2], NULL) as rel__avg__player_points,
if(arr__avg__player_villages[2] <> 0, arr__avg__player_villages[1] / arr__avg__player_villages[2], NULL) as rel__avg__player_villages,
if(arr__avg__player_rank[2] <> 0, arr__avg__player_rank[1] / arr__avg__player_rank[2], NULL) as rel__avg__player_rank,
if(arr__avg__player_has_ally[2] <> 0, arr__avg__player_has_ally[1] / arr__avg__player_has_ally[2], NULL) as rel__avg__player_has_ally,
if(arr__avg__num_allies[2] <> 0, arr__avg__num_allies[1] / arr__avg__num_allies[2], NULL) as rel__avg__num_allies,
if(arr__avg__villages_gain[2] <> 0, arr__avg__villages_gain[1] / arr__avg__villages_gain[2], NULL) as rel__avg__villages_gain,
if(arr__avg__villages_gain_bb[2] <> 0, arr__avg__villages_gain_bb[1] / arr__avg__villages_gain_bb[2], NULL) as rel__avg__villages_gain_bb,
if(arr__avg__villages_gain_self[2] <> 0, arr__avg__villages_gain_self[1] / arr__avg__villages_gain_self[2], NULL) as rel__avg__villages_gain_self,
if(arr__avg__villages_gain_ally_count[2] <> 0, arr__avg__villages_gain_ally_count[1] / arr__avg__villages_gain_ally_count[2], NULL) as rel__avg__villages_gain_ally_count,
if(arr__avg__villages_loss[2] <> 0, arr__avg__villages_loss[1] / arr__avg__villages_loss[2], NULL) as rel__avg__villages_loss,
if(arr__avg__villages_loss_self[2] <> 0, arr__avg__villages_loss_self[1] / arr__avg__villages_loss_self[2], NULL) as rel__avg__villages_loss_self,
if(arr__avg__villages_loss_ally_count[2] <> 0, arr__avg__villages_loss_ally_count[1] / arr__avg__villages_loss_ally_count[2], NULL) as rel__avg__villages_loss_ally_count,
if(arr__avg__kill_att[2] <> 0, arr__avg__kill_att[1] / arr__avg__kill_att[2], NULL) as rel__avg__kill_att,
if(arr__avg__kill_def[2] <> 0, arr__avg__kill_def[1] / arr__avg__kill_def[2], NULL) as rel__avg__kill_def,
if(arr__avg__kill_all[2] <> 0, arr__avg__kill_all[1] / arr__avg__kill_all[2], NULL) as rel__avg__kill_all,
if(arr__avg__kill_other[2] <> 0, arr__avg__kill_other[1] / arr__avg__kill_other[2], NULL) as rel__avg__kill_other,
if(arr__avg__has_ally_fraction[2] <> 0, arr__avg__has_ally_fraction[1] / arr__avg__has_ally_fraction[2], NULL) as rel__avg__has_ally_fraction,
if(arr__avg__global_bb_pressure[2] <> 0, arr__avg__global_bb_pressure[1] / arr__avg__global_bb_pressure[2], NULL) as rel__avg__global_bb_pressure,
if(arr__avg__local_bb_pressure_avg[2] <> 0, arr__avg__local_bb_pressure_avg[1] / arr__avg__local_bb_pressure_avg[2], NULL) as rel__avg__local_bb_pressure_avg,
if(arr__avg__player_points_potential[2] <> 0, arr__avg__player_points_potential[1] / arr__avg__player_points_potential[2], NULL) as rel__avg__player_points_potential
from (
select
week,
week < 4 as is_first_4_weeks,
count(distinct player_id) as players_num,
churns_soon,
avg(player_points) as avg__player_points,
avg(player_villages) as avg__player_villages,
avg(player_rank) as avg__player_rank,
avg(player_has_ally) as avg__player_has_ally,
avg(num_allies) as avg__num_allies,
avg(villages_gain) as avg__villages_gain,
avg(villages_gain_bb) as avg__villages_gain_bb,
avg(villages_gain_self) as avg__villages_gain_self,
avg(villages_gain_ally_count) as avg__villages_gain_ally_count,
avg(villages_loss) as avg__villages_loss,
avg(villages_loss_self) as avg__villages_loss_self,
avg(villages_loss_ally_count) as avg__villages_loss_ally_count,
avg(kill_att) as avg__kill_att,
avg(kill_def) as avg__kill_def,
avg(kill_all) as avg__kill_all,
avg(kill_other) as avg__kill_other,
avg(has_ally_fraction) as avg__has_ally_fraction,
avg(global_bb_pressure) as avg__global_bb_pressure,
avg(local_bb_pressure_avg) as avg__local_bb_pressure_avg,
avg(player_points_potential) as avg__player_points_potential
from (
select
d,
player_id,
round(active_time_real_virtual / 60 / 60 / 24) as active_time_real_vitual_days,
round(t_relative_virtual / 60 / 60 / 24) as t_relative_virtual_days,
(select max(d) from ds.playerstats) - (select min(t_min_virtual) from ds.playerstats) as observation_period_days,
round(t_relative_virtual_days / 7) as week,
(t_max - d) as remaining_days,
remaining_days < 14 as churns_soon,
*,
1
from
ds.playerstats
any inner join (--discretization into weeks
select distinct
toStartOfWeek(d) as d,
player_id
from
ds.playerstats
) as X1
using(player_id, d)
order by
d
)
group by
churns_soon,
week,
1
order by
churns_soon
)
group by
week,
is_first_4_weeks
)
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['rel__avg__player_points', 'rel__avg__player_villages', 'rel__avg__player_rank', 'rel__avg__player_has_ally', 'rel__avg__num_allies', 'rel__avg__villages_gain', 'rel__avg__villages_gain_bb', 'rel__avg__villages_gain_self', 'rel__avg__villages_gain_ally_count', 'rel__avg__villages_loss', 'rel__avg__villages_loss_self', 'rel__avg__villages_loss_ally_count', 'rel__avg__kill_att', 'rel__avg__kill_def', 'rel__avg__kill_all', 'rel__avg__kill_other', 'rel__avg__has_ally_fraction', 'rel__avg__global_bb_pressure', 'rel__avg__local_bb_pressure_avg', 'rel__avg__player_points_potential', '1'])
data.T
0 | |
---|---|
rel__avg__player_points | 3.629031 |
rel__avg__player_villages | 3.567750 |
rel__avg__player_rank | 0.762076 |
rel__avg__player_has_ally | 1.583887 |
rel__avg__num_allies | 2.015811 |
rel__avg__villages_gain | 16.820940 |
rel__avg__villages_gain_bb | 16.915526 |
rel__avg__villages_gain_self | 6.033490 |
rel__avg__villages_gain_ally_count | 3.207348 |
rel__avg__villages_loss | 1.762352 |
rel__avg__villages_loss_self | 6.033490 |
rel__avg__villages_loss_ally_count | 0.965153 |
rel__avg__kill_att | 2.989966 |
rel__avg__kill_def | 2.223863 |
rel__avg__kill_all | 2.697096 |
rel__avg__kill_other | 343.643317 |
rel__avg__has_ally_fraction | 1.574397 |
rel__avg__global_bb_pressure | 0.995592 |
rel__avg__local_bb_pressure_avg | 1.004379 |
rel__avg__player_points_potential | 0.846107 |
1 | 1.000000 |
# ingress/egress by date
QUERY = '''
select
toStartOfWeek(timestamp) as d,
sum(num_quit),
sum(num_new),
sum(delta_net_plus),
sum(points_loss),
sum(points_gain),
sum(delta_points_net_plus),
sum(villages_loss),
sum(villages_gain),
sum(delta_villages_net_plus),
sum(world_villages),
sum(world_villages_delta),
min(active_players)
from (
select
--toStartOfWeek(toDate(date_add(SECOND, active_time, t_min))) as timestamp,
--toStartOfMonth(toDate(date_add(SECOND, active_time, t_min))) as timestamp,
toDate(date_add(SECOND, active_time, t_min)) as timestamp,
count() as num_quit,
num_new,
num_new - num_quit as delta_net_plus,
--add points net minus/net plus
sum(points_loss) as points_loss,
points_gain,
points_gain - points_loss as delta_points_net_plus,
sum(villages_loss) as villages_loss,
villages_gain,
villages_gain - points_loss as delta_villages_net_plus,
(
select
arrayZip(
groupArray(d),
groupArray(world_villages_count)
)
from (
select
toDate(timestamp) as d,
count(distinct village_id) as world_villages_count
from
ds.village
group by
d
order by
d
)
) as date__to__world_villages_count,
arrayElement(
arrayFilter(
x, y -> y == timestamp,
date__to__world_villages_count.2,
date__to__world_villages_count.1
),
1
) as world_villages,
arrayElement(
arrayFilter(
x, y -> y == timestamp,
arrayDifference(date__to__world_villages_count.2),
date__to__world_villages_count.1
),
1
) as world_villages_delta
from (
select
player_id,
max(active_time_real_virtual) as active_time,
min(t_min_virtual) as t_min,
arrayElement(
arrayReverse(
arraySort(
x, y -> y,
groupArray(player_points),
groupArray(d)
)
),
1
) as points_loss,
arrayElement(
arrayReverse(
arraySort(
x, y -> y,
groupArray(player_villages),
groupArray(d)
)
),
1
) as villages_loss
from
ds.playerstats
group by
player_id
) as X1
full outer join (
select
t_min as timestamp,
count() as num_new,
sum(points_gain) as points_gain,
sum(villages_gain) as villages_gain
from (
select
player_id,
min(t_min_virtual) as t_min,
arrayElement(
arraySort(
x, y -> y,
groupArray(player_points),
groupArray(d)
),
1
) as points_gain,
arrayElement(
arraySort(
x, y -> y,
groupArray(player_villages),
groupArray(d)
),
1
) as villages_gain
from
ds.playerstats
group by
player_id
) as X3
group by
timestamp
) as X2
using(timestamp)
group by
timestamp,
--join
num_new,
points_gain,
villages_gain
order by
timestamp
) as X5
any left join (
select
toDate(d) as timestamp,
count(distinct player_id) as active_players
from
ds.playerstats
group by
timestamp
) as X4
using(timestamp)
group by
d
order by
d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['timestamp', 'num_quit', 'num_new', 'delta_net_plus', 'points_loss', 'points_gain', 'delta_points_net_plus', 'villages_loss', 'villages_gain', 'delta_villages_net_plus', 'world_villages', 'world_villages_delta', 'active_players'])
view_other = data.drop(data.tail(1).index)
view_other.drop(view_other.head(1).index,inplace=True)
plt.figure(figsize=(4,4))
plt.plot(view_other['timestamp'], view_other['num_new'], label='p new')
plt.plot(view_other['timestamp'], view_other['num_quit'], label='p quit')
plt.plot(view_other['timestamp'], view_other['delta_net_plus'], label='p net')
plt.grid(True)
plt.legend(loc='upper right', ncol=1, prop={'size': 12})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
plt.show()
plt.figure(figsize=(4,4))
plt.plot(view_other['timestamp'], view_other['villages_gain'], label='v new')
plt.plot(view_other['timestamp'], view_other['villages_loss'], label='v quit')
#plt.plot(view_other['timestamp'], view_other['world_villages_delta'], label='v net') #decresing reate if new villages.
plt.grid(True)
plt.legend(loc='upper right', ncol=1, prop={'size': 12})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
plt.ylim(-200, 5000)
plt.show()
#heatmap start date vs lifetime TODO!
QUERY = '''
select
--toUnixTimestamp(toDateTime(toStartOfWeek(t_min_virtual))) as d,
toStartOfMonth(t_min_virtual) as d,
round(active_time_real_virtual / 60 / 60 / 24 / 7 / 4) as active_time,
1 as dummy
from (
select distinct
player_id,
t_min_virtual,
active_time_real_virtual
from
ds.playerstats
)
order by
d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'active_time', 'dummy'])
view = data
view['active_time'] = view['active_time'].astype(int)
view['m'] = view['d'].apply(lambda x: str(x)[0:7])
pivot = pd.pivot_table(view, values='dummy', index='m', columns='active_time', aggfunc='count', fill_value=None).sort_index(ascending=False)
plt.rc('text', usetex=False)
from matplotlib.colors import LogNorm
plt.figure(figsize=(4,3))
sns.heatmap(pivot, cmap='magma', norm=LogNorm())
plt.xlabel('player lifetime [months]')
plt.ylabel('player start [month]')
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
for index, label in enumerate(plt.gca().yaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
plt.rc('text', usetex=True)
# plt.grid()
plt.tight_layout()
QUERY = '''
select
totals,
active_time_real,
num,
num_cumsum,
cdf,
pdf
from (
select
groupArray(theactive_time) as arr__active_time_real,
groupArray(num) as arr__num,
arrayCumSum(arr__num) as arr__num_cumsum,
arrayReduce('sum', arr__num) as totals,
arrayMap(
x -> x / totals,
arr__num_cumsum
) as arr__cdf,
arrayMap(
x -> x / totals,
arr__num
) as arr__pdf,
1
from (
select
theactive_time,
count() as num
from (
select
max(active_time_real_virtual) as theactive_time
from
ds.playerstats
group by
player_id
) as X1
group by
theactive_time
order by
theactive_time
) as X2
) as X3
array join
arr__active_time_real as active_time_real,
arr__num as num,
arr__num_cumsum as num_cumsum,
arr__cdf as cdf,
arr__pdf as pdf
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['totals', 'active_time_real', 'num', 'num_cumsum', 'cdf', 'pdf'])
view = data
view['active_time_real_days'] = (view['active_time_real'] / 60 / 60 / 24 / 30.5)#.astype(int)
plt.rc('text', usetex=False)
plt.figure(figsize=(3,4))
# plt.plot(view['active_time_days'], view['cdf'], label='cdf')
plt.plot(view['active_time_real_days'], view['cdf'], label='CDF')
plt.grid(True)
plt.xlabel('player lifetime [months]')
plt.legend()
plt.show()
QUERY = '''
select
w,
kill_att__avg,
kill_def__avg,
kill_all__avg,
kill_other__avg,
kill_att__sum,
kill_def__sum,
kill_all__sum,
kill_other__sum,
kill_att__q25,
kill_def__q25,
kill_all__q25,
kill_other__q25,
kill_att__q50,
kill_def__q50,
kill_all__q50,
kill_other__q50,
kill_att__q75,
kill_def__q75,
kill_all__q75,
kill_other__q75,
1
from (
select
--aggregation key
toStartOfWeek(d) as w,
--aggregates
avg(kill_att) as kill_att__avg,
avg(kill_def) as kill_def__avg,
avg(kill_all) as kill_all__avg,
avg(kill_other) as kill_other__avg,
stddevPop(kill_att) as kill_att__std,
stddevPop(kill_def) as kill_def__std,
stddevPop(kill_all) as kill_all__std,
stddevPop(kill_other) as kill_other__std,
sum(kill_att) as kill_att__sum,
sum(kill_def) as kill_def__sum,
sum(kill_all) as kill_all__sum,
sum(kill_other) as kill_other__sum,
quantile(.25)(kill_att) as kill_att__q25,
quantile(.25)(kill_def) as kill_def__q25,
quantile(.25)(kill_all) as kill_all__q25,
quantile(.25)(kill_other) as kill_other__q25,
quantile(.50)(kill_att) as kill_att__q50,
quantile(.50)(kill_def) as kill_def__q50,
quantile(.50)(kill_all) as kill_all__q50,
quantile(.50)(kill_other) as kill_other__q50,
quantile(.75)(kill_att) as kill_att__q75,
quantile(.75)(kill_def) as kill_def__q75,
quantile(.75)(kill_all) as kill_all__q75,
quantile(.75)(kill_other) as kill_other__q75,
1
from
ds.playerstats
where
1
-- and d = toStartOfWeek(toDate('2020-07-01'))
group by
d
order by
d
-- limit 1
-- format Vertical
)
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'kill_att__avg', 'kill_def__avg', 'kill_all__avg', 'kill_other__avg', 'kill_att__sum', 'kill_def__sum', 'kill_all__sum', 'kill_other__sum', 'kill_att__q25','kill_def__q25','kill_all__q25','kill_other__q25','kill_att__q50','kill_def__q50','kill_all__q50','kill_other__q50','kill_att__q75','kill_def__q75','kill_all__q75','kill_other__q75','1'])
view = data
plt.figure(figsize=(3,4))
plt.plot(view['d'], view['kill_def__avg'], label='def', linestyle='dashdot', color='green')
plt.plot(view['d'], view['kill_att__avg'], label='att', linestyle='dashed', color='red')
plt.plot(view['d'], view['kill_other__avg'], label='other', linestyle='dotted', color='blue')
n=4
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
if index % n != 0:
label.set_visible(False)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
QUERY = '''
select
toStartOfWeek(last_day_seen) as d,
avg(total_villages) as total_villages__avg,
stddevPop(total_villages) as total_villages__std,
avg(delay_days_avg) as delay_days_avg__avg,
stddevPop(delay_days_avg) as delay_days_avg__std,
quantile(.25)(delay_days_avg) as delay_days_avg__q25,
quantile(.50)(delay_days_avg) as delay_days_avg__q50,
quantile(.75)(delay_days_avg) as delay_days_avg__q75,
avg(top1_percentage) as top1_percentage__avg,
quantile(.25)(top1_percentage) as top1_percentage__q25,
quantile(.50)(top1_percentage) as top1_percentage__q50,
quantile(.75)(top1_percentage) as top1_percentage__q75
from (
select
player_id,
last_day_seen,
total_villages,
arrayReduce('avg', delay_days) as delay_days_avg,
sumMap(
arr__new_owner,
arrayResize([1], length(arr__new_owner), 1)
) as counts_per_new_owner,
arrayReverse(arraySort(counts_per_new_owner.2)) as counts_per_new_owner__distribution,
arrayElement(counts_per_new_owner__distribution, 1) / total_villages as top1_percentage,
1
from (
select
player_id,
last_day_seen,
arrayFilter(x -> x > 0, groupArray(new_owner)) as arr__new_owner,
length(arr__new_owner) as total_villages,
arrayFilter(
x, y -> y > 0,
arrayMap(
x, y -> x - y,
groupArray(toUnixTimestamp(the_conquer_timestamp) / 60 / 60 / 24),
groupArray(toUnixTimestamp(toDateTime(last_day_seen)) / 60 / 60 / 24)
),
groupArray(new_owner)
) as delay_days,
1
from (
select
player_id,
last_day_seen,
village_id,
new_owner,
the_conquer_timestamp,
arrayElement(
arrayFilter(
_, y -> y >= last_day_seen,
arraySort(
x, y -> y,
arr__player_id_new,
arr__timestamps
),
arr__timestamps
),
1
) as new_owner,
arrayElement(
arrayFilter(
x -> x >= last_day_seen,
arraySort(arr__timestamps)
),
1
) as the_conquer_timestamp,
1
from (
select
player_id,
last_day_seen,
arr__villages
from (
select distinct
player_id,
t_max as last_day_seen,
1
from
ds.playerstats
where
t_max < toDate('2022-01-05')
) as X1
any left join (
select
d as last_day_seen,
player_id,
groupArray(village_id) as arr__villages
from (
--normalize villages first
select
toDate(timestamp) as d,
village_id,
arrayElement(groupArray(village_player), 1) as player_id,
1
from
ds.village
group by
d,
village_id
) as X1
group by
last_day_seen,
player_id,
1
) as X2
using(last_day_seen, player_id)
) as X3
array join
arr__villages as village_id
any left join (
select
--toDate('2021-06-01') as last_day_seen,
village_id,
groupArray(conquer_timestamp) as arr__timestamps,
groupArray(player_id_new) as arr__player_id_new
from
ds.conquer
group by
village_id
) as X4
using(village_id)
--where
-- village_id = 1983
)
group by
player_id,
last_day_seen,
1
)
where
total_villages > 0
group by
player_id,
last_day_seen,
total_villages,
delay_days,
1
)
group by
d
order by
d
--limit 10
--format Vertical
--village_id = 1983
--┌──────────t─┬─village_player─┐
--│ 2021-05-19 │ 1576939597 │
--│ 2021-05-19 │ 3372451 │
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'total_villages__avg', 'total_villages__std', 'delay_days_avg__avg', 'delay_days_avg__std', 'delay_days_avg__q25', 'delay_days_avg__q50', 'delay_days_avg__q75', 'top1_percentage__avg', 'top1_percentage__q25', 'top1_percentage__q50', 'top1_percentage__q75'])
view = data
plt.figure(figsize=(16,8))
plt.plot(view['d'], view['total_villages__avg'], label='villages')
# plt.yscale('log')
plt.ylabel('bb villages')
plt.grid(True)
plt.legend()
plt.show()
plt.figure(figsize=(16,8))
plt.plot(view['d'], view['delay_days_avg__avg'], label='delay avg')
plt.plot(view['d'], view['delay_days_avg__q25'], label='delay q25', color='black')
plt.plot(view['d'], view['delay_days_avg__q50'], label='delay q50', color='black')
plt.plot(view['d'], view['delay_days_avg__q75'], label='delay q75', color='black')
plt.ylabel('delay [d]')
plt.grid(True)
plt.legend()
plt.show()
plt.figure(figsize=(16,8))
plt.plot(view['d'], view['top1_percentage__avg'], label='top1 frac avg')
plt.plot(view['d'], view['top1_percentage__q25'], label='top1 frac q25', color='black')
plt.plot(view['d'], view['top1_percentage__q50'], label='top1 frac q50', color='black')
plt.plot(view['d'], view['top1_percentage__q75'], label='top1 frac q75', color='black')
plt.grid(True)
plt.legend()
plt.show()
plt.rc('text', usetex=False)
sns.jointplot(x=view['delay_days_avg__avg'], y=view['top1_percentage__avg'], color="darkred", kind="hex", joint_kws=dict(gridsize=20))
<seaborn.axisgrid.JointGrid at 0x7f5c4215b978>