Data Crawling¶

The game provides API endpoints for download the data. A simple curl suffices. Please find more information here: https://forum.die-staemme.de/index.php?threads/weltdaten-und-configs.183996/

Note: From our observation, we have not determined any specific frequency in updates. Thus, our crawl acquires the data hourly (being well below actual update frequencies). For many analyses, we drop fine detail in favor of a daily-aggregation basis anyway.

Readme / Prerequisites¶

Our paper bases at times a little heavy on data engineering. While the plots mostly may be created without too complex queries, creating and gathering domain-specific features/metrics definitely requires deeper lookups.

Data Tables¶

The dataset contains various data sources/tables from RAW crawling data (in case you should require raw data, please send an inquiry via email):

villages: villages
players: players
allys: alliances
kills_player: kill counts for players
killy_ally: kill counts for alliances
conquers: game conquer events

Further, we derived tables gathering lots of information or being architectual helpers for more efficient queries

village_h3: helper table for local neighborhoods (using Ubers H3 spatial index)
bb_pressure: local pressure values for all villages
playerstats: aggregate of numerous player features

Using ClickHouse¶

We use ClickHouse (a powerful and highly efficient analytical columnar DBMS). We suggest following the Quick Start Tutorial (https://clickhouse.com/docs/en/getting-started/install) to get things running.

As soon as your database works clickhouse-client --query="select 'Hello World :-)'", please execute the setup_databse.sql script within its directory as follows:

clickhouse-client --queries-file="setup_databse.sql"

This process will take a while. Time to grab some coffee. It will create a new database ds and load all data tables. If interested more in ClickHouse, we suggest following the Tutorial (https://clickhouse.com/docs/en/tutorial/#3-analyze-the-data).

In case you want to go beyond rather simple aggregates, but focus on time sequences, ClickHouse arrays come in handy as they allow for another dimension within aggregations with higher order functions & map-reduce approaches.

Note: The database queries and this script have been executed on a machine with 1T memory; You might run into memory issues on lightweight hardware. In case you encounter memory limits, you might want to use our pre-calculated aggregate. Other options are sharding execution by a suitable primary key into materialized pre-aggregates or apply sampling (alike you would shard the data) as necessary; this is no fun though.

Directly Loading Data in Python¶

You may also work with pandas dataframes directly, e.g.:

import pandas as pd
villages = pd.read_parquet('villages.parquet.gz')
unique_village_ids = villages.village_id.unique()

Script Initialization¶

Note: These used libraries should work across python 3.6 to 3.9.

In [5]:

# install dependencies first
import sys
!{sys.executable} -m pip install -U numpy pandas clickhouse_driver matplotlib seaborn fastplot scipy

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: numpy in /home/helge/.local/lib/python3.9/site-packages (1.23.1)
Requirement already satisfied: pandas in /home/helge/.local/lib/python3.9/site-packages (1.4.3)
Requirement already satisfied: clickhouse_driver in /home/helge/.local/lib/python3.9/site-packages (0.2.4)
Requirement already satisfied: matplotlib in /home/helge/.local/lib/python3.9/site-packages (3.5.2)
Requirement already satisfied: seaborn in /home/helge/.local/lib/python3.9/site-packages (0.11.2)
Requirement already satisfied: fastplot in /home/helge/.local/lib/python3.9/site-packages (1.1.0)
Requirement already satisfied: scipy in /home/helge/.local/lib/python3.9/site-packages (1.8.1)
Requirement already satisfied: pytz>=2020.1 in /home/helge/.local/lib/python3.9/site-packages (from pandas) (2022.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/helge/.local/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: tzlocal in /home/helge/.local/lib/python3.9/site-packages (from clickhouse_driver) (4.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (4.32.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib) (21.3)
Requirement already satisfied: pyparsing>=2.2.1 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/lib/python3/dist-packages (from matplotlib) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in /home/helge/.local/lib/python3.9/site-packages (from matplotlib) (9.1.0)
Requirement already satisfied: statsmodels in /home/helge/.local/lib/python3.9/site-packages (from fastplot) (0.13.2)
Requirement already satisfied: six>=1.5 in /home/helge/.local/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: patsy>=0.5.2 in /home/helge/.local/lib/python3.9/site-packages (from statsmodels->fastplot) (0.5.2)
Requirement already satisfied: pytz-deprecation-shim in /home/helge/.local/lib/python3.9/site-packages (from tzlocal->clickhouse_driver) (0.1.0.post0)
Requirement already satisfied: tzdata in /home/helge/.local/lib/python3.9/site-packages (from pytz-deprecation-shim->tzlocal->clickhouse_driver) (2021.5)
WARNING: You are using pip version 22.0.3; however, version 22.1.2 is available.
You should consider upgrading via the '/usr/bin/python3.9 -m pip install --upgrade pip' command.

In [3]:

# common libs
import os
import pandas as pd
import numpy as np
import math
import random
from datetime import datetime
import operator
import itertools
import scipy.stats

# database
from clickhouse_driver import Client
DB_username='default'
DB_password=''
DB_database='ds'

CH = clickhouse = Client('localhost', user=DB_username, password=DB_password, database=DB_database)
# DB Test
CH.execute('select 1')

# plotting
import matplotlib
import matplotlib.pyplot as plt
import fastplot
%matplotlib inline
import seaborn as sns

plt.rc('text', usetex=True)
plt.rc('font', family='sans-serif')
plt.rcParams.update({'font.size': 14})

Empirical Insights¶

Pressure Metrics over Time¶

In [18]:

QUERY = '''
    select
        d,
        avg(neighbors_count_bb_unique),
        quantile(.25)(neighbors_count_bb_unique),
        quantile(.50)(neighbors_count_bb_unique),
        quantile(.75)(neighbors_count_bb_unique),
        avg(player_points_potential),
        quantile(.25)(player_points_potential),
        quantile(.50)(player_points_potential),
        quantile(.75)(player_points_potential),
        quantile(.25)(local_bb_pressure_avg),
        quantile(.50)(local_bb_pressure_avg),
        quantile(.75)(local_bb_pressure_avg),
        any(global_bb_pressure),
        count(distinct player_id) as players
    from
        ds.bb_pressure
    where
        1
        and player_id > 0 --players' perspective does not include barbarians
    group by
        d
    order by
        d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['timestamp',  'neighbors_count_bb_unique__avg', 'neighbors_count_bb_unique__q25', 'neighbors_count_bb_unique__q50', 'neighbors_count_bb_unique__q75', 'player_points_potential__avg', 'player_points_potential__q25', 'player_points_potential__q50', 'player_points_potential__q75', 'local_bb_pressure_avg__q25', 'local_bb_pressure_avg__q50', 'local_bb_pressure_avg__q75', 'global_bb_pressure', 'active_players'])
view = data


plt.figure(figsize=(6,4))
plt.plot(view['timestamp'], view['local_bb_pressure_avg__q25'], linestyle='dashed', color='black')
plt.plot(view['timestamp'], view['local_bb_pressure_avg__q50'], linestyle='solid', color='black', label='pressure local Qs')
plt.plot(view['timestamp'], view['local_bb_pressure_avg__q75'], linestyle='dashed', color='black')
plt.plot(view['timestamp'], view['global_bb_pressure'], label='pressure global')

plt.plot(view['timestamp'], view['active_players'] / max(view['active_players']), label='active players')
plt.plot(view['timestamp'], view['player_points_potential__avg'], label='points potential')

plt.grid(True)
plt.legend(loc='lower left', ncol=1, prop={'size': 15})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
plt.show()

No description has been provided for this image

Pressure 2d Game Heatmap¶

In [34]:

last_buckets = None
heatmapdata = {}
for i, date in enumerate(sorted(['2020-06-01', '2020-07-01', '2020-08-01', '2020-10-01', '2020-12-01'])):
    resolution = 5
    QUERY = f'''
        with {resolution} as resolution
        select
            toDate(timestamp) as d,
            round(village_x / resolution, 0) * resolution as bucket_x,
            round(village_y / resolution, 0) * resolution as bucket_y,
            1 - countIf(village_player = 0) / count() as bb_pressure
        from
            ds.village
        where
            d = toDate('{date}')
        group by
            d,
            bucket_x,
            bucket_y
    '''
    data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'x', 'y', 'pressure'])
    data['x'] = data['x'].astype(int) - 500
    data['y'] = data['y'].astype(int) - 500
    view = pd.pivot_table(data, values='pressure', index='y', columns='x', aggfunc='mean', fill_value=None)
    buckets = len(view)
    plt.figure()
    sns.heatmap(view).set_title(date)
    if i > 0:
        c = plt.Circle((buckets//2, buckets//2), last_buckets//2, color='red', fill=False, linewidth=2)
        plt.gca().add_patch(c)
        plt.tight_layout()
    last_buckets = buckets
    heatmapdata[date] = data
    

# plot for the paper:

date = ['2020-06-01', '2020-07-01', '2020-08-01']
radius = [
    (
        max(heatmapdata.get(date[0]).x.max(), heatmapdata.get(date[0]).y.max()),
        min(heatmapdata.get(date[0]).x.min(), heatmapdata.get(date[0]).y.min()),
    ),
    (
        max(heatmapdata.get(date[1]).x.max(), heatmapdata.get(date[1]).y.max()),
        min(heatmapdata.get(date[1]).x.min(), heatmapdata.get(date[1]).y.min()),
    ),
    (
        max(heatmapdata.get(date[2]).x.max(), heatmapdata.get(date[2]).y.max()),
        min(heatmapdata.get(date[2]).x.min(), heatmapdata.get(date[2]).y.min()),
    )        
]
data = heatmapdata.get(date[1]).drop(columns=['d'])
view = data.append(pd.Series({'x': radius[2][0], 'y': radius[2][1], 'pressure': 1}), ignore_index=True)
view = view.append(pd.Series({'x': radius[2][1], 'y': radius[2][0], 'pressure': 1}), ignore_index=True)
pivot = pd.pivot_table(view, values='pressure', index='y', columns='x', aggfunc='mean', fill_value=None)
buckets = len(pivot)

plt.figure(figsize=(6,4))
sns.heatmap(pivot.sort_index(ascending=False), cmap='magma').set_title(f'{date[1]} +/-1 month')
c0 = plt.Circle((buckets//2, buckets//2), (radius[0][0])/resolution, color='seagreen', fill=False, linewidth=2)
c2 = plt.Circle((buckets//2, buckets//2), (radius[2][0])/resolution, color='seagreen', fill=False, linewidth=2)
plt.gca().add_patch(c0)
plt.gca().add_patch(c2)

n = 3
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
for index, label in enumerate(plt.gca().yaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
# plt.grid()
plt.tight_layout()

In [20]:

QUERY = '''
    select
        d,
        countIf(player_id_old = 0) as cnt_bb,
        countIf(player_id_new = player_id_old) as cnt_self,
        countIf(
            (player_ally_new = player_ally_old) 
            and (player_ally_new > 0)
            and (player_id_new <> player_id_old)
        ) as cnt_ally_only,
        count() as total,
        countIf(
            (player_id_old > 0)
            and (player_points_new > player_points_old)
        ) as cnt_victim_smaller_nonbb,
        countIf(
            (player_id_old > 0)
            and (player_id_old <> player_id_new)
            and (player_ally_old <> player_ally_new)
            and (player_points_new > player_points_old)
        ) as cnt_hostile_victim_smaller_nonbb,
        total - cnt_bb - cnt_self - cnt_ally_only as cnt_hostile
    from (
        select
            d,
            player_id_new,
            player_id_old,
            player_ally_new,
            player_ally_old,
            player_points_new,
            player_points_old,
            1
        from (
            select
                toDate(toStartOfWeek(conquer_timestamp)) as d, --(!)
                player_id_new,
                player_id_old,
                player_ally_new,
                player_points_new,
                1
            from
                ds.conquer
            any left join (
                select
                    toDate(toStartOfWeek(timestamp)) as d,
                    player_id as player_id_new,
                    player_ally as player_ally_new,
                    player_points as player_points_new,
                    1
                from
                    ds.player
            ) as X1
            using(d, player_id_new)
            where
                1
                -- and d = toStartOfWeek(toDate('2021-11-13'))
        ) as X2
        any left join (
            select
                toDate(toStartOfWeek(timestamp)) as d,
                player_id as player_id_old,
                player_ally as player_ally_old,
                player_points as player_points_old,
                1
            from
                ds.player
        ) as X3
        using(d, player_id_old)
    ) as X4
    group by
        d
    order by 
        d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'cnt_bb', 'cnt_self', 'cnt_ally_only', 'total', 'cnt_victim_smaller_nonbb', 'cnt_hostile_victim_smaller_nonbb', 'cnt_hostile'])
view = data


plt.figure(figsize=(6,4))
plt.plot(view['d'], view['total'] - view['cnt_self'], label='total') # wo self
plt.plot(view['d'], view['cnt_bb'], label='barbarian')
plt.plot(view['d'], view['cnt_hostile'], label='hostile', linestyle='dashdot')
plt.plot(view['d'], view['cnt_victim_smaller_nonbb'], label='onto smaller', linestyle='dashed')
plt.grid(True)
plt.legend(ncol=2, prop={'size': 15})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
plt.show()

Statistical Analysis¶

Significance¶

In [31]:

QUERY = '''
    -- STATISTICAL TEST raw data
    select
        week,
        count(distinct player_id) as players_num,
        churns_soon,

        groupArray(player_points) as arr__player_points,
        groupArray(player_villages) as arr__player_villages,
        groupArray(player_rank) as arr__player_rank,
        groupArray(player_has_ally) as arr__player_has_ally,
        groupArray(num_allies) as arr__num_allies,
        groupArray(villages_gain) as arr__villages_gain,
        groupArray(villages_gain_bb) as arr__villages_gain_bb,
        groupArray(villages_gain_self) as arr__villages_gain_self,
        groupArray(villages_gain_ally_count) as arr__villages_gain_ally_count,
        groupArray(villages_loss) as arr__villages_loss,
        groupArray(villages_loss_self) as arr__villages_loss_self,
        groupArray(villages_loss_ally_count) as arr__villages_loss_ally_count,
        groupArray(kill_att) as arr__kill_att,
        groupArray(kill_def) as arr__kill_def,
        groupArray(kill_all) as arr__kill_all,
        groupArray(kill_other) as arr__kill_other,
        groupArray(has_ally_fraction) as arr__has_ally_fraction,
        groupArray(global_bb_pressure) as arr__global_bb_pressure,
        groupArray(local_bb_pressure_avg) as arr__local_bb_pressure_avg,
        groupArray(player_points_potential) as arr__player_points_potential
    from (
        select
            d,
            player_id,
            round(active_time_real_virtual / 60 / 60 / 24) as active_time_real_vitual_days,
            round(t_relative_virtual / 60 / 60 / 24) as t_relative_virtual_days,
            (select max(d) from ds.playerstats) - (select min(t_min_virtual) from ds.playerstats) as observation_period_days,
            round(t_relative_virtual_days / 7) as week,
            (t_max - d) as remaining_days,
            remaining_days < 14 as churns_soon,
            *,
            1
        from 
            ds.playerstats
        any inner join (--discretization into weeks
            select distinct
                toStartOfWeek(d) as d,
                player_id
            from
                ds.playerstats
        ) as X1
        using(player_id, d)
        where
            1
            -- and player_name = 'derVernichter'
            -- and player_id in [1577194388, 9201456]
            -- and player_id in [1577194388]
        order by
            d
    )
    group by
        churns_soon,
        week,
        1
    --limit 1
    --format Vertical
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['week', 'players_num', 'churns_soon', 'player_points', 'player_villages', 'player_rank', 'player_has_ally', 'num_allies', 'villages_gain', 'villages_gain_bb', 'villages_gain_self', 'villages_gain_ally_count', 'villages_loss', 'villages_loss_self', 'villages_loss_ally_count', 'kill_att', 'kill_def', 'kill_all', 'kill_other', 'has_ally_fraction', 'global_bb_pressure', 'local_bb_pressure_avg', 'player_points_potential'])
view = data.set_index(['week', 'churns_soon', 'players_num']).sort_index()
for col in view.columns:
    view[col] = view[col].apply(list)


from scipy.stats import mannwhitneyu
results = []
for week, row in view.groupby(by=['week']).agg(list).iterrows():
    for col in view.columns:
        try:
            #difference in medians is significant? low p-values say yes, rejecting H0 -- ie populations are likely not the same.
            U, p = mannwhitneyu(row[col][0], row[col][1])
            results.append(
                (week, col, p)
            )
        except ValueError as e:
            # print(f'{week} {col}: {str(e)}')
            # values are identical
            results.append(
                (week, col, 1)
            )
        except IndexError as e:
            #print(f'{week} {col}: {str(e)}')
            continue

results = pd.DataFrame(results, columns=['week', 'metric', 'p']).set_index(['metric'])


p=0.05
themeans = results[['p']].groupby(by=['metric']).mean()
thequantiles = results[['p']].groupby(by=['metric']).quantile(.9)
print('Overall p-values 90% quantile\n', thequantiles)

first_n_weeks = 4
themeans = results[results['week'] < first_n_weeks][['p']].groupby(by=['metric']).mean()
thequantiles = results[results['week'] < first_n_weeks][['p']].groupby(by=['metric']).quantile(.9)
print('First 4 Weeks p-values 90% quantile\n', thequantiles)

Overall p-values 90% quantile
                                  p
metric                            
global_bb_pressure        0.230208
has_ally_fraction         0.007655
kill_all                  0.073920
kill_att                  0.070090
kill_def                  0.144612
kill_other                0.067183
local_bb_pressure_avg     0.414349
num_allies                0.084200
player_has_ally           0.030915
player_points             0.000018
player_points_potential   0.388789
player_rank               0.004248
player_villages           0.000003
villages_gain             0.067301
villages_gain_ally_count  0.339895
villages_gain_bb          0.219401
villages_gain_self        0.304913
villages_loss             0.435828
villages_loss_ally_count  0.402696
villages_loss_self        0.304913
First 4 Weeks p-values 90% quantile
                                       p
metric                                 
global_bb_pressure         1.441534e-11
has_ally_fraction         4.208970e-117
kill_all                   2.371890e-12
kill_att                   2.381867e-09
kill_def                   7.095659e-06
kill_other                 4.408445e-01
local_bb_pressure_avg      1.783070e-04
num_allies                2.304371e-124
player_has_ally            1.681814e-21
player_points              2.865796e-02
player_points_potential    1.861066e-01
player_rank                1.625595e-07
player_villages            2.103746e-07
villages_gain              1.568761e-08
villages_gain_ally_count   7.871758e-01
villages_gain_bb           5.365761e-08
villages_gain_self         1.134528e-01
villages_loss              2.355080e-02
villages_loss_ally_count   8.178269e-01
villages_loss_self         1.134528e-01

Metric Value Relation¶

In [35]:

QUERY = '''
    select
        avgOrNull(rel__avg__player_points),
        avgOrNull(rel__avg__player_villages),
        avgOrNull(rel__avg__player_rank),
        avgOrNull(rel__avg__player_has_ally),
        avgOrNull(rel__avg__num_allies),
        avgOrNull(rel__avg__villages_gain),
        avgOrNull(rel__avg__villages_gain_bb),
        avgOrNull(rel__avg__villages_gain_self),
        avgOrNull(rel__avg__villages_gain_ally_count),
        avgOrNull(rel__avg__villages_loss),
        avgOrNull(rel__avg__villages_loss_self),
        avgOrNull(rel__avg__villages_loss_ally_count),
        avgOrNull(rel__avg__kill_att),
        avgOrNull(rel__avg__kill_def),
        avgOrNull(rel__avg__kill_all),
        avgOrNull(rel__avg__kill_other),
        avgOrNull(rel__avg__has_ally_fraction),
        avgOrNull(rel__avg__global_bb_pressure),
        avgOrNull(rel__avg__local_bb_pressure_avg),
        avgOrNull(rel__avg__player_points_potential),
        1
    from (
        select
            week,
            is_first_4_weeks,
            groupArray(players_num) as arr__players_num,
            groupArray(churns_soon) as arr__churns_soon,
            groupArray(avg__player_points) as arr__avg__player_points,
            groupArray(avg__player_villages) as arr__avg__player_villages,
            groupArray(avg__player_rank) as arr__avg__player_rank,
            groupArray(avg__player_has_ally) as arr__avg__player_has_ally,
            groupArray(avg__num_allies) as arr__avg__num_allies,
            groupArray(avg__villages_gain) as arr__avg__villages_gain,
            groupArray(avg__villages_gain_bb) as arr__avg__villages_gain_bb,
            groupArray(avg__villages_gain_self) as arr__avg__villages_gain_self,
            groupArray(avg__villages_gain_ally_count) as arr__avg__villages_gain_ally_count,
            groupArray(avg__villages_loss) as arr__avg__villages_loss,
            groupArray(avg__villages_loss_self) as arr__avg__villages_loss_self,
            groupArray(avg__villages_loss_ally_count) as arr__avg__villages_loss_ally_count,
            groupArray(avg__kill_att) as arr__avg__kill_att,
            groupArray(avg__kill_def) as arr__avg__kill_def,
            groupArray(avg__kill_all) as arr__avg__kill_all,
            groupArray(avg__kill_other) as arr__avg__kill_other,
            groupArray(avg__has_ally_fraction) as arr__avg__has_ally_fraction,
            groupArray(avg__global_bb_pressure) as arr__avg__global_bb_pressure,
            groupArray(avg__local_bb_pressure_avg) as arr__avg__local_bb_pressure_avg,
            groupArray(avg__player_points_potential) as arr__avg__player_points_potential,

            if(arr__players_num[2] <> 0, arr__players_num[1] / arr__players_num[2], NULL) as rel__players_num,
            if(arr__avg__player_points[2] <> 0, arr__avg__player_points[1] / arr__avg__player_points[2], NULL) as rel__avg__player_points,
            if(arr__avg__player_villages[2] <> 0, arr__avg__player_villages[1] / arr__avg__player_villages[2], NULL) as rel__avg__player_villages,
            if(arr__avg__player_rank[2] <> 0, arr__avg__player_rank[1] / arr__avg__player_rank[2], NULL) as rel__avg__player_rank,
            if(arr__avg__player_has_ally[2] <> 0, arr__avg__player_has_ally[1] / arr__avg__player_has_ally[2], NULL) as rel__avg__player_has_ally,
            if(arr__avg__num_allies[2] <> 0, arr__avg__num_allies[1] / arr__avg__num_allies[2], NULL) as rel__avg__num_allies,
            if(arr__avg__villages_gain[2] <> 0, arr__avg__villages_gain[1] / arr__avg__villages_gain[2], NULL) as rel__avg__villages_gain,
            if(arr__avg__villages_gain_bb[2] <> 0, arr__avg__villages_gain_bb[1] / arr__avg__villages_gain_bb[2], NULL) as rel__avg__villages_gain_bb,
            if(arr__avg__villages_gain_self[2] <> 0, arr__avg__villages_gain_self[1] / arr__avg__villages_gain_self[2], NULL) as rel__avg__villages_gain_self,
            if(arr__avg__villages_gain_ally_count[2] <> 0, arr__avg__villages_gain_ally_count[1] / arr__avg__villages_gain_ally_count[2], NULL) as rel__avg__villages_gain_ally_count,
            if(arr__avg__villages_loss[2] <> 0, arr__avg__villages_loss[1] / arr__avg__villages_loss[2], NULL) as rel__avg__villages_loss,
            if(arr__avg__villages_loss_self[2] <> 0, arr__avg__villages_loss_self[1] / arr__avg__villages_loss_self[2], NULL) as rel__avg__villages_loss_self,
            if(arr__avg__villages_loss_ally_count[2] <> 0, arr__avg__villages_loss_ally_count[1] / arr__avg__villages_loss_ally_count[2], NULL) as rel__avg__villages_loss_ally_count,
            if(arr__avg__kill_att[2] <> 0, arr__avg__kill_att[1] / arr__avg__kill_att[2], NULL) as rel__avg__kill_att,
            if(arr__avg__kill_def[2] <> 0, arr__avg__kill_def[1] / arr__avg__kill_def[2], NULL) as rel__avg__kill_def,
            if(arr__avg__kill_all[2] <> 0, arr__avg__kill_all[1] / arr__avg__kill_all[2], NULL) as rel__avg__kill_all,
            if(arr__avg__kill_other[2] <> 0, arr__avg__kill_other[1] / arr__avg__kill_other[2], NULL) as rel__avg__kill_other,
            if(arr__avg__has_ally_fraction[2] <> 0, arr__avg__has_ally_fraction[1] / arr__avg__has_ally_fraction[2], NULL) as rel__avg__has_ally_fraction,
            if(arr__avg__global_bb_pressure[2] <> 0, arr__avg__global_bb_pressure[1] / arr__avg__global_bb_pressure[2], NULL) as rel__avg__global_bb_pressure,
            if(arr__avg__local_bb_pressure_avg[2] <> 0, arr__avg__local_bb_pressure_avg[1] / arr__avg__local_bb_pressure_avg[2], NULL) as rel__avg__local_bb_pressure_avg,
            if(arr__avg__player_points_potential[2] <> 0, arr__avg__player_points_potential[1] / arr__avg__player_points_potential[2], NULL) as rel__avg__player_points_potential
        from (
            select
                week,
                week < 4 as is_first_4_weeks,
                count(distinct player_id) as players_num,
                churns_soon,

                avg(player_points) as avg__player_points,
                avg(player_villages) as avg__player_villages,
                avg(player_rank) as avg__player_rank,
                avg(player_has_ally) as avg__player_has_ally,
                avg(num_allies) as avg__num_allies,
                avg(villages_gain) as avg__villages_gain,
                avg(villages_gain_bb) as avg__villages_gain_bb,
                avg(villages_gain_self) as avg__villages_gain_self,
                avg(villages_gain_ally_count) as avg__villages_gain_ally_count,
                avg(villages_loss) as avg__villages_loss,
                avg(villages_loss_self) as avg__villages_loss_self,
                avg(villages_loss_ally_count) as avg__villages_loss_ally_count,
                avg(kill_att) as avg__kill_att,
                avg(kill_def) as avg__kill_def,
                avg(kill_all) as avg__kill_all,
                avg(kill_other) as avg__kill_other,
                avg(has_ally_fraction) as avg__has_ally_fraction,
                avg(global_bb_pressure) as avg__global_bb_pressure,
                avg(local_bb_pressure_avg) as avg__local_bb_pressure_avg,
                avg(player_points_potential) as avg__player_points_potential
            from (
                select
                    d,
                    player_id,
                    round(active_time_real_virtual / 60 / 60 / 24) as active_time_real_vitual_days,
                    round(t_relative_virtual / 60 / 60 / 24) as t_relative_virtual_days,
                    (select max(d) from ds.playerstats) - (select min(t_min_virtual) from ds.playerstats) as observation_period_days,
                    round(t_relative_virtual_days / 7) as week,
                    (t_max - d) as remaining_days,
                    remaining_days < 14 as churns_soon,
                    *,
                    1
                from 
                    ds.playerstats
                any inner join (--discretization into weeks
                    select distinct
                        toStartOfWeek(d) as d,
                        player_id
                    from
                        ds.playerstats
                ) as X1
                using(player_id, d)
                order by
                    d
            )
            group by
                churns_soon,
                week,
                1
            order by
                churns_soon
        )
        group by
            week,
            is_first_4_weeks
    )
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['rel__avg__player_points', 'rel__avg__player_villages', 'rel__avg__player_rank', 'rel__avg__player_has_ally', 'rel__avg__num_allies', 'rel__avg__villages_gain', 'rel__avg__villages_gain_bb', 'rel__avg__villages_gain_self', 'rel__avg__villages_gain_ally_count', 'rel__avg__villages_loss', 'rel__avg__villages_loss_self', 'rel__avg__villages_loss_ally_count', 'rel__avg__kill_att', 'rel__avg__kill_def', 'rel__avg__kill_all', 'rel__avg__kill_other', 'rel__avg__has_ally_fraction', 'rel__avg__global_bb_pressure', 'rel__avg__local_bb_pressure_avg', 'rel__avg__player_points_potential', '1'])
data.T

Out[35]:

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	0
rel__avg__player_points	3.629031
rel__avg__player_villages	3.567750
rel__avg__player_rank	0.762076
rel__avg__player_has_ally	1.583887
rel__avg__num_allies	2.015811
rel__avg__villages_gain	16.820940
rel__avg__villages_gain_bb	16.915526
rel__avg__villages_gain_self	6.033490
rel__avg__villages_gain_ally_count	3.207348
rel__avg__villages_loss	1.762352
rel__avg__villages_loss_self	6.033490
rel__avg__villages_loss_ally_count	0.965153
rel__avg__kill_att	2.989966
rel__avg__kill_def	2.223863
rel__avg__kill_all	2.697096
rel__avg__kill_other	343.643317
rel__avg__has_ally_fraction	1.574397
rel__avg__global_bb_pressure	0.995592
rel__avg__local_bb_pressure_avg	1.004379
rel__avg__player_points_potential	0.846107
1	1.000000

Beyond the Paper¶

Players / Villages over Time¶

In [5]:

# ingress/egress by date
QUERY = '''
    select
        toStartOfWeek(timestamp) as d,
        sum(num_quit),
        sum(num_new),
        sum(delta_net_plus),
        sum(points_loss),
        sum(points_gain),
        sum(delta_points_net_plus),
        
        sum(villages_loss),
        sum(villages_gain),
        sum(delta_villages_net_plus),
        sum(world_villages),
        sum(world_villages_delta),
        min(active_players)
    from (
        select
            --toStartOfWeek(toDate(date_add(SECOND, active_time, t_min))) as timestamp,
            --toStartOfMonth(toDate(date_add(SECOND, active_time, t_min))) as timestamp,
            toDate(date_add(SECOND, active_time, t_min)) as timestamp,
            count() as num_quit,
            num_new,
            num_new - num_quit as delta_net_plus,
            
            --add points net minus/net plus
            sum(points_loss) as points_loss,
            points_gain,
            points_gain - points_loss as delta_points_net_plus,
            
            sum(villages_loss) as villages_loss,
            villages_gain,
            villages_gain - points_loss as delta_villages_net_plus,
            
            (
                select
                    arrayZip(
                        groupArray(d), 
                        groupArray(world_villages_count)
                    )
                from (
                    select
                        toDate(timestamp) as d,
                        count(distinct village_id) as world_villages_count
                    from
                        ds.village
                    group by
                        d
                    order by
                        d
                )
            ) as date__to__world_villages_count,
            arrayElement(
                arrayFilter(
                    x, y -> y == timestamp,
                    date__to__world_villages_count.2,
                    date__to__world_villages_count.1
                ),
                1
            ) as world_villages,
            arrayElement(
                arrayFilter(
                    x, y -> y == timestamp,
                    arrayDifference(date__to__world_villages_count.2),
                    date__to__world_villages_count.1
                ),
                1
            ) as world_villages_delta
        from (
            select
                player_id,
                max(active_time_real_virtual) as active_time,
                min(t_min_virtual) as t_min,
                arrayElement(
                    arrayReverse(
                        arraySort(
                            x, y -> y,
                            groupArray(player_points),
                            groupArray(d)
                        )
                    ),
                    1
                ) as points_loss,
                arrayElement(
                    arrayReverse(
                        arraySort(
                            x, y -> y,
                            groupArray(player_villages),
                            groupArray(d)
                        )
                    ),
                    1
                ) as villages_loss
            from
                ds.playerstats
            group by
                player_id
        ) as X1
        full outer join (
            select
                t_min as timestamp,
                count() as num_new,
                sum(points_gain) as points_gain,
                sum(villages_gain) as villages_gain
            from (
                select
                    player_id,
                    min(t_min_virtual) as t_min,
                    arrayElement(
                        arraySort(
                            x, y -> y,
                            groupArray(player_points),
                            groupArray(d)
                        ),
                        1
                    ) as points_gain,
                    arrayElement(
                        arraySort(
                            x, y -> y,
                            groupArray(player_villages),
                            groupArray(d)
                        ),
                        1
                    ) as villages_gain
                from
                    ds.playerstats
                group by
                    player_id
            ) as X3
            group by
                timestamp
        ) as X2
        using(timestamp)
        group by
            timestamp,
            --join
            num_new,
            points_gain,
            villages_gain
        order by
            timestamp
    ) as X5
    any left join (
        select
            toDate(d) as timestamp,
            count(distinct player_id) as active_players
        from
            ds.playerstats
        group by
            timestamp
    ) as X4
    using(timestamp)
    group by
        d
    order by
        d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['timestamp', 'num_quit', 'num_new', 'delta_net_plus', 'points_loss', 'points_gain', 'delta_points_net_plus', 'villages_loss', 'villages_gain', 'delta_villages_net_plus', 'world_villages', 'world_villages_delta', 'active_players'])
view_other = data.drop(data.tail(1).index)
view_other.drop(view_other.head(1).index,inplace=True)




plt.figure(figsize=(4,4))
plt.plot(view_other['timestamp'], view_other['num_new'], label='p new')
plt.plot(view_other['timestamp'], view_other['num_quit'], label='p quit')
plt.plot(view_other['timestamp'], view_other['delta_net_plus'], label='p net')
plt.grid(True)
plt.legend(loc='upper right', ncol=1, prop={'size': 12})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
plt.show()


plt.figure(figsize=(4,4))
plt.plot(view_other['timestamp'], view_other['villages_gain'], label='v new')
plt.plot(view_other['timestamp'], view_other['villages_loss'], label='v quit')
#plt.plot(view_other['timestamp'], view_other['world_villages_delta'], label='v net') #decresing reate if new villages.
plt.grid(True)
plt.legend(loc='upper right', ncol=1, prop={'size': 12})
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
plt.ylim(-200, 5000)
plt.show()

Player Lifetime Heatmap¶

In [6]:

#heatmap start date vs lifetime TODO!
QUERY = '''
    select
        --toUnixTimestamp(toDateTime(toStartOfWeek(t_min_virtual))) as d,
        toStartOfMonth(t_min_virtual) as d,
        round(active_time_real_virtual / 60 / 60 / 24 / 7 / 4) as active_time,
        1 as dummy
    from (
        select distinct
            player_id,
            t_min_virtual,
            active_time_real_virtual
        from
            ds.playerstats
    )
    order by
        d
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'active_time', 'dummy'])
view = data
view['active_time'] = view['active_time'].astype(int)
view['m'] = view['d'].apply(lambda x: str(x)[0:7])
pivot = pd.pivot_table(view, values='dummy', index='m', columns='active_time', aggfunc='count', fill_value=None).sort_index(ascending=False)
plt.rc('text', usetex=False)
from matplotlib.colors import LogNorm
plt.figure(figsize=(4,3))
sns.heatmap(pivot, cmap='magma', norm=LogNorm())
plt.xlabel('player lifetime [months]')
plt.ylabel('player start [month]')
n=2
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
for index, label in enumerate(plt.gca().yaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
plt.rc('text', usetex=True)
# plt.grid()
plt.tight_layout()

In [9]:

QUERY = '''
    select
        totals,
        active_time_real,
        num,
        num_cumsum,
        cdf,
        pdf
    from (
        select
            groupArray(theactive_time) as arr__active_time_real,
            groupArray(num) as arr__num,
            arrayCumSum(arr__num) as arr__num_cumsum,
            arrayReduce('sum', arr__num) as totals,
            arrayMap(
                x -> x / totals,
                arr__num_cumsum
            ) as arr__cdf,
            arrayMap(
                x -> x / totals,
                arr__num
            ) as arr__pdf,
            1
        from (
            select
                theactive_time,
                count() as num
            from (
                select
                    max(active_time_real_virtual) as theactive_time
                from
                    ds.playerstats
                group by
                    player_id
            ) as X1
            group by
                theactive_time
            order by
                theactive_time
        ) as X2
    ) as X3
    array join 
        arr__active_time_real as active_time_real,
        arr__num as num,
        arr__num_cumsum as num_cumsum,
        arr__cdf as cdf,
        arr__pdf as pdf
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['totals', 'active_time_real', 'num', 'num_cumsum', 'cdf', 'pdf'])
view = data
view['active_time_real_days'] = (view['active_time_real'] / 60 / 60 / 24 / 30.5)#.astype(int)
plt.rc('text', usetex=False)
plt.figure(figsize=(3,4))
# plt.plot(view['active_time_days'], view['cdf'], label='cdf')
plt.plot(view['active_time_real_days'], view['cdf'], label='CDF')
plt.grid(True)
plt.xlabel('player lifetime [months]')
plt.legend()
plt.show()

In [15]:

QUERY = '''
    select
        w, 
        kill_att__avg, 
        kill_def__avg, 
        kill_all__avg, 
        kill_other__avg, 
        kill_att__sum, 
        kill_def__sum, 
        kill_all__sum, 
        kill_other__sum, 
        kill_att__q25,
        kill_def__q25,
        kill_all__q25,
        kill_other__q25,
        kill_att__q50,
        kill_def__q50,
        kill_all__q50,
        kill_other__q50,
        kill_att__q75,
        kill_def__q75,
        kill_all__q75,
        kill_other__q75,
        1
    from (
        select
            --aggregation key
            toStartOfWeek(d) as w,
            
            --aggregates
            avg(kill_att) as kill_att__avg,
            avg(kill_def) as kill_def__avg,
            avg(kill_all) as kill_all__avg,
            avg(kill_other) as kill_other__avg,
            stddevPop(kill_att) as kill_att__std,
            stddevPop(kill_def) as kill_def__std,
            stddevPop(kill_all) as kill_all__std,
            stddevPop(kill_other) as kill_other__std,
            
            sum(kill_att) as kill_att__sum,
            sum(kill_def) as kill_def__sum,
            sum(kill_all) as kill_all__sum,
            sum(kill_other) as kill_other__sum,

            quantile(.25)(kill_att) as kill_att__q25,
            quantile(.25)(kill_def) as kill_def__q25,
            quantile(.25)(kill_all) as kill_all__q25,
            quantile(.25)(kill_other) as kill_other__q25,

            quantile(.50)(kill_att) as kill_att__q50,
            quantile(.50)(kill_def) as kill_def__q50,
            quantile(.50)(kill_all) as kill_all__q50,
            quantile(.50)(kill_other) as kill_other__q50,

            quantile(.75)(kill_att) as kill_att__q75,
            quantile(.75)(kill_def) as kill_def__q75,
            quantile(.75)(kill_all) as kill_all__q75,
            quantile(.75)(kill_other) as kill_other__q75,
            1        
        from
            ds.playerstats
        where
            1
            -- and d = toStartOfWeek(toDate('2020-07-01'))
        group by
            d
        order by
            d
        -- limit 1
        -- format Vertical
    )
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'kill_att__avg', 'kill_def__avg', 'kill_all__avg', 'kill_other__avg', 'kill_att__sum', 'kill_def__sum', 'kill_all__sum', 'kill_other__sum', 'kill_att__q25','kill_def__q25','kill_all__q25','kill_other__q25','kill_att__q50','kill_def__q50','kill_all__q50','kill_other__q50','kill_att__q75','kill_def__q75','kill_all__q75','kill_other__q75','1'])
view = data

plt.figure(figsize=(3,4))
plt.plot(view['d'], view['kill_def__avg'], label='def', linestyle='dashdot', color='green')
plt.plot(view['d'], view['kill_att__avg'], label='att', linestyle='dashed', color='red')
plt.plot(view['d'], view['kill_other__avg'], label='other', linestyle='dotted', color='blue')
n=4
for index, label in enumerate(plt.gca().xaxis.get_ticklabels()):
    if index % n != 0:
        label.set_visible(False)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

How fast are freshly unoccupied villages re-conquered by others? And how many?¶

In [4]:

QUERY = '''
    select
        toStartOfWeek(last_day_seen) as d, 

        avg(total_villages) as total_villages__avg,
        stddevPop(total_villages) as total_villages__std,

        avg(delay_days_avg) as delay_days_avg__avg,
        stddevPop(delay_days_avg) as delay_days_avg__std,
        quantile(.25)(delay_days_avg) as delay_days_avg__q25,
        quantile(.50)(delay_days_avg) as delay_days_avg__q50,
        quantile(.75)(delay_days_avg) as delay_days_avg__q75,

        avg(top1_percentage) as top1_percentage__avg,
        quantile(.25)(top1_percentage) as top1_percentage__q25,
        quantile(.50)(top1_percentage) as top1_percentage__q50,
        quantile(.75)(top1_percentage) as top1_percentage__q75
    from (
        select
            player_id,
            last_day_seen,
            total_villages,
            arrayReduce('avg', delay_days) as delay_days_avg,
            sumMap(
                arr__new_owner, 
                arrayResize([1], length(arr__new_owner), 1)
            ) as counts_per_new_owner,
            arrayReverse(arraySort(counts_per_new_owner.2)) as counts_per_new_owner__distribution,
            arrayElement(counts_per_new_owner__distribution, 1) / total_villages as top1_percentage,
            1
        from (
            select
                player_id,
                last_day_seen,
                arrayFilter(x -> x > 0, groupArray(new_owner)) as arr__new_owner,
                length(arr__new_owner) as total_villages,
                arrayFilter(
                    x, y -> y > 0,
                    arrayMap(
                        x, y -> x - y,
                        groupArray(toUnixTimestamp(the_conquer_timestamp) / 60 / 60 / 24),
                        groupArray(toUnixTimestamp(toDateTime(last_day_seen)) / 60 / 60 / 24)
                    ),
                    groupArray(new_owner)
                ) as delay_days,
                1
            from (
                select
                    player_id,
                    last_day_seen,
                    village_id,
                    new_owner,
                    the_conquer_timestamp,
                    arrayElement(
                        arrayFilter(
                            _, y -> y >= last_day_seen,
                            arraySort(
                                x, y -> y, 
                                arr__player_id_new, 
                                arr__timestamps
                            ),
                            arr__timestamps
                        ),
                        1
                    ) as new_owner,
                    arrayElement(
                        arrayFilter(
                            x -> x >= last_day_seen,
                            arraySort(arr__timestamps)
                        ),
                        1
                    ) as the_conquer_timestamp,
                    1
                from (
                    select
                        player_id,
                        last_day_seen,
                        arr__villages
                    from (
                        select distinct
                            player_id,
                            t_max as last_day_seen,
                            1
                        from
                            ds.playerstats
                        where
                            t_max < toDate('2022-01-05')
                    ) as X1
                    any left join (
                        select
                            d as last_day_seen,
                            player_id,
                            groupArray(village_id) as arr__villages
                        from (
                            --normalize villages first
                            select
                                toDate(timestamp) as d,
                                village_id,
                                arrayElement(groupArray(village_player), 1) as player_id,
                                1
                            from
                                ds.village
                            group by
                                d,
                                village_id
                        ) as X1
                        group by
                            last_day_seen,
                            player_id,
                            1
                    ) as X2
                    using(last_day_seen, player_id)
                ) as X3
                array join
                    arr__villages as village_id
                any left join (
                    select
                        --toDate('2021-06-01') as last_day_seen,
                        village_id,
                        groupArray(conquer_timestamp) as arr__timestamps,
                        groupArray(player_id_new) as arr__player_id_new
                    from
                        ds.conquer
                    group by
                        village_id
                ) as X4
                using(village_id)
                --where
                --    village_id = 1983
            )
            group by
                player_id,
                last_day_seen,
                1
        )
        where
            total_villages > 0
        group by
            player_id,
            last_day_seen,
            total_villages,
            delay_days,
            1
    )
    group by
        d
    order by
        d
    --limit 10
    --format Vertical

    --village_id = 1983
    --┌──────────t─┬─village_player─┐
    --│ 2021-05-19 │     1576939597 │
    --│ 2021-05-19 │        3372451 │
'''
data = pd.DataFrame(CH.execute(QUERY), columns=['d', 'total_villages__avg', 'total_villages__std', 'delay_days_avg__avg', 'delay_days_avg__std', 'delay_days_avg__q25', 'delay_days_avg__q50', 'delay_days_avg__q75', 'top1_percentage__avg', 'top1_percentage__q25', 'top1_percentage__q50', 'top1_percentage__q75'])
view = data


plt.figure(figsize=(16,8))
plt.plot(view['d'], view['total_villages__avg'], label='villages')
# plt.yscale('log')
plt.ylabel('bb villages')
plt.grid(True)
plt.legend()
plt.show()

plt.figure(figsize=(16,8))
plt.plot(view['d'], view['delay_days_avg__avg'], label='delay avg')
plt.plot(view['d'], view['delay_days_avg__q25'], label='delay q25', color='black')
plt.plot(view['d'], view['delay_days_avg__q50'], label='delay q50', color='black')
plt.plot(view['d'], view['delay_days_avg__q75'], label='delay q75', color='black')
plt.ylabel('delay [d]')
plt.grid(True)
plt.legend()
plt.show()

plt.figure(figsize=(16,8))
plt.plot(view['d'], view['top1_percentage__avg'], label='top1 frac avg')
plt.plot(view['d'], view['top1_percentage__q25'], label='top1 frac q25', color='black')
plt.plot(view['d'], view['top1_percentage__q50'], label='top1 frac q50', color='black')
plt.plot(view['d'], view['top1_percentage__q75'], label='top1 frac q75', color='black')
plt.grid(True)
plt.legend()
plt.show()

plt.rc('text', usetex=False)
sns.jointplot(x=view['delay_days_avg__avg'], y=view['top1_percentage__avg'], color="darkred", kind="hex", joint_kws=dict(gridsize=20))

Out[4]:

<seaborn.axisgrid.JointGrid at 0x7f5c4215b978>