In Round 1b of the IDG-DREAM Challenge, we used the BayesBootLadderBoot approach to allow participants to submit repeatedly while (hopefully) avoiding overfitting. In brief, this approach reports back a bootstrapped score of submitted predictions instead of the exact score. In addition, subsequent submissions from a submitter are only given a new score if they are substantially better than the previous best submission (Bayes factor > 3). This approach allows participants to see when they have made large improvements to their model, but doesn’t reward small tweaks that incrementally improve scores. We ran this round for 2 months and participants could submit once per day.

The Spearman correlation was used to track performance for determining a participant’s “best” submission, so I’ve focused on that metric in this analysis, but we also scored RMSE, Pearson correlation, F1, concordance index, and average AUC for each submission.

Submission metrics

Here are some basic numbers about Round 1b.

metric value
Round Length 59 days
Number of Valid Submissions 216
Number of Improved Submissions (K>3) 72
Number of Not-Improved Submissions (K<3 or worse score) 144
Best Spearman Correlation 0.552
Best Average AUC 0.777
Best F1 0.504
Best Concordance Index 0.712
Best Pearson Correlation 0.555
Best Root Mean Squared Error 1.019

Overall Performance

Many participants submitted more than once, though some only submitted once during this round.

Here is the same plot, but only looking at the final submission (blue) and the best submission (orange) for each team. When they are one and the same the point is orange.

Reported vs actual scores

As a reminder, the participants see a bootstrapped score instead of a precise score. So, how does the reported score track with the actual score? We can plot the actual score vs the reported score (that the participants saw on the leaderboard) to get a sense of how close they were. If they didn’t meet the cutoff (met_cutoff = FALSE), they were reported back a bootstrap of their previous score, so these will not be correlated.

As another reminder, we used Spearman for the assignment of whether a submission met a cutoff or did not, so there will be some submissions where the actual vs reported score (for non-Spearman metrics) do not fall along the diagonal for met_cutoff = TRUE.

Participant performance over Time

So using this strategy, do partipants ascend up the ladder? Here I am plotting the submission date vs the reported (bootstrapped) Spearman for all submitters that made more than one Round 1b submission. The score should never go (substantially) down for a given participant (one line). Scores that go slightly down are due to the bootstrapping of returned scores. A few participants started out worse than the “baseline” good model (dashed line) and ended up better.

If we look at the actual Spearman for each submission, we can see that it is highly variable, and that participants frequently do much worse than previous submissions (they cannot tell this though, because they remain on the same ladder rung).

Let’s look at individual participants. Here, each facet plot is an individual participant. The blue line shows what the participant saw, while the red line shows actual performance. The dashed line again reflects the Plos One baseline method.

Example participants

Here is an example of a single participant who improved on most submissions. Again, the blue line is the shown Spearman, while the red is the true score. Each submission is marked by a point if they had a submission that met the cutoff and was considered a new “best” and marked with an X if it did not meet the cutoff.

Here’s another example.

Submission bins

I also looked at performance based on the number of submissions a group made. The data plotted here is the performance for a given submitter based on their number of submissions (x axis). In this first plot, we are looking at the difference between the worst and best submission for a given submitter.

In this second plot, we are looking at the difference between the first and last submission for a given submitter. In the final plot, we are looking at the score of their final submission.

## Warning: Removed 10 rows containing non-finite values (stat_boxplot).

Heatmap of all scores

Scores are scaled relative to the other scores of the same type.