Published January 30, 2026 | Version v2
Software Open

ProfOlaf: Semi-Automated Tool for Systematic Literature Reviews

  • 1. ROR icon Instituto Superior Técnico
  • 2. Instituto Superior Técnico, University of Lisbon

Description

Setup with Dockerfile:

  Build the image

docker build -t profolaf .

  Run the container

docker run -it profolaf 

Or 

docker load < profolaf.tar.gz

Before running the experiments

git pull

To get the most recent version of the tool

ProfOlaf Walkthrough

This appendix provides a walkthrough of ProfOlaf, demonstrating how the tool supports automated and semi-automated snowballing for literature reviews. The tool is available both as a web application and as a command-line interface. Here, we describe the typical usage of the command-line version, which exposes the full pipeline.

Prerequisites and Input

Before running ProfOlaf, the user must prepare an initial seed file: a plain-text (.txt) file containing the titles of the seed articles. These articles represent the starting point of the snowballing process.

Main Snowballing Pipeline

The snowballing workflow consists of the following steps, which are executed sequentially:

  1. generate_search_conf.py
    Generates the search configuration used to query the supported scholarly databases. It also includes information such as metadata filtering criteria and important file paths.

  2. 0_generate_snowball_start.py
    Initializes the snowballing process using the provided seed file and stores the initial set of articles in the database.

  3. 1_start_iteration.py
    Starts a new snowballing iteration, collecting forward and backward citations from the set of articles of the previous iteration (or the initial set when starting a new search). Currently, only Semantic Scholar provides both backward and forward snowballing, while only citations can be fetched from Google Scholar.

  4. 2_remove_duplicates.py
    Identifies and removes duplicate entries across databases.

  5. 3_get_bibtex.py
    Retrieves BibTeX metadata for the collected articles. Without a web-scraping proxy, Semantic Scholar is recommended as the search method. Too many requests to Google Scholar may result in a block.

  6. (Optional) 4_generate_conf_rank.py
    Filters articles based on venue ranking, if the user wishes to restrict the corpus to specific publication venues.

  7. 5_filter_by_metadata.py
    Filters articles according to metadata attributes (e.g., year, venue, online availability, and language). The user can configure which metadata fields are considered.

  8. 6_filter_by_title.py
    Performs a title-based screening. The user is interactively prompted to decide whether to keep or discard each article, along with a brief justification. At this stage, users are encouraged to be conservative and only discard articles that are clearly irrelevant. Optionally, an LLM-based screening can be enabled to assist this process.

  9. 7_solve_title_disagreements.py
    Resolves disagreements between multiple raters. The script presents articles for which raters disagreed, along with their reasoning from the previous step, and prompts them to reach a consensus decision.

  10. 8_filter_by_content.py
    Performs content-based screening using the full text of the articles, following the same interaction model as the title-based filtering.

  11. 9_solve_content_disagreements.py
    Resolves rater disagreements arising during content-based screening. Similar to the previous step. The resulting set of articles after this step marks the end of an iteration.

  12. Iteration
    Steps 3 through 9 are repeated until no new articles are discovered.

  13. 10_generate_csv.py
    Produces the final CSV file containing the selected articles and their associated metadata.

Additional Analysis Scripts

In addition to the main snowballing pipeline, ProfOlaf provides auxiliary scripts for post hoc analysis of the final article set.

For the additional setup, the user must run:

  • generate_analysis_conf.py
    Stores important paths and filenames in a JSON file for the following steps.

  • 11_download_pdfs.py
    Downloads all article PDFs to a folder for subsequent analysis.

Topic Modeling

Topic modeling is run through five different scripts:

  1. 11_topic_modeling_lvl1.py
    Reads the set of articles and generates a set of general topics.

  2. 11_topic_modeling_lvl2.py
    Generates more specific sub-topics from the initial set of topics.

  3. 11_topic_modeling_refine.py
    Merges similar topics and removes overly specific or redundant topics that occur in less than 1% of the articles.

  4. 11_topic_modeling_assign.py
    Assigns the generated topics to each article.

  5. 11_topic_modeling_correct.py
    Corrects hallucinated topic assignments or errors.

Task Assistant

The task assistant module is run using 11_task_assistant.py. The user can add new prompts as text files under the folder specified in the analysis configuration and run the script to have an LLM execute the tasks for each article.

Files

Files (5.8 GB)

Name Size Download all
md5:420b3e0e422d44190ac6517c8e66d785
2.1 kB Download
md5:593bf2ea1d3dac165eaa33a1071732db
5.8 GB Download

Additional details

Software

Repository URL
https://github.com/sr-lab/ProfOlaf
Development Status
Active