# Artifact "Question Answering over Linked Data with Vague Temporal Adverbials"

## Setup

We suggest using Python 3.11 to run this artifact, older versions have not been tested. The following instructions should work for most major Linux distributions.

Start by setting up a virtual environment and installing the required packages. For this, run the following at the top level of a cloned version of this repository:

```bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
bash setupSpacy.sh
```

If you want to execute a single Python file manually, you have to activate the virtual environment and add the `src/` subdirectory as `PYTHONPATH`:

```bash
source venv/bin/activate
export PYTHONPATH=/path/to/src/
python some/python/file.py
```

## Artifact Structure

This artifact is an extension of [https://github.com/ag-sc/neodudes](https://github.com/ag-sc/neodudes) to support vague temporal adverbials. In the following, we will only highlight the most important changes and additions.

* src/lemon/resources/ - Resources, results, datasets, and other files needed for the experiments
  * comp_query_score_models/ - Pre-trained model for the query score prediction task
    * recombine.sh - Script to recombine and decompress the model files
  * lexicon_fuzzy/ - Lexicon for vague temporal adverbials and corresponding events
  * vaguetemp/ - Evaluation results, dataset files (both for the fuzzy question answering task and training the query selection model), the household knowledge graph, FuzzyLLI model parameters, and other relevant files
    * 2025-05-07-full-final-evaluation_data_results.csv.zst - All candidate queries generated by the extended NeoDUDES pipeline for the presented benchmark
    * 2025-05-07-full-final-evaluation_data_results_llm_selected.jsonl.zst - Queries chosen by the query selection LLM for each query selection strategy and benckmark item
    * entity_predicates_*fuzzy.sqlite - Databases containing valid entity-predicate combinations
    * evaluation_data_*.json - The benchmark dataset files with ground truth results
    * evaluation_data_*.json-dataset.cpkl - Dataset files for query selection model training
    * household_events.ttl - The household knowledge graph used for the experiments
    * household_events_example.ttl - Small excerpt from the household knowledge graph
    * optimized_parameters_*.pkl - (extended) FuzzyLLI model parameters
* src/dudes/qa/dudes_creation/dudes_creation_strategy.py - Separate strategy for the vague temporal adverbials and the corresponding events
* src/dudes/qa/sparql/ - SPARQL query generation has been adapted to support vague temporal adverbials
  * sparql_generator.py - Refactored SPARQL generator to be modular and extensible
  * sparql_modules.py - Modules implementing parts of the SPARQL query generation, added module evaluating special property `vaguetemp` that triggers a call to the FuzzyLLI model and generates FILTER statements for the results
* src/llm/query_scoring/dataset.py - Added specialized dataset for further fine-tuning the query selection model with SPARQL queries representing interpretations of vague temporal adverbials
* src/fuzzylli/ - Standalone scripts and data for the (extended) FuzzyLLI model
* src/fuzzy_dataset/ - Scripts and data for knowledge graph and evaluation dataset generation
  * twor.2010/ - Dataset from [https://casas.wsu.edu/datasets/](https://casas.wsu.edu/datasets/)
  * preprocess_data.py - Script to preprocess and clean the household dataset
  * create_kg.py - Script to create the knowledge graph from the household dataset
  * create_evaluation_dataset.py - Script to create the evaluation dataset based on the household dataset, i.e., the corresponding knowledge graph
* src/llm/query_scoring/training_fuzzy.py - Adapted training script for the fuzzy query selection model
* src/tests/test_fuzzy.py - Benchmark script running the pipeline as well as the query selection, also includes other tests and examples that might be interesting
* requirements.txt - Python requirements of this project

## Replication Steps

### Dataset Generation

First, go to the `src/fuzzy_dataset/` directory 
```bash
cd src/fuzzy_dataset/
```

Then, ensure to decompress the original household dataset file:

```bash
zstd -d ./twor.2010/data.zst
```

Afterwards, the original household dataset needs to be cleaned and preprocessed first:
```bash
python ./preprocess_data.py
```

Argument:

--filepath - the path to the household dataset Default: twor.2010/data

Output: R1_activities.json, R2_activities.json

Both files contain the cleaned activities split by resident. Then, we can create the corresponding knowledge graph:

```bash
python ./create_kg.py
```

Output: household_events.ttl

Based on these files, we can then generate the evaluation dataset:

```bash
python ./create_evaluation_dataset.py
```

Additionally, you can change the parameters of the dataset creation:

Arguments: 

--num_what Number of WHAT Questions per adverbial, Default: 100

--num_did Number of DID questions per event, Default: 5

--num_who Number of WHO questions per adverbial, Default: 200 

--num_what_happened Number of WHAT HAPPENED questions per adverbial, Default: 100

For example:

```bash
python ./create_evaluation_dataset.py --num_what 10 --num_did 5 --num_who 1 --num_what_happened 5
```

The evaluation data for the four question categories is placed in evaluation_data/...

### Query Selection Model Training

The used base model can be obtained from [https://zenodo.org/records/12610838/files/query_score_models.tar.zst?download=1](https://zenodo.org/records/12610838/files/query_score_models.tar.zst?download=1), using the `query_score_llm_clampfp_1.3902932441715008e-05_0.9013707813420198_64_2_2024-06-21_20-33-02-776942.ckpt` model as a basis. The model is expected to be present at `src/lemon/resources/query_score_llm_clampfp_1.3902932441715008e-05_0.9013707813420198_64_2_2024-06-21_20-33-02-776942_best_val_loss.ckpt`.

The training can then be started by simply running the following command:

```bash
python src/llm/query_scoring/training_fuzzy.py
```

Some parameters, such as the batch size, might need to be adjusted using the available command line options depending on the available hardware.

### Experiments

Running the experiments consists of two steps: First, the pipeline is run to generate the candidate queries for the benchmark dataset. This is done by running the following command:

```bash
python src/tests/test_fuzzy.py --eval
```

This command by default places the generated queries in the `src/lemon/resources/vaguetemp/` directory in a file named `evaluation_data_results.csv`. The most promising queries are then selected by the query selection model. This is done by running the following command:

```bash
python src/tests/test_fuzzy.py --llmselect --resultspath src/lemon/resources/vaguetemp/evaluation_data_results.csv
```

To run the query selection with the pre-computed candidate queries from our experiments, you can simply run:

```bash
python src/tests/test_fuzzy.py --llmselect
```

By default, the selected resulting queries per strategy are written to `src/lemon/resources/vaguetemp/2025-05-07-full-final-evaluation_data_results_llm_selected.jsonl`.

This two-step approach was chosen for easier replicability of single steps. In a real-world scenario, one would rather directly forward the candidate queries to the model as they are generated and stop once no more candidates are available or some timeout is reached.
