# Training and testing a model
The library supports training and testing of custom models for reference classification, relation classification
between quotations and (page) references, and a linking model for identifying the source of a quotation.
The reference and relation classification models are used by `ProQuo`. The linking model is used by `ProQuoLM`.

## Training data

All training data, that can be made available, can be found in [Proquo-Data](https://scm.cms.hu-berlin.de/schluesselstellen/proquodata).

## Reference model
The following command can be used to train a reference model:

~~~
proquo train reference
path_to_train_set.txt
path_to_val_set.txt
path_to_the_output_folder
~~~

`path_to_train_set.txt` and `path_to_val_set.txt` contain one example per line in the form of two strings and a class,
tab separated, for example:

~~~
S. 47   S. 35	1
63	DKV III, 17	0
~~~

To test the model, run:

~~~
proquo test reference
path_to_test_set.txt
path_to_the_reference_vocab_file
path_to_the_reference_model_file
~~~

## Relation model
The following command can be used to train a BERT based relation model:

~~~
proquo train relation
path_to_train_set.txt
path_to_val_set.txt
path_to_the_output_folder
--arch
"bert"
~~~

`path_to_train_set.txt` and `path_to_val_set.txt` contain one example per line in the form of a string and a class,
tab separated, for example:

~~~
some context, some text <Q> some quote </Q> ( <OREF> ). some more text ( <REF> )   0
~~~

To test the model, run:

~~~
proquo test relation bert
path_to_test_set.txt
path_to_the_tokenizer_folder
path_to_the_model_folder
~~~

## ProQuoLM model
The following command can be used to train a linking model:

~~~
proquolm train
path_to_train_set.txt
path_to_val_set.txt
path_to_the_output_folder
~~~

`path_to_train_set.txt` and `path_to_val_set.txt` contain one example per line in the form of two strings and a class,
tab separated, for example:

~~~
some text for context, <S> candidate </S> some more text  start of second text, some context <T> candidate </T> text text text  0
~~~

This uses the default model [dbmdz/bert-base-german-uncased](https://huggingface.co/dbmdz/bert-base-german-uncased)
and can be changed using the command line option `--base-model-name`.

<details>
<summary>All command line options</summary>

~~~
usage: proquolm train [-h]
                      [--create-dated-subfolder | --no-create-dated-subfolder]
                      [--base-model-name BASE_MODEL_NAME]
                      [--lower-case | --no-lower-case]
                      [--batch-size BATCH_SIZE] [--num-epochs NUM_EPOCHS]
                      train-file-path val-file-path output-folder_path

ProQuoLm train allows the user to train their own models.

positional arguments:
  train-file-path       Path to the txt file containing the training examples
  val-file-path         Path to the txt file containing the validation
                        examples
  output-folder_path    Path to the folder for storing the output model and
                        vocabulary

options:
  -h, --help            show this help message and exit
  --create-dated-subfolder, --no-create-dated-subfolder
                        Create a subfolder named with the current date to
                        store the results (default: False)
  --base-model-name BASE_MODEL_NAME
                        The model name (default: dbmdz/bert-base-german-
                        uncased)
  --lower-case, --no-lower-case
                        Train model on lower case text (default: True)
  --batch-size BATCH_SIZE
                        The batch size (default: 4)
  --num-epochs NUM_EPOCHS
                        The number of epochs to train for (default: 3)
~~~

</details>

To test the model, run:

~~~
proquolm test
path_to_test_set.txt
path_to_the_tokenizer_folder
path_to_the_model_folder
~~~

<details>
<summary>All command line options</summary>

~~~
usage: proquolm test [-h] [--lower-case | --no-lower-case]
                     test-file-path tokenizer-folder-path model-folder-path

ProQuoLm test allows the user to test their trained model.

positional arguments:
  test-file-path        Path to the txt file containing the testing examples
  tokenizer-folder-path
                        Path to the vocab file
  model-folder-path     Path to the model file

options:
  -h, --help            show this help message and exit
  --lower-case, --no-lower-case
                        Test model on lower case text (default: True)
~~~

</details>