# The Pathogen--Host Interactions Database, version 5.3

**Release date:** 31 January 2026

The [Pathogen--Host Interactions Database](https://phi5.phi-base.org)
(PHI-base) is an online knowledge base that catalogues experimentally-
verified pathogenicity, virulence and effector genes from fungal,
oomycete, and bacterial pathogens, which infect animal, plant, fungal,
and insect hosts. PHI-base is a valuable resource in the discovery of
genes in medically and agronomically important pathogens, which may be
potential targets for chemical intervention.

Information in PHI-base is manually curated by domain experts and is
supported by strong experimental evidence (for example, gene disruption
and gene complementation experiments), as well as references to the
literature in which the original experiments are described. Annotations
are made using terms from ontologies and controlled vocabularies,
including the Gene Ontology (GO), Brenda Tissue Ontology (BTO), and the
Pathogen--Host Interaction Phenotype Ontology (PHIPO).

PHI-base 5 includes information that was curated using a new curation
process, described in [Cuzick et
al.](https://doi.org/10.7554/eLife.84658) (2023). PHI-base 5 uses a
different data schema from PHI-base 4, but the majority of information
from PHI-base 4 has been migrated to PHI-base 5. Information that has
not yet been migrated can still be viewed in the PHI-base 4 dataset that
is available on Zenodo at <https://doi.org/10.5281/zenodo.5356870>.

For more information about the planned transition from PHI-base 4 to
PHI-base 5, see the [Help](https://phi5.phi-base.org/#/help) and
[Announcements](https://phi5.phi-base.org/#/announcements) page on the
PHI-base 5 website.

## Release statistics

This version of the PHI-base 5 dataset contains the following types of
information:

| Data type                           |   Count |
|:------------------------------------|--------:|
| Genes                               |   10475 |
| Interactions                        |   33876 |
| Pathogen species                    |     309 |
| Host species                        |     237 |
| Diseases                            |     344 |
| References (publications)           |    5273 |
| **Annotations**                     |         |
| Pathogen-host interaction phenotype |   19898 |
| Gene-for-gene phenotype             |     575 |
| Pathogen phenotype                  |   12421 |
| Host phenotype                      |      21 |
| GO biological process               |    1540 |
| GO cellular component               |     139 |
| GO molecular function               |     215 |
| Post-translational modification     |       8 |
| Physical interaction                |     118 |
| WT RNA expression                   |      76 |
| WT protein expression               |       2 |

## File contents

-   **phi-base_v5.3.xlsx**: the PHI-base dataset as an Excel
    spreadsheet. This format follows the layout of the PHI-base 5
    website, with sheets corresponding to the sections of gene pages on
    the website. This format is designed for use by non-technical users.

-   **phi-base_v5.3.json**: the PHI-base dataset in JSON format. This is
    modelled on the export format used by PHI-Canto, the curation tool
    used by PHI-base. This format is primarily intended for programmatic
    usage and has additional information (e.g. metadata for curation
    sessions) that is not included in the spreadsheet format.

-   **phi-base.schema.json**: a [JSON Schema](https://json-schema.org/)
    file for the JSON format of the dataset. This is included as
    documentation for the fields in the JSON file, but can also be used
    to validate the dataset.

## How to cite

To cite this version of the dataset (version 5.3), use the following
citation:

> Chang, H., Seager, J., Urban, M., & Hammond-Kosack, K. (2026).
> PHI-base: the Pathogen-Host Interactions Database, version 5.3 \[Data
> set\]. Zenodo. <https://doi.org/10.5281/zenodo.18449986>

## Conditions of use

**Rights holder**: Rothamsted Research

**Licence**: Creative Commons Attribution 4.0 International
(<https://creativecommons.org/licenses/by/4.0/>)

**Citation**: Chang, H., Seager, J., Urban, M., & Hammond-Kosack, K.
(2026). PHI-base: the Pathogen-Host Interactions Database, version 5.3
\[Data set\]. Zenodo. <https://doi.org/10.5281/zenodo.18449986>

Rothamsted Research relies on the integrity of our users to ensure that
we receive suitable acknowledgment as being the originator of this
dataset. This enables us to monitor the use of this dataset and to
demonstrate its value. Please send us a link to any publication that
uses this dataset.

## Authors

The current members of the PHI-base team are the authors of this
dataset.

-   Hsin-Yu Chang
    -   **Role**: Curation, curation review
    -   **ORCID**:
        [0000-0001-5577-2356](https://orcid.org/0000-0001-5577-2356)
    -   **Affiliation**: Protecting Crops and the Environment,
        Rothamsted Research
-   James Seager
    -   **Role**: Software development, data engineering
    -   **ORCID**:
        [0000-0001-7487-610X](https://orcid.org/0000-0001-7487-610X)
    -   **Affiliation**: Protecting Crops and the Environment,
        Rothamsted Research
-   Martin Urban
    -   **Role**: Database management, curation review
    -   **ORCID**:
        [0000-0003-2440-4352](https://orcid.org/0000-0003-2440-4352)
    -   **Affiliation**: Protecting Crops and the Environment,
        Rothamsted Research
-   Kim Hammond-Kosack
    -   **Role**: Principal Investigator
    -   **ORCID**:
        [0000-0002-9699-485X](https://orcid.org/0000-0002-9699-485X)
    -   **Affiliation**: Protecting Crops and the Environment,
        Rothamsted Research

## Contributors

### Professional curators

The following professional curators provided curation for this release:

-   Nagashree Nonavinakere
    -   **Role**: Data curator
    -   **ORCID**:
        [0009-0005-6705-9722](https://orcid.org/0009-0005-6705-9722)
    -   **Affiliation**: Molecular Connections Pvt Ltd.

### Community curators

No members of the research community were recorded as providing curation
or review to this version of the dataset.

Some contributors may not have consented to having their personal
details shown here. For a complete list of contributors, see the record
page on Zenodo for this dataset:
<https://doi.org/10.5281/zenodo.18449986>

## Understanding the dataset

The following sections provide guidance on how to interpret and use the
information in the dataset.

### Metagenotypes

In PHI-base, a 'metagenotype' is a concept that represents the combined
genotype of a pathogen and host during a pathogen--host interaction. The
metagenotype is annotated with phenotypes that describe the outcome of
the interaction and its effects on the pathogen and host, which may be
closely interconnected.

The concept of a metagenotype is broadly synonymous with 'interaction',
although we view an interaction as a dynamic process, whereas a
metagenotype is more static.

### Biological feature display names

PHI-base applies annotations to biological features, which include
genes, genotypes, and metagenotypes. These features are displayed on the
website and in the spreadsheet using concise display names. The syntax
of the display names is based on the syntax used by the PHI-Canto
curation tool.

#### Gene names

The display name for the gene is simply the gene name provided by
UniProtKB, where available. The spreadsheet typically uses the UniProtKB
accession number instead of the gene name to reduce ambiguity.

#### Genotype names

Genotype names consist of a list of alleles, where each allele is
followed by the species name, then the strain name in parentheses. Each
allele contains: an allele name, which may immediately be followed by a
description of the allele in parentheses; the allele type, in
parentheses; and changes in expression level, if applicable, in square
brackets.

The following template illustrates the syntax:

> allele_name(description) (allele_type)\[expression\] species_name
> (strain)

An example of a deletion allele, where the expression level does not
apply:

> lsrKΔ (deletion) Escherichia coli (APEC94)

An example of an allele with a description:

> AvrSen1deltaSP(1-30) (partial_amino_acid_deletion)\[Not assayed\]
> Synchytrium endobioticum (MB42)

#### Metagenotype names

Metagenotype names consist of two genotype names, the first being the
pathogen genotype name, and the second being the host genotype name.

Shown below is an example of a metagenotype for an interaction between
Trypanosoma cruzi (the pathogen) and Mus musculus (the host).

> CUBΔ (deletion) Trypanosoma cruzi (Tulahuen) wild type Mus musculus
> (BALB/c)

### Annotation extension syntax

PHI-base uses annotation extensions to provide additional information
about the annotations made to biological features. Examples include the
tissue site infected during a pathogen--host interaction, the change in
infective ability of the pathogen during an interaction, and
interactions used as a control case for an experiment.

Each annotation extension can be thought of as a named relation between
the annotation (the domain) and a value (the range), where values can be
an ontology term, a gene, a metagenotype, and so on.

The syntax of an annotation extension is the relation name, followed by
a display name for the value, followed by a unique identifier for the
value in parentheses.

In the following example, 'infects_tissue' is the relation name, 'leaf'
is the display name for an ontology term, and BTO:0000713 is an
identifier for the ontology term.

> infects_tissue leaf (BTO:0000713)

### Ontology term labels and IDs

PHI-base uses ontology terms in annotation. Ontology terms have a
human-readable term label in addition to a machine-readable unique
identifier. For example, the label 'pathogen host interaction phenotype'
has the identifier PHIPO:0000001.

Most of the ontologies used by PHI-base follow the principles set out by
the [OBO Foundry](https://obofoundry.org/principles/fp-003-uris.html),
which mandates the use of a unique ontology prefix (e.g. PHIPO, GO, BTO)
followed by a zero-padded number, usually seven digits in length.

Ontology term labels should not be regarded as stable or unique
identifiers for an ontology term: only the ontology term identifier has
these properties. Fortunately, the term identifier can be used to
resolve the current term label. This can be done using lookup services,
such as the [Ontology Lookup Service](https://www.ebi.ac.uk/ols4)
provided by the EBI.

Note that the ontology term labels in the PHI-base spreadsheet are
provided as a convenience. If you are planning to load information from
PHI-base into your own application, we recommend using an ontology term
lookup service to retrieve the latest term labels.

### Resolving biological feature identifiers

This guidance is for users of the PHI-base JSON format.

The PHI-base JSON export is organised into a collection of curation
sessions (the `curation_sessions` object). Each curation session
corresponds to a single publication that has been curated by PHI-base.

Annotations and biological features can reference the unique identifier
of other biological features: for example, a genotype can reference one
or more alleles. To retrieve the information for the referenced
biological feature, the identifier must be looked up elsewhere in the
JSON file.

The hierarchy of biological features is shown below. Essentially, each
type of biological feature 'contains' the feature below it:

1.  Metagenotype
2.  Genotype
3.  Allele
4.  Gene

Allele objects contain a gene identifier as the value of the `gene` key.
This identifier will match a key of the `genes` object of the curation
session.

Genotype objects contain allele identifiers in the `loci` array. These
identifiers match the keys of the `alleles` object of the curation
session object.

Metagenotype objects contain genotype identifiers as the values of the
`pathogen_genotype` and `host_genotype` keys. These identifiers match
the keys of the `genotypes` object of the curation session object.

Annotation objects in the `annotations` array can contain one of the
following identifiers: a gene identifier (the `gene` key), a genotype
identifier (a `genotype` key), or a metagenotype identifier (a
`metagenotype`) identifier. The metagenotype identifiers match the keys
of the `metagenotypes` object of the curation session object; the other
identifiers can be resolved using the process described above.

Organism objects in the `organisms` object are a special case: they are
identified by their NCBI Taxonomy ID (as a string type) but are
referenced in two different ways:

1.  Gene objects reference organisms by their scientific name, in the
    value of the `organism` key, which corresponds to the value of the
    `full_name` key of the organism object.

2.  Genotype objects reference organisms in the value of the
    `organism_taxonid` key, which is the NCBI Taxonomy ID as an integer
    type (instead of a string type).
