Matcher (Version 1) for Automated Task Alignment in the Genomic API for Model Evaluation (GAME)
Authors/Creators
Description
gemma3:12b model and all necessary Python dependencies to map fuzzy, free-text user inputs to canonical terms from a controlled vocabulary. It operates as a standalone TCP server, accepting JSON-formatted requests and returning the best-matched term.Matcher V1: Recursive Tournament
This version is a direct evolution of V0 (deprecated V1), which utilized raw TCP sockets. This introduces significant algorithmic, accuracy improvements, and the use of REST API framework (using FastAPI) for greater scalability. It retains the core architecture of V0 -- a robust framework for LLM-based entity matching in three key genomic domains: cell types, species, and binding molecules (e.g., Transcription Factors, Histone Modifications).
Core Functionality and Improvements from V0
- LLM-Powered Matching: Utilizes the
gemma3:12bmodel via the Ollama framework to understand the semantic content of a user's input term. - Prompting: Employs sophisticated, few-shot prompt engineering to guide the LLM's reasoning.
- Recursive Tournament Algorithm: The "Chunk-and-Compete" method of V0 is upgraded to a more scalable, multi-stage recursive tournament.
- Chunking: The extensive list of potential choices is broken down into smaller, manageable chunks (e.g. of 20 items). The LLM then finds the best candidate ("champion") within each chunk.
- Recursive Chunking: After the initial chunking round, the algorithm checks the number of resulting champions. If it exceeds the chunk size, it treats the champions as a new list to be chunked and runs another elimination round. This process repeats recursively, like a tournament bracket, until a small group of finalists remains for the final decision. This ensures the matcher can gracefully handle massive choice lists without failure.
- Enhanced Granularity Matching: The prompt for
cell_typematching has been refined with new instructions and examples.- V1 is now better able to discern the required level of detail. For instance, given the input
mammary epithelial cell, it can correctly choose"mammary epithelial cell female"over the more specific"mammary epithelial cell female adult (23 years)"from a list of choices, and vice-versa if the input is more specific. This leads to more contextually appropriate matches.
- V1 is now better able to discern the required level of detail. For instance, given the input
Running the container
Ensure Apptainer is intalled in the system the container is intended to run. Always run the Matcher first, so it can listen for incoming connections from Predictors:
apptainer run --containall --nv matcher.sif MATCHER_IP MATCHER_PORT
Note on Flags:
--nv: This flag enables NVIDIA GPU support inside the container. It is essential for performance, as the LLM requires GPU acceleration for timely inference.--containall: This flag ensures the container is fully self-contained. It prevents the container from accessing the user's home directory or other host system files, guaranteeing that the service runs with only the software and libraries packaged within it for maximum reproducibility.
Files
Files
(9.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d074f117f22cf2f826b7e126603abbed
|
9.8 GB | Download |