Published June 7, 2025 | Version v1
Software Open

Cross-Modal Prompt Inversion

  • 1. ROR icon University of Technology Sydney

Description

This project includes all the Python code required for our reverse prompt engineering experiments across three modalities: text, image, and video. To improve the clarity of the structure, we have organized the code into three separate folders corresponding to each modality. The implementation follows a two-step inference approach as proposed in the paper: Direct Inversion as the first step and Fine-tuning as the second step. The project provides comprehensive datasets and evaluation frameworks across all modalities. In the following sections, we will provide a detailed explanation of each modality folder, the two-step approach implementation, and the available datasets.
1. Text Prompt Inversion (text_prompt_inversion/):
This folder implements the reverse prompt engineering approach for text modalities, targeting text-to-text models:

Step 1 - Default Direct Inversion: Default_DI_for_text.ipynb implements the first step of the proposed approach, performing direct inversion on text prompts using pre-trained models without additional training. This notebook includes both implementation and evaluation components.

Step 2 - Fine-tuning: The Fine-tuning/ directory contains the implementation for the second step, where models are fine-tuned using reinforcement learning (RL) methods. The fine-tuning process uses the direct inversion (DI) model as the initial checkpoint, with customizable training parameters through configuration files in scripts/training/task_configs/.

Environment Setup: Provides complete environment configuration (txt2txt.yml) and documentation.

Datasets: The text modality experiments utilize two comprehensive datasets:
•    Alpaca-GPT4 Dataset: Available at Alpaca-GPT4 with processed version at hugging face cyprivlab/Alpaca-GPT4 (https://huggingface.co/datasets/cyprivlab/Alpaca-GPT4/)
•    RetrievalQA Dataset: Source available at RetrievalQA with processed version at huggingface cyprivlab/GPT4RQA (https://huggingface.co/datasets/cyprivlab/GPT4RQA)

2. Image Prompt Inversion (image_prompt_inversion/):
This folder focuses on reverse prompt engineering for image modalities, targeting text-to-image models:

Step 1 - Default Direct Inversion: Default_DI_for_image.ipynb implements the first step, applying direct inversion methods to extract prompts from images using pre-trained vision-language models. 

Step 2 - Fine-tuning: The Fine-tuning/ directory contains RL-based fine-tuning implementation, using the DI model as the initial checkpoint with customizable configuration files.

Evaluation: Evaluation.ipynb provides comprehensive evaluation metrics and testing frameworks for comparing results and references across both steps of the image prompt inversion approach.

Environment Setup: Provides complete environment configuration (img2text.yml) and documentation.

Datasets: The image modality experiments are based on two major datasets:
•    COCO Prompts Dataset: Available at huggingface cyprivlab/Regen_COCO_prompts (https://huggingface.co/datasets/cyprivlab/Regen_COCO_prompts)
•    Stable Diffusion Dataset: Available at huggingface cyprivlab/Regen_SatableDiffusion
(https://huggingface.co/datasets/cyprivlab/Regen_SatableDiffusion)

3. Video Prompt Inversion (video_prompt_inversion/):
This is the most comprehensive component, implementing reverse prompt engineering for video modalities with full production-ready capabilities:

Core Implementation: Built on the LongVU framework, supporting both Qwen2_7B and Llama3_2_3B model variants. The app.py provides a web interface for interactive video prompt inversion, while process_video_dataset.py handles video dataset preprocessing.

Two-Step Approach: Integrates both direct inversion and model optimization through comprehensive training scripts in scripts/, following the same two-step philosophy.

Training and Evaluation: Includes training scripts for both image and video fine-tuning stages, and evaluation tools (evaluation_forbert_xclip.ipynb) for performance assessment.

Environment Setup: Provides complete environment configuration (longvu.yml, requirements.txt) and documentation.

Datasets: The video modality extends experiments to text-to-video models using:
•    Video Dataset (CogVideo): Source at VidProM CogVideo (https://huggingface.co/datasets/WenhaoWang/VidProM/resolve/main/example/cog_videos_example.tar)
•    Processed Video Prompts Dataset: Available at cyprivlab/processed_video_prompts (https://huggingface.co/datasets/cyprivlab/processed_video_prompts)
Results: Video inversion results are available for inspection in the video_prompt_inversion/Examples directory.

Files

reverse-prompt-engineering.zip

Files (48.1 MB)

Name Size Download all
md5:28e0282d7873b867969928017adf8a10
48.1 MB Preview Download

Additional details