Cross-Modal Prompt Inversion
Description
This project includes all the Python code required for our reverse prompt engineering experiments across three modalities: text, image, and video. To improve the clarity of the structure, we have organized the code into three separate folders corresponding to each modality. The implementation follows a two-step inference approach as proposed in the paper: Direct Inversion as the first step and Fine-tuning as the second step. The project provides comprehensive datasets and evaluation frameworks across all modalities. In the following sections, we will provide a detailed explanation of each modality folder, the two-step approach implementation, and the available datasets.
1. Text Prompt Inversion (text_prompt_inversion/):
This folder implements the reverse prompt engineering approach for text modalities, targeting text-to-text models:
Step 1 - Default Direct Inversion: Default_DI_for_text.ipynb implements the first step of the proposed approach, performing direct inversion on text prompts using pre-trained models without additional training. This notebook includes both implementation and evaluation components.
Step 2 - Fine-tuning: The Fine-tuning/ directory contains the implementation for the second step, where models are fine-tuned using reinforcement learning (RL) methods. The fine-tuning process uses the direct inversion (DI) model as the initial checkpoint, with customizable training parameters through configuration files in scripts/training/task_configs/.
Environment Setup: Provides complete environment configuration (txt2txt.yml) and documentation.
Datasets: The text modality experiments utilize two comprehensive datasets:
• Alpaca-GPT4 Dataset: Available at Alpaca-GPT4 with processed version at hugging face cyprivlab/Alpaca-GPT4 (https://huggingface.co/datasets/cyprivlab/Alpaca-GPT4/)
• RetrievalQA Dataset: Source available at RetrievalQA with processed version at huggingface cyprivlab/GPT4RQA (https://huggingface.co/datasets/cyprivlab/GPT4RQA)
2. Image Prompt Inversion (image_prompt_inversion/):
This folder focuses on reverse prompt engineering for image modalities, targeting text-to-image models:
Step 1 - Default Direct Inversion: Default_DI_for_image.ipynb implements the first step, applying direct inversion methods to extract prompts from images using pre-trained vision-language models.
Step 2 - Fine-tuning: The Fine-tuning/ directory contains RL-based fine-tuning implementation, using the DI model as the initial checkpoint with customizable configuration files.
Evaluation: Evaluation.ipynb provides comprehensive evaluation metrics and testing frameworks for comparing results and references across both steps of the image prompt inversion approach.
Environment Setup: Provides complete environment configuration (img2text.yml) and documentation.
Datasets: The image modality experiments are based on two major datasets:
• COCO Prompts Dataset: Available at huggingface cyprivlab/Regen_COCO_prompts (https://huggingface.co/datasets/cyprivlab/Regen_COCO_prompts)
• Stable Diffusion Dataset: Available at huggingface cyprivlab/Regen_SatableDiffusion
(https://huggingface.co/datasets/cyprivlab/Regen_SatableDiffusion)
3. Video Prompt Inversion (video_prompt_inversion/):
This is the most comprehensive component, implementing reverse prompt engineering for video modalities with full production-ready capabilities:
Core Implementation: Built on the LongVU framework, supporting both Qwen2_7B and Llama3_2_3B model variants. The app.py provides a web interface for interactive video prompt inversion, while process_video_dataset.py handles video dataset preprocessing.
Two-Step Approach: Integrates both direct inversion and model optimization through comprehensive training scripts in scripts/, following the same two-step philosophy.
Training and Evaluation: Includes training scripts for both image and video fine-tuning stages, and evaluation tools (evaluation_forbert_xclip.ipynb) for performance assessment.
Environment Setup: Provides complete environment configuration (longvu.yml, requirements.txt) and documentation.
Datasets: The video modality extends experiments to text-to-video models using:
• Video Dataset (CogVideo): Source at VidProM CogVideo (https://huggingface.co/datasets/WenhaoWang/VidProM/resolve/main/example/cog_videos_example.tar)
• Processed Video Prompts Dataset: Available at cyprivlab/processed_video_prompts (https://huggingface.co/datasets/cyprivlab/processed_video_prompts)
Results: Video inversion results are available for inspection in the video_prompt_inversion/Examples directory.
Files
reverse-prompt-engineering.zip
Files
(48.1 MB)
Name | Size | Download all |
---|---|---|
md5:28e0282d7873b867969928017adf8a10
|
48.1 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/cyprivlab/reverse-prompt-engineering