styloformer-artfilm-scene-classification
Authors/Creators
Description
# Styloformer: Automatic Classification of Art Film Scenes
This repository contains the implementation of **Styloformer**, a multimodal transformer framework for **automatic classification of art film scenes** based on **image and audio deep features**.
The project integrates **visual, auditory, textual, and curatorial signals** into a unified representation space, enabling both predictive performance and art-historical interpretability.
---
## ✨ Key Features
- **Multimodal Fusion**
Cross-modal attention mechanism dynamically aligns visual and auditory features for robust scene understanding.
- **Styloformer Architecture**
A transformer-based framework integrating:
- Stylistic clustering
- Canonicality estimation
- Influence prediction
- Historiographic navigation
- **Historiographic Navigation**
Novel interpretive module embedding ontological priors and temporal logic for reasoning about artistic influence.
- **State-of-the-Art Performance**
- **MovieNet dataset**: 91.85% accuracy, 94.31% AUC
- Outperforms baselines like **CLIP**, **ViT**, and **PANDA**​:contentReference[oaicite:1]{index=1}
---
## 📂 Datasets
Experiments were conducted on several benchmarks:
- **MovieNet** – narrative and stylistic structure in cinema
- **Hollywood2** – action and scene classification
- **MovieGraphs** – graph-based social interaction semantics
- **TACoS** – fine-grained visual-text alignment
- **CineArtSet (new)** – curated art film dataset (1,920 clips, 54 films, 9,458 labeled scenes)​:contentReference[oaicite:2]{index=2}
---
## ⚙️ Installation
```bash
# Clone this repo
git clone https://github.com/<your-username>/styloformer.git
cd styloformer
# Create environment
conda create -n styloformer python=3.9
conda activate styloformer
# Install dependencies
pip install -r requirements.txt
Files
Files
(18.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a3665d55ac0389d246c1cc2dfa303188
|
18.4 kB | Download |