CLAP: Learning Audio Concepts From Natural Language Supervision (Pretrained Model)
Description
CLAP (Contrastive Language-Audio Pretraining) is a model that learns acoustic concepts from natural language supervision and enables “Zero-Shot” inference. The model has been extensively evaluated in 26 audio downstream tasks achieving SoTA in several of them including classification, retrieval, and captioning.
Weights for the Microsoft CLAP model published in 2023 and 2022. clapcap is the audio captioning model that uses the 2023 encoders.
Refer to the GitHub repository for the code.
microsoft/CLAP: Learning audio concepts from natural language supervision (github.com)
Files
Files
(4.7 GB)
Name | Size | Download all |
---|---|---|
md5:0731ffb09d8567ba5610be34aa577a62
|
2.3 GB | Download |
md5:1006a9206ccb48982dfb3b46581b8a27
|
690.0 MB | Download |
md5:521913b023dcc38853d3c27ad177a997
|
1.7 GB | Download |