Semantic Diversity in Pretraining Corpora for Cross-Domain Multimodal Embedding Generalization
Description
Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. Labeled IMU data is scarce, however, unlabeled or weakly labeled IMU data can be used to model human motions. For video or text modalities, the "pretrain and adapt" approach utilizes large volumes of unlabeled or weakly labeled data to build a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. However, pretraining methods are poorly understood for IMU data, and pipelines are rarely evaluated on out-
Research goal: Does increasing the semantic diversity of pretraining corpora for embedding models improve cross-domain generalization on multimodal structured data tasks more effectively than scaling corpus size?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.7/10.
Notes
Files
paper.pdf
Files
(87.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:76e89d97a17918d48efb82a89fe2a962
|
87.0 kB | Preview Download |