DS1 spectrogram: Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

2605.20674

Authors

Herman Bergström,Aditya Mehrotra,Rahul G. Krishnan

Abstract

We introduce CoMET, Composing Modality Encoders with Tabular foundation models, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities.

When the CLS tokens of the foundation model align poorly with downstream tasks, we propose PALPooling, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training.

On hierarchical tasks with large fine-grained class spaces, our approach enables fast and scalable classification, handling datasets with over 500,000 samples and 2,000 classes without any fine-tuning. Overall, our results show that the composition of foundation models is a simple, yet powerful, out-of-the-box solution for multimodal learning, challenging the necessity of complex, end-to-end training pipelines for new problems.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.