DS1 spectrogram: Efficient Code Embeddings from Code Generation Models

Efficient Code Embeddings from Code Generation Models

2508.21290

Authors

Daria Kryvosheieva,Saba Sturua,Michael Günther,Scott Martens,Han Xiao

Abstract

jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling.

We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.