To Write or to Automate Linguistic Prompts, That Is the Question

Abstract

LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations.

Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality.

In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization.

Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.

To Write or to Automate Linguistic Prompts, That Is the Question

Authors

Abstract

Resources

Stay in the loop

Pages

Tools

Details