I write my wishlist on prompt engineering in 2025.

My current prompting workflow in 2024

If I am tasked to write a classification prompt, it takes me six full engineering days at minimum.

  • Two full days is needed to understand the problem and source for real input examples.
    • I need to think what the average user experience is (the impression sample)
    • I need to think what the average negative case is (the negative action sample)
    • I need to think of edge cases - cases where failure will cause a disproportionate impact (for example users that query whether how many rocks should you eat in a day)
    • The product owner may give me some examples to start with, but I still need to find all these sets of examples.
    • In this period, I will not be writing any prompts.
  • Four full days is needed to search for a prompt that works.
    • I will start with the cheapest model available (currently Gemini-1.5-Flash if I am not concerned with their uptime), and choose a more expensive model if necessary.
    • I will set up a notebook to illustrate the performance on the test set.
    • I will also start with the simplest prompt possible.
    • Prompt improvements is usually done by adding few-shot examples where the input has issues. Whenever I add examples from the dataset into the prompt, I need to add more examples into my dataset, so that I am not just testing on the training set.
    • From understanding the outputs the model gives, I could think of more edge cases where the model could fail, and I need to find and add them.
    • Sometimes, the classification from the model disagrees with what I labelled in my dataset. I might update the dataset.
    • Sometimes, I find that there are examples where it is very borderline and I am okay with either classification. I will remove the example from the evaluation of the performance.
  • Additional time is needed to analyze on the product impact
    • How would the user experience change because of my prompts?
    • What is the expected rate of regrettable false positive and regrettable false negatives?
    • Are we happy enough with the performance to ship the prompts, or we should I continue to attempt to fix the regrettable mistakes?
    • I commit these analysis scripts to the monorepo.
  • Additional time is needed if I need to redo my prompts
    • It is not always the case the product owner or leadership agrees with my prompt and dataset.
    • Sometimes they find some sample in the dataset where the intended classification should be positive instead of negative.
    • Sometimes they find some sample in the dataset that must be classified as positive rather than being okay with either classification.
    • To avoid changing the prompt that I have spent a lot of time on, I share constant updates on my progress with the dataset and the prompt.

If I am tasked to write a generation prompt, it also takes me six full engineering days at minimum.

  • Two full days is needed to understand the problem and source for real input examples.
    • I need to think what the average user experience is (the impression sample)
    • I need to think what the average negative case is (the negative action sample)
    • I need to think of edge cases - cases where failure will cause a disproportionate impact (for example users that query whether how many rocks should you eat in a day)
    • The product owner may give me some examples to start with, but I still need to find all these sets of examples.
    • In this period, I will not be writing any prompts.
  • Four full days is needed to search for a prompt that works.
    • I will start with the cheapest model available (currently Gemini-1.5-Flash if I am not concerned with their uptime), and choose a more expensive model if necessary.
    • I will set up a notebook to illustrate the performance on the test set.
    • I will also start with the simplest prompt possible.
    • Prompt improvements is usually done by adding few-shot examples where we have issues with the input. Whenever I add examples from the dataset into the prompt, I need to add more examples into my dataset so that I am not just testing on the training set.
    • From understanding the outputs the model gives, I could think of more edge cases where the model could produce a bad generation. I need to find and add them.
  • Additional time is needed to analyze on the product impact
    • How would the user experience change because of my prompts?
    • What is the expected rate of regrettable bad generations?
    • Are we happy enough with the performance to ship it, or should I continue to attempt to fix the regrettable mistakes?
    • I commit these analysis scripts to the monorepo.
  • Additional time is needed if I need to redo my prompts
    • It is not always the case the product owner or leadership agrees with my prompt and dataset.
    • Sometimes they found some new inputs that my prompt must get correct.
    • Sometimes they disagree with the expected output in my few shot examples.
    • To avoid changing the prompt that I have spent a lot of time on, I share constant updates on my progress with the dataset and the prompt.
  • I usually need to write classification prompts for the generation prompts
    • Classification prompts could be used to evaluate my generation prompt (more commonly known as LLM as a judge). Maybe we can ship only if the generation prompt achieves a certain performance as measured by the classification prompt.
    • Classification prompts could be used to decide whether should you even do generation (for example, do not respond if the asks how many rocks should a human eat)
    • Classification prompts could be used to decide whether an output is better than the other. (However, in practice, writing the classifier could be much harder than writing the generation prompt).
    • Time is needed for me to align each classification prompt, although now I spend less time finding input samples.

Note the following patterns in my workflow

  • I am open to include data from the test set in the few shot examples. Of course, some samples in my test set should not be found in the few shot prompts.
  • The test set I compiled initially is negotiable. I can add / remove / change samples in the test set.
  • My prompt iterations are manual, for now.
  • I work on notebooks, calling model provider APIs directly.
  • I do not share metrics without context, I share the performance sample.
  • I keep my product owner aligned so I am not prompting for the wrong objective.

The best prompting trick I have learnt is to print the results to html. This way the product owner can see what the predictions actually looks like. If we want to annotate the predictions, it is easy to copy from the html table to Google sheets.

output_table_file_name = "output.html"
with open(output_table_file_name, 'w') as f:
    f.write('<meta charset="UTF-8">\n' + df.replace({r'\n': '<br>'}, regex=True).to_html(index=False, escape=False))

from IPython.display import display, HTML

filepath = os.getcwd().replace("/efs/notebooks", "https://redacted.company.net/") + "/" + output_table_file_name
link = f'<a href="{filepath}" target="_blank">{filepath}</a>'
display(HTML(link))

The second best prompting trick I learnt is to download from Google sheets (requires the Google sheets to be accessible by anyone with the URL). This way, my notebook can quickly receive updates from the product owner.

%%bash
curl -L -o llm_topic_eval_set.csv \
"https://docs.google.com/spreadsheets/d/redacted_RandomSetOfAlphaNumericCharactersWithUnderscores/export?exportFormat=csv"

This summarizes my prompting workflow in 2024.

Model providers should automate prompting

In 2025, I hope to pass the model providers my classification dataset so that they can write me a prompt that I can just use for their model.

Currently, few shot prompting achieves this effect. But there is no guarantee that my few shot prompt is the best among all possible prompts. I need to communicate with confidence and conscience, with my engineering reputation at stake, to my product owner that I have tried my best.

Since my company has spent enough money with Anthropic, I can ask one of their friendly solution engineers to come up with a better prompt than mine. This way less of my engineering reputation is at stake. Another strategy is to get the product owner to try prompting to see if they can produce a better prompt. Ultimately, my job as a prompt engineer is not to write prompts, but to measure how bad the prompts are.

I hope Anthropic can provide an automated prompt engineering tool that is better than their solution engineers.

While Anthropic has a prompt optimization tool, it is still more convenient for me to write my own prompts. I have tried to build an automated prompt engineer for one of my datasets, but the performance is worse than random. I think OpenAI already has plans to provide an automated prompting solution. Their Evals tool seeks permission from the developer to share their datasets. Cohere has a prompt tuner feature but you have to use their models, and it seems to be for generation.

I think the models could be improved too. The post-training dataset could include more prompting scenarios so that the model is more prompt-able. Models could also be trained with reinforcement fine-tuning so that the model is better at writing prompts.

The aligned prompt optimizer should be provider-agnostic

While the Anthropic solution engineers are friendly, their interests do not exactly align with mine.

If I want to switch from using OpenAI models to Anthropic models for a particular use case, the Anthropic engineers will be very interested to help me. However, if I want to switch from Claude-3.5-Sonnet to Claude-3.5-Haiku for cost reasons, our interests might not align. I am interested in reducing costs, whereas Anthropic is interested in revenue.

The prompt optimizer should be aligned to me. Instead of spending an hour of my salaried time on a prompt iteration, I think the company is happy to pay a prompt optimizer ten dollars for every prompt iteration. I do not need to put my engineering reputation at stake. The prompt optimizer could be incentivized by only charging if they manage to find a prompt that is much better than mine.

When the job is classification, the job is classification. It should not matter which model provider I use, or what techniques I employ. I should only care about the cost-performance tradeoff. Therefore, the prompt optimizer that aligns with my interest should be provider-agnostic.

Solutions on the cost-performance frontier may not be just prompt engineering. Maybe with enough data, we can distill a closed sourced model to another closed sourced model. Maybe with even more data, we can fine-tune a BERT model for classification. Maybe, after all, the solution with the best tradeoff is still a small generic model served at scale.

There is still a role for model providers here. Model providers are incentivized to work with the provider-agnostic prompt optimizers to make sure that their prompts to their models are correctly optimized.

In 2025, I hope to do less prompt engineering. It is pain. AI should write prompts for me. I tell AI what are we prompting for.