In the rapidly evolving world of artificial intelligence, large language models (LLMs) like Gemini, GPT, Claude, and open-source models such as Gemma or LLaMA have become powerful tools for tasks ranging from text generation to code writing. However, the key to unlocking their full potential lies in prompt engineering—the art and science of designing high-quality prompts to guide LLMs toward accurate and relevant outputs. The whitepaper Prompt Engineering by Lee Boonstra, published in February 2025, provides an in-depth exploration of this iterative process, detailing techniques, configurations, and best practices to help anyone—data scientist or not—craft effective prompts. This blog post summarizes the key insights from the whitepaper, offering a practical guide to mastering prompt engineering.
What is Prompt Engineering?
At its core, prompt engineering is about designing inputs (prompts) that steer LLMs to produce desired outputs. LLMs function as prediction engines, generating the next token in a sequence based on patterns learned during training. A well-crafted prompt sets the stage for the model to predict the correct sequence of tokens, whether for text summarization, code generation, or answering questions. However, crafting effective prompts is not straightforward. Factors like word choice, tone, structure, context, and model configuration all influence the output. Inadequate prompts can lead to ambiguous or inaccurate responses, making prompt engineering an iterative process of trial, refinement, and evaluation.
LLM Output Configuration
Before diving into prompting techniques, it’s crucial to understand how to configure an LLM’s output. The whitepaper emphasizes several key settings:
- Output Length: The number of tokens generated affects computation time, energy consumption, and costs. Limiting output length is essential for techniques like ReAct, where excessive tokens can lead to irrelevant text. However, simply reducing token limits doesn’t make the output more concise—it stops generation once the limit is reached, requiring careful prompt design to ensure succinctness.
- Sampling Controls: LLMs predict probabilities for the next token and use sampling to select one. Three key settings control this process:
- Temperature: This controls randomness. A low temperature (e.g., 0) produces deterministic outputs, ideal for tasks with a single correct answer, like math problems. Higher temperatures (e.g., 0.9) yield more creative, diverse responses but may reduce coherence.
- Top-K Sampling: This selects the top K most likely tokens, balancing creativity and coherence. A low K (e.g., 1) mimics greedy decoding, while a higher K increases diversity.
- Top-P Sampling (Nucleus Sampling): This selects tokens whose cumulative probability exceeds a threshold P (0 to 1). A lower P focuses on high-probability tokens, while a higher P allows more variety.
The interplay of these settings is critical. For example, a temperature of 0 makes top-K and top-P irrelevant by always selecting the highest-probability token, while extreme values can negate other settings. The whitepaper suggests starting points like temperature 0.2, top-P 0.95, and top-K 30 for coherent yet creative outputs, or temperature 0 for tasks requiring precision.
Prompting Techniques
The whitepaper outlines several prompting techniques, each suited to different tasks:
- Zero-Shot Prompting: The simplest approach, providing a task description without examples. For instance, classifying a movie review as positive, neutral, or negative (e.g., Table 1 in the whitepaper) can be done with a single instruction. However, zero-shot prompts may struggle with complex tasks due to their lack of guidance.
- One-Shot and Few-Shot Prompting: These provide one or multiple examples to guide the model. Few-shot prompting (e.g., parsing pizza orders into JSON, as in Table 2) is particularly effective for establishing patterns, especially for structured outputs. The whitepaper recommends 3–5 examples for most tasks, with diverse, high-quality examples to handle edge cases.
- System, Contextual, and Role Prompting:
- System Prompting: Defines the model’s overarching purpose, such as returning output in a specific format (e.g., JSON, as in Table 4). It’s useful for enforcing structure and reducing hallucinations.
- Contextual Prompting: Provides task-specific details to improve relevance, like suggesting blog topics for retro arcade games (Table 7).
- Role Prompting: Assigns a persona (e.g., travel guide, as in Table 5) to tailor tone and style. For example, a humorous travel guide prompt for Manhattan (Table 6) yields playful suggestions.
- Step-Back Prompting: This technique prompts the model to consider a general question before tackling a specific task, activating broader knowledge. For example, identifying key settings for a first-person shooter game (Table 9) before writing a storyline (Table 10) results in richer, more accurate outputs.
- Chain of Thought (CoT): CoT encourages LLMs to generate intermediate reasoning steps, improving accuracy for complex tasks like math problems (Tables 12 and 13). It’s low-effort, interpretable, and robust across model versions but increases token usage.
- Self-Consistency: This enhances CoT by generating multiple reasoning paths with high temperature and selecting the most common answer (Table 14). It improves accuracy but is computationally expensive.
- Tree of Thoughts (ToT) and ReAct: ToT explores multiple reasoning paths, while ReAct combines reasoning and acting, ideal for dynamic tasks. These are briefly mentioned but not detailed in the provided excerpt.
- Code Prompting: LLMs can generate, explain, translate, or debug code. For example, a Bash script to rename files can be translated to Python (Table 18) or debugged . Always test generated code, as LLMs may reproduce errors from training data.
Best Practices for Prompt Engineering
The whitepaper offers actionable best practices to refine prompts:
- Provide Examples: One-shot or few-shot examples act as teaching tools, improving output accuracy and style.
- Design with Simplicity: Use clear, concise language and action-oriented verbs (e.g., “Generate,” “Classify”) to avoid confusion.
- Be Specific About Output: Specify format, style, or content (e.g., JSON) to focus the model’s response.
- Use Instructions Over Constraints: Positive instructions (e.g., “Generate a blog post about consoles, including company and sales”) are more effective than constraints (e.g., “Don’t list game names”).
- Control Token Length: Set max token limits or request specific lengths (e.g., “in a tweet-length message”).
- Use Variables: Make prompts reusable with variables (e.g., city names in Table 20).
- Experiment with Formats and Styles: Test different prompt structures (e.g., questions vs. instructions) and styles (e.g., humorous vs. formal).
- Mix Classes in Few-Shot Classification: Randomize example order to avoid overfitting.
- Adapt to Model Updates: Stay updated on model changes and test prompts in tools like Vertex AI Studio.
- Use Structured Outputs: JSON or XML reduces hallucinations and ensures consistency (Table 4).
- JSON Repair and Schemas: Use tools like json-repair to fix truncated JSON and schemas to structure inputs (Snippets 5 and 6).
- Collaborate and Document: Work with other prompt engineers and document attempts (Table 21) to track performance and refine iteratively.
Challenges and Considerations
Prompt engineering isn’t without challenges. Inadequate prompts can lead to ambiguous or incorrect outputs, and the “repetition loop bug” (where models repeat filler words) can occur at extreme temperature settings. Multimodal prompting, combining text with images or audio, is also emerging but requires models with such capabilities. Additionally, JSON outputs, while structured, can be token-heavy and prone to truncation, necessitating tools like json-repair.
Conclusion
Prompt engineering is a critical skill for leveraging LLMs effectively. By understanding model configurations, mastering techniques like zero-shot, few-shot, CoT, and role prompting, and following best practices, anyone can craft prompts that deliver accurate, relevant outputs. The iterative nature of prompt engineering—testing, refining, and documenting—ensures continuous improvement. Whether you’re generating text, code, or structured data, the principles outlined in Boonstra’s whitepaper provide a roadmap to becoming a proficient prompt engineer. Experiment, document, and iterate to unlock the full power of LLMs in your projects.