Learn proven techniques for writing effective prompts that consistently produce high-quality results from LLMs. Includes practical examples, templates, and testing strategies for production applications.
Prompt engineering is the most accessible yet most impactful skill for working with large language models. While model architectures and training are the domain of AI researchers, anyone can learn to write prompts that unlock 10x better performance from the same model.
The difference between a mediocre prompt and an excellent one can be dramatic: a poorly worded prompt might produce inconsistent, unreliable results 50% of the time, while a well-engineered prompt can achieve 95%+ accuracy on the same task. This isn't about luck or trial-and-error—it's about understanding how LLMs process instructions and applying proven techniques.
In this comprehensive guide, you'll learn the fundamental principles of effective prompts, master advanced techniques like few-shot learning and chain-of-thought reasoning, and build a systematic approach to testing and refining prompts for production use. Whether you're building chatbots, content generators, data extractors, or any AI application, these skills are essential.
Before diving into advanced techniques, let's establish the core principles that make prompts effective.
A well-engineered prompt has six essential characteristics:
Most production prompts follow this structure:
[ROLE/CONTEXT]
You are an expert customer service assistant for TechCorp,
a B2B software company specializing in project management tools.
[TASK]
Analyze the customer's message and provide a helpful, professional response.
[CONSTRAINTS]
- Keep responses under 150 words
- Always acknowledge the customer's concern
- Never promise features that don't exist
- Escalate to human if you detect anger or complex technical issues
[FORMAT]
Response format:
- Acknowledgment: [Brief empathy statement]
- Solution: [Steps or information]
- Next steps: [What the customer should do]
[EXAMPLES]
Example 1:
Customer: "I can't export my data to CSV"
Response:
Acknowledgment: I understand how frustrating export issues can be.
Solution: To export to CSV: 1) Click Reports, 2) Select your data, 3) Choose "Export" → "CSV"
Next steps: Try these steps and let me know if you need further assistance.
[INPUT]
Customer message: {{user_message}}
[OUTPUT]This structure ensures consistency and quality across diverse inputs.
Avoid these mistakes that plague poorly-engineered prompts:
❌ Vague instructions:
Summarize this article.✅ Specific instructions:
Summarize this article in 3 bullet points, each under 20 words, focusing on key business implications for Australian SMEs.❌ Implicit expectations:
Extract the important information from this email.✅ Explicit format:
Extract from this email:
- Sender name
- Main request or question
- Deadline (if mentioned)
- Priority level (High/Medium/Low based on urgency)
Format as JSON.❌ Assuming context:
Is this a good idea?✅ Providing context:
As a cybersecurity expert reviewing this proposed authentication system for a fintech app handling sensitive financial data, evaluate whether this approach meets industry security standards. Consider: data encryption, multi-factor authentication, compliance requirements (APRA CPS 234), and potential vulnerabilities.Few-shot learning is one of the most powerful prompt engineering techniques. Instead of just describing what you want, you show the model examples of correct outputs.
| Method | Examples | Best For | Effectiveness |
|---|---|---|---|
| Zero-shot | 0 examples, just instructions | Simple, well-defined tasks | Baseline performance |
| Few-shot | 1-5 examples | Most tasks - optimal cost/performance balance | 80% of benefit with minimal examples |
| Many-shot | 10+ examples | Complex, nuanced tasks with subtle distinctions | Marginal gains, higher cost |
For most tasks, 2-5 well-chosen examples provide 80% of the benefit.
Your examples should:
Let's extract structured data from unstructured text:
Extract meeting details from messages. Output as JSON.
Example 1:
Input: "Let's meet next Tuesday at 2pm in Conference Room B to discuss Q4 planning. Sarah and Mike should join."
Output:
{
"date": "next Tuesday",
"time": "2pm",
"location": "Conference Room B",
"topic": "Q4 planning",
"attendees": ["Sarah", "Mike"]
}
Example 2:
Input: "Quick sync tomorrow morning? 9am works for me."
Output:
{
"date": "tomorrow",
"time": "9am",
"location": null,
"topic": "quick sync",
"attendees": []
}
Example 3:
Input: "Can we reschedule our Friday budget review? Something urgent came up."
Output:
{
"date": "Friday (reschedule requested)",
"time": null,
"location": null,
"topic": "budget review",
"attendees": []
}
Now extract from this message:
"Team standup Monday 10:30am via Zoom. John, Lisa, and Tom please join to discuss the client deliverables."
Output:The examples show the model how to handle missing information, informal language, and different formats.
For advanced applications, select examples dynamically based on the input:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def get_relevant_examples(query, example_pool, embeddings, top_k=3):
"""Retrieve most similar examples for few-shot prompting."""
# Get query embedding
query_embedding = embed_text(query)
# Calculate similarity to all examples
similarities = cosine_similarity([query_embedding], embeddings)[0]
# Get top-k most similar
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [example_pool[i] for i in top_indices]
# Build prompt with dynamically selected examples
relevant_examples = get_relevant_examples(user_query, example_database, example_embeddings)
prompt = build_few_shot_prompt(relevant_examples, user_query)This ensures examples are always relevant to the current input, improving performance on diverse tasks.
Chain-of-thought (CoT) prompting dramatically improves performance on reasoning tasks by asking the model to show its work.
Instead of asking for just the final answer, you prompt the model to explain its reasoning step-by-step. This:
Without CoT:
Q: A store has 15 apples. They sell 6 apples and then receive a delivery of 8 apples. How many apples do they have?
A: 17 applesWith CoT:
Q: A store has 15 apples. They sell 6 apples and then receive a delivery of 8 apples. How many apples do they have?
Let me think through this step by step:
1. Starting amount: 15 apples
2. After selling 6: 15 - 6 = 9 apples
3. After receiving 8: 9 + 8 = 17 apples
A: The store has 17 apples.The second approach is more reliable and verifiable.
Add "think step by step" instructions to your prompts:
Analyze this customer support ticket and determine:
1. Issue category (Billing / Technical / Account / Other)
2. Priority level (Critical / High / Medium / Low)
3. Suggested assignment (Support / Engineering / Finance)
Ticket: "I was charged twice for my subscription this month. This is the third billing error in two months. I need a refund immediately and I'm considering canceling."
Think through this step by step:
Step 1: Identify the core issue
- Customer was double-charged
- This is a recurring problem (third error)
- Customer is frustrated and considering leaving
Step 2: Categorize
- Primary category: Billing (double charge)
- Secondary aspect: Account (retention risk)
Step 3: Assess priority
- Financial impact: Moderate (double charge on one account)
- Customer sentiment: High frustration, retention risk
- Recurrence: Pattern of errors suggests systemic issue
- Priority: HIGH (not critical as service still works, but urgent due to retention risk)
Step 4: Determine assignment
- Requires immediate billing investigation
- May need Finance for refund processing
- Support should handle customer communication
- Suggested assignment: Support (primary) + Finance (secondary)
Final Answer:
Category: Billing
Priority: High
Assignment: Support + Finance escalation
Reasoning: Double charge with retention risk requires immediate attention from both Support (customer relationship) and Finance (refund processing).Zero-Shot CoT: Simply add "Let's think step by step" or "Let's approach this systematically"
Question: If a company's revenue grew by 20% in Q1 and 15% in Q2, what is the total growth for H1?
Let's think step by step:Few-Shot CoT: Provide examples with reasoning shown
Question: If a company's revenue was $100M and grew by 10% in Q1, what's the Q1 revenue?
Let's think step by step:
1. Starting revenue: $100M
2. Growth amount: 10% of $100M = $10M
3. Q1 revenue: $100M + $10M = $110M
Answer: $110M
Question: If a company's revenue grew by 20% in Q1 and 15% in Q2, what is the total growth for H1?
Let's think step by step:| Use Case | Recommend CoT? | Reason |
|---|---|---|
| Mathematical reasoning and calculations | ✅ Yes | Forces step-by-step logic, catches errors |
| Multi-step decision making | ✅ Yes | Makes reasoning transparent and auditable |
| Complex classifications with multiple criteria | ✅ Yes | Helps model consider all factors systematically |
| Tasks requiring audit trails | ✅ Yes | Provides verifiable reasoning path |
| Accuracy-critical tasks (speed less important) | ✅ Yes | Improves accuracy at cost of latency |
| Simple classifications (sentiment, spam detection) | ❌ No | Overkill - adds latency without benefit |
| Direct information retrieval | ❌ No | No reasoning required |
| Template filling or formatting | ❌ No | Mechanical task, no complex logic |
| Latency-critical applications | ❌ No | CoT adds 30-50% latency overhead |
Beyond few-shot and chain-of-thought, several advanced techniques can further improve prompt effectiveness.
Assign the model a specific role or persona to bias outputs toward desired expertise:
You are a senior cybersecurity analyst with 15 years of experience in financial services. You specialize in threat detection and incident response.
Analyze this security log entry and determine if it represents a potential threat:
[log entry]
Consider: attack patterns, normal vs. anomalous behavior, and severity.Role prompting works because models are trained on diverse internet text, including domain-specific content. Asking it to adopt an expert role biases its responses toward that domain's knowledge and reasoning patterns.
Explicitly state what the model should NOT do:
Generate a professional email response to this customer complaint.
MUST include:
- Acknowledgment of their concern
- Clear next steps
- Timeline for resolution
MUST NOT include:
- Apologies that admit fault or liability
- Promises of specific outcomes we can't guarantee
- Technical jargon the customer won't understand
- Generic "we value your feedback" statementsFor extracting structured data, specify exact JSON schema:
Extract product information from this description.
Output must be valid JSON matching this schema:
{
"name": string,
"price": number,
"currency": string,
"features": string[],
"inStock": boolean,
"category": string
}
Description: "The ErgoChair Pro is our premium office chair, featuring lumbar support, adjustable armrests, and breathable mesh. Currently available for $499 AUD. Perfect for home offices."
JSON:Many modern models support JSON mode natively—enable it for guaranteed valid JSON:
# OpenAI
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
# Anthropic Claude
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": prompt}],
# Include "output valid JSON" in your prompt for best results
)Generate multiple responses and use majority voting or consistency checks:
async def get_consistent_answer(prompt, num_samples=5):
"""Generate multiple responses and return most consistent answer."""
responses = []
# Generate multiple completions
for _ in range(num_samples):
response = await llm.complete(prompt, temperature=0.7)
responses.append(response)
# Extract final answers
answers = [extract_answer(r) for r in responses]
# Return most common answer
from collections import Counter
return Counter(answers).most_common(1)[0][0]This is particularly effective for tasks where you need high reliability and can afford extra API calls.
Combine reasoning with action-taking for complex workflows:
You can use these tools:
- SEARCH[query]: Search the knowledge base
- CALCULATE[expression]: Perform calculations
- EMAIL[recipient, message]: Send an email
Solve this task by alternating between Thought, Action, and Observation.
Task: A customer is asking about their refund status for order #12345.
Thought: I need to look up the order status first.
Action: SEARCH[order #12345 status]
Observation: Order #12345 was refunded on 2025-01-15, $156.99 to card ending in 4532.
Thought: I have the refund details. The refund was processed 5 days ago. I should explain refund processing times.
Action: SEARCH[credit card refund processing time]
Observation: Credit card refunds typically take 5-10 business days to appear.
Thought: I now have all information needed to respond.
Action: EMAIL[customer, "Your refund of $156.99 for order #12345 was processed on Jan 15. It typically takes 5-10 business days to appear on your card ending in 4532. If you don't see it by Jan 27, please contact us."]
Observation: Email sent successfully.
Task complete.Break complex tasks into multiple sequential prompts:
# Step 1: Extract key information
prompt1 = "Extract all dates, people, and action items from this meeting transcript: {transcript}"
extraction = llm.complete(prompt1)
# Step 2: Summarize decisions
prompt2 = f"Based on these extracted details: {extraction}, summarize the key decisions made."
summary = llm.complete(prompt2)
# Step 3: Generate action items
prompt3 = f"Based on this summary: {summary}, create a formatted list of action items with owners and deadlines."
action_items = llm.complete(prompt3)
# Step 4: Draft follow-up email
prompt4 = f"Write a professional follow-up email summarizing these action items: {action_items}"
email = llm.complete(prompt4)Each step produces cleaner outputs than trying to do everything in one complex prompt.
Writing prompts is iterative. Systematic testing and refinement separates amateur prompt engineering from production-ready systems.
Create a diverse set of test cases covering:
test_cases = [
# Happy path
{
"input": "I need to return a product I bought last week.",
"expected_category": "Returns",
"expected_priority": "Medium",
"expected_sentiment": "Neutral"
},
# Edge case
{
"input": "Can I return something I bought 6 months ago but never opened?",
"expected_category": "Returns",
"expected_priority": "Low",
"expected_sentiment": "Neutral",
"notes": "Outside normal return window"
},
# Adversarial
{
"input": "Your product broke and now my business is losing $10,000 a day!!! I want a full refund AND compensation!!!",
"expected_category": "Returns",
"expected_priority": "Critical",
"expected_sentiment": "Very Negative",
"notes": "High emotion, potential legal threat"
}
]Build a testing harness to compare prompt versions:
def evaluate_prompt(prompt_template, test_cases):
"""Evaluate a prompt against test cases."""
results = {
"correct": 0,
"total": len(test_cases),
"failures": []
}
for test in test_cases:
# Generate prompt from template
prompt = prompt_template.format(input=test["input"])
# Get model response
response = llm.complete(prompt)
# Parse response
parsed = parse_response(response)
# Check correctness
is_correct = (
parsed["category"] == test["expected_category"] and
parsed["priority"] == test["expected_priority"]
)
if is_correct:
results["correct"] += 1
else:
results["failures"].append({
"input": test["input"],
"expected": test,
"actual": parsed
})
results["accuracy"] = results["correct"] / results["total"]
return results
# Compare prompt versions
version_a_results = evaluate_prompt(prompt_v1, test_cases)
version_b_results = evaluate_prompt(prompt_v2, test_cases)
print(f"Version A accuracy: {version_a_results['accuracy']:.1%}")
print(f"Version B accuracy: {version_b_results['accuracy']:.1%}")Gradually roll out new prompt versions:
import random
def get_prompt_version(user_id):
"""Route users to prompt versions for A/B testing."""
# Deterministic assignment based on user_id
if hash(user_id) % 100 < 10: # 10% of users
return "prompt_v2_experimental"
else:
return "prompt_v1_stable"
def process_request(user_id, user_input):
"""Process request with A/B tested prompts."""
prompt_version = get_prompt_version(user_id)
# Load appropriate prompt
prompt_template = load_prompt(prompt_version)
prompt = prompt_template.format(input=user_input)
# Get response
response = llm.complete(prompt)
# Log for analysis
log_interaction(
user_id=user_id,
prompt_version=prompt_version,
input=user_input,
output=response,
timestamp=now()
)
return responseTrack metrics like user satisfaction, task completion rate, and response accuracy to determine winners.
Treat prompts like code—use version control:
prompts/
├── customer_classification/
│ ├── v1.0_baseline.txt
│ ├── v1.1_added_few_shot.txt
│ ├── v1.2_constrained_output.txt
│ ├── v2.0_cot_reasoning.txt
│ ├── changelog.md
│ └── test_cases.json
├── content_generation/
│ ├── v1.0_baseline.txt
│ └── ...
└── README.mdDocument each version with:
Here are battle-tested templates for common business applications.
You are a customer support routing assistant for {company_name}.
Your task: Analyze customer messages and classify them for routing.
Classification criteria:
- Category: Technical / Billing / Account / Product / General
- Priority: Critical / High / Medium / Low
- Sentiment: Positive / Neutral / Negative / Very Negative
- Requires human: Yes / No
Priority guidelines:
- Critical: System down, data loss, security issue, legal threat
- High: Blocking work, payment issues, angry customer
- Medium: Feature questions, minor bugs, general inquiries
- Low: Feature requests, compliments, simple questions
Output format (JSON):
{
"category": "...",
"priority": "...",
"sentiment": "...",
"requires_human": true/false,
"reasoning": "Brief explanation",
"suggested_response": "Draft response if requires_human=false"
}
Message: {customer_message}
Analysis:You are a content writer for {company_name}.
Brand voice guidelines:
- Tone: {tone} (e.g., "Professional but approachable")
- Audience: {audience} (e.g., "B2B decision makers in Australia")
- Avoid: {avoid} (e.g., "Jargon, hype, excessive exclamation marks")
- Style: {style} (e.g., "Clear, benefit-focused, data-driven")
Task: Write a {content_type} about {topic}.
Requirements:
- Length: {word_count} words
- Include: {must_include} (e.g., "Customer success story, ROI statistics")
- SEO keywords: {keywords} (use naturally, not stuffed)
- Call-to-action: {cta}
Examples of our brand voice:
{example_1}
{example_2}
Now write the content:Extract structured information from the following unstructured text.
Output must be valid JSON matching this exact schema:
{
"entities": {
"people": [{"name": string, "role": string}],
"organizations": [{"name": string, "type": string}],
"dates": [{"date": string, "context": string}],
"amounts": [{"value": number, "currency": string, "context": string}]
},
"key_facts": [string],
"action_items": [{"task": string, "owner": string, "deadline": string}],
"sentiment": "positive" | "neutral" | "negative"
}
Rules:
- Only extract information explicitly stated in the text
- For missing fields, use null
- Dates should be in ISO format (YYYY-MM-DD) if specific, otherwise keep original
- For action items without explicit owner, use "unassigned"
Text:
{input_text}
JSON:You are a senior software engineer reviewing code for security, performance, and best practices.
Review criteria:
1. Security vulnerabilities (SQL injection, XSS, authentication issues)
2. Performance problems (N+1 queries, unnecessary loops, memory leaks)
3. Code quality (readability, maintainability, error handling)
4. Best practices for {language} and {framework}
For each issue found, provide:
- Severity: Critical / High / Medium / Low
- Category: Security / Performance / Quality / Style
- Line number(s): Where the issue occurs
- Description: What the problem is
- Recommendation: How to fix it
- Example: Code snippet showing the fix
Code to review:
```{language}
{code}
```
Review:Summarize this meeting transcript following this structure:
## Meeting Summary
**Date:** {date}
**Attendees:** {attendees}
**Duration:** {duration}
## Key Decisions
[Bullet list of decisions made, with owner if applicable]
## Action Items
| Task | Owner | Deadline | Status |
|------|-------|----------|--------|
| ... | ... | ... | Not Started |
## Discussion Points
[Brief summary of main topics discussed]
## Next Steps
[What happens next, next meeting date if mentioned]
## Risks and Blockers
[Any mentioned risks, blockers, or concerns]
Transcript:
{transcript}
Summary:Analyze the sentiment of this text with detailed reasoning.
Step 1: Identify sentiment-bearing phrases
List all phrases that indicate positive, negative, or neutral sentiment.
Step 2: Assess overall sentiment
Consider:
- Balance of positive vs. negative language
- Intensity of emotion (mild frustration vs. extreme anger)
- Context and sarcasm
- Mixed sentiments (positive about product, negative about support)
Step 3: Determine final sentiment
Overall: Positive / Neutral / Negative / Very Negative / Very Positive
Confidence: High / Medium / Low
Step 4: Explain reasoning
Why this sentiment? What are the key indicators?
Text: {input_text}
Analysis:Prompt engineering is both an art and a science. The techniques in this guide—clear structure, few-shot learning, chain-of-thought reasoning, and systematic testing—will dramatically improve your results. But remember that every application is unique, and the best prompt for your use case will come from iterative refinement based on real-world data.
Start simple. Test thoroughly. Refine based on failures. Document your learnings. And treat prompts like production code: version controlled, tested, and continuously improved.
The most important skill in prompt engineering isn't knowing every technique—it's developing a systematic approach to understanding what works and why. Build feedback loops, analyze failures, and always be testing. The difference between a 70% accuracy prompt and a 95% accuracy prompt isn't magic; it's systematic, data-driven iteration.
Your prompts are the interface between your application and the LLM. Invest time in making them excellent, and you'll see that investment repaid many times over in better, more reliable AI systems.
Understand how LLMs work, compare GPT-4, Claude, Gemini, and Llama, and learn to choose the right model for your business needs. Complete guide to capabilities, limitations, and practical applications.
Learn how to build a production-ready RAG (Retrieval Augmented Generation) system from scratch with practical code examples, architecture patterns, and best practices.
Understand the differences between fine-tuning, RAG, and prompt engineering. Learn when to use each approach, compare costs and complexity, and make informed decisions for your AI implementation.