Training Data Best Practices: What Makes a Good Knowledge Base
The quality of your chatbot’s answers depends entirely on the quality of your training data. Poorly structured training data produces vague, inaccurate, or unhelpful responses — even from a powerful AI model. This guide shows you what good training data looks like, and how to structure your content for maximum retrieval accuracy.
Before You Start
You’ll need:
- At least one embeddings dataset created (Embeddings in 10 Minutes)
- Understanding of embeddings vs fine-tuning (Fine-Tuning vs Embeddings)
How the AI Uses Your Training Data
When a visitor asks a question, AIWU doesn’t feed your entire knowledge base to the AI. Instead:
- The question is converted into a vector (numeric representation of its meaning)
- The system searches your training data for the 3–5 most similar chunks
- Those chunks are sent to the AI along with the question
- The AI writes an answer using only those chunks as context
This means: if the right information isn’t retrieved in step 2, the AI never sees it — no matter how good your answer document is. Good training data structure directly impacts what gets retrieved.
The Core Principle: One Topic Per Document
The single most impactful best practice: one topic per training document.
| ❌ Bad — one document, multiple topics | ✅ Good — separate documents per topic |
|---|---|
| Document: “Company FAQ”
Q: What are your hours? |
Document 1: “Business Hours” Document 2: “Return Policy” Document 3: “International Shipping” Document 4: “Account Password Reset” |
Why it matters: When a visitor asks “how do I return an item?”, the semantic search finds documents about returns. A document mixing returns, hours, shipping, and passwords is less focused — it may not rank as the top result for any single topic.
Optimal Document Length
Training documents are split into chunks before embedding. The optimal size per document is 150–400 words.
| Document length | Problem |
|---|---|
| Under 50 words | Not enough context — the AI may not have enough information to write a complete answer |
| 150–400 words | ✅ Sweet spot — enough context, focused enough to rank well for specific questions |
| Over 600 words | Gets split into multiple chunks — the split point may separate related information. Better to split manually into logical sections. |
Write in Q&A Format for FAQs
For frequently asked questions, Q&A format dramatically improves retrieval accuracy:
| ❌ Paragraph format | ✅ Q&A format |
|---|---|
| Our return policy allows customers to return items within 30 days of purchase. Items must be in original condition with tags attached. Refunds are processed within 5–7 business days. | Q: What is your return policy? A: You can return any item within 30 days of purchase. Q: What condition do items need to be in? Q: How long do refunds take? |
Q&A format works because the visitor’s question often mirrors the Q in your document — semantic similarity is very high, so retrieval is reliable.
Include Synonyms and Natural Language Variations
Customers don’t all use the same words. If your policy document says “return” but a visitor asks “refund” or “exchange” or “send back”, the semantic search may miss it.
Add a synonyms line at the bottom of key documents:
Return PolicyQ: What is your return policy?
A: You can return any item within 30 days...[Also covers: refund, exchange, send back, money back, return request]
This isn’t shown to visitors — it’s there to improve semantic retrieval for related search terms.
What NOT to Include in Training Data
- Navigation instructions: “Click the menu → go to Account → click Returns” — UI changes and this becomes outdated and wrong
- Promotional content: “Our amazing products offer incredible value” — adds noise, no retrieval benefit
- Dates that will expire: “Sale ends January 31st” — outdated training data is worse than no training data
- Duplicate content: Don’t add the same information in multiple documents — it wastes space and can confuse retrieval
- Very long unstructured text: Full blog posts, complete terms of service — extract the relevant sections instead
A Practical Document Template
[Topic Name]Q: [Most likely question a visitor would ask about this topic]
A: [Direct, complete answer — 2–5 sentences]Q: [Second likely question]
A: [Answer]Q: [Third likely question]
A: [Answer][Related terms: synonym1, synonym2, synonym3]
This template works for: return policy, shipping FAQ, product care, size guides, account management, payment questions, and most other support topics.
Updating Training Data
Training data goes stale. Common triggers to update:
- Policy change (new return window, new shipping carrier)
- New products with unique FAQs
- Analytics showing a high rate of unanswered questions on a topic
- Chatbot giving outdated information (a customer corrects the bot)
After updating, go to your dataset and click Re-generate Embeddings for the changed documents. Changes take effect immediately.
What’s Next
- 🔗 Connect your knowledge base to the chatbot: Train Your Chatbot with Embeddings
- 📊 Find gaps in your training data: Chatbot Analytics — the unanswered questions list is your next training task
- ⚙️ Fine-tuning vs embeddings — which to use: Fine-Tuning vs Embeddings: Which Training Method?
Last verified: AIWU v.4.9.2 · Updated: 2026-02-25
