Training Data Best Practices: What Makes a Good Knowledge Base

PostedFebruary 25, 2026

Byaiadmin

The quality of your chatbot’s answers depends entirely on the quality of your training data. Poorly structured training data produces vague, inaccurate, or unhelpful responses — even from a powerful AI model. This guide shows you what good training data looks like, and how to structure your content for maximum retrieval accuracy.

Before You Start

You’ll need:

At least one embeddings dataset created (Embeddings in 10 Minutes)
Understanding of embeddings vs fine-tuning (Fine-Tuning vs Embeddings)

How the AI Uses Your Training Data

When a visitor asks a question, AIWU doesn’t feed your entire knowledge base to the AI. Instead:

The question is converted into a vector (numeric representation of its meaning)
The system searches your training data for the 3–5 most similar chunks
Those chunks are sent to the AI along with the question
The AI writes an answer using only those chunks as context

This means: if the right information isn’t retrieved in step 2, the AI never sees it — no matter how good your answer document is. Good training data structure directly impacts what gets retrieved.

The Core Principle: One Topic Per Document

The single most impactful best practice: one topic per training document.

❌ Bad — one document, multiple topics	✅ Good — separate documents per topic
Document: “Company FAQ” Q: What are your hours? Q: How do I return an item? Q: Do you ship internationally? Q: How do I change my password?	Document 1: “Business Hours” Document 2: “Return Policy” Document 3: “International Shipping” Document 4: “Account Password Reset”

❌ Bad — one document, multiple topics

✅ Good — separate documents per topic

Document: “Company FAQ”

Q: What are your hours?
Q: How do I return an item?
Q: Do you ship internationally?
Q: How do I change my password?

Document 1: “Business Hours”
Document 2: “Return Policy”
Document 3: “International Shipping”
Document 4: “Account Password Reset”

Why it matters: When a visitor asks “how do I return an item?”, the semantic search finds documents about returns. A document mixing returns, hours, shipping, and passwords is less focused — it may not rank as the top result for any single topic.

Optimal Document Length

Training documents are split into chunks before embedding. The optimal size per document is 150–400 words.

Document length	Problem
Under 50 words	Not enough context — the AI may not have enough information to write a complete answer
150–400 words	✅ Sweet spot — enough context, focused enough to rank well for specific questions
Over 600 words	Gets split into multiple chunks — the split point may separate related information. Better to split manually into logical sections.

Write in Q&A Format for FAQs

For frequently asked questions, Q&A format dramatically improves retrieval accuracy:

❌ Paragraph format	✅ Q&A format
Our return policy allows customers to return items within 30 days of purchase. Items must be in original condition with tags attached. Refunds are processed within 5–7 business days.	Q: What is your return policy? A: You can return any item within 30 days of purchase. Q: What condition do items need to be in? A: Items must be in original condition with tags attached. Q: How long do refunds take? A: Refunds are processed within 5–7 business days after we receive the item.

❌ Paragraph format

✅ Q&A format

Our return policy allows customers to return items within 30 days of purchase. Items must be in original condition with tags attached. Refunds are processed within 5–7 business days.

Q: What is your return policy?
A: You can return any item within 30 days of purchase.

Q: What condition do items need to be in?
A: Items must be in original condition with tags attached.

Q: How long do refunds take?
A: Refunds are processed within 5–7 business days after we receive the item.

Q&A format works because the visitor’s question often mirrors the Q in your document — semantic similarity is very high, so retrieval is reliable.

Include Synonyms and Natural Language Variations

Customers don’t all use the same words. If your policy document says “return” but a visitor asks “refund” or “exchange” or “send back”, the semantic search may miss it.

Add a synonyms line at the bottom of key documents:

Return Policy
Q: What is your return policy? A: You can return any item within 30 days...
[Also covers: refund, exchange, send back, money back, return request]

This isn’t shown to visitors — it’s there to improve semantic retrieval for related search terms.

What NOT to Include in Training Data

Navigation instructions: “Click the menu → go to Account → click Returns” — UI changes and this becomes outdated and wrong
Promotional content: “Our amazing products offer incredible value” — adds noise, no retrieval benefit
Dates that will expire: “Sale ends January 31st” — outdated training data is worse than no training data
Duplicate content: Don’t add the same information in multiple documents — it wastes space and can confuse retrieval
Very long unstructured text: Full blog posts, complete terms of service — extract the relevant sections instead

A Practical Document Template

[Topic Name]
Q: [Most likely question a visitor would ask about this topic] A: [Direct, complete answer — 2–5 sentences] Q: [Second likely question] A: [Answer] Q: [Third likely question] A: [Answer]
[Related terms: synonym1, synonym2, synonym3]

This template works for: return policy, shipping FAQ, product care, size guides, account management, payment questions, and most other support topics.

Updating Training Data

Training data goes stale. Common triggers to update:

Policy change (new return window, new shipping carrier)
New products with unique FAQs
Analytics showing a high rate of unanswered questions on a topic
Chatbot giving outdated information (a customer corrects the bot)

After updating, go to your dataset and click Re-generate Embeddings for the changed documents. Changes take effect immediately.

What’s Next

🔗 Connect your knowledge base to the chatbot: Train Your Chatbot with Embeddings
📊 Find gaps in your training data: Chatbot Analytics — the unanswered questions list is your next training task
⚙️ Fine-tuning vs embeddings — which to use: Fine-Tuning vs Embeddings: Which Training Method?

Last verified: AIWU v.4.9.2 · Updated: 2026-02-25

Training Data Best Practices: What Makes a Good Knowledge Base

Before You Start

How the AI Uses Your Training Data

The Core Principle: One Topic Per Document

Optimal Document Length

Write in Q&A Format for FAQs

Include Synonyms and Natural Language Variations

What NOT to Include in Training Data

A Practical Document Template

Updating Training Data

What’s Next

Quick Start Guide

Content Generation

AI ChatBots

AI Training

API

Model Context Protocol

WooCommerce Product Generator

Workflow Builder

AI Providers

Troubleshooting

Forms & Calculators

Comparisons

Integrations

Resources

Solutions