❄️ Winter Sale: 40% OFF AIWU
WINTER_SECRET
Valid until Mar 1st
Training Data Best Practices: What Makes a Good Knowledge Base - AIWU – AI Plugin for WordPress
Table of Contents
< All Topics

Training Data Best Practices: What Makes a Good Knowledge Base

The quality of your chatbot’s answers depends entirely on the quality of your training data. Poorly structured training data produces vague, inaccurate, or unhelpful responses — even from a powerful AI model. This guide shows you what good training data looks like, and how to structure your content for maximum retrieval accuracy.


Before You Start

You’ll need:


How the AI Uses Your Training Data

When a visitor asks a question, AIWU doesn’t feed your entire knowledge base to the AI. Instead:

  1. The question is converted into a vector (numeric representation of its meaning)
  2. The system searches your training data for the 3–5 most similar chunks
  3. Those chunks are sent to the AI along with the question
  4. The AI writes an answer using only those chunks as context

This means: if the right information isn’t retrieved in step 2, the AI never sees it — no matter how good your answer document is. Good training data structure directly impacts what gets retrieved.


The Core Principle: One Topic Per Document

The single most impactful best practice: one topic per training document.

❌ Bad — one document, multiple topics ✅ Good — separate documents per topic
Document: “Company FAQ”

Q: What are your hours?
Q: How do I return an item?
Q: Do you ship internationally?
Q: How do I change my password?

Document 1: “Business Hours”
Document 2: “Return Policy”
Document 3: “International Shipping”
Document 4: “Account Password Reset”

Why it matters: When a visitor asks “how do I return an item?”, the semantic search finds documents about returns. A document mixing returns, hours, shipping, and passwords is less focused — it may not rank as the top result for any single topic.


Optimal Document Length

Training documents are split into chunks before embedding. The optimal size per document is 150–400 words.

Document length Problem
Under 50 words Not enough context — the AI may not have enough information to write a complete answer
150–400 words ✅ Sweet spot — enough context, focused enough to rank well for specific questions
Over 600 words Gets split into multiple chunks — the split point may separate related information. Better to split manually into logical sections.

Write in Q&A Format for FAQs

For frequently asked questions, Q&A format dramatically improves retrieval accuracy:

❌ Paragraph format ✅ Q&A format
Our return policy allows customers to return items within 30 days of purchase. Items must be in original condition with tags attached. Refunds are processed within 5–7 business days. Q: What is your return policy?
A: You can return any item within 30 days of purchase.

Q: What condition do items need to be in?
A: Items must be in original condition with tags attached.

Q: How long do refunds take?
A: Refunds are processed within 5–7 business days after we receive the item.

Q&A format works because the visitor’s question often mirrors the Q in your document — semantic similarity is very high, so retrieval is reliable.


Include Synonyms and Natural Language Variations

Customers don’t all use the same words. If your policy document says “return” but a visitor asks “refund” or “exchange” or “send back”, the semantic search may miss it.

Add a synonyms line at the bottom of key documents:

Return Policy

Q: What is your return policy?
A: You can return any item within 30 days...

[Also covers: refund, exchange, send back, money back, return request]

This isn’t shown to visitors — it’s there to improve semantic retrieval for related search terms.


What NOT to Include in Training Data

  • Navigation instructions: “Click the menu → go to Account → click Returns” — UI changes and this becomes outdated and wrong
  • Promotional content: “Our amazing products offer incredible value” — adds noise, no retrieval benefit
  • Dates that will expire: “Sale ends January 31st” — outdated training data is worse than no training data
  • Duplicate content: Don’t add the same information in multiple documents — it wastes space and can confuse retrieval
  • Very long unstructured text: Full blog posts, complete terms of service — extract the relevant sections instead

A Practical Document Template

[Topic Name]

Q: [Most likely question a visitor would ask about this topic]
A: [Direct, complete answer — 2–5 sentences]

Q: [Second likely question]
A: [Answer]

Q: [Third likely question]
A: [Answer]

[Related terms: synonym1, synonym2, synonym3]

This template works for: return policy, shipping FAQ, product care, size guides, account management, payment questions, and most other support topics.


Updating Training Data

Training data goes stale. Common triggers to update:

  • Policy change (new return window, new shipping carrier)
  • New products with unique FAQs
  • Analytics showing a high rate of unanswered questions on a topic
  • Chatbot giving outdated information (a customer corrects the bot)

After updating, go to your dataset and click Re-generate Embeddings for the changed documents. Changes take effect immediately.


What’s Next


Last verified: AIWU v.4.9.2 · Updated: 2026-02-25

Scroll to Top