❄️ Winter Sale: 40% OFF AIWU
WINTER_SECRET
Valid until Mar 1st
How to Create and Manage Datasets in AIWU AI Training - AIWU – AI Plugin for WordPress
Table of Contents
< All Topics

How to Create and Manage Datasets in AIWU AI Training

Datasets are the foundation of all AI training in AIWU. Whether you’re building a smart chatbot, training a custom fine-tuned model, or creating an embedding-powered knowledge base — it all starts with a dataset.

This guide covers everything you need to know: how to navigate the Datasets tab, create datasets from all available sources, choose the right format, and manage your training data.

Path: WordPress Admin → AI Copilot → Training → Datasets

AIWU AI Training - Datasets tab showing the main table with dataset entries


Quick Decision Guide: Which Source and Type Should I Use?

Before diving into the interface, use this table to find the right approach for your goal:

Your Goal Source Dataset Type
Quickly test a chatbot with a few Q&A pairs Text Input Prompt → Completion
Import existing FAQ from a spreadsheet Upload File (CSV) Prompt → Completion
Import structured Q&A from an API export Upload File (JSON) Prompt → Completion
Add internal docs or brand guidelines for RAG search Upload File (TXT/PDF/DOCX) Raw Text
Build a knowledge base from your blog posts Site Content Raw Text
Auto-generate Q&A pairs from product pages Site Content Prompt → Completion
Paste a short text block for embedding Text Input Raw Text

Key rule:

  • Raw Text → used for Embeddings (semantic search, RAG)
  • Prompt → Completion → used for Fine-tuning (custom AI models)

Datasets Table Overview

The main Datasets screen shows all your datasets in a table with the following columns:

Column Description
Checkbox for bulk selection
ID Unique dataset identifier
Name Dataset name — click to open and edit
Source How the data was created: Text Input, Upload File, or Site Content
Type Format: Raw Text or Prompt → Completion
Tokens Total token count in the dataset
Status Current processing status (see below)
Used In Which Embeddings or Fine-tuned Models reference this dataset

Toolbar actions:

  • Create New — opens the dataset creation form
  • Delete — removes selected datasets and all their data records

Dataset Statuses

Status Meaning
New Just created, not yet processed
Waiting Queued for processing
Processing Currently being generated (Site Content source)
Ready Complete and available for use in Embeddings or Fine-tuning
In Use Already linked to an Embedding or Fine-tuned Model
Error Processing failed — check content and retry
Pause Processing paused by user
Canceled Processing was canceled

Dataset Lifecycle

AIWU Dataset lifecycle diagram showing status flow from New to Ready to In Use

A dataset moves through statuses automatically:

New → Waiting → Processing → Ready → In Use

If processing fails, the status changes to Error. Datasets reach In Use automatically when linked to an Embedding or Fine-tuned Model. Deleting a dataset also removes all associated data records and generation tasks.


Creating a New Dataset

Click Create New in the toolbar to open the creation form.

URL: admin.php?page=waic-workspace&tab=training-ds

The first choice you make is the Source — this determines where the training data comes from and what options appear next.

AIWU Create Dataset form - Source selection dropdown

Source Description
Text Input Manually enter text or Q&A pairs in the admin
Upload File Upload a .txt, .md, .pdf, .docx, .csv, or .json file
Site Content Auto-extract data from WordPress posts, pages, or WooCommerce products

Source 1: Text Input

Best for quick tests, small datasets, or when you want full control over every entry.

After selecting Text Input, choose the Dataset Type:

Raw Text Mode

A single textarea where you paste or type unstructured text. This content is used as-is for embedding generation.

Example — store knowledge base for RAG:

Our company offers free shipping on all orders over $50 within the continental United States. 
International shipping is available to 45+ countries with delivery times ranging from 7-21 business days.

Returns are accepted within 30 days of purchase. Items must be unused and in original packaging. 
Refunds are processed within 5-7 business days after we receive the returned item.

Our customer support team is available Monday through Friday, 9 AM to 6 PM EST. 
You can reach us via email at [email protected] or through the live chat on our website.

When to use: Building a knowledge base for RAG-powered chatbots, adding brand guidelines, feeding internal documentation into embeddings.

Prompt → Completion Mode

A table with two columns: Prompt and Completion. Each row is one training pair.

AIWU Dataset - Prompt Completion pairs table

Example — customer support bot:

Prompt Completion
What is your return policy? We accept returns within 30 days of purchase. Items must be unused and in original packaging. Contact [email protected] to start a return.
Do you offer international shipping? Yes! We ship to 45+ countries. International delivery takes 7-21 business days depending on destination.
How can I track my order? Visit our Order Tracking page and enter your order number. You’ll also receive tracking updates via email.
Can I cancel my order? Orders can be canceled within 2 hours of placement. After that, please wait for delivery and use our return process.
What payment methods do you accept? We accept Visa, Mastercard, American Express, PayPal, and Apple Pay.

Adding multiple pairs at once: Click the Add button to open a dialog. Enter each pair on a new line, separating Prompt and Completion with a colon (:):

What are your business hours?: We're open Monday-Friday, 9 AM to 6 PM EST.
Do you have a loyalty program?: Yes! Sign up for free and earn 1 point per dollar spent.
How do I reset my password?: Click "Forgot Password" on the login page and follow the email instructions.

When to use: Training a fine-tuned model to respond in a specific style, building FAQ bots, teaching AI your brand voice.


Source 2: Upload File

Best when you already have training data prepared outside WordPress.

Click Upload to select a file. After uploading, the system detects the file type and shows appropriate settings.

Supported File Formats

Format Default Dataset Type Best For
.txt Raw Text Plain text documents, notes
.md Raw Text Markdown documentation
.pdf Raw Text Manuals, reports, whitepapers
.docx Raw Text Word documents, policies
.csv Prompt → Completion Structured Q&A from spreadsheets
.json Prompt → Completion API exports, structured data

The default type can be overridden manually if needed.

Uploading a CSV File

After uploading a CSV, the following settings appear:

Separator — choose the delimiter used in your file:

  • Comma ,
  • Semicolon ;
  • Colon :
  • Tab t

Mapping “Prompt” — select the CSV column containing the input/question.

Mapping “Completion” — select the CSV column containing the expected output/answer.

A live Preview section shows how the data will be parsed. Use the Refresh button (↻) to re-parse after changing settings.

Example CSV file:

question,answer
What is your return policy?,We accept returns within 30 days of purchase.
Do you ship internationally?,Yes we ship to over 45 countries worldwide.
How long does delivery take?,Standard delivery takes 3-5 business days domestically.
What if my item arrives damaged?,Contact us within 48 hours with photos and we will send a replacement.
Can I change my shipping address?,Yes if the order has not been shipped yet. Contact support immediately.

After uploading this file, set Separator to ,, map “Prompt” to question, and “Completion” to answer.

Uploading a JSON File

JSON files are parsed automatically — no separator or column mapping needed.

Example JSON file:

[
  {
    "prompt": "What sizes do you carry?",
    "completion": "We carry sizes XS through 3XL across all product lines."
  },
  {
    "prompt": "Are your products eco-friendly?",
    "completion": "Yes, all our packaging is recyclable and we use sustainable materials wherever possible."
  },
  {
    "prompt": "Do you offer gift wrapping?",
    "completion": "Yes! Select the gift wrap option at checkout for $3.99."
  }
]

Use consistent key names — "prompt" and "completion" are recommended.

Uploading TXT, MD, PDF, or DOCX Files

These formats default to Raw Text and display a text preview instead of a table. No separator or mapping configuration is needed.

Tips:

  • Remove unnecessary formatting, HTML tags, or special characters before upload
  • Use UTF-8 encoding, especially with non-English text
  • Large files increase token count, which affects training time and API cost

Source 3: Site Content

Best when your WordPress site already contains the content you want to train on — blog posts, pages, or WooCommerce products.

The plugin extracts content, optionally processes it with OpenAI, and creates a ready-to-use dataset automatically.

AIWU Dataset creation from Site Content

Step 1: Choose Dataset Type

Type What Happens
Raw Text Extracts and cleans content as plain text
Prompt → Completion AI generates Q&A pairs from the content

Step 2: Select Content

Choose the content type from the dropdown:

  • WordPress Post — blog posts
  • WordPress Page — static pages
  • WooCommerce Product — product listings (visible only if WooCommerce is active)

Click Add to open a picker dialog where you can search and filter items:

  • Posts: filter by Title, Categories, Tags
  • Products: filter by Product Name, Categories, Tags
  • Pages: filter by Title

Selected items appear in a table with columns: ☑, Title, Type, and Additional prompt (an optional per-item instruction for the AI).

Step 3: Choose Fields to Include

Post Fields:
Post Title, Post Content (Body), Post Excerpt, Post Tags, Post Categories, Author Name, Publish Date.

Product Fields:
Product Name, Product Description, Short Description, Product Tags, Product Categories, SKU, Attributes, Price, Stock Status.

Select multiple fields to give the AI richer context.

Step 4: Configure Generation Settings

OpenAI Model — select which model processes your content. Affects output quality and cost.

Additional prompt (global) — an optional instruction applied to all items. Enable the checkbox, then enter your instruction.

Example global prompts:

  • “Summarize each product for a technical audience. Focus on specifications and use cases.”
  • “Generate FAQ-style questions a customer might ask before purchasing.”
  • “Write in a friendly, conversational tone. Avoid jargon.”

Raw Text mode settings:

  • Max Length per Item (default: 300) — token limit per content item

Prompt → Completion mode settings:

  • QA per Item (default: 3) — how many Q&A pairs to generate per item
  • Max Length Prompt (default: 100) — max tokens for the generated question
  • Max Length Completion (default: 300) — max tokens for the generated answer

How Generation Works

After clicking Create dataset, the plugin:

  1. Collects all selected posts/pages/products
  2. Assembles a prompt for each item using selected fields + instructions
  3. Queues items for AI processing (status → Processing)
  4. Generates dataset entries — either clean text or Q&A pairs
  5. Marks the dataset as Ready when complete

⚠️ Important: Site Content generation uses the OpenAI API, which incurs costs based on token usage. Monitor your API usage in the OpenAI dashboard.

Example: Building a Product FAQ Bot

Scenario: You have 30 WooCommerce products and want a chatbot that answers customer questions about them.

Configuration:

  • Source: Site Content
  • Dataset Type: Prompt → Completion
  • Content: Select your 30 products
  • Product Fields: Product Name, Description, Price, Categories, Stock Status
  • QA per Item: 5
  • Additional prompt: “Generate realistic customer questions and helpful answers. Include pricing and availability details when relevant.”

Result: Up to 150 Q&A pairs (30 products × 5 pairs) ready for fine-tuning.

Example: Building a Knowledge Base for RAG Search

Scenario: You have 50 blog posts about hiking gear and want a smart search that understands natural language queries.

Configuration:

  • Source: Site Content
  • Dataset Type: Raw Text
  • Content: Select your 50 posts
  • Post Fields: Post Title, Post Content, Post Categories, Post Tags
  • Max Length per Item: 500
  • Additional prompt: “Extract the key facts, product recommendations, and practical advice. Remove filler content.”

Result: Clean, condensed text from each post, ready for embedding generation.


Editing a Dataset

Click a dataset Name in the table to open the Edit view.

What you can do:

  • Change the Dataset Type (Raw Text ↔ Prompt → Completion)
  • Edit text content or Q&A pairs
  • Add new Prompt → Completion pairs via the Add button
  • Delete selected pairs via the Delete button
  • Save changes with the Save dataset button

Note: Only datasets created via Text Input or Upload File are directly editable. Site Content datasets are managed through the generation process — to change them, create a new dataset with updated settings.


What’s Next?

Once your dataset has Ready status, you can use it to:

For best practices on dataset preparation, see: Best Practices for Datasets

Scroll to Top