Knowledge Base - Definable.ai

A Knowledge Base is your AI agent’s memory and information repository. Think of it as a smart library where your agents can instantly find and retrieve relevant information to answer questions and complete tasks.

What is a Knowledge Base?

In simple terms, a Knowledge Base is a collection of documents, data, and information that your AI agents can search through and reference when helping users. Instead of just relying on what the AI model learned during training, agents can access up-to-date, specific information from your knowledge base.

Real-World Example

Imagine you’re building a customer support agent for your company:

Without Knowledge Base: The agent only knows general information and might give generic answers
With Knowledge Base: The agent can access your product manuals, FAQ documents, company policies, and recent updates to give accurate, specific answers

How Knowledge Bases Work

Step-by-Step Process

📄 Document Upload: You upload files (PDFs, Word docs, web pages, etc.)
🔍 Processing: The system extracts and cleans the text content
✂️ Chunking: Long documents are split into smaller, manageable pieces
🧮 Embeddings: Each chunk is converted into a mathematical representation
🗄️ Storage: These representations are stored in a searchable database
❓ Query Time: When a user asks a question, the system finds the most relevant chunks
🤖 Response: The AI agent uses this information to generate accurate answers

Supported Content Types

Text Documents

PDF, Word, TXT, Markdown files

Web Content

Websites, articles, documentation sites

Structured Data

CSV files, spreadsheets, JSON data

Rich Media

Images with text (OCR), presentations

For Business Users

Why You Need a Knowledge Base

Before Knowledge Base:

User: "What's your refund policy?"
Agent: "I don't have specific information about refund policies. Please contact support."

After Knowledge Base:

User: "What's your refund policy?"
Agent: "Our refund policy allows returns within 30 days of purchase. Items must be in original condition. Here are the steps to process a refund... [detailed, accurate information from your actual policy documents]"

Business Benefits

🎯 Accurate Information: Agents give precise answers based on your actual documents
⚡ Instant Updates: Update documents once, all agents immediately have new information
📈 Better Customer Experience: Faster, more helpful responses
💰 Cost Savings: Reduce human support workload
🔄 Consistency: Same accurate information across all interactions

Getting Started (Business User)

Identify Your Content: Gather FAQs, manuals, policies, product information
Upload Documents: Drag and drop files into the knowledge base
Test and Refine: Ask your agent questions to see how it performs
Keep Updated: Regularly add new information and remove outdated content

For Developers

Architecture Overview

Document Processing Pipeline

1. Text Extraction

Different file types require different extraction methods:

# PDF Extraction Example
import fitz  # PyMuPDF

def extract_pdf_text(file_path):
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# Web Scraping Example
import requests
from bs4 import BeautifulSoup

def extract_web_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.extract()
    return soup.get_text()

2. Chunking Strategies

Different chunking approaches for different content types:

# Semantic Chunking (Recommended)
def semantic_chunk(text, max_chunk_size=1000, overlap=200):
    sentences = split_into_sentences(text)
    chunks = []
    current_chunk = ""
    
    for sentence in sentences:
        if len(current_chunk + sentence) > max_chunk_size:
            chunks.append(current_chunk)
            # Keep some overlap for context
            current_chunk = current_chunk[-overlap:] + sentence
        else:
            current_chunk += sentence
    
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks

# Fixed-Size Chunking (Simple)
def fixed_chunk(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    
    return chunks

3. Embedding Generation

Transform text chunks into vector representations:

# Using OpenAI Embeddings
import openai

def generate_embeddings(text_chunks):
    embeddings = []
    for chunk in text_chunks:
        response = openai.Embedding.create(
            input=chunk,
            model="text-embedding-ada-002"
        )
        embeddings.append(response['data'][0]['embedding'])
    return embeddings

# Using Sentence Transformers (Local)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_local_embeddings(text_chunks):
    embeddings = model.encode(text_chunks)
    return embeddings.tolist()

Retrieval Patterns

Semantic Search

Basic similarity search using vector embeddings:

def semantic_search(query, knowledge_base, top_k=5):
    # Generate query embedding
    query_embedding = generate_embedding(query)
    
    # Calculate similarity with all chunks
    similarities = cosine_similarity(query_embedding, knowledge_base.embeddings)
    
    # Get top-k most similar chunks
    top_indices = similarities.argsort()[-top_k:][::-1]
    
    return [knowledge_base.chunks[i] for i in top_indices]

Hybrid Search

Combine semantic search with keyword search:

def hybrid_search(query, knowledge_base, top_k=5, alpha=0.7):
    # Semantic search results
    semantic_results = semantic_search(query, knowledge_base, top_k * 2)
    
    # Keyword search results (BM25)
    keyword_results = bm25_search(query, knowledge_base, top_k * 2)
    
    # Combine and rerank
    combined_results = combine_results(
        semantic_results, 
        keyword_results, 
        alpha=alpha
    )
    
    return combined_results[:top_k]

Metadata Filtering

Filter results based on document metadata:

def filtered_search(query, knowledge_base, filters=None, top_k=5):
    # Apply filters first
    if filters:
        filtered_chunks = apply_filters(knowledge_base.chunks, filters)
    else:
        filtered_chunks = knowledge_base.chunks
    
    # Perform semantic search on filtered chunks
    return semantic_search(query, filtered_chunks, top_k)

# Example usage
results = filtered_search(
    query="refund policy",
    knowledge_base=kb,
    filters={
        "category": "customer_service",
        "date_range": ("2024-01-01", "2024-12-31")
    }
)

Advanced Features

Query Expansion

Improve search results by expanding the query:

def expand_query(original_query, expansion_model):
    # Generate related terms
    related_terms = expansion_model.get_related_terms(original_query)
    
    # Create expanded query
    expanded_query = f"{original_query} {' '.join(related_terms)}"
    
    return expanded_query

Result Reranking

Improve result relevance with cross-encoder models:

from sentence_transformers import CrossEncoder

def rerank_results(query, search_results, reranker_model):
    # Create query-document pairs
    pairs = [(query, result.text) for result in search_results]
    
    # Score pairs
    scores = reranker_model.predict(pairs)
    
    # Sort by relevance
    ranked_results = sorted(
        zip(search_results, scores), 
        key=lambda x: x[1], 
        reverse=True
    )
    
    return [result for result, score in ranked_results]

Performance Optimization

Indexing Strategies

# Vector index optimization
import faiss

def create_optimized_index(embeddings):
    dimension = embeddings.shape[1]
    
    # For large datasets, use IVF (Inverted File) index
    if len(embeddings) > 100000:
        quantizer = faiss.IndexFlatL2(dimension)
        index = faiss.IndexIVFFlat(quantizer, dimension, 1000)
        index.train(embeddings)
    else:
        # For smaller datasets, use flat index
        index = faiss.IndexFlatL2(dimension)
    
    index.add(embeddings)
    return index

Caching Strategies

from functools import lru_cache
import redis

# In-memory caching
@lru_cache(maxsize=1000)
def cached_search(query_hash, knowledge_base_id):
    return perform_search(query_hash, knowledge_base_id)

# Redis caching
redis_client = redis.Redis(host='localhost', port=6379, db=0)

def cached_search_redis(query, knowledge_base_id, ttl=3600):
    cache_key = f"search:{hash(query)}:{knowledge_base_id}"
    
    # Try to get from cache
    cached_result = redis_client.get(cache_key)
    if cached_result:
        return json.loads(cached_result)
    
    # Perform search and cache result
    result = perform_search(query, knowledge_base_id)
    redis_client.setex(
        cache_key, 
        ttl, 
        json.dumps(result, default=str)
    )
    
    return result

Configuration Options

Document Processing Settings

{
  "processing": {
    "chunk_size": 1000,
    "chunk_overlap": 200,
    "chunking_strategy": "semantic",
    "embedding_model": "text-embedding-ada-002",
    "enable_ocr": true,
    "supported_formats": ["pdf", "docx", "txt", "md", "html"]
  },
  "retrieval": {
    "search_type": "hybrid",
    "top_k": 5,
    "similarity_threshold": 0.7,
    "enable_reranking": true,
    "metadata_boost": {
      "title": 1.5,
      "headers": 1.2
    }
  }
}

Quality Control

# Content quality checks
def validate_chunk_quality(chunk):
    checks = {
        "min_length": len(chunk) >= 50,
        "max_length": len(chunk) <= 2000,
        "not_mostly_numbers": sum(c.isdigit() for c in chunk) / len(chunk) < 0.8,
        "has_meaningful_content": has_meaningful_words(chunk)
    }
    return all(checks.values())

# Duplicate detection
def detect_duplicates(chunks, threshold=0.9):
    embeddings = generate_embeddings(chunks)
    duplicates = []
    
    for i, emb1 in enumerate(embeddings):
        for j, emb2 in enumerate(embeddings[i+1:], i+1):
            similarity = cosine_similarity([emb1], [emb2])[0][0]
            if similarity > threshold:
                duplicates.append((i, j, similarity))
    
    return duplicates

Best Practices

For Content Creators

Structure Your Documents: Use clear headings and sections
Keep Information Current: Regularly update outdated content
Use Consistent Terminology: Maintain consistent language across documents
Include Context: Provide enough context in each section

For Developers

Chunk Strategically: Balance between context and specificity
Monitor Performance: Track search quality and response times
Implement Feedback Loops: Use user interactions to improve search
Version Control: Track changes to knowledge base content

Security Considerations

Access Control: Implement proper permissions for sensitive documents
Data Privacy: Ensure compliance with privacy regulations
Audit Trails: Log access and modifications
Encryption: Encrypt sensitive data at rest and in transit

Common Use Cases

Customer Support Knowledge Base

# Example: Creating a support KB
kb_config = {
    "name": "Customer Support KB",
    "categories": ["faqs", "troubleshooting", "policies"],
    "auto_categorization": True,
    "search_boost": {
        "faqs": 1.3,
        "recent_updates": 1.2
    }
}

Product Documentation

# Example: Technical documentation KB
kb_config = {
    "name": "API Documentation",
    "chunking_strategy": "code_aware",
    "preserve_code_blocks": True,
    "metadata_extraction": {
        "api_endpoints": True,
        "code_examples": True,
        "version_info": True
    }
}

Research and Analysis

# Example: Research paper KB
kb_config = {
    "name": "Research Papers",
    "citation_extraction": True,
    "academic_parsing": True,
    "metadata_fields": ["authors", "publication_date", "journal", "doi"]
}

Troubleshooting

Common Issues and Solutions

Issue: Search results are not relevant Solutions:

Adjust chunk size and overlap
Try different embedding models
Implement query expansion
Add result reranking

Issue: Slow search performance Solutions:

Use vector indexing (FAISS, Pinecone)
Implement caching
Optimize chunk size
Use approximate search methods

Issue: Memory usage too high Solutions:

Use smaller embedding models
Implement batch processing
Use external vector databases
Compress embeddings

Next Steps

Now that you understand Knowledge Bases, explore related concepts:

Vector Database - Deep dive into the search engine that powers knowledge retrieval
AI Agents - Learn how agents use knowledge bases to provide better responses
Tools - Discover how to create tools that can search and retrieve information

Ready to implement? Check out the Knowledge Base API Reference or start with our Getting Started Guide.

🎯 Welcome

🚀 Getting Started

💡 Core Concepts

🏗️ Platform Architecture

🔐 Security & Auth

📡 API Reference

🛠️ Development Guides

🚨 Troubleshooting

​What is a Knowledge Base?

​Real-World Example

​How Knowledge Bases Work

​Step-by-Step Process

​Supported Content Types

Text Documents

Web Content

Structured Data

Rich Media

​For Business Users

​Why You Need a Knowledge Base

​Business Benefits

​Getting Started (Business User)

​For Developers

​Architecture Overview

​Document Processing Pipeline

​1. Text Extraction

​2. Chunking Strategies

​3. Embedding Generation

​Retrieval Patterns

​Semantic Search

​Hybrid Search

​Metadata Filtering

​Advanced Features

​Query Expansion

​Result Reranking

​Performance Optimization

​Indexing Strategies

​Caching Strategies

​Configuration Options

​Document Processing Settings

​Quality Control

​Best Practices

​For Content Creators

​For Developers

​Security Considerations

​Common Use Cases

​Customer Support Knowledge Base

​Product Documentation

​Research and Analysis

​Troubleshooting

​Common Issues and Solutions

​Next Steps

What is a Knowledge Base?

Real-World Example

How Knowledge Bases Work

Step-by-Step Process

Supported Content Types

For Business Users

Why You Need a Knowledge Base

Business Benefits

Getting Started (Business User)

For Developers

Architecture Overview

Document Processing Pipeline

1. Text Extraction

2. Chunking Strategies

3. Embedding Generation

Retrieval Patterns

Semantic Search

Hybrid Search

Metadata Filtering

Advanced Features

Query Expansion

Result Reranking

Performance Optimization

Indexing Strategies

Caching Strategies

Configuration Options

Document Processing Settings

Quality Control

Best Practices

For Content Creators

For Developers

Security Considerations

Common Use Cases

Customer Support Knowledge Base

Product Documentation

Research and Analysis

Troubleshooting

Common Issues and Solutions

Next Steps