Document Chat Interface

Build chat interfaces for querying documents using natural language. Code organized in examples: document_processors.py, conversation_manager.py, text_processor.py

officialDevelopment
#document-chat#pdf-qa#chatbot#document-search#knowledge-base#conversational

Document Chat Interface

Build intelligent chat interfaces that allow users to query and interact with documents using natural language, transforming static documents into interactive knowledge sources.

Overview

A document chat interface combines three capabilities:

  1. Document Processing - Extract and prepare documents
  2. Semantic Understanding - Understand questions and find relevant content
  3. Conversational Interface - Maintain context and provide natural responses

Common Applications

  • PDF Q&A: Answer questions about research papers, reports, books
  • Email Search: Find information in email archives conversationally
  • GitHub Explorer: Ask questions about code repositories
  • Knowledge Base: Interactive access to company documentation
  • Contract Review: Query legal documents with natural language
  • Research Assistant: Explore academic papers interactively

Architecture

Document Source
    ↓
Document Processor
    ├→ Extract text
    ├→ Process content
    └→ Generate embeddings
    ↓
Vector Database
    ↓
Chat Interface ← User Question
    ├→ Retrieve relevant content
    ├→ Maintain conversation history
    └→ Generate response

Core Components

1. Document Sources

See examples/document_processors.py for implementations:

PDF Documents

  • Extract text from PDF pages
  • Preserve document structure and metadata
  • Handle scanned PDFs with OCR (pytesseract)
  • Extract tables (pdfplumber)

GitHub Repositories

  • Extract code files from repositories
  • Parse repository structure
  • Process multiple file types

Email Archives

  • Extract email metadata (from, to, subject, date)
  • Parse email body content
  • Handle multiple mailbox formats

Web Pages

  • Extract page text and structure
  • Preserve heading hierarchy
  • Extract links and navigation

YouTube/Audio

  • Get transcripts from YouTube videos
  • Transcribe audio files
  • Handle multiple formats

2. Document Processing

See examples/text_processor.py for implementations:

Text Extraction & Cleaning

  • Remove extra whitespace and special characters
  • Smart text chunking with overlap
  • Intelligent sentence boundary detection

Metadata Extraction

  • Extract title, author, date, language
  • Calculate word count and document statistics
  • Track document source and format

Structure Preservation

  • Keep heading hierarchy in chunks
  • Preserve section context
  • Enable hierarchical retrieval

3. Chat Interface Design

See examples/conversation_manager.py for implementations:

Conversation Management

  • Maintain conversation history with size limits
  • Track message metadata (timestamps, roles)
  • Provide context for LLM integration
  • Clear history as needed

Question Refinement

  • Expand implicit references in questions
  • Handle pronouns and context references
  • Improve question clarity with previous context

Response Generation

  • Use document context for answering
  • Maintain conversation history in prompts
  • Provide source citations
  • Handle out-of-scope questions

4. User Experience Features

Citation & Sources

def format_response_with_citations(response: str, sources: List[Dict]) -> str:
    """Add source citations to response"""

    formatted = response + "\n\n**Sources:**\n"
    for i, source in enumerate(sources, 1):
        formatted += f"[{i}] Page {source['page']} of {source['source']}\n"
        if 'excerpt' in source:
            formatted += f"    \"{source['excerpt'][:100]}...\"\n"

    return formatted

Clarifying Questions

def generate_follow_up_questions(context: str, response: str) -> List[str]:
    """Suggest follow-up questions to user"""

    prompt = f"""
    Based on this Q&A, generate 3 relevant follow-up questions:
    Context: {context[:500]}
    Response: {response[:500]}
    """

    follow_ups = llm.generate(prompt)
    return follow_ups

Error Handling

def handle_query_failure(question: str, error: Exception) -> str:
    """Handle when no relevant documents found"""

    if isinstance(error, NoRelevantDocuments):
        return (
            "I couldn't find information about that in the documents. "
            "Try asking about different topics like: "
            + ", ".join(get_main_topics())
        )
    elif isinstance(error, ContextTooLarge):
        return (
            "The answer requires too much context. "
            "Can you be more specific about what you'd like to know?"
        )
    else:
        return f"I encountered an issue: {str(error)[:100]}"

Implementation Frameworks

Using LangChain

from langchain.document_loaders import PDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

# Load document
loader = PDFLoader("document.pdf")
documents = loader.load()

# Split into chunks
splitter = CharacterTextSplitter(chunk_size=1000)
chunks = splitter.split_documents(documents)

# Create embeddings
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# Create chat chain
llm = ChatOpenAI(model="gpt-4")
qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Chat interface
chat_history = []
while True:
    question = input("You: ")
    result = qa({"question": question, "chat_history": chat_history})
    print(f"Assistant: {result['answer']}")
    chat_history.append((question, result['answer']))

Using LlamaIndex

from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ChatMemoryBuffer
from llama_index.llms import ChatMessage, MessageRole

# Load documents
documents = SimpleDirectoryReader("./docs").load_data()

# Create index
index = GPTVectorStoreIndex.from_documents(documents)

# Create chat engine with memory
chat_engine = index.as_chat_engine(
    memory=ChatMemoryBuffer.from_defaults(token_limit=3900),
    llm="gpt-4"
)

# Chat loop
while True:
    question = input("You: ")
    response = chat_engine.chat(question)
    print(f"Assistant: {response}")

Using RAG-Based Approach

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load and embed documents
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = load_documents("document.pdf")
embeddings = model.encode(documents)

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings).astype('float32'))

# Chat function
def chat(question):
    # Embed question
    q_embedding = model.encode(question)

    # Retrieve documents
    k = 5
    distances, indices = index.search(
        np.array([q_embedding]).astype('float32'), k
    )

    # Get relevant documents
    context = " ".join([documents[i] for i in indices[0]])

    # Generate response
    response = llm.generate(
        f"Context: {context}\nQuestion: {question}\nAnswer:"
    )
    return response

Best Practices

Document Handling

  • ✓ Support multiple formats (PDF, TXT, docx, etc.)
  • ✓ Handle large documents efficiently
  • ✓ Preserve document structure
  • ✓ Extract metadata
  • ✓ Handle multiple languages
  • ✓ Implement OCR for scanned PDFs

Conversation Quality

  • ✓ Maintain conversation context
  • ✓ Ask clarifying questions
  • ✓ Cite sources
  • ✓ Handle ambiguity
  • ✓ Suggest follow-up questions
  • ✓ Handle out-of-scope questions

Performance

  • ✓ Optimize retrieval speed
  • ✓ Implement caching
  • ✓ Handle large document sets
  • ✓ Batch process documents
  • ✓ Monitor latency
  • ✓ Implement pagination

User Experience

  • ✓ Clear response formatting
  • ✓ Ability to cite sources
  • ✓ Document browser/explorer
  • ✓ Search suggestions
  • ✓ Query history
  • ✓ Export conversations

Common Challenges & Solutions

Challenge: Irrelevant Answers

Solutions:

  • Improve retrieval (more context, better embeddings)
  • Validate answer against context
  • Ask clarifying questions
  • Implement confidence scoring
  • Use hybrid search

Challenge: Lost Context Across Turns

Solutions:

  • Maintain conversation memory
  • Update retrieval based on history
  • Summarize long conversations
  • Re-weight previous queries

Challenge: Handling Long Documents

Solutions:

  • Hierarchical chunking
  • Summarize first
  • Question refinement
  • Multi-hop retrieval
  • Document navigation

Challenge: Limited Context Window

Solutions:

  • Compress retrieved context
  • Use document summarization
  • Hierarchical retrieval
  • Focus on most relevant sections
  • Iterative refinement

Advanced Features

Multi-Document Analysis

def compare_documents(question: str, documents: List[str]):
    """Analyze and compare across multiple documents"""
    results = []

    for doc in documents:
        response = query_document(doc, question)
        results.append({
            "document": doc.name,
            "answer": response
        })

    # Compare and synthesize
    comparison = llm.generate(
        f"Compare these answers: {results}"
    )
    return comparison

Interactive Document Exploration

class DocumentExplorer:
    def __init__(self, documents):
        self.documents = documents

    def browse_by_topic(self, topic):
        """Find documents by topic"""
        pass

    def get_related_documents(self, doc_id):
        """Find similar documents"""
        pass

    def get_key_terms(self, document):
        """Extract key terms and concepts"""
        pass

Resources

Document Processing Libraries

  • PyPDF: PDF handling
  • python-docx: Word document handling
  • BeautifulSoup: Web scraping
  • youtube-transcript-api: YouTube transcripts

Chat Frameworks

  • LangChain: Comprehensive framework
  • LlamaIndex: Document-focused
  • RAG libraries: Vector DB integration

Implementation Checklist

  • Choose document source(s) to support
  • Implement document loading and processing
  • Set up vector database/embeddings
  • Build chat interface
  • Implement conversation management
  • Add source citation
  • Handle edge cases (large docs, OCR, etc.)
  • Implement error handling
  • Add performance monitoring
  • Test with real documents
  • Deploy and monitor

Getting Started

  1. Start Simple: Single PDF, basic chat
  2. Add Features: Multi-document, conversation history
  3. Improve Quality: Better chunking, retrieval
  4. Scale: Support more formats, larger documents
  5. Polish: UX improvements, error handling