tarachat

PDF Upload Feature

TaraChat now supports uploading PDF documents to enhance the chatbot’s knowledge base.

Features

How to Upload a PDF

Via Web Interface

  1. Click the “Upload Document” button in the header
  2. Select the “PDF File” tab
  3. Click “Choose File” and select your PDF
  4. Click “Upload”
  5. Wait for the success message

The PDF will be automatically:

Via API

curl -X POST http://localhost:8000/documents/upload-pdf \
  -F "file=@/path/to/your/document.pdf"

Response:

{
  "message": "PDF uploaded and processed successfully",
  "filename": "document.pdf",
  "metadata": {
    "num_pages": 10,
    "file_type": "pdf",
    "title": "Document Title",
    "author": "Author Name",
    "filename": "document.pdf"
  },
  "text_length": 15432
}

API Endpoint

POST /documents/upload-pdf

Error Responses

400 Bad Request

{
  "detail": "Only PDF files are supported"
}

400 Bad Request

{
  "detail": "Invalid PDF file"
}

400 Bad Request

{
  "detail": "No text content could be extracted from the PDF"
}

503 Service Unavailable

{
  "detail": "RAG system is not ready yet. Please try again later."
}

Technical Details

Backend Implementation

PDF Processing (backend/app/pdf_processor.py):

API Endpoint (backend/app/main.py):

Frontend Implementation

Upload Component (frontend/src/components/DocumentUpload.tsx):

API Client (frontend/src/api.ts):

Supported PDF Features

Supported:

Not Supported:

Best Practices

  1. Use Text-Based PDFs: Ensure your PDF contains selectable text, not just images
  2. File Size: Keep PDFs under 10MB for optimal processing
  3. Page Count: Large PDFs (100+ pages) may take longer to process
  4. Format: Use standard PDF formats for best results
  5. Content: Clear, well-formatted text extracts better than complex layouts

Troubleshooting

“No text content could be extracted from the PDF”

This usually means:

Solution: Try converting the PDF to text first, or use OCR software if it’s a scanned document.

“Invalid PDF file”

This means:

Solution: Open the PDF in a PDF reader to verify it’s valid, or try re-saving it.

Upload is very slow

For large PDFs:

Example Usage

Python Script

import requests

# Upload a PDF
with open('document.pdf', 'rb') as f:
    files = {'file': f}
    response = requests.post(
        'http://localhost:8000/documents/upload-pdf',
        files=files
    )
    print(response.json())

JavaScript/TypeScript

const uploadPDF = async (file: File) => {
  const formData = new FormData();
  formData.append('file', file);

  const response = await fetch('http://localhost:8000/documents/upload-pdf', {
    method: 'POST',
    body: formData,
  });

  return await response.json();
};

cURL

# Upload a PDF
curl -X POST http://localhost:8000/documents/upload-pdf \
  -F "file=@document.pdf"

# With verbose output
curl -v -X POST http://localhost:8000/documents/upload-pdf \
  -F "file=@document.pdf"

Processing Flow

  1. Upload: User selects PDF file
  2. Validation: File type and PDF structure validated
  3. Extraction: Text extracted from all pages with PyPDF2
  4. Metadata: PDF metadata captured (title, author, etc.)
  5. Chunking: Text split into manageable chunks
  6. Embedding: Each chunk converted to vector embedding
  7. Storage: Embeddings stored in FAISS vector database
  8. Ready: Document available for RAG queries

Security Considerations

Future Enhancements

Potential improvements: