tarachat

PDF Upload Feature

TaraChat now supports uploading PDF documents to enhance the chatbot’s knowledge base.

Features

PDF Text Extraction: Automatically extracts text from all pages
Metadata Extraction: Captures title, author, subject, and creator from PDF metadata
Page Markers: Text is organized with page numbers for better context
Validation: Ensures only valid PDF files are processed
Error Handling: Clear error messages for invalid or unreadable PDFs

How to Upload a PDF

Via Web Interface

Click the “Upload Document” button in the header
Select the “PDF File” tab
Click “Choose File” and select your PDF
Click “Upload”
Wait for the success message

The PDF will be automatically:

Validated
Text extracted from all pages
Split into chunks
Embedded and stored in the vector database

Via API

curl -X POST http://localhost:8000/documents/upload-pdf \
  -F "file=@/path/to/your/document.pdf"

Response:

{
  "message": "PDF uploaded and processed successfully",
  "filename": "document.pdf",
  "metadata": {
    "num_pages": 10,
    "file_type": "pdf",
    "title": "Document Title",
    "author": "Author Name",
    "filename": "document.pdf"
  },
  "text_length": 15432
}

API Endpoint

POST /documents/upload-pdf

Content-Type: multipart/form-data
Parameter: file (PDF file)
Returns: Upload status and metadata

Error Responses

400 Bad Request

{
  "detail": "Only PDF files are supported"
}

400 Bad Request

{
  "detail": "Invalid PDF file"
}

400 Bad Request

{
  "detail": "No text content could be extracted from the PDF"
}

503 Service Unavailable

{
  "detail": "RAG system is not ready yet. Please try again later."
}

Technical Details

Backend Implementation

PDF Processing (backend/app/pdf_processor.py):

Uses PyPDF2 for PDF parsing
Extracts text page by page
Captures PDF metadata (title, author, etc.)
Validates PDF structure

API Endpoint (backend/app/main.py):

File upload handling with FastAPI
Validation of file type and content
Integration with RAG system
Error handling and logging

Frontend Implementation

Upload Component (frontend/src/components/DocumentUpload.tsx):

Tab-based interface (Text / PDF File)
File input with PDF validation
Progress indicators
Success/error feedback

API Client (frontend/src/api.ts):

uploadPDF(file: File) function
FormData for file upload
Type-safe response handling

Supported PDF Features

✅ Supported:

Standard PDF format (PDF 1.0 - 1.7)
Text-based PDFs
Multi-page documents
PDF metadata
Encrypted PDFs (if no password required)

❌ Not Supported:

Password-protected PDFs
Scanned images (OCR not included)
Complex layouts may have text extraction issues
Embedded fonts with special encodings

Best Practices

Use Text-Based PDFs: Ensure your PDF contains selectable text, not just images
File Size: Keep PDFs under 10MB for optimal processing
Page Count: Large PDFs (100+ pages) may take longer to process
Format: Use standard PDF formats for best results
Content: Clear, well-formatted text extracts better than complex layouts

Troubleshooting

“No text content could be extracted from the PDF”

This usually means:

The PDF contains only images (scanned document)
Text is embedded in a non-standard way
The PDF is corrupted

Solution: Try converting the PDF to text first, or use OCR software if it’s a scanned document.

“Invalid PDF file”

This means:

The file is not a valid PDF
The PDF structure is corrupted
The file extension is .pdf but the content is not

Solution: Open the PDF in a PDF reader to verify it’s valid, or try re-saving it.

Upload is very slow

For large PDFs:

Processing time increases with page count
Check backend logs for progress
Be patient - the system is extracting and processing all pages

Example Usage

Python Script

import requests

# Upload a PDF
with open('document.pdf', 'rb') as f:
    files = {'file': f}
    response = requests.post(
        'http://localhost:8000/documents/upload-pdf',
        files=files
    )
    print(response.json())

JavaScript/TypeScript

const uploadPDF = async (file: File) => {
  const formData = new FormData();
  formData.append('file', file);

  const response = await fetch('http://localhost:8000/documents/upload-pdf', {
    method: 'POST',
    body: formData,
  });

  return await response.json();
};

cURL

# Upload a PDF
curl -X POST http://localhost:8000/documents/upload-pdf \
  -F "file=@document.pdf"

# With verbose output
curl -v -X POST http://localhost:8000/documents/upload-pdf \
  -F "file=@document.pdf"

Processing Flow

Upload: User selects PDF file
Validation: File type and PDF structure validated
Extraction: Text extracted from all pages with PyPDF2
Metadata: PDF metadata captured (title, author, etc.)
Chunking: Text split into manageable chunks
Embedding: Each chunk converted to vector embedding
Storage: Embeddings stored in FAISS vector database
Ready: Document available for RAG queries

Security Considerations

File size limits should be enforced in production
PDF validation prevents malicious file uploads
No file is permanently stored on disk (processed in memory)
Only text content is extracted and stored
Metadata is sanitized before storage

Future Enhancements

Potential improvements:

OCR support for scanned PDFs
Multiple file upload at once
Progress bar for large PDFs
PDF preview before upload
Support for other formats (DOCX, TXT, HTML)

This site is open source. Improve this page.