This guide explains how to populate and manage documents in the TaraChat vector database using the ingestion script instead of the web interface.
The ingest_documents.py script provides a command-line interface for managing documents in the FAISS vector store. It supports:
Make sure the backend environment is set up:
cd backend
poetry install
Add all .txt files from a directory:
poetry run python scripts/ingest_documents.py add --dir data/documents/
Add files with a different pattern:
poetry run python scripts/ingest_documents.py add --dir data/documents/ --pattern "*.md"
Add a document with a specific ID:
poetry run python scripts/ingest_documents.py add --file data/documents/paris.txt --id paris
Add with custom metadata:
poetry run python scripts/ingest_documents.py add \
--file data/documents/paris.txt \
--id paris \
--metadata '{"author": "Tourism Board", "category": "geography", "language": "fr"}'
Update a document’s content:
poetry run python scripts/ingest_documents.py update --id paris --file data/documents/paris_updated.txt
Update with new metadata:
poetry run python scripts/ingest_documents.py update \
--id paris \
--file data/documents/paris_updated.txt \
--metadata '{"author": "Updated Author", "version": "2.0"}'
View all documents in the vector store:
poetry run python scripts/ingest_documents.py list
This shows:
Delete a specific document:
poetry run python scripts/ingest_documents.py delete --id paris
Note: FAISS doesn’t support direct deletion, so this rebuilds the vector store without the deleted document.
Remove all documents from the vector store:
poetry run python scripts/ingest_documents.py clear
You’ll be prompted to confirm before deletion.
backend/data/
├── documents/ # Your document collection
│ ├── geography/
│ │ ├── paris.txt
│ │ └── france.txt
│ ├── culture/
│ │ ├── french_cuisine.txt
│ │ └── literature.txt
│ └── history/
│ ├── louvre.txt
│ └── french_revolution.txt
└── sample_documents.txt # Original sample file
Choose meaningful document IDs that help you identify and update documents:
paris.txt → ID: parisgeography_paris, culture_cuisineparis_v1, paris_v2Metadata helps organize and filter documents. Common metadata fields:
{
"author": "Author Name",
"category": "geography",
"language": "fr",
"date": "2024-01-15",
"version": "1.0",
"source": "path/to/file.txt",
"tags": ["paris", "tourism", "france"]
}
The script automatically adds:
doc_id: The document identifiersource: File path (when using --file)filename: Original filenamedata/documents/ directorypoetry run python scripts/ingest_documents.py add --dir data/documents/
poetry run python scripts/ingest_documents.py list
When you need to update a document:
data/documents/paris.txtpoetry run python scripts/ingest_documents.py update --id paris --file data/documents/paris.txt
Keep different versions of documents:
# Add version 1
poetry run python scripts/ingest_documents.py add \
--file docs/guide_v1.txt \
--id guide_v1 \
--metadata '{"version": "1.0"}'
# Add version 2
poetry run python scripts/ingest_documents.py add \
--file docs/guide_v2.txt \
--id guide_v2 \
--metadata '{"version": "2.0"}'
# Remove old version
poetry run python scripts/ingest_documents.py delete --id guide_v1
To use the ingestion script inside the Docker container:
# Enter the backend container
docker-compose exec backend bash
# Run ingestion commands
python scripts/ingest_documents.py add --dir data/documents/
python scripts/ingest_documents.py list
Or run directly:
docker-compose exec backend python scripts/ingest_documents.py list
Add these commands to your Makefile for convenience:
.PHONY: ingest-docs list-docs
ingest-docs:
cd backend && poetry run python scripts/ingest_documents.py add --dir data/documents/
list-docs:
cd backend && poetry run python scripts/ingest_documents.py list
update-doc:
@echo "Usage: make update-doc ID=doc_id FILE=path/to/file.txt"
cd backend && poetry run python scripts/ingest_documents.py update --id $(ID) --file $(FILE)
Then use:
make ingest-docs
make list-docs
make update-doc ID=paris FILE=data/documents/paris.txt
index.faiss, index.pkl), not a database.Document metadata is stored in vector_store/documents_metadata.json. This file tracks:
Important: Back up this file along with the vector store files.
For production use with frequent updates, consider:
Use update instead of add:
poetry run python scripts/ingest_documents.py update --id mydoc --file mydoc.txt
List documents to check the ID:
poetry run python scripts/ingest_documents.py list
Clear and re-ingest:
poetry run python scripts/ingest_documents.py clear
poetry run python scripts/ingest_documents.py add --dir data/documents/
Adjust chunk size in backend/app/config.py:
chunk_size: int = 256 # Reduce from 512
chunk_overlap: int = 25 # Reduce from 50
# 1. Prepare your documents
mkdir -p backend/data/documents
cp your_docs/*.txt backend/data/documents/
# 2. Enter backend directory
cd backend
# 3. Ingest all documents
poetry run python scripts/ingest_documents.py add --dir data/documents/
# 4. Verify ingestion
poetry run python scripts/ingest_documents.py list
# 5. Update a specific document
echo "Updated content" > data/documents/paris.txt
poetry run python scripts/ingest_documents.py update --id paris --file data/documents/paris.txt
# 6. Delete an outdated document
poetry run python scripts/ingest_documents.py delete --id old_doc
# 7. Start the application
cd ..
docker-compose up -d