EideticRAG - Privacy-First Local AI Platform

Why Go Local?

What You Get with
EideticRAG

Built for people who value privacy, freedom, and control over their AI.

🔓

Uncensored Responses

Fine-tune the model on your own data without corporate restrictions. Get direct, unfiltered answers to your questions—no more "I can't help with that" messages.

🔒

Complete Privacy

Your conversations never leave your computer. No data sent to OpenAI, Google, or anyone else. Perfect for sensitive documents, personal journals, or confidential work.

💰

Zero Subscription Costs

No $20/month ChatGPT Plus. No API bills that spike unexpectedly. One-time setup, unlimited usage forever. Your electricity, your AI.

📚

Your Documents, Searchable

Upload your PDFs, notes, research papers, or personal files. Ask questions about them in natural language. The AI actually reads YOUR documents.

🌐

Works Offline

No internet required after setup. Use it on airplanes, in remote locations, or when your WiFi dies. Your AI is always available.

🎯

Train It Your Way

Fine-tune the model to speak like you, understand your industry jargon, or focus on topics you care about. Make it truly personal.

EideticRAG vs Cloud Chatbots

Your data sent to servers

Yes ❌

Never ✓

Monthly cost

$20+/month

Content restrictions

Heavy filtering

You control

Search your private docs

Limited/No

Unlimited ✓

Fine-tune on your data

Not possible

Full control ✓

Works offline

Yes ✓

Stats for Nerds 🤓

Technical Deep Dive

Complete technical breakdown for developers. All values extracted from actual source code.

System Architecture

📥 Document Ingestion Layer

File Parsers pypdf, python-docx, txt

Text Chunker 512 tokens, 50 overlap*

Metadata Extractor Filename, page, position

↓

🧮 Embedding Layer

SentenceTransformers all-MiniLM-L6-v2 (384 dim)

Embedding Cache Disk-based, LRU eviction

Batch Processing 32 chunks per batch*

↓

💾 Vector Storage Layer

ChromaDB HNSW index, L2 distance

SQLite Metadata SQLAlchemy ORM

Persistent Storage ./data/chroma_db/

↓

🔍 Retrieval Layer

Intent Classifier 7 types: factual, comparative, causal, procedural, opinion, code, unknown

Query Expansion Entity extraction + synonyms

MMR Diversification λ=0.7 (from code)

Multi-Hop Retrieval Iterative refinement

↓

🧠 Generation Layer

Ollama + Llama 3.2 1B REST API, localhost:11434

Prompt Templates System + Context + Query

Reflection Agent max_iterations=3, threshold=0.3

↓

✅ Output

Verified Answer With source citations

Data Flow Diagrams

🔍 Query Processing Pipeline

💬

User Query "What is RAG?"

→

🎯

Intent Classification type: FACTUAL, confidence: 0.85

→

🧮

Query Embedding 384-dim vector (MiniLM)

→

🔮

Vector Search ChromaDB HNSW, k=5

→

📊

MMR Diversification λ=0.7, dedup results

→

📝

Context Assembly System + Chunks + Query

→

🧠

LLM Generation Ollama → Llama 3.2 1B

→

🛡️

Reflection Verify claims, threshold=0.3

→

✅

                                Response
                                Answer + Citations
                            

📄 Document Ingestion Pipeline

📁

Raw Documents PDF, TXT, DOCX files

→

📖

Text Extraction pypdf, python-docx

→

✂️

Chunking 512 tokens, 50 overlap

→

🏷️

Metadata filename, page, position

→

🧮

Embedding SentenceTransformers

→

💾

                                Vector Store
                                ChromaDB persistent
                            

🎯 Fine-Tuning Pipeline (ModelOps)

📊

Training Data JSON/Parquet dataset

→

🔄

Tokenization max_length=512*

→

📦

Load Model 4-bit NF4 quantized

→

🔧

Apply LoRA r=8, α=16, 0.1% params

→

⚡

Training HF Trainer, MLflow

→

📈

Evaluation loss, perplexity

→

💎

                                Adapter
                                LoRA weights saved
                            

Technology Stack (Complete)

🤖 LLM & Fine-Tuning (ModelOps)

Meta Llama 3.2 1B 1 billion parameters, optimized for consumer hardware

QLoRA Configuration r=8, alpha=16, dropout=0.1, target_modules=[q_proj, k_proj, v_proj, o_proj]

4-bit NF4 Quantization BitsAndBytes, bnb_4bit_use_double_quant=True, compute_dtype=float16

Optimizer paged_adamw_8bit (memory efficient), lr=2e-4, weight_decay=0.01

Training Args batch=2, grad_accum=8 (effective=16), warmup=3%, max_grad_norm=0.3

Gradient Checkpointing Enabled for VRAM reduction (trades compute for memory)

MLflow Integration Experiment tracking, param/metric logging, artifact registration

🔍 RAG Pipeline (EideticRAG)

Vector Database ChromaDB with HNSW indexing, persistent storage mode

Embedding Model all-MiniLM-L6-v2 (384 dimensions, 80ms per query*)

Chunking Strategy 512 tokens max, 50 token overlap, paragraph-aware splitting

Intent Classification Keyword + regex matching, 7 intent types, confidence scoring

Retrieval Controller Policy-based retrieval, default_k=5, adaptive depth

MMR (Diversity) Maximal Marginal Relevance, diversity_factor=0.7

Multi-Hop Retrieval Iterative query refinement for complex questions

🛡️ Verification & Safety

Reflection Agent max_iterations=3, hallucination_threshold=0.3

Action Types ACCEPT, REGENERATE, BROADEN, ESCALATE, REFUSE

Verification Engine Claim extraction + source document matching

Safe Refusal Graceful decline when evidence insufficient

Answer Annotation Highlights unsupported claims in output

⚙️ Infrastructure & APIs

FastAPI Async REST API, OpenAPI docs at /docs

Ollama Local LLM server, REST API at localhost:11434

SQLite + SQLAlchemy Metadata storage, ORM for data access

Temporal.io Workflow orchestration for training jobs (RetryPolicy with backoff)

HuggingFace Trainer Training loop with checkpointing, evaluation, metrics logging

📦 Data Engineering

Dataset Pipeline Ingestion → Preprocessing → Validation → Registration

Auto-Labeling Rule-based labeling for instruction datasets

Data Quality Schema validation, duplicate detection, text cleaning

Privacy Features PII anonymization, GDPR-compliant by design*

Performance Benchmarks

~200ms* Retrieval Latency 5K docs, SSD, warm cache

~2-5s* Full Query Response Retrieval + LLM generation

4GB VRAM (Inference) Llama 3.2 1B via Ollama

8GB* VRAM (Fine-Tuning) QLoRA on T4 GPU

2-4 hrs* Fine-Tuning Time 3 epochs, 1K samples, T4

384 dim Embedding Size MiniLM-L6-v2

~10GB Disk Space Model + deps + data

0.1% Trainable Params LoRA adapters only

⚠️ Trade-Offs of Local Deployment

* What you sacrifice for complete privacy and zero recurring costs.

You Need Your Own Hardware: Unlike cloud APIs, you control the compute. Smaller models (1B-3B) run on laptops; larger models need better GPUs. Trade-off: No usage limits or rate throttling.
Fine-Tuning Requires a GPU: Training requires NVIDIA GPUs (or free Colab/Kaggle). Trade-off: Complete control over model behavior—no content filters, no censorship.
Setup Takes Time (~15 min): Cloud models need just an API key. Local requires downloading models and dependencies. Trade-off: One-time setup for permanent data privacy.
Smaller Models = Less General Knowledge: A 1B/3B model won't match GPT-4's broad knowledge. Trade-off: After fine-tuning on YOUR data, it outperforms GPT-4 in your specific domain—at $0/month.
Manual Updates Required: Cloud models auto-update; local models don't. Trade-off: No sudden API changes breaking your app. Stability and version control.

Skills Demonstrated

LLMs RAG PEFT LoRA QLoRA Fine-Tuning Meta Llama Hugging Face Transformers PyTorch ChromaDB Vector Databases SentenceTransformers Semantic Search Embeddings FastAPI MLflow Ollama 4-bit Quantization Gradient Checkpointing BitsAndBytes Temporal.io Workflow Orchestration ETL Pipelines Data Quality GDPR Compliance Hallucination Detection Prompt Engineering SQLAlchemy Python REST APIs

Quick Start Guide

Get Running in 10-15 Minutes

No coding experience required. Just follow the steps.

📋 Before You Start

Make sure you have these installed:

🐍 Python 3.10+

📦 Git

🦙 Ollama

Download the Project

~2 min

Open your terminal (Command Prompt on Windows, Terminal on Mac/Linux) and run:

git clone https://github.com/Akshar-Guha/Chat_Bot.git
cd Chat_Bot

Set Up Python Environment

~1 min

Create an isolated environment so it doesn't mess with other Python projects:

# Windows
python -m venv .venv
.venv\Scripts\activate

# Mac/Linux
python3 -m venv .venv
source .venv/bin/activate

You'll see (.venv) appear in your terminal when it's active.

Install Dependencies

~5 min

This downloads all the required libraries:

pip install -r requirements.txt

This may take a few minutes. You'll see a lot of text scrolling—that's normal.

Download the AI Model

~5 min (depends on internet)

First, make sure Ollama is running (open the Ollama app). Then download the Llama model:

ollama pull llama3.2:1b

This downloads ~1.3GB. The model will be stored locally and works offline after this.

Add Your Documents (Optional)

~1 min

Put your PDF, TXT, or DOCX files in the data/ folder, then index them:

python -m src.core.cli ingest ./data/

Start the Application!

Ready to use

Launch the API server:

python -m src.api.main

Open http://localhost:8000/docs in your browser to see the API.

💻 System Requirements

RAM 8GB minimum, 16GB recommended

Storage 10GB free space

GPU Not required for inference (but helps)

OS Windows 10+, macOS 12+, or Linux

Advanced: Fine-Tuning

Train Your Own Custom Model

For developers who want to fine-tune Llama on their own data. Requires Google Colab (free tier works).

⚠️ Important Notes

GPU Required: Fine-tuning needs a GPU. Use Google Colab (free T4 GPU) or your own NVIDIA GPU with 8GB+ VRAM.
Time Required: Expect 2-4 hours for training on a typical dataset (1-5K samples).
Not beginner-friendly: This section assumes you're comfortable with Python, Jupyter notebooks, and command line.

Prepare Your Training Data

Create a JSON file with your training examples in this format:

[
  {
    "instruction": "Summarize the following document",
    "input": "Your document text here...",
    "output": "The expected summary..."
  },
  {
    "instruction": "Answer this question about the document",
    "input": "What is the main topic?",
    "output": "The main topic is..."
  }
]

Need 500-5000 examples for good results. Quality matters more than quantity.

Open in Google Colab

We provide a ready-to-use notebook. Click the badge to open:

Make sure to select Runtime → Change runtime type → T4 GPU (free tier).

Configure Training Parameters

Key settings in the notebook (already optimized for T4 GPU):

# LoRA Configuration
lora_r = 8                    # Rank (higher = more capacity, more VRAM)
lora_alpha = 16               # Scaling factor
lora_dropout = 0.1            # Regularization

# Quantization (saves VRAM)
load_in_4bit = True           # 4-bit NF4 quantization
bnb_4bit_compute_dtype = "float16"

# Training
num_epochs = 3
batch_size = 2                # Increase if you have more VRAM
gradient_accumulation = 8     # Effective batch = 2 × 8 = 16
learning_rate = 2e-4

Run Training

Execute all cells in the notebook. Training will:

Load Llama 3.2 1B with 4-bit quantization
Apply LoRA adapters to attention layers
Train on your data with checkpointing
Save the adapter to Google Drive

Training ~1K samples takes about 2 hours on T4. Watch the loss curve—it should decrease.

Export and Use Locally

After training, download the adapter and merge with the base model:

# In the notebook, after training completes:
# The adapter is saved to Google Drive automatically

# On your local machine:
# 1. Download the adapter folder from Drive
# 2. Merge with Ollama:
ollama create my-custom-model -f Modelfile

💡 Pro Tips

Data Quality > Quantity

500 high-quality examples beats 5000 noisy ones. Clean your data carefully.

Monitor the Loss

Training loss should decrease. If it plateaus, try lower learning rate.

Save Checkpoints

Colab sessions disconnect. Enable Drive auto-save in the notebook.

Test Iteratively

Train for 1 epoch, test, then continue. Don't train blindly for 10 epochs.

Your Personal AI. No Filters. No Limits.

What You Get withEideticRAG