100% Local โ€ข Zero Cloud โ€ข Your Data Stays Yours

Your Personal AI.
No Filters. No Limits.

A private chatbot that runs entirely on your computer. Fine-tune it with your own data. Get uncensored, unrestricted responses. No subscriptions. No data harvesting.

Scroll to explore

What You Get with
EideticRAG

Built for people who value privacy, freedom, and control over their AI.

๐Ÿ”“

Uncensored Responses

Fine-tune the model on your own data without corporate restrictions. Get direct, unfiltered answers to your questionsโ€”no more "I can't help with that" messages.

๐Ÿ”’

Complete Privacy

Your conversations never leave your computer. No data sent to OpenAI, Google, or anyone else. Perfect for sensitive documents, personal journals, or confidential work.

๐Ÿ’ฐ

Zero Subscription Costs

No $20/month ChatGPT Plus. No API bills that spike unexpectedly. One-time setup, unlimited usage forever. Your electricity, your AI.

๐Ÿ“š

Your Documents, Searchable

Upload your PDFs, notes, research papers, or personal files. Ask questions about them in natural language. The AI actually reads YOUR documents.

๐ŸŒ

Works Offline

No internet required after setup. Use it on airplanes, in remote locations, or when your WiFi dies. Your AI is always available.

๐ŸŽฏ

Train It Your Way

Fine-tune the model to speak like you, understand your industry jargon, or focus on topics you care about. Make it truly personal.

EideticRAG vs Cloud Chatbots

Feature
ChatGPT / Claude
Your data sent to servers
Yes โŒ
Never โœ“
Monthly cost
$20+/month
$0
Content restrictions
Heavy filtering
You control
Search your private docs
Limited/No
Unlimited โœ“
Fine-tune on your data
Not possible
Full control โœ“
Works offline
No
Yes โœ“

Technical Deep Dive

Complete technical breakdown for developers. All values extracted from actual source code.

System Architecture

๐Ÿ“ฅ Document Ingestion Layer
File Parsers pypdf, python-docx, txt
Text Chunker 512 tokens, 50 overlap*
Metadata Extractor Filename, page, position
โ†“
๐Ÿงฎ Embedding Layer
Embedding Cache Disk-based, LRU eviction
Batch Processing 32 chunks per batch*
โ†“
๐Ÿ’พ Vector Storage Layer
SQLite Metadata SQLAlchemy ORM
Persistent Storage ./data/chroma_db/
โ†“
๐Ÿ” Retrieval Layer
Intent Classifier 7 types: factual, comparative, causal, procedural, opinion, code, unknown
Query Expansion Entity extraction + synonyms
MMR Diversification ฮป=0.7 (from code)
Multi-Hop Retrieval Iterative refinement
โ†“
๐Ÿง  Generation Layer
Prompt Templates System + Context + Query
Reflection Agent max_iterations=3, threshold=0.3
โ†“
โœ… Output
Verified Answer With source citations

Data Flow Diagrams

๐Ÿ” Query Processing Pipeline

๐Ÿ’ฌ
User Query "What is RAG?"
โ†’
๐ŸŽฏ
Intent Classification type: FACTUAL, confidence: 0.85
โ†’
๐Ÿงฎ
Query Embedding 384-dim vector (MiniLM)
โ†’
๐Ÿ”ฎ
Vector Search ChromaDB HNSW, k=5
โ†’
๐Ÿ“Š
MMR Diversification ฮป=0.7, dedup results
โ†’
๐Ÿ“
Context Assembly System + Chunks + Query
โ†’
๐Ÿง 
LLM Generation Ollama โ†’ Llama 3.2 1B
โ†’
๐Ÿ›ก๏ธ
Reflection Verify claims, threshold=0.3
โ†’
โœ…
Response Answer + Citations

๐Ÿ“„ Document Ingestion Pipeline

๐Ÿ“
Raw Documents PDF, TXT, DOCX files
โ†’
๐Ÿ“–
Text Extraction pypdf, python-docx
โ†’
โœ‚๏ธ
Chunking 512 tokens, 50 overlap
โ†’
๐Ÿท๏ธ
Metadata filename, page, position
โ†’
๐Ÿงฎ
Embedding SentenceTransformers
โ†’
๐Ÿ’พ
Vector Store ChromaDB persistent

๐ŸŽฏ Fine-Tuning Pipeline (ModelOps)

๐Ÿ“Š
Training Data JSON/Parquet dataset
โ†’
๐Ÿ”„
Tokenization max_length=512*
โ†’
๐Ÿ“ฆ
Load Model 4-bit NF4 quantized
โ†’
๐Ÿ”ง
Apply LoRA r=8, ฮฑ=16, 0.1% params
โ†’
โšก
Training HF Trainer, MLflow
โ†’
๐Ÿ“ˆ
Evaluation loss, perplexity
โ†’
๐Ÿ’Ž
Adapter LoRA weights saved

Technology Stack (Complete)

๐Ÿค– LLM & Fine-Tuning (ModelOps)

Meta Llama 3.2 1B 1 billion parameters, optimized for consumer hardware
QLoRA Configuration r=8, alpha=16, dropout=0.1, target_modules=[q_proj, k_proj, v_proj, o_proj]
4-bit NF4 Quantization BitsAndBytes, bnb_4bit_use_double_quant=True, compute_dtype=float16
Optimizer paged_adamw_8bit (memory efficient), lr=2e-4, weight_decay=0.01
Training Args batch=2, grad_accum=8 (effective=16), warmup=3%, max_grad_norm=0.3
Gradient Checkpointing Enabled for VRAM reduction (trades compute for memory)
MLflow Integration Experiment tracking, param/metric logging, artifact registration

๐Ÿ” RAG Pipeline (EideticRAG)

Vector Database ChromaDB with HNSW indexing, persistent storage mode
Embedding Model all-MiniLM-L6-v2 (384 dimensions, 80ms per query*)
Chunking Strategy 512 tokens max, 50 token overlap, paragraph-aware splitting
Intent Classification Keyword + regex matching, 7 intent types, confidence scoring
Retrieval Controller Policy-based retrieval, default_k=5, adaptive depth
MMR (Diversity) Maximal Marginal Relevance, diversity_factor=0.7
Multi-Hop Retrieval Iterative query refinement for complex questions

๐Ÿ›ก๏ธ Verification & Safety

Reflection Agent max_iterations=3, hallucination_threshold=0.3
Action Types ACCEPT, REGENERATE, BROADEN, ESCALATE, REFUSE
Verification Engine Claim extraction + source document matching
Safe Refusal Graceful decline when evidence insufficient
Answer Annotation Highlights unsupported claims in output

โš™๏ธ Infrastructure & APIs

FastAPI Async REST API, OpenAPI docs at /docs
Ollama Local LLM server, REST API at localhost:11434
SQLite + SQLAlchemy Metadata storage, ORM for data access
Temporal.io Workflow orchestration for training jobs (RetryPolicy with backoff)
HuggingFace Trainer Training loop with checkpointing, evaluation, metrics logging

๐Ÿ“ฆ Data Engineering

Dataset Pipeline Ingestion โ†’ Preprocessing โ†’ Validation โ†’ Registration
Auto-Labeling Rule-based labeling for instruction datasets
Data Quality Schema validation, duplicate detection, text cleaning
Privacy Features PII anonymization, GDPR-compliant by design*

Performance Benchmarks

~200ms* Retrieval Latency 5K docs, SSD, warm cache
~2-5s* Full Query Response Retrieval + LLM generation
4GB VRAM (Inference) Llama 3.2 1B via Ollama
8GB* VRAM (Fine-Tuning) QLoRA on T4 GPU
2-4 hrs* Fine-Tuning Time 3 epochs, 1K samples, T4
384 dim Embedding Size MiniLM-L6-v2
~10GB Disk Space Model + deps + data
0.1% Trainable Params LoRA adapters only

โš ๏ธ Trade-Offs of Local Deployment

* What you sacrifice for complete privacy and zero recurring costs.

  • You Need Your Own Hardware: Unlike cloud APIs, you control the compute. Smaller models (1B-3B) run on laptops; larger models need better GPUs. Trade-off: No usage limits or rate throttling.
  • Fine-Tuning Requires a GPU: Training requires NVIDIA GPUs (or free Colab/Kaggle). Trade-off: Complete control over model behaviorโ€”no content filters, no censorship.
  • Setup Takes Time (~15 min): Cloud models need just an API key. Local requires downloading models and dependencies. Trade-off: One-time setup for permanent data privacy.
  • Smaller Models = Less General Knowledge: A 1B/3B model won't match GPT-4's broad knowledge. Trade-off: After fine-tuning on YOUR data, it outperforms GPT-4 in your specific domainโ€”at $0/month.
  • Manual Updates Required: Cloud models auto-update; local models don't. Trade-off: No sudden API changes breaking your app. Stability and version control.

Skills Demonstrated

LLMs RAG PEFT LoRA QLoRA Fine-Tuning Meta Llama Hugging Face Transformers PyTorch ChromaDB Vector Databases SentenceTransformers Semantic Search Embeddings FastAPI MLflow Ollama 4-bit Quantization Gradient Checkpointing BitsAndBytes Temporal.io Workflow Orchestration ETL Pipelines Data Quality GDPR Compliance Hallucination Detection Prompt Engineering SQLAlchemy Python REST APIs

Get Running in 10-15 Minutes

No coding experience required. Just follow the steps.

๐Ÿ“‹ Before You Start

Make sure you have these installed:

1

Download the Project

~2 min

Open your terminal (Command Prompt on Windows, Terminal on Mac/Linux) and run:

git clone https://github.com/Akshar-Guha/Chat_Bot.git
cd Chat_Bot
2

Set Up Python Environment

~1 min

Create an isolated environment so it doesn't mess with other Python projects:

# Windows
python -m venv .venv
.venv\Scripts\activate

# Mac/Linux
python3 -m venv .venv
source .venv/bin/activate

You'll see (.venv) appear in your terminal when it's active.

3

Install Dependencies

~5 min

This downloads all the required libraries:

pip install -r requirements.txt

This may take a few minutes. You'll see a lot of text scrollingโ€”that's normal.

4

Download the AI Model

~5 min (depends on internet)

First, make sure Ollama is running (open the Ollama app). Then download the Llama model:

ollama pull llama3.2:1b

This downloads ~1.3GB. The model will be stored locally and works offline after this.

5

Add Your Documents (Optional)

~1 min

Put your PDF, TXT, or DOCX files in the data/ folder, then index them:

python -m src.core.cli ingest ./data/
6

Start the Application!

Ready to use

Launch the API server:

python -m src.api.main

Open http://localhost:8000/docs in your browser to see the API.

๐Ÿ’ป System Requirements

RAM 8GB minimum, 16GB recommended
Storage 10GB free space
GPU Not required for inference (but helps)
OS Windows 10+, macOS 12+, or Linux

Train Your Own Custom Model

For developers who want to fine-tune Llama on their own data. Requires Google Colab (free tier works).

โš ๏ธ Important Notes

  • GPU Required: Fine-tuning needs a GPU. Use Google Colab (free T4 GPU) or your own NVIDIA GPU with 8GB+ VRAM.
  • Time Required: Expect 2-4 hours for training on a typical dataset (1-5K samples).
  • Not beginner-friendly: This section assumes you're comfortable with Python, Jupyter notebooks, and command line.
1

Prepare Your Training Data

Create a JSON file with your training examples in this format:

[
  {
    "instruction": "Summarize the following document",
    "input": "Your document text here...",
    "output": "The expected summary..."
  },
  {
    "instruction": "Answer this question about the document",
    "input": "What is the main topic?",
    "output": "The main topic is..."
  }
]

Need 500-5000 examples for good results. Quality matters more than quantity.

2

Open in Google Colab

We provide a ready-to-use notebook. Click the badge to open:

Open in Colab

Make sure to select Runtime โ†’ Change runtime type โ†’ T4 GPU (free tier).

3

Configure Training Parameters

Key settings in the notebook (already optimized for T4 GPU):

# LoRA Configuration
lora_r = 8                    # Rank (higher = more capacity, more VRAM)
lora_alpha = 16               # Scaling factor
lora_dropout = 0.1            # Regularization

# Quantization (saves VRAM)
load_in_4bit = True           # 4-bit NF4 quantization
bnb_4bit_compute_dtype = "float16"

# Training
num_epochs = 3
batch_size = 2                # Increase if you have more VRAM
gradient_accumulation = 8     # Effective batch = 2 ร— 8 = 16
learning_rate = 2e-4
4

Run Training

Execute all cells in the notebook. Training will:

  • Load Llama 3.2 1B with 4-bit quantization
  • Apply LoRA adapters to attention layers
  • Train on your data with checkpointing
  • Save the adapter to Google Drive

Training ~1K samples takes about 2 hours on T4. Watch the loss curveโ€”it should decrease.

5

Export and Use Locally

After training, download the adapter and merge with the base model:

# In the notebook, after training completes:
# The adapter is saved to Google Drive automatically

# On your local machine:
# 1. Download the adapter folder from Drive
# 2. Merge with Ollama:
ollama create my-custom-model -f Modelfile

๐Ÿ’ก Pro Tips

Data Quality > Quantity

500 high-quality examples beats 5000 noisy ones. Clean your data carefully.

Monitor the Loss

Training loss should decrease. If it plateaus, try lower learning rate.

Save Checkpoints

Colab sessions disconnect. Enable Drive auto-save in the notebook.

Test Iteratively

Train for 1 epoch, test, then continue. Don't train blindly for 10 epochs.

Ready to Own Your AI?

No more sending your data to the cloud. No more content filters. Your AI, your rules.