RAG

RAG

RAG

Implementation-accurate, engineering-grade documentation of a retrieval-augmented generation system for document Q&A over custom knowledge bases: PDF/web/markdown indexing, dual vector stores (Pinecone and FAISS), and optional LLM generation via LangGraph.

Role:Software Engineer
Year:
PythonLangChain (community, text_splitters, google_genai, openai, pinecone, core, chat_models)LangGraph (StateGraph)PineconeFAISSGoogle Gemini (text-embedding-004)OpenAI (gpt-4.1-nano)PyPDFLoader / WebBaseLoaderRecursiveCharacterTextSplitterMarkdownHeaderTextSplitterJupyter

Problem

The Challenge

Context

Need to query specific documents (PDFs, web pages, or markdown) with natural language and get answers grounded in that content. No web UI or API; the system is designed for local use via CLI script and Jupyter notebooks.

User Pain Points

1

Documents must be indexable and searchable by semantic similarity.

2

Dual workflows: retrieval-only (snippets) vs full RAG (retrieve then generate with LLM).

Why Existing Solutions Failed

Generic search or static docs do not support natural-language Q&A grounded in custom content; retrieval-augmented generation with vector stores and optional LLM meets the need.

Goals & Metrics

What We Set Out to Achieve

Objectives

  • 01Index documents (PDF, web, or markdown) into vector stores (Pinecone or FAISS).
  • 02Answer user questions via similarity search over indexed chunks.
  • 03Optionally generate LLM answers from retrieved context (007_rag.ipynb only).

Success Metrics

  • 01rag_doc.py: PDF indexed to Pinecone, CLI prints top-3 snippets per question.
  • 02rag.ipynb: Markdown indexed to Pinecone, similarity search returns top-k chunks.
  • 03007_rag.ipynb: Web/PDF → FAISS, LangGraph retrieve→generator produces state["answer"] with gpt-4.1-nano.
Loading diagram...

User Flow

User Journey

Indexing: document source → load → split → embed → vector store. Query: user question → similarity search → snippets (rag_doc.py, rag.ipynb) or retrieve→generator→answer (007_rag.ipynb).

start
Start
action
Load documents (PDF path, URL, or markdown)
action
Split and embed (Gemini)
action
Store in Pinecone or FAISS
action
User asks question (CLI or graph.invoke)
action
Similarity search top-k chunks
action
Print snippets or generator → answer
end
End
Loading diagram...

Architecture

System Design

Three entrypoints: rag_doc.py (CLI, Pinecone), rag.ipynb (markdown, Pinecone), 007_rag.ipynb (web/PDF, FAISS, LangGraph RAG). Services: Pinecone, Google Gemini, OpenAI. No frontend; no relational DB.

Backend

rag_doc.py: CLI loop, PyPDFLoader, RecursiveCharacterTextSplitter, Pinecone upsert, similarity_search_with_scorerag.ipynb: MarkdownHeaderTextSplitter, Pinecone add_documents, similarity search007_rag.ipynb: WebBaseLoader/PyPDFLoader, FAISS, StateGraph retrieve→generator with gpt-4.1-nano

Services

Pinecone (vector index, serverless AWS us-east-1)Google Gemini (text-embedding-004, 768 dim)OpenAI (gpt-4.1-nano in 007_rag.ipynb)

Databases

Pinecone vector index (luxdit-paper-index, testing-pinecone-gemini)FAISS local vector store (in-memory or load_local)
Loading diagram...

Data Flow

How Data Moves

User/PDF/URL → loaders → splitters → embedding (Gemini) → Pinecone/FAISS. User question → vector store → top-k chunks → (optional) generator node → state["answer"].

1
User/PDF/URL → Document loaders
File path or URL; trigger: script run or notebook execution
2
Document loaders → Text splitters
List of Document objects; trigger: after load
3
Text splitters → Embedding model
Chunk text; trigger: before vector store write
4
Embedding model → Pinecone or FAISS
Vectors and metadata; trigger: from_documents / add_documents or FAISS build
5
User question → Vector store
Query string; trigger: user input in CLI or graph.invoke
6
Vector store → Retrieve node / caller
Top-k Document chunks (and optional scores)
7
Retrieve node → Generator node
State with question and context (doc list); trigger: LangGraph edge
8
Generator node → User
state['answer'] (LLM response); trigger: graph.invoke in 007_rag.ipynb
Loading diagram...

Core Features

Key Functionality

01

PDF load and chunk

What it does

Loads a PDF and splits it into text chunks with overlap

Why it matters

rag_doc.py, 007_rag.ipynb

Implementation

PyPDFLoader + RecursiveCharacterTextSplitter (chunk_size 1000, overlap 150 or 200)

02

Web page load

What it does

Fetches and parses a URL into document chunks

Why it matters

007_rag.ipynb

Implementation

WebBaseLoader with bs4 SoupStrainer (post-title, post-header, post-content)

03

Markdown header split

What it does

Splits markdown by headers (#, ##, ###) with metadata

Why it matters

rag.ipynb

Implementation

MarkdownHeaderTextSplitter with headers_to_split_on

04

Google Gemini embeddings

What it does

Produces 768-dim embeddings for chunks and queries

Why it matters

rag_doc.py, rag.ipynb, 007_rag.ipynb

Implementation

GoogleGenerativeAIEmbeddings(model='models/text-embedding-004')

05

Pinecone index and upsert

What it does

Creates or reuses Pinecone index and stores vectors

Why it matters

rag_doc.py, rag.ipynb

Implementation

Pinecone client, create_index (768 dim, cosine, ServerlessSpec), PineconeVectorStore.from_documents or add_documents

06

FAISS vector store

What it does

Local vector index for similarity search; save/load to disk

Why it matters

007_rag.ipynb

Implementation

FAISS from langchain_community.vectorstores, save_local/load_local with allow_dangerous_deserialization

07

Similarity search

What it does

Returns top-k document chunks for a query

Why it matters

rag_doc.py, rag.ipynb, 007_rag.ipynb

Implementation

vector_store.similarity_search or similarity_search_with_score(query, k=2..4)

08

RAG agent (retrieve + generate)

What it does

Retrieves context then generates answer with LLM

Why it matters

007_rag.ipynb

Implementation

StateGraph(State) with nodes retrieve and generator; retrieve similarity_search k=4; generator invokes init_chat_model('gpt-4.1-nano', 'openai'), returns answer

09

CLI Q&A loop

What it does

Interactive terminal: user types question, system prints top snippets (no LLM answer)

Why it matters

rag_doc.py

Implementation

while True input loop, similarity_search_with_score(user_query, k=3), print score/page/content; exit on exit/quit/q

Technical Challenges

Problems We Solved

Why This Was Hard

Missing or invalid PDF path, API errors, empty chunks, missing env vars surface as unhandled exceptions.

Our Solution

Only defensive logic in analyzed code: index existence check before Pinecone create (rag_doc.py).

Why This Was Hard

Re-running rag_doc.py with the same index re-upserts; no "index only if empty" or idempotent upsert.

Our Solution

Not addressed in code; future improvement: optional conditional upsert or idempotent strategy.

Why This Was Hard

Omits pinecone, python-dotenv, langchain_pinecone despite use; complicates reproducible installs.

Our Solution

Not addressed; add missing dependencies with version pins.

Engineering Excellence

Performance, Security & Resilience

Performance

  • Chunk size and overlap tuned (1000/150 or 1000/200); top-k limited (2–4) to bound context size.
  • Pinecone serverless for managed scale; FAISS for local fast ANN.
  • No caching of embeddings or LLM responses; single-threaded.
🛡️

Error Handling

  • Index existence check before Pinecone create (rag_doc.py).
  • No try/except or explicit error handling in analyzed code.
🔒

Security

  • API keys from environment (load_dotenv); no secrets in repo.
  • No input validation or sanitization on user questions; no rate limiting or auth for CLI/notebook.
  • FAISS load_local with allow_dangerous_deserialization (documented risk).
Loading diagram...

Design Decisions

Visual & UX Choices

CLI

Rationale

rag_doc.py: prompt "User: ", print snippets with score and page, 250-char content preview.

Details

Sequential loop: question → print snippets → repeat; exit on exit/quit/q.

Notebook

Rationale

Cell-by-cell execution; output of documents and search results in notebook.

Details

007_rag.ipynb: graph.invoke({"question": "..."}) returns final state with answer.

Impact

The Result

What We Achieved

Three entrypoints: (1) rag_doc.py indexes a PDF into Pinecone with Google Gemini embeddings and runs an interactive CLI that prints top-3 snippets per question (no LLM answer). (2) rag.ipynb indexes markdown into Pinecone and demonstrates similarity search. (3) 007_rag.ipynb loads web or PDF into FAISS and runs a LangGraph RAG agent (retrieve → generator) with OpenAI gpt-4.1-nano to produce answers. When env and APIs are valid, indexing and retrieval work as designed; LLM-based answers only in 007_rag.ipynb.

👥

Who It Helped

Solo project; Pinecone, Google Gemini, and OpenAI provide vector index, embeddings, and LLM.

Why It Matters

Implementation-accurate RAG pipelines for PDF, web, and markdown with dual vector-store support (Pinecone, FAISS), single embedding model (Gemini), and optional LLM generation via a two-node LangGraph in one notebook. Design favors clarity and local/exploratory use over production hardening.

Verification

Measurable Outcomes

Each outcome verified against reference implementations or test suites.

01

rag_doc.py: PDF → Pinecone, CLI prints top-3 snippets per question

02

rag.ipynb: Markdown → Pinecone, similarity search

03

007_rag.ipynb: Web/PDF → FAISS, LangGraph retrieve→generator with gpt-4.1-nano

Reflections

Key Learnings

Technical Learnings

  • Load → split → embed → store → retrieve → (optional) generate pipeline is explicit across scripts and notebook.
  • Dual vector stores (Pinecone vs FAISS) allow cloud persistence vs local fast ANN per workflow.

Architectural Insights

  • Two-node LangGraph (retrieve → generator) in 007_rag.ipynb makes RAG flow explicit; state carries question, context, answer.
  • No web framework or API; CLI and notebook only keeps scope local and exploratory.

What I'd Improve

  • Add error handling, requirements.txt gaps (pinecone, python-dotenv, langchain_pinecone), optional "index only if empty" upsert, input validation if exposed beyond local use.

Roadmap

Future Enhancements

01

Add try/except for missing file, network/API errors, empty chunk list, missing env vars; surface clear messages or exit codes.

02

Add pinecone, python-dotenv, langchain_pinecone to requirements.txt with version pins.

03

Optionally "index only if empty" or idempotent upsert to avoid redundant re-indexing.

04

Validate or sanitize user questions before embedding and LLM call; consider rate limiting if exposed beyond local use.

05

Consider moving LangGraph retrieve→generator flow into rag_doc.py or a shared module so CLI can optionally return an LLM answer.

06

Document or enforce loading FAISS only from trusted paths; consider alternatives to allow_dangerous_deserialization if loading untrusted indices is ever required.

07

If the system is ever served (API or web UI), add Dockerfile, env documentation, and deployment config.