Document Semantic Search

Overview

An n8n workflow for building semantic search over your documents. Processes PDFs, docs, and text files, creates embeddings with OpenAI, stores them in Qdrant, and enables AI-powered search with Gemini.

780+

Views

AI Models

Qdrant

Vector DB

How It Works

Ingestion Pipeline

Upload Documents - PDFs, DOCX, TXT, or URLs
Text Extraction - Convert to plain text
Chunking - Split into semantic chunks
Embeddings - Generate vectors with OpenAI
Storage - Store in Qdrant vector database

Search Pipeline

Query - Receive search query
Embed Query - Convert to vector
Vector Search - Find similar chunks in Qdrant
Context Assembly - Gather relevant chunks
AI Answer - Gemini synthesises response

Workflow Components

Ingestion3

Doc processing pipeline

Embeddings1

OpenAI ada-002

Vector DB1

Qdrant storage

Search1

Semantic matching

AI1

Gemini answers

Output1

Formatted response

Features

Multi-Format Support - PDF, DOCX, TXT, Markdown
Smart Chunking - Preserves semantic boundaries
Hybrid Search - Combines vector + keyword search
Source Attribution - Links back to original documents
Incremental Updates - Add new docs without full reindex

Architecture

Documents → Text Extraction → Chunking → OpenAI Embeddings
                                              ↓
                                          Qdrant DB
                                              ↓
Query → Embed Query → Vector Search → Gemini Answer

Example Query

User: “What’s our refund policy for software subscriptions?”

System:

Based on the company policies document:

**Software Subscription Refunds**

- Full refund available within 14 days of purchase
- Pro-rata refund for annual plans cancelled after 14 days
- No refund for monthly plans (cancel before renewal)

Special cases:
- Technical issues preventing use: Full refund at any time
- Billing errors: Immediate correction + refund

📄 Source: Company-Policies-2024.pdf (Page 12)

Use Cases

Knowledge Base - Search internal documentation

Legal/Compliance - Find relevant policy sections

Research - Search academic papers and reports

Customer Support - Find answers in product docs

Configuration

Component	Options
Embedding Model	text-embedding-ada-002, text-embedding-3-small
Chunk Size	500-2000 tokens
Overlap	50-200 tokens
Vector DB	Qdrant (self-hosted or cloud)
LLM	Gemini Pro, GPT-4