Proof of ConceptSearch Pipeline • Python

Aggregator

A production-grade proof-of-concept for an intelligent search and content extraction pipeline that leverages Large Language Models to gather, analyze, and rank web content based on user queries. Demonstrates the potential of AI-driven web aggregation for research assistants, content curation, and automated knowledge discovery.

What is Aggregator?

Aggregator enhances information retrieval by combining traditional web search with AI-powered content analysis. Instead of just returning links, it intelligently extracts, analyzes, and ranks the actual content to find what's most relevant to your query.

The system generates multiple semantic variations of your query, searches the web concurrently, extracts relevant content using LLMs, and grades each result on a 1-10 scale with detailed reasoning.

LLM Compatibility

Aggregator uses the OpenAI API standard, making it compatible with any OpenAI-compatible API provider or service. This includes popular options like:

Ollama

Run LLMs locally for complete privacy and offline capabilities.

OpenAI

Access GPT models through the official OpenAI API.

PromptShield

Access potentially 500 models across 14 providers through a unified API.

Flexibility: To use a different provider, simply set the appropriate API endpoint and key in your configuration.

Architecture

The pipeline orchestrates a sophisticated workflow that combines traditional search with AI analysis:

1.
Query Variation Generation: LLM generates semantic variations of the user query
2.
Concurrent Search: Multiple variations searched simultaneously via SearXNG
3.
Async Web Scraping: Concurrent fetching and processing of web content
4.
LLM Content Extraction: AI extracts relevant content from HTML
5.
Relevance Grading: Each result scored 1-10 with detailed reasoning
6.
Structured Results: Clean JSON output for downstream use

Quick Start

Prerequisites

  • Python 3.8 or later
  • An LLM server compatible with the OpenAI API (such as Ollama or any OpenAI-compatible provider)
  • SearXNG instance (or compatible search API)

Installation

git clone https://github.com/PromptShieldLabs/aggregator.git
cd aggregator
pip install -r requirements.txt

Configuration

Set up environment variables for your search and LLM endpoints:

export SEARXNG_URL="http://your-searxng-instance:port"
export OLLAMA_URL="http://localhost:11434/v1"
export OLLAMA_MODEL="your-preferred-model"
export OLLAMA_API_KEY="your-api-key"

Usage

# Basic usage
python aggregator.py "What are the key features of FastAPI?"

# Output to file
python aggregator.py "How does Python asyncio work?" --output results.json

# Configuration validation
python aggregator.py --config-check

Relevance Grading Rubric

Each result is scored objectively on a 1-10 scale:

10Official documentation that directly and completely answers the question
9Comprehensive tutorial or guide that thoroughly addresses the question
8Detailed article with substantial relevant information
7Good explanation with most key points covered
5-6Partial answer with some relevant details
3-4Tangentially related or minimal relevance
1-2Barely related or completely irrelevant

Configuration Options

Highly configurable via environment variables with sensible defaults:

VariableDescriptionDefault
NUM_VARIATIONSQuery variations to generate3
URLS_PER_VARIATIONURLs to fetch per variation3
MAX_CONCURRENT_REQUESTSConcurrent request limit5
MAX_HTML_CHARSMax HTML characters to process120000
HTTP_TIMEOUTRequest timeout in seconds15

Use Cases

Research Assistants

Automatically gather and rank relevant sources for academic or technical research.

Content Curation

Find and organize high-quality content on specific topics with intelligent filtering.

Knowledge Discovery

Explore topics comprehensively by capturing diverse perspectives and sources.

Technical Highlights

Robust Error Handling: Comprehensive exception handling for network timeouts, invalid URLs, LLM API errors, and HTML parsing failures.

Performance Optimized: Asynchronous processing with configurable concurrency, HTML size limits, and efficient token usage.

Privacy & Security: Input validation, rate limiting, SearXNG for privacy-preserving search, and secure API credential management.

MIT Licensed - Aggregator is free and open source software. Use, modify, and distribute it for any purpose.

Ready to Try Aggregator?

Check out the repository for full documentation, examples, and installation instructions.