A production-grade proof-of-concept for an intelligent search and content extraction pipeline that leverages Large Language Models to gather, analyze, and rank web content based on user queries. Demonstrates the potential of AI-driven web aggregation for research assistants, content curation, and automated knowledge discovery.
Aggregator enhances information retrieval by combining traditional web search with AI-powered content analysis. Instead of just returning links, it intelligently extracts, analyzes, and ranks the actual content to find what's most relevant to your query.
The system generates multiple semantic variations of your query, searches the web concurrently, extracts relevant content using LLMs, and grades each result on a 1-10 scale with detailed reasoning.
Aggregator uses the OpenAI API standard, making it compatible with any OpenAI-compatible API provider or service. This includes popular options like:
Run LLMs locally for complete privacy and offline capabilities.
Access GPT models through the official OpenAI API.
Access potentially 500 models across 14 providers through a unified API.
Flexibility: To use a different provider, simply set the appropriate API endpoint and key in your configuration.
The pipeline orchestrates a sophisticated workflow that combines traditional search with AI analysis:
git clone https://github.com/PromptShieldLabs/aggregator.git
cd aggregator
pip install -r requirements.txtSet up environment variables for your search and LLM endpoints:
export SEARXNG_URL="http://your-searxng-instance:port"
export OLLAMA_URL="http://localhost:11434/v1"
export OLLAMA_MODEL="your-preferred-model"
export OLLAMA_API_KEY="your-api-key"# Basic usage
python aggregator.py "What are the key features of FastAPI?"
# Output to file
python aggregator.py "How does Python asyncio work?" --output results.json
# Configuration validation
python aggregator.py --config-checkEach result is scored objectively on a 1-10 scale:
Highly configurable via environment variables with sensible defaults:
| Variable | Description | Default |
|---|---|---|
NUM_VARIATIONS | Query variations to generate | 3 |
URLS_PER_VARIATION | URLs to fetch per variation | 3 |
MAX_CONCURRENT_REQUESTS | Concurrent request limit | 5 |
MAX_HTML_CHARS | Max HTML characters to process | 120000 |
HTTP_TIMEOUT | Request timeout in seconds | 15 |
Automatically gather and rank relevant sources for academic or technical research.
Find and organize high-quality content on specific topics with intelligent filtering.
Explore topics comprehensively by capturing diverse perspectives and sources.
Robust Error Handling: Comprehensive exception handling for network timeouts, invalid URLs, LLM API errors, and HTML parsing failures.
Performance Optimized: Asynchronous processing with configurable concurrency, HTML size limits, and efficient token usage.
Privacy & Security: Input validation, rate limiting, SearXNG for privacy-preserving search, and secure API credential management.
MIT Licensed - Aggregator is free and open source software. Use, modify, and distribute it for any purpose.