CLI — Production Data Pipelines From Your Terminal
The open-source CLI that turns Anysite's web-to-API engine into production data pipelines. Describe what you need — your AI agent builds and runs the pipeline using CLI commands. Local storage. Zero token waste. From idea to structured dataset in minutes.
$ pip install anysite-cli
Why Traditional Web Scraping Alternatives Fall Short
Every existing method for web data extraction has the same problem: they weren't built for production data pipelines.
Traditional approaches — browser automation, workflow tools, custom scripts, API aggregators — each solve part of the problem. Anysite CLI solves the whole thing: declarative pipelines that handle extraction, transformation, storage, analysis, and scheduling from a single YAML file. Your AI agent builds the pipeline from a description using CLI tools. No selectors to break, no tokens to burn, no infrastructure to manage.
Declarative YAML Pipelines
Chain multiple data sources with dependencies. Define filters, output format, storage destination, and schedule — in one file. Six pre-built templates for common patterns. Incremental collection with cursor tracking.
Agent-Ready Protocol
Your AI agent discovers endpoints (anysite schema search), builds the YAML, estimates costs (--dry-run), and executes autonomously. Structured JSON responses, _hints metadata, exit codes 0-5.
anysite schema search "linkedin company employees" anysite dataset collect pipeline.yaml --dry-run anysite dataset collect pipeline.yaml
Full Data Stack Built In
Batch processing with parallel execution and error strategies. Database loading into SQLite, PostgreSQL, ClickHouse. LLM enrichment: classify, summarize, enrich, deduplicate. SQL queries via DuckDB. Cron scheduling with webhooks.
Describe It or Define It. Collect. Store. Query.
Two paths to the same result: let your AI agent build the pipeline from natural language, or write the YAML yourself for full control.
Define Pipeline
YAML config or natural language via your AI agent
Preview & Collect
Dry-run to estimate, then execute
Store Locally
Parquet, DuckDB, PostgreSQL, SQLite
Query & Analyze
SQL queries + LLM classification
name: prospect-pipeline sources: target_companies: endpoint: /api/linkedin/search/companies input: industry: "SaaS" employee_count: "51-200" parallel: 3 decision_makers: endpoint: /api/linkedin/company/employees depends_on: target_companies input: company: ${target_companies.urn} keywords: "VP Sales, Director Sales" count: 5 on_error: skip recent_posts: endpoint: /api/linkedin/user/posts depends_on: decision_makers input: urn: ${decision_makers.internal_id.value} count: 5 storage: format: parquet path: ./data/prospects
# Preview costs before running anysite dataset collect pipeline.yaml --dry-run # Execute the full pipeline anysite dataset collect pipeline.yaml # Run incremental updates anysite dataset collect pipeline.yaml --incremental # Query results with SQL anysite dataset query pipeline.yaml \ --sql "SELECT * FROM decision_makers WHERE title LIKE '%CTO%'" # Classify posts with LLM anysite llm classify pipeline.yaml --source recent_posts \ --categories "product_update,hiring,thought_leadership"
Any Website Is an Endpoint. Major Platforms Are Ready Out of the Box.
The Anysite engine turns any web page into structured data via AI parsing. Major platforms come with dedicated, optimized endpoints.
| Platform | Coverage |
|---|---|
| Profiles, companies, posts, jobs, search, email finder | |
| Profiles, posts, reels, comments, search | |
| Twitter/X | Profiles, tweets, search, followers |
| Posts, comments, subreddits, search, user history | |
| YouTube | Videos, channels, subtitles, comments, search |
| DuckDuckGo | Web search results |
| SEC EDGAR | Company filings (10-K, 10-Q, 8-K) |
| Y Combinator | Companies, founders, batches |
| Any URL | AI-powered structured extraction from any webpage |
Built for Real Workflows
From lead gen to research, the CLI handles production-grade data collection.
| Use Case | What Happens |
|---|---|
| Sales Intelligence | YAML chains company search → employee lookup → activity. Runs on cron. Outputs to PostgreSQL. |
| Competitive Intelligence | Multi-source collection across LinkedIn, Twitter, Reddit, web. dataset diff detects changes between runs. |
| Research at Scale | Batch 10K+ records with parallel execution. Incremental resume after interruption. DuckDB SQL for analysis. |
| Brand Monitoring | Scheduled pipeline collects mentions across platforms. LLM sentiment classification. Webhook on completion. |
Your Data Stays Local. Your Tokens Stay Unburned.
Unlike workflow tools that pass every record through LLM context, Anysite CLI processes data locally. Only config enters the context window.
| Approach | 1,000 Records | 10,000 Records | 100,000 Records |
|---|---|---|---|
| Workflow Tool (context-based) | ~500K tokens | ~5M tokens | ~50M tokens |
| Anysite CLI | ~1K tokens | ~1K tokens | ~1K tokens |
| Efficiency gain | 500x | 5,000x | 50,000x |
Get Running in 5 Minutes
1. Install the CLI
pip install anysite-cli
2. Configure your API key
anysite config set api_key YOUR_API_KEY
3. Update the schema
anysite schema update
4. Make your first request
anysite api /api/linkedin/user user=satyanadella
5. Create your first pipeline
anysite dataset init my-first-pipeline anysite dataset collect my-first-pipeline/dataset.yaml --dry-run
Resources
No YAML required: The agent-ready CLI means your AI assistant can build pipelines without you writing YAML. Describe the data you need in plain English — your agent discovers endpoints, builds the YAML, and runs the pipeline. Works with Claude Code, Cursor, and any MCP-compatible agent.
Simple Credit-Based Pricing
7-day free trial on Starter. Scale as you grow.
PAYG top-ups at $2.90/1K credits (min $20, 12-month rollover). Active subscription required.
The entire web is your database. The agent is your data engineer.
Open source. MIT license. Start with pip install anysite-cli
$ pip install anysite-cli