CLI — Production Data Pipelines From Your Terminal
The open-source CLI that turns Anysite's web-to-API engine into production data pipelines. Describe what you need — the Data Agent builds and runs the pipeline. Local storage. Zero token waste. From idea to structured dataset in minutes.
$ pip install anysite-cli
The Entire Web Is Your Database. The Agent Is Your Data Engineer.
Every website has structured data inside it. Anysite's AI extracts it — from any URL, any platform, any page. The CLI gives you a production runtime to build data pipelines against any web resource. And the Data Agent lets you skip the manual work entirely: describe what data you need in plain English, and the agent discovers endpoints, builds the YAML pipeline, estimates costs, and executes.
You're not choosing from a catalog. You're pointing at the web and getting data back.
Why Traditional Web Scraping Alternatives Fall Short
Every existing method for web data extraction has the same problem: they weren't built for production data pipelines.
Traditional approaches — browser automation, workflow tools, custom scripts, API aggregators — each solve part of the problem. Anysite CLI solves the whole thing: declarative pipelines that handle extraction, transformation, storage, analysis, and scheduling from a single YAML file. The Data Agent builds the pipeline from a description. No selectors to break, no tokens to burn, no infrastructure to manage.
One Data Pipeline CLI. Seven Capabilities. Any Web Source.
Single API Calls
Instant requests with flexible output formats (JSON, CSV, JSONL, table) and field filtering. Dot-notation for nested data. Built-in presets.
anysite api /api/linkedin/user \ user=satyanadella --fields "name,headline" anysite api /api/instagram/user \ user=natgeo --format table anysite api /api/twitter/user \ user=elonmusk --preset minimal
Batch Processing
Process thousands of inputs in parallel. Three error strategies: stop, skip, retry with backoff.
anysite api /api/linkedin/user \ --from-file users.txt --input-key user \ --parallel 5 --on-error skip --progress
Dataset Pipelines
Declarative YAML workflows with chained dependencies and scheduling. Six pre-built templates.
anysite dataset init prospect-pipe anysite dataset collect pipe.yaml \ --dry-run anysite dataset collect pipe.yaml \ --incremental
Database Integration
Load into SQLite or PostgreSQL with auto-schema and diff-sync. Upsert with conflict handling.
anysite api /api/linkedin/user \ user=satyanadella \ | anysite db insert mydb \ --table profiles anysite db upsert mydb \ --table leads --conflict-key email
LLM Analysis
Classify, summarize, enrich, deduplicate using OpenAI or Anthropic. Four enrichment types. Built-in SQLite cache.
anysite llm classify dataset.yaml --source posts \ --categories "positive,negative,neutral" anysite llm enrich dataset.yaml --source companies \ --extract "industry_category,funding_stage" anysite llm dedupe dataset.yaml --source leads \ --threshold 0.85
SQL Querying
DuckDB SQL on collected datasets. Run analytics without external databases.
anysite dataset query pipeline.yaml \ --sql "SELECT * FROM employees WHERE title LIKE '%CTO%'"
Data Agent
The hero capability. Describe what data you need in natural language. The agent discovers endpoints, builds the pipeline YAML, estimates costs, and executes. Idea to structured dataset — zero config.
# Just describe what you need anysite agent "Find Series B SaaS companies, get their decision makers, and pull their recent LinkedIn posts"
Describe It or Define It. Collect. Store. Query.
Two paths to the same result: let the Data Agent build your pipeline from natural language, or write the YAML yourself for full control.
Define Pipeline
YAML config or natural language via Agent
Preview & Collect
Dry-run to estimate, then execute
Store Locally
Parquet, DuckDB, PostgreSQL, SQLite
Query & Analyze
SQL queries + LLM classification
name: prospect-pipeline sources: target_companies: endpoint: /api/linkedin/search/companies input: industry: "SaaS" employee_count: "51-200" parallel: 3 decision_makers: endpoint: /api/linkedin/company/employees depends_on: target_companies input: company: ${target_companies.urn} keywords: "VP Sales, Director Sales" count: 5 on_error: skip recent_posts: endpoint: /api/linkedin/user/posts depends_on: decision_makers input: urn: ${decision_makers.internal_id.value} count: 5 storage: format: parquet path: ./data/prospects
# Preview costs before running anysite dataset collect pipeline.yaml --dry-run # Execute the full pipeline anysite dataset collect pipeline.yaml # Run incremental updates anysite dataset collect pipeline.yaml --incremental # Query results with SQL anysite dataset query pipeline.yaml \ --sql "SELECT * FROM decision_makers WHERE title LIKE '%CTO%'" # Classify posts with LLM anysite llm classify pipeline.yaml --source recent_posts \ --categories "product_update,hiring,thought_leadership"
Any Website Is an Endpoint. Major Platforms Are Ready Out of the Box.
The Anysite engine turns any web page into structured data via AI parsing. Major platforms come with dedicated, optimized endpoints.
| Platform | What You Get | Example |
|---|---|---|
| Profiles, companies, posts, jobs, search, messaging, employees | anysite api /api/linkedin/user user=satyanadella |
|
| Twitter/X | Posts, threads, users, search, followers | anysite api /api/twitter/user user=elonmusk |
| Posts, reels, profiles, comments, likes | anysite api /api/instagram/user user=natgeo |
|
| Discussions, subreddits, comments, user history | anysite api /api/reddit/search/posts query="AI agents" |
|
| YouTube | Videos, channels, comments, subtitles | anysite api /api/youtube/video video_id=dQw4w9WgXcQ |
| SEC EDGAR | 10-K, 10-Q, 8-K filings | anysite api /api/sec/search/companies |
| Y Combinator | Companies, founders, batch data | anysite api /api/yc/search/companies |
| Search, Maps, News | anysite api /api/search/google |
| Capability | What It Does | Example |
|---|---|---|
| Web Parser | Any URL to structured JSON | anysite api /api/webparser/parse url="https://..." |
| AI Parsers | Specialized extraction for GitHub, Amazon, Glassdoor, G2, Trustpilot, Crunchbase, Pinterest, AngelList | anysite api /api/ai-parser/glassdoor url="..." |
| Data Agent | Describe a data need — agent discovers or creates the right endpoint | "Get pricing data from competitor websites" |
The endpoint library grows continuously. But you're never limited to it — the AI parser and Data Agent can extract structured data from any web resource.
Explore Available Endpoints
# Browse all ready-made endpoints anysite describe # Filter by platform anysite describe --search linkedin # Get parameter details for a specific endpoint anysite describe /api/linkedin/user
Built for Real Workflows
From lead gen to research, the CLI handles production-grade data collection.
Sales Intelligence
Define target criteria once. Pipeline refreshes on schedule via cron. Always-fresh prospect data flowing into your CRM.
Competitive Intelligence
Multi-source collection with anysite dataset diff for change detection across competitor websites.
Research at Scale
Batch processing 10K+ records with parallel execution and incremental tracking. Academic and market research workflows.
Brand Monitoring
Scheduled pipeline with LLM sentiment classification and webhook alerts. Know what's being said, automatically.
Prefer AI-native access? Try the MCP Server for Claude & Cursor. Need direct HTTP calls? Use the REST API. Compare all plans →
Your Data Stays Local. Your Tokens Stay Unburned.
Unlike workflow tools that pass every record through LLM context, Anysite CLI processes data locally. Only config enters the context window.
| Approach | 1,000 Records | 10,000 Records | 100,000 Records |
|---|---|---|---|
| Workflow Tool (context-based) | ~500K tokens | ~5M tokens | ~50M tokens |
| Anysite CLI | ~1K tokens | ~1K tokens | ~1K tokens |
| Efficiency gain | 500x | 5,000x | 50,000x |
Technical Specifications
Output Formats
JSON (default), JSONL, CSV, Rich table
Field Control
--fields, --exclude, --preset (minimal, contact, recruiting)
Error Handling
stop (default), skip, retry with backoff
LLM Support
OpenAI + Anthropic, 6 operations, SQLite response cache
Scheduling
Cron, systemd, webhooks
Install Extras
[data], [postgres], [llm], [all]
Unix Piping
# Pipe API output directly into database anysite api /api/linkedin/user user=satyanadella | anysite db insert mydb --table profiles # Pipe into jq for quick extraction anysite api /api/linkedin/company company=anthropic | jq '.employees[] | .title'
Simple Credit-Based Pricing
7-day free trial on Starter. Scale as you grow.
PAYG top-ups at $2.90/1K credits (min $20, 12-month rollover). Active subscription required.
Get Running in 5 Minutes
1. Install the CLI
pip install anysite-cli
2. Configure your API key
anysite config set api_key YOUR_API_KEY
3. Update the schema
anysite schema update
4. Make your first request
anysite api /api/linkedin/user user=satyanadella
5. Create your first pipeline
anysite dataset init my-first-pipeline anysite dataset collect my-first-pipeline/dataset.yaml --dry-run
Resources
For vibecoders: The Data Agent means you don't need to know the API or write YAML. Describe the data you need in plain English — "Find Series B SaaS companies and their decision makers" — and the agent builds and runs the pipeline. Works with Claude Code as a skill.
Trusted by Data Teams
"Replaced 3 weeks of scraper code with one YAML file."
"Finally stopped burning tokens on data shuffling."
"The pipeline just runs. Haven't touched it in 2 months."
The entire web is your database. The agent is your data engineer.
Open source. MIT license. Start with pip install anysite-cli
$ pip install anysite-cli