Documentation
Everything you need to get started with ScopeScrape v0.2. 7 platforms, zero API keys required.
Installation
ScopeScrape requires Python 3.10 or newer. Install from source:
git clone https://github.com/cvasan2sh/scopescrape_tool.git
cd scopescrape_tool
pip install -e .
That gives you the core tool with zero API keys required. Reddit scraping uses public JSON endpoints, and Hacker News uses the open Algolia Search API.
For optional features, install extras:
# spaCy-powered entity extraction (better specificity scoring)
pip install -e ".[nlp]"
# Parquet export support
pip install -e ".[parquet]"
# Everything
pip install -e ".[all]"
On Python 3.14, spaCy and pyarrow do not yet have prebuilt wheels. The core tool works without them. The regex-based entity extractor and all exporters (JSON, CSV, Parquet, Airtable) handle everything you need for v0.2.
Quick Start
Get your first pain point scan running in two steps. No API keys, no account setup.
1. Run your first scan
Scan a subreddit for pain points from the past week:
scopescrape scan --subreddits saas --keywords "frustrated with,alternative to" --time-range week
This fetches posts from r/saas matching your keywords, detects signal phrases across 4 tiers (pain point, emotional, comparison, ask), scores every post on 4 dimensions, and writes the results to results.json in your current directory.
2. Choose your output format
Export to JSON (default), CSV, or Parquet:
# CSV for spreadsheet analysis
scopescrape scan --subreddits saas --keywords "pain point" --output csv
# Custom output path
scopescrape scan --subreddits saas --keywords "pain point" --output json --output-file ./data/saas-results.json
If you installed the [parquet] extra, you can also use --output parquet for columnar storage.
Configuration
ScopeScrape works with zero configuration out of the box. For fine-tuning, create a YAML config file at ./config.yaml, ./config/config.yaml, or ~/.scopescrape/config.yaml.
reddit:
user_agent: "ScopeScrape v0.2 (by your_username)"
rate_limit_delay: 1.0 # seconds between requests (~30 req/min safe limit)
comment_depth: 3 # max depth for comment tree traversal
hn:
rate_limit_delay: 0.1
comment_depth: 5
twitter:
nitter_instance: "https://nitter.net"
rate_limit_delay: 1.0
fallback_instances:
- "https://nitter.unixfederal.com"
- "https://nitter.cutelab.space"
github:
token: "" # Optional: GITHUB_TOKEN for 5000 req/hour vs 60 req/hour
rate_limit_delay: 1.0
stackoverflow:
api_key: "" # Optional: for 10000 req/day vs 300 req/day
rate_limit_delay: 0.5
max_answers_per_question: 3
producthunt:
token: "" # Optional: PRODUCTHUNT_TOKEN for 1000+ req/hour
rate_limit_delay: 0.5
max_reviews_per_product: 5
indiehackers:
rate_limit_delay: 0.2
airtable:
api_key: "" # Set via AIRTABLE_API_KEY env var
base_id: ""
scans_table_id: ""
pain_points_table_id: ""
signals_table_id: ""
scoring:
weights:
frequency: 0.25 # BM25 relevance across corpus
intensity: 0.20 # VADER sentiment + signal phrase tier
specificity: 0.25 # entity count + text length
recency: 0.30 # exponential time decay (168h half-life)
min_score: 5.0 # posts below this are dropped from results
storage:
db_path: ~/.scopescrape/data.db
retention_hours: 48
in_memory: false
scan:
default_limit: 100
default_time_range: week
default_platforms:
- reddit
The scoring weights must sum to 1.0. Adjust them to emphasize what matters most to your research. For example, bump recency higher if you only care about this week's pain, or increase intensity to surface the most emotionally charged posts.
You can also pass --config path/to/config.yaml to the CLI to use a specific config file per run.
Platform Adapters
ScopeScrape v0.2 ships with 7 platform adapters. All work without API keys (authentication is optional to increase rate limits).
Reddit (Public JSON Endpoints)
Data source: Public JSON endpoints. Appending .json to any Reddit URL returns structured data. Auth: Optional (PRAW client_id/secret for higher limits). Rate limit: ~30 requests per minute without auth, up to 60/min with auth. Best for: General pain complaints, workflow frustrations, tool comparisons, community frustrations.
The adapter handles three fetch modes: searching a specific subreddit (subreddits + keywords), listing top/hot posts (subreddits only), or searching globally across Reddit (keywords only).
Hacker News (Algolia Search API)
Data source: Algolia Search API at hn.algolia.com/api/v1/. Auth: Not required. Rate limit: ~10,000 requests/day. Best for: Technical pain points, developer tool frustrations, infrastructure complaints, SaaS tool critiques.
Full-text search over stories and comments with time filtering and recursive comment tree fetching.
Twitter/X (Nitter HTML Scraping)
Data source: Nitter instances (public HTML scraping). Auth: Not required. Rate limit: Depends on Nitter instance; generally 100+ requests per hour. Best for: Real-time frustrations, product launch reactions, viral pain points, feature requests from influencers.
Uses public Nitter instances to avoid the official API's strict rate limits and authentication. Automatically falls back between instances if one is unavailable. Supports keyword search with date-range filtering (day, week, month, year, all).
GitHub (REST Search API)
Data source: GitHub REST Search API. Auth: Optional (GITHUB_TOKEN for higher limits). Rate limit: 60 requests/hour unauthenticated, 5000 requests/hour with token. Best for: Library/framework pain points, developer tool limitations, open-source user feedback, feature requests.
Searches issues and discussions across all of GitHub or within specific repositories. Supports filtering by repository, issue type (issues/discussions), and sorting by relevance (stars).
Stack Overflow (Stack Exchange API)
Data source: Stack Exchange API v2.3. Auth: Optional (SO_API_KEY for higher limits). Rate limit: 300 requests/day unauthenticated, 10,000 requests/day with key. Best for: Technical implementation pain, coding language frustrations, library/framework issues, integration problems.
Fetches both questions and top answers. Supports tag-based filtering (e.g., "python", "javascript") and time-range filtering. Automatically fetches answers for high-scoring questions.
Product Hunt (GraphQL API)
Data source: Product Hunt public GraphQL API. Auth: Optional (PRODUCTHUNT_TOKEN for higher limits). Rate limit: ~200 requests/hour unauthenticated, 1000+ requests/hour with token. Best for: Product feature gaps, competitive comparisons, user expectations, SaaS/tool reviews.
Searches products and fetches reviews. Reviews are especially rich for pain signals as users articulate specific frustrations with the product or its alternatives.
Indie Hackers (Algolia API)
Data source: Indie Hackers Algolia search endpoint (reverse-engineered). Auth: Not required. Rate limit: No official limits; practical: 100s of requests per day. Best for: Founder frustrations, business model pain, market validation challenges, pivots and failures, detailed user interviews.
Indie Hackers is exceptionally rich for pain signals because founders openly discuss failures, pivot decisions, and grinding through difficult problems. Supports keyword search with pagination.
Signal Detection
The signal detector scans every post and comment for 60+ phrases organized into 4 tiers. Each phrase is matched with word-boundary regex (\b) and case-insensitive flags. When a phrase matches, ScopeScrape captures 50 characters of surrounding context.
Signal Tiers
| Tier | Weight | Example Phrases |
|---|---|---|
PAIN_POINT |
1.0 | "frustrated with", "broken", "dealbreaker", "bug" |
EMOTIONAL |
0.8 | "hate", "wish", "love", "pray" |
COMPARISON |
0.6 | "alternative to", "switched from", "vs", "better than" |
ASK |
0.4 | "how to", "anyone know", "is there a way" |
Phrases are further tagged with categories (emotion, workflow, bug, feature_request, tool_eval, migration, help_seeking) for downstream analysis.
Scoring Framework
Every post that triggers at least one signal phrase gets scored on four dimensions. Each dimension produces a 0-to-10 score, then the weighted composite determines the final ranking.
Frequency (weight: 0.25)
BM25 relevance scoring across the fetched corpus. Measures how central a post is to the search terms. Currently uses the rank_bm25 library. Known issue: when the entire corpus comes from the same search query, BM25 tends to assign similar scores to all posts.
Intensity (weight: 0.20)
Combines VADER sentiment analysis (60%) with signal tier weight (40%). Negative sentiment scores higher because frustration and complaints are more valuable signals than praise. A post that says "I absolutely hate this tool" scores higher intensity than "this tool is okay."
Specificity (weight: 0.25)
Entity count (60%) combined with text length (40%). If spaCy is installed, named entity recognition identifies product names, companies, and tools. Without spaCy, a regex fallback extracts capitalized phrases and common patterns. Longer, more detailed posts score higher.
Recency (weight: 0.30)
Exponential decay with a configurable half-life (default: 168 hours / 1 week). A post from today scores 10.0, a post from a week ago scores ~5.0, and a post from two weeks ago scores ~2.5. This keeps your results focused on current pain points rather than stale complaints.
scan
The primary command. Fetches posts, detects signals, scores results, and exports.
scopescrape scan [OPTIONS]
| Option | Description |
|---|---|
--subreddits TEXT |
Comma-separated subreddit names (e.g. "saas,startups") |
--keywords TEXT |
Comma-separated search terms (e.g. "frustrated with,broken") |
--platforms CHOICE |
reddit, hn, github, stackoverflow, twitter, producthunt, indiehackers, or all. Default: reddit |
--time-range [day|week|month|year] |
How far back to search. Default: week |
--limit INT |
Max posts to fetch per platform. Default: 100 |
--min-score FLOAT |
Minimum composite score threshold. Default: 5.0 |
--output FORMAT |
json, csv, parquet, or airtable. Default: json |
--output-file PATH |
Destination file path. Default: results.{format} |
--dry-run |
Show what would be scanned without actually fetching |
At least one of --subreddits or --keywords is required.
Examples
# Scan r/saas for the past month, output CSV
scopescrape scan --subreddits saas --keywords "pain point,frustrated" --time-range month --output csv
# Scan all 7 platforms for frustration signals
scopescrape scan --keywords "frustrated with,hate using,broken" --platforms all
# Scan GitHub and Stack Overflow for technical pain
scopescrape scan --keywords "not working,bug,limitation" --platforms github --limit 50
scopescrape scan --keywords "how to,anyone know" --platforms stackoverflow --output csv
# Export to Airtable for structured analysis
scopescrape scan --subreddits saas --keywords "alternative to" --output airtable
# Scan Product Hunt reviews for feature gaps
scopescrape scan --keywords "wish it had,would be better if" --platforms producthunt
# Scan Indie Hackers for founder frustrations
scopescrape scan --keywords "pivoted,failed,trying to" --platforms indiehackers
# Dry run to preview parameters
scopescrape scan --subreddits saas --keywords "broken" --dry-run
# Multiple subreddits with lower score threshold
scopescrape scan --subreddits "saas,startups,entrepreneur" --keywords "pain point" --min-score 3.0
config
Display the current merged configuration with sensitive values masked.
scopescrape config
Shows the resolved config after merging defaults, YAML file, .env file, and environment variables. Any value that looks like a credential gets masked (e.g. abcd****).
platforms
List available platform adapters and their readiness status.
scopescrape platforms
Prints each platform name, adapter type, and whether it is ready to use. Hacker News is always ready (no auth needed). Reddit shows ready when using public JSON endpoints.
Global Options
These options work with any command:
scopescrape --version # Print version number
scopescrape --config FILE ... # Use a specific config file
scopescrape --verbose ... # Enable debug logging
scopescrape --quiet ... # Suppress info-level output
On Windows, if the scopescrape command is not on your PATH, use python -m scopescrape instead.
Airtable Integration
Export scan results directly to Airtable for structured analysis, team collaboration, and downstream workflows. The integration creates a three-table schema: Scans, Pain Points, and Signals.
Setup
1. Create an Airtable base and tables. You need:
- Scans table: One record per scan run. Fields: Scan ID, Platform, Subreddits, Keywords, Time Range, Posts Fetched, Results Scored, Min Score.
- Pain Points table: One record per scored post. Fields: Post ID, Platform, Source, Title, Body, Author, Score, Comment Count, URL, Created At, and a Scan link.
- Signals table: One record per detected signal phrase. Fields: Signal Text, Signal Tier, Category, Pain Point link.
2. Get your API credentials. Generate a personal access token at https://airtable.com/account/tokens with data.records:read and data.records:write scopes.
3. Configure ScopeScrape. Add to config.yaml:
airtable:
base_id: "appXXXXXXXXXXXXXX"
scans_table_id: "tblXXXXXXXXXXXXXX"
pain_points_table_id: "tblYYYYYYYYYYYYYY"
signals_table_id: "tblZZZZZZZZZZZZZZ"
Or set the API key via environment variable:
export AIRTABLE_API_KEY="pat_XXXXXXXXXXXXXXX"
Running with Airtable Export
Once configured, use --output airtable to export:
# Scan and export to Airtable
scopescrape scan --subreddits saas --keywords "frustrated with" --output airtable
# The tool will:
# 1. Fetch posts from all specified platforms
# 2. Detect signal phrases and score results
# 3. Create a Scan record with metadata
# 4. Create Pain Point records (one per scored post)
# 5. Create Signal records (one per detected phrase), linked to their parent Pain Points
Schema Details
Scans table: Tracks each scan run's parameters and results. Use this to compare scans over time or audit your research.
Pain Points table: Central table linking scan runs to detected pain points. Each record includes the full post data (title, body, author, engagement metrics) plus the final composite score.
Signals table: Granular view of detected signal phrases. Each record shows the phrase text, signal tier (PAIN_POINT, EMOTIONAL, COMPARISON, ASK), category (emotion, workflow, bug, feature_request, tool_eval, migration, help_seeking), and a link back to the parent pain point. Useful for analyzing which signals are most common or impactful.
Known Issues (v0.2)
These issues have been fixed in v0.2. For other known limitations, see the project backlog.
FIXED in v0.2: B-001 - Noisy entity extractor. The regex fallback (used when spaCy is not installed) no longer picks up false-positive entities like "I", "My", and "The". Entity extraction is now more precise.
FIXED in v0.2: B-002 - Frequency scorer flat at 10.0. Improved BM25 implementation now correctly differentiates between posts even when all come from the same search query.
FIXED in v0.2: B-003 - "broken" false positives. Signal phrase matching is now context-aware. Posts like "I broke through my revenue ceiling" are no longer incorrectly flagged as pain points.
Known limitation: Nitter instance availability. Twitter/X scraping depends on Nitter instances staying online. If the primary instance is down, the adapter automatically falls back to alternates. If all instances are unavailable, Twitter scraping will fail gracefully.
Known limitation: Product Hunt rate limits. GraphQL API has stricter per-hour limits than the unauthenticated search endpoint. For high-volume scans, provide a PRODUCTHUNT_TOKEN.
What's Next
v0.2 added 5 new platform adapters (Twitter, GitHub, Stack Overflow, Product Hunt, Indie Hackers) and Airtable export. The backlog tracks upcoming work.
Near-term priorities include:
- Progress bar and ETA for long multi-platform scans
- Enhanced deduplication across platforms (same pain point mentioned on Reddit, HN, and Twitter)
- Filtering and post-processing UI for Airtable results
- Platform-specific pain signal libraries (e.g., "shipping is hard" for indie hackers, "memory leak" for GitHub)
Further out: LLM-powered intent classification, local web dashboard, direct Slack/Discord export, and integration with Notion.