Installation

ScopeScrape requires Python 3.10 or newer. Install from source:

git clone https://github.com/cvasan2sh/scopescrape_tool.git
cd scopescrape_tool
pip install -e .

That gives you the core tool with zero API keys required. Reddit scraping uses public JSON endpoints, and Hacker News uses the open Algolia Search API.

For optional features, install extras:

# spaCy-powered entity extraction (better specificity scoring)
pip install -e ".[nlp]"

# Parquet export support
pip install -e ".[parquet]"

# Everything
pip install -e ".[all]"

On Python 3.14, spaCy and pyarrow do not yet have prebuilt wheels. The core tool works without them. The regex-based entity extractor and JSON/CSV exporters handle everything you need for v0.1.

Quick Start

Get your first pain point scan running in two steps. No API keys, no account setup.

1. Run your first scan

Scan a subreddit for pain points from the past week:

scopescrape scan --subreddits saas --keywords "frustrated with,alternative to" --time-range week

This fetches posts from r/saas matching your keywords, detects signal phrases across 4 tiers (pain point, emotional, comparison, ask), scores every post on 4 dimensions, and writes the results to results.json in your current directory.

2. Choose your output format

Export to JSON (default), CSV, or Parquet:

# CSV for spreadsheet analysis
scopescrape scan --subreddits saas --keywords "pain point" --output csv

# Custom output path
scopescrape scan --subreddits saas --keywords "pain point" --output json --output-file ./data/saas-results.json

If you installed the [parquet] extra, you can also use --output parquet for columnar storage.

Configuration

ScopeScrape works with zero configuration out of the box. For fine-tuning, create a YAML config file at ./config.yaml, ./config/config.yaml, or ~/.scopescrape/config.yaml.

reddit:
  user_agent: "ScopeScrape v0.1 (by your_username)"
  rate_limit_delay: 1.0    # seconds between requests (~30 req/min safe limit)
  comment_depth: 3          # max depth for comment tree traversal

hn:
  rate_limit_delay: 0.1
  comment_depth: 5

scoring:
  weights:
    frequency: 0.25    # BM25 relevance across corpus
    intensity: 0.20    # VADER sentiment + signal phrase tier
    specificity: 0.25  # entity count + text length
    recency: 0.30      # exponential time decay (168h half-life)
  min_score: 5.0       # posts below this are dropped from results

storage:
  db_path: ~/.scopescrape/data.db
  retention_hours: 48
  in_memory: false

scan:
  default_limit: 100
  default_time_range: week
  default_platforms:
    - reddit

The scoring weights must sum to 1.0. Adjust them to emphasize what matters most to your research. For example, bump recency higher if you only care about this week's pain, or increase intensity to surface the most emotionally charged posts.

You can also pass --config path/to/config.yaml to the CLI to use a specific config file per run.

Platform Adapters

ScopeScrape v0.1 ships with two platform adapters. Both work without API keys.

Reddit (Public JSON Endpoints)

Appending .json to any Reddit URL returns structured data. ScopeScrape uses this to search subreddits, list posts, and fetch full comment threads. Rate limiting is built in at ~30 requests per minute with automatic retry on 429 responses and exponential backoff.

The adapter handles three fetch modes depending on your inputs: searching a specific subreddit (subreddits + keywords), listing top/hot posts (subreddits only), or searching globally across Reddit (keywords only).

Hacker News (Algolia Search API)

HN data is fetched through the Algolia Search API at hn.algolia.com/api/v1/. Full-text search over stories and comments, time filtering via numericFilters, and recursive comment tree fetching. No authentication required.

Signal Detection

The signal detector scans every post and comment for 60+ phrases organized into 4 tiers. Each phrase is matched with word-boundary regex (\b) and case-insensitive flags. When a phrase matches, ScopeScrape captures 50 characters of surrounding context.

Signal Tiers

Tier Weight Example Phrases
PAIN_POINT 1.0 "frustrated with", "broken", "dealbreaker", "bug"
EMOTIONAL 0.8 "hate", "wish", "love", "pray"
COMPARISON 0.6 "alternative to", "switched from", "vs", "better than"
ASK 0.4 "how to", "anyone know", "is there a way"

Phrases are further tagged with categories (emotion, workflow, bug, feature_request, tool_eval, migration, help_seeking) for downstream analysis.

Scoring Framework

Every post that triggers at least one signal phrase gets scored on four dimensions. Each dimension produces a 0-to-10 score, then the weighted composite determines the final ranking.

Frequency (weight: 0.25)

BM25 relevance scoring across the fetched corpus. Measures how central a post is to the search terms. Currently uses the rank_bm25 library. Known issue: when the entire corpus comes from the same search query, BM25 tends to assign similar scores to all posts.

Intensity (weight: 0.20)

Combines VADER sentiment analysis (60%) with signal tier weight (40%). Negative sentiment scores higher because frustration and complaints are more valuable signals than praise. A post that says "I absolutely hate this tool" scores higher intensity than "this tool is okay."

Specificity (weight: 0.25)

Entity count (60%) combined with text length (40%). If spaCy is installed, named entity recognition identifies product names, companies, and tools. Without spaCy, a regex fallback extracts capitalized phrases and common patterns. Longer, more detailed posts score higher.

Recency (weight: 0.30)

Exponential decay with a configurable half-life (default: 168 hours / 1 week). A post from today scores 10.0, a post from a week ago scores ~5.0, and a post from two weeks ago scores ~2.5. This keeps your results focused on current pain points rather than stale complaints.

scan

The primary command. Fetches posts, detects signals, scores results, and exports.

scopescrape scan [OPTIONS]
Option Description
--subreddits TEXT Comma-separated subreddit names (e.g. "saas,startups")
--keywords TEXT Comma-separated search terms (e.g. "frustrated with,broken")
--platforms [reddit|hn|all] Which platform(s) to scan. Default: reddit
--time-range [day|week|month|year] How far back to search. Default: week
--limit INT Max posts to fetch per platform. Default: 100
--min-score FLOAT Minimum composite score threshold. Default: 5.0
--output [json|csv|parquet] Export format. Default: json
--output-file PATH Destination file path. Default: results.{format}
--dry-run Show what would be scanned without actually fetching

At least one of --subreddits or --keywords is required.

Examples

# Scan r/saas for the past month, output CSV
scopescrape scan --subreddits saas --keywords "pain point,frustrated" --time-range month --output csv

# Scan multiple subreddits with a lower score threshold
scopescrape scan --subreddits "saas,startups,entrepreneur" --keywords "alternative to" --min-score 3.0

# Dry run to preview parameters
scopescrape scan --subreddits saas --keywords "broken" --dry-run

# Scan Hacker News instead of Reddit
scopescrape scan --keywords "frustrated with,hate using" --platforms hn

# Scan both platforms
scopescrape scan --subreddits saas --keywords "broken" --platforms all

config

Display the current merged configuration with sensitive values masked.

scopescrape config

Shows the resolved config after merging defaults, YAML file, .env file, and environment variables. Any value that looks like a credential gets masked (e.g. abcd****).

platforms

List available platform adapters and their readiness status.

scopescrape platforms

Prints each platform name, adapter type, and whether it is ready to use. Hacker News is always ready (no auth needed). Reddit shows ready when using public JSON endpoints.

Global Options

These options work with any command:

scopescrape --version          # Print version number
scopescrape --config FILE ...  # Use a specific config file
scopescrape --verbose ...      # Enable debug logging
scopescrape --quiet ...        # Suppress info-level output

On Windows, if the scopescrape command is not on your PATH, use python -m scopescrape instead.

Known Issues (v0.1)

These are documented in the project backlog and will be addressed in upcoming releases:

B-001: Noisy entity extractor. The regex fallback (used when spaCy is not installed) picks up false-positive entities like "I", "My", and "The". This inflates specificity scores for posts that do not actually mention product names.

B-002: Frequency scorer flat at 10.0. When all posts come from the same search query, BM25 treats every post as maximally relevant to the query terms, producing identical frequency scores.

B-003: "broken" false positives. Promotional posts like "I broke through my revenue ceiling" trigger the PAIN_POINT tier because "broke" matches the signal phrase list.

What's Next

The backlog tracks everything planned. Near-term priorities include comment-level fetching and signal detection, a progress bar for long scans, content deduplication, and a subreddits-only browsing mode that works without keywords. Further out: LLM-powered classification, GitHub Discussions adapter, and a local dashboard for exploring results.