Documentation
Everything you need to get started with ScopeScrape v0.1.
Installation
ScopeScrape requires Python 3.10 or newer. Install from source:
git clone https://github.com/cvasan2sh/scopescrape_tool.git
cd scopescrape_tool
pip install -e .
That gives you the core tool with zero API keys required. Reddit scraping uses public JSON endpoints, and Hacker News uses the open Algolia Search API.
For optional features, install extras:
# spaCy-powered entity extraction (better specificity scoring)
pip install -e ".[nlp]"
# Parquet export support
pip install -e ".[parquet]"
# Everything
pip install -e ".[all]"
On Python 3.14, spaCy and pyarrow do not yet have prebuilt wheels. The core tool works without them. The regex-based entity extractor and JSON/CSV exporters handle everything you need for v0.1.
Quick Start
Get your first pain point scan running in two steps. No API keys, no account setup.
1. Run your first scan
Scan a subreddit for pain points from the past week:
scopescrape scan --subreddits saas --keywords "frustrated with,alternative to" --time-range week
This fetches posts from r/saas matching your keywords, detects signal phrases across 4 tiers (pain point, emotional, comparison, ask), scores every post on 4 dimensions, and writes the results to results.json in your current directory.
2. Choose your output format
Export to JSON (default), CSV, or Parquet:
# CSV for spreadsheet analysis
scopescrape scan --subreddits saas --keywords "pain point" --output csv
# Custom output path
scopescrape scan --subreddits saas --keywords "pain point" --output json --output-file ./data/saas-results.json
If you installed the [parquet] extra, you can also use --output parquet for columnar storage.
Configuration
ScopeScrape works with zero configuration out of the box. For fine-tuning, create a YAML config file at ./config.yaml, ./config/config.yaml, or ~/.scopescrape/config.yaml.
reddit:
user_agent: "ScopeScrape v0.1 (by your_username)"
rate_limit_delay: 1.0 # seconds between requests (~30 req/min safe limit)
comment_depth: 3 # max depth for comment tree traversal
hn:
rate_limit_delay: 0.1
comment_depth: 5
scoring:
weights:
frequency: 0.25 # BM25 relevance across corpus
intensity: 0.20 # VADER sentiment + signal phrase tier
specificity: 0.25 # entity count + text length
recency: 0.30 # exponential time decay (168h half-life)
min_score: 5.0 # posts below this are dropped from results
storage:
db_path: ~/.scopescrape/data.db
retention_hours: 48
in_memory: false
scan:
default_limit: 100
default_time_range: week
default_platforms:
- reddit
The scoring weights must sum to 1.0. Adjust them to emphasize what matters most to your research. For example, bump recency higher if you only care about this week's pain, or increase intensity to surface the most emotionally charged posts.
You can also pass --config path/to/config.yaml to the CLI to use a specific config file per run.
Platform Adapters
ScopeScrape v0.1 ships with two platform adapters. Both work without API keys.
Reddit (Public JSON Endpoints)
Appending .json to any Reddit URL returns structured data. ScopeScrape uses this to search subreddits, list posts, and fetch full comment threads. Rate limiting is built in at ~30 requests per minute with automatic retry on 429 responses and exponential backoff.
The adapter handles three fetch modes depending on your inputs: searching a specific subreddit (subreddits + keywords), listing top/hot posts (subreddits only), or searching globally across Reddit (keywords only).
Hacker News (Algolia Search API)
HN data is fetched through the Algolia Search API at hn.algolia.com/api/v1/. Full-text search over stories and comments, time filtering via numericFilters, and recursive comment tree fetching. No authentication required.
Signal Detection
The signal detector scans every post and comment for 60+ phrases organized into 4 tiers. Each phrase is matched with word-boundary regex (\b) and case-insensitive flags. When a phrase matches, ScopeScrape captures 50 characters of surrounding context.
Signal Tiers
| Tier | Weight | Example Phrases |
|---|---|---|
PAIN_POINT |
1.0 | "frustrated with", "broken", "dealbreaker", "bug" |
EMOTIONAL |
0.8 | "hate", "wish", "love", "pray" |
COMPARISON |
0.6 | "alternative to", "switched from", "vs", "better than" |
ASK |
0.4 | "how to", "anyone know", "is there a way" |
Phrases are further tagged with categories (emotion, workflow, bug, feature_request, tool_eval, migration, help_seeking) for downstream analysis.
Scoring Framework
Every post that triggers at least one signal phrase gets scored on four dimensions. Each dimension produces a 0-to-10 score, then the weighted composite determines the final ranking.
Frequency (weight: 0.25)
BM25 relevance scoring across the fetched corpus. Measures how central a post is to the search terms. Currently uses the rank_bm25 library. Known issue: when the entire corpus comes from the same search query, BM25 tends to assign similar scores to all posts.
Intensity (weight: 0.20)
Combines VADER sentiment analysis (60%) with signal tier weight (40%). Negative sentiment scores higher because frustration and complaints are more valuable signals than praise. A post that says "I absolutely hate this tool" scores higher intensity than "this tool is okay."
Specificity (weight: 0.25)
Entity count (60%) combined with text length (40%). If spaCy is installed, named entity recognition identifies product names, companies, and tools. Without spaCy, a regex fallback extracts capitalized phrases and common patterns. Longer, more detailed posts score higher.
Recency (weight: 0.30)
Exponential decay with a configurable half-life (default: 168 hours / 1 week). A post from today scores 10.0, a post from a week ago scores ~5.0, and a post from two weeks ago scores ~2.5. This keeps your results focused on current pain points rather than stale complaints.
scan
The primary command. Fetches posts, detects signals, scores results, and exports.
scopescrape scan [OPTIONS]
| Option | Description |
|---|---|
--subreddits TEXT |
Comma-separated subreddit names (e.g. "saas,startups") |
--keywords TEXT |
Comma-separated search terms (e.g. "frustrated with,broken") |
--platforms [reddit|hn|all] |
Which platform(s) to scan. Default: reddit |
--time-range [day|week|month|year] |
How far back to search. Default: week |
--limit INT |
Max posts to fetch per platform. Default: 100 |
--min-score FLOAT |
Minimum composite score threshold. Default: 5.0 |
--output [json|csv|parquet] |
Export format. Default: json |
--output-file PATH |
Destination file path. Default: results.{format} |
--dry-run |
Show what would be scanned without actually fetching |
At least one of --subreddits or --keywords is required.
Examples
# Scan r/saas for the past month, output CSV
scopescrape scan --subreddits saas --keywords "pain point,frustrated" --time-range month --output csv
# Scan multiple subreddits with a lower score threshold
scopescrape scan --subreddits "saas,startups,entrepreneur" --keywords "alternative to" --min-score 3.0
# Dry run to preview parameters
scopescrape scan --subreddits saas --keywords "broken" --dry-run
# Scan Hacker News instead of Reddit
scopescrape scan --keywords "frustrated with,hate using" --platforms hn
# Scan both platforms
scopescrape scan --subreddits saas --keywords "broken" --platforms all
config
Display the current merged configuration with sensitive values masked.
scopescrape config
Shows the resolved config after merging defaults, YAML file, .env file, and environment variables. Any value that looks like a credential gets masked (e.g. abcd****).
platforms
List available platform adapters and their readiness status.
scopescrape platforms
Prints each platform name, adapter type, and whether it is ready to use. Hacker News is always ready (no auth needed). Reddit shows ready when using public JSON endpoints.
Global Options
These options work with any command:
scopescrape --version # Print version number
scopescrape --config FILE ... # Use a specific config file
scopescrape --verbose ... # Enable debug logging
scopescrape --quiet ... # Suppress info-level output
On Windows, if the scopescrape command is not on your PATH, use python -m scopescrape instead.
Known Issues (v0.1)
These are documented in the project backlog and will be addressed in upcoming releases:
B-001: Noisy entity extractor. The regex fallback (used when spaCy is not installed) picks up false-positive entities like "I", "My", and "The". This inflates specificity scores for posts that do not actually mention product names.
B-002: Frequency scorer flat at 10.0. When all posts come from the same search query, BM25 treats every post as maximally relevant to the query terms, producing identical frequency scores.
B-003: "broken" false positives. Promotional posts like "I broke through my revenue ceiling" trigger the PAIN_POINT tier because "broke" matches the signal phrase list.
What's Next
The backlog tracks everything planned. Near-term priorities include comment-level fetching and signal detection, a progress bar for long scans, content deduplication, and a subreddits-only browsing mode that works without keywords. Further out: LLM-powered classification, GitHub Discussions adapter, and a local dashboard for exploring results.