I spent two weeks building a Reddit data pipeline for ScopeScrape and found myself answering the same questions over and over: what fields does a Submission actually return? How do I get all comments without hitting rate limits? Can I still access historical data now that Pushshift is gone? This post is the reference I wish existed when I started.

How do I get Reddit API access in 2026?

This changed significantly in late 2024. Reddit ended self-service API key generation under their new Responsible Builder Policy. You can no longer just go to reddit.com/prefs/apps and create credentials immediately.

The current process:

  1. Read Reddit's Responsible Builder Policy
  2. Submit an application via Reddit's Developer Support form describing your use case, which subreddits you'll access, and expected request volume
  3. Wait for approval (typically 2-4 weeks)
  4. Once approved, generate your OAuth2 credentials at reddit.com/prefs/apps

Having an established Reddit account with positive karma improves your chances. Accounts that look like they were created just for API access tend to get rejected.

What are the different app types and which should I pick?

App Type Use Case OAuth Flow Has client_secret?
Script Personal scripts, bots, data pipelines running on your own machine Password grant (requires your Reddit username and password) Yes
Web app Web services that act on behalf of other Reddit users Authorization code grant (redirect-based) Yes
Installed app Mobile apps, desktop apps (runs on user's device) Implicit grant No (cannot keep a secret)

For data collection and analysis, pick script. It's the simplest: you provide your client_id (the 14+ character string under "personal use script"), client_secret (the 27+ character string next to "secret"), your Reddit username, and your password. PRAW handles the rest.

import praw

reddit = praw.Reddit(
    client_id="your_14char_client_id",
    client_secret="your_27char_client_secret",
    user_agent="scopescrape/0.1 by u/your_username",
    username="your_username",
    password="your_password"
)

Script apps only have access to accounts registered as "developers" of the app. If you need to act on behalf of arbitrary users, you need a web app.

What fields does a Submission object return?

PRAW dynamically provides attributes based on what Reddit's API returns. Reddit can add or remove fields without notice, and PRAW does not document specific attributes for this reason. That said, in practice these are the fields you can rely on as of March 2026:

Field Type Notes
id string (base36) Unique identifier, e.g. "abc123de"
title string Post title. Plain text, max 300 characters.
selftext string Post body in Markdown. Empty string for link posts.
score int Net upvotes (upvotes minus downvotes). Not actual vote counts.
upvote_ratio float Proportion of upvotes to total votes. 0.5 means 50/50 split.
num_comments int Total comment count. May not match PRAW's extracted count exactly.
created_utc float Unix timestamp in seconds since epoch.
subreddit Subreddit Subreddit object. Use str(submission.subreddit) for the name.
author Redditor or None None if the account was deleted. Check before accessing.
permalink string Relative URL path, e.g. "/r/saas/comments/abc123de/..."
url string For link posts, the external URL. For self posts, the full Reddit URL.
is_self bool True for text posts, False for link posts.
link_flair_text string or None Post flair label. Subreddit-specific. None if no flair.
over_18 bool NSFW flag.
locked bool True if moderators locked comments.
stickied bool True if pinned by moderators.
distinguished string or None "moderator", "admin", or None.
is_original_content bool OC flag. Set by the post author.
spoiler bool True if marked as spoiler.
name string Fullname with type prefix, e.g. "t3_abc123de".

Because PRAW fetches attributes lazily (it does not pull the full object until you access a field), always use defensive access in production code:

# Safe: returns 0 if field is missing or deprecated
score = getattr(submission, 'score', 0)

# Unsafe: throws AttributeError if field is removed
score = submission.score

What fields does a Comment object return?

Field Type Notes
id string (base36) Unique identifier.
body string Comment text in Markdown. "[deleted]" if removed.
score int Net votes. Reddit rounds negative scores.
created_utc float Unix timestamp.
author Redditor or None None if account was deleted.
parent_id string Fullname of parent. "t3_xyz" = parent is a post. "t1_xyz" = parent is a comment.
replies CommentForest Child comments. May contain MoreComments stubs.
is_submitter bool True if comment author is the original post author (OP).
stickied bool True if pinned by moderators (usually automod or mod comments).
distinguished string or None "moderator", "admin", or None.

How do I get all comments from a post?

This is the single most common source of bugs I've seen. When you access submission.comments, you get a CommentForest object, not a flat list. The forest contains a mix of Comment objects and MoreComments stubs. If you iterate directly, you will hit AttributeError: 'MoreComments' object has no attribute 'body'.

The correct pattern:

# Replace MoreComments stubs with actual comments.
# limit=None replaces ALL stubs (most thorough, most API calls).
# limit=32 is the default (replaces up to 32, drops the rest).
# limit=0 removes all stubs without fetching (fastest, loses deep comments).
submission.comments.replace_more(limit=32)

# Flatten the tree into a single list
all_comments = submission.comments.list()

for comment in all_comments:
    print(comment.body)

Each MoreComments stub replacement costs one API call that returns roughly 30-40 comments. On a popular post with 500+ comments, replacing all stubs can consume 15-20 API calls just for one post.

replace_more() Parameter Behavior API Cost When to Use
limit=None Replace every stub High (1 call per stub) Deep analysis of a single post
limit=32 (default) Replace up to 32, discard the rest Moderate Balanced data collection
limit=0 Remove all stubs, keep only already-loaded comments Zero Speed-critical bulk scanning

There is also a threshold parameter (default 0) that controls the minimum number of comments a stub must represent before it gets replaced. Setting threshold=10 skips small stubs that only contain a few replies.

Comment sort order

You can control how comments are sorted before fetching:

Sort Value Behavior
confidence Reddit's default ranking algorithm (surfaces relevant, high-quality comments)
top Highest score first
new Most recent first
controversial Highest disagreement (lots of both upvotes and downvotes)
old Oldest first
qa Q&A mode: original poster's replies surface first
submission.comment_sort = "new"
submission.comments.replace_more(limit=8)
comments = submission.comments.list()

What are the rate limits?

Access Level Rate Limit Notes
Unauthenticated 10 requests/minute Essentially unusable for data collection
OAuth authenticated (free tier) 100 requests/minute Sufficient for small projects, tight for bulk scanning
Premium tier ($12,000+/year) Higher limits (negotiated) Contact Reddit's enterprise sales

PRAW handles rate limiting automatically. It reads Reddit's response headers (X-Ratelimit-Remaining, X-Ratelimit-Reset) and sleeps when necessary. You do not need to add your own sleep calls.

If you exceed the limit, PRAW raises prawcore.exceptions.TooManyRequests. In practice this rarely happens because PRAW self-throttles, but it can occur if you're running multiple PRAW instances or if Reddit's headers are inconsistent.

Practical math for a scanning pipeline: at 100 req/min, you can fetch roughly 100 submissions per minute (one API call each). But if each submission has comments you want to expand, the budget shrinks fast. A post with 8 MoreComments stubs costs 9 total calls (1 for the post + 8 for stubs). Scanning 10 such posts would consume 90 of your 100 per-minute budget.

What data does the API not expose?

These are the fields people commonly expect to exist but do not.

Missing Data What You Get Instead Workaround
Individual upvote and downvote counts score (net) and upvote_ratio (proportion) Estimate: upvotes = score / upvote_ratio. Breaks when score is 0.
View counts Nothing. Removed years ago. Use num_comments as a rough engagement proxy.
Edit history edited (timestamp of last edit, or False) None. You only see the current version.
Deleted content "[deleted]" for both body and author None via the API. Pushshift archives may have the original.
Private subreddit posts 403 error Must be an approved member of the subreddit.
Quarantined subreddit posts 403 unless opted in reddit.subreddit("name").quaran.opt_in()

The vote count limitation is the most consequential for analysis. You're working with a survivorship-biased dataset (only non-deleted content) and approximate engagement metrics (net score, not absolute vote volume). Any analysis should acknowledge these constraints.

PRAW vs Async PRAW vs raw HTTP: which should I use?

Approach Best For Rate Limiting Thread Safety
PRAW (praw 7.7.1) Scripts, data pipelines, CLIs Automatic (reads response headers) Not thread-safe. Do not share instances across threads.
Async PRAW (asyncpraw) Discord bots, async web apps, asyncio pipelines Automatic Async-safe. Uses aiohttp internally.
Raw HTTP (requests/httpx) Maximum control, non-Python languages Manual (read headers yourself, implement backoff) Depends on your HTTP client.

PRAW is not thread-safe because it depends on requests.Session, which is not thread-safe. If you need parallelism, either use Async PRAW or create separate PRAW instances per thread (each with its own credentials).

For ScopeScrape, I chose synchronous PRAW because the CLI tool runs single-threaded and PRAW's automatic rate limiting means I never have to think about backoff logic.

Can I still access historical Reddit data?

This is the question that comes up most in the post-2023 landscape. The short answer: it's complicated.

Source Status (March 2026) Data Range Cost
Reddit's official API Active, requires approval Real-time and recent (search endpoint is limited) Free tier: 100 req/min
Pushshift (pushshift.io) Real-time ingestion stopped in 2023. Historical archives remain. 2005 to mid-2023 Free (archives/dumps)
PullPush (pullpush.io) Community-maintained Pushshift alternative Varies. Check their endpoints. Free API
Academic Torrents / BigQuery Static dumps available Pushshift monthly dumps through 2023 Free (bulk download)

Reddit's own search endpoint (subreddit.search()) is limited. It returns at most 250 results per query, does not support date range filtering natively, and results are sorted by relevance rather than time. For any serious historical analysis, you need Pushshift archives or PullPush.

For a tool like ScopeScrape that needs recent data (past week, past month), Reddit's API is sufficient. The search limitations only matter if you're trying to build a comprehensive historical dataset.

What does a real Submission look like as JSON?

{
  "id": "abc123de",
  "title": "I wish there was a way to track customer churn reasons automatically",
  "selftext": "We lose ~15% of customers annually and nobody knows why...",
  "score": 847,
  "upvote_ratio": 0.92,
  "num_comments": 124,
  "created_utc": 1742726400.0,
  "subreddit": "saas",
  "author": "founder_jane",
  "permalink": "/r/saas/comments/abc123de/i_wish_there_was_a_way/",
  "url": "https://www.reddit.com/r/saas/comments/abc123de/i_wish_there_was_a_way/",
  "is_self": true,
  "link_flair_text": "Question",
  "over_18": false,
  "locked": false,
  "stickied": false,
  "distinguished": null,
  "is_original_content": false,
  "spoiler": false,
  "name": "t3_abc123de"
}

Note: this is a representative example, not a live API response. The actual JSON from Reddit includes additional internal fields (subreddit_id, domain, media, gilded, etc.) that are less commonly used in data analysis.

Common PRAW errors and how to fix them

Error Cause Fix
AttributeError: 'MoreComments' object has no attribute 'body' Iterating over CommentForest without calling replace_more() first Call submission.comments.replace_more(limit=N) then .list()
prawcore.exceptions.OAuthException Invalid client_id, client_secret, or Reddit credentials Double-check all four values. client_id is the short string, client_secret is the long one.
prawcore.exceptions.ResponseException: 403 Accessing a private or quarantined subreddit without permission For quarantined: reddit.subreddit("name").quaran.opt_in(). For private: must be approved member.
prawcore.exceptions.TooManyRequests Exceeded 100 req/min (rare with PRAW, common with raw HTTP) PRAW auto-throttles. If this fires, you may have multiple instances running.
praw.exceptions.RedditAPIException Various Reddit-side errors (post not found, subreddit banned, etc.) Catch and inspect exception.items for the specific error code.
AttributeError on a submission field Reddit removed or renamed the field; PRAW fetches lazily Use getattr(obj, 'field', default) for all field access in production.

How I use this data for pain point detection

Given these API constraints, here is the approach ScopeScrape takes. I scan three text fields per post: title, selftext, and each comment's body. I match against 60+ signal phrases organized into four tiers:

Tier Signal Type Example Phrases
1 (strongest) Explicit pain "I'm frustrated with", "this is broken", "driving me crazy", "we struggle with"
2 Seeking solutions "is there a tool for", "anyone know how to", "looking for a way to"
3 Workarounds in use "what I do instead", "my hack for this", "I ended up building"
4 (weakest) Implicit signals "would be nice if", "thinking about switching", "looking for alternatives"

Each detected signal gets scored across four dimensions (frequency, intensity, specificity via NER, recency via time decay) with configurable YAML weights. The scoring framework is documented in a separate post.

Since the API does not expose individual vote counts, I use comment score distribution as a proxy for community consensus: if multiple comments expressing the same pain point have high scores, the community is validating that pain. High variance in scores suggests disagreement rather than consensus.

def estimate_consensus(pain_comments):
    scores = [c.score for c in pain_comments]
    if not scores:
        return {"mean": 0, "median": 0, "agreement": "none"}

    mean_score = sum(scores) / len(scores)
    sorted_scores = sorted(scores)
    median_score = sorted_scores[len(scores) // 2]
    variance = sum((s - mean_score) ** 2 for s in scores) / len(scores)

    # Low variance + high mean = strong agreement
    # High variance = contentious topic
    agreement = "strong" if variance < 100 and mean_score > 10 else "weak"
    return {"mean": mean_score, "median": median_score, "agreement": agreement}

The full source code for ScopeScrape's Reddit adapter is on GitHub.