I spent two weeks building a Reddit data pipeline for ScopeScrape and found myself answering the same questions over and over: what fields does a Submission actually return? How do I get all comments without hitting rate limits? Can I still access historical data now that Pushshift is gone? This post is the reference I wish existed when I started.
How do I get Reddit API access in 2026?
This changed significantly in late 2024. Reddit ended self-service API key generation under their new Responsible Builder Policy. You can no longer just go to reddit.com/prefs/apps and create credentials immediately.
The current process:
- Read Reddit's Responsible Builder Policy
- Submit an application via Reddit's Developer Support form describing your use case, which subreddits you'll access, and expected request volume
- Wait for approval (typically 2-4 weeks)
- Once approved, generate your OAuth2 credentials at reddit.com/prefs/apps
Having an established Reddit account with positive karma improves your chances. Accounts that look like they were created just for API access tend to get rejected.
What are the different app types and which should I pick?
| App Type | Use Case | OAuth Flow | Has client_secret? |
|---|---|---|---|
| Script | Personal scripts, bots, data pipelines running on your own machine | Password grant (requires your Reddit username and password) | Yes |
| Web app | Web services that act on behalf of other Reddit users | Authorization code grant (redirect-based) | Yes |
| Installed app | Mobile apps, desktop apps (runs on user's device) | Implicit grant | No (cannot keep a secret) |
For data collection and analysis, pick script. It's the simplest: you provide your client_id (the 14+ character string under "personal use script"), client_secret (the 27+ character string next to "secret"), your Reddit username, and your password. PRAW handles the rest.
import praw
reddit = praw.Reddit(
client_id="your_14char_client_id",
client_secret="your_27char_client_secret",
user_agent="scopescrape/0.1 by u/your_username",
username="your_username",
password="your_password"
)
Script apps only have access to accounts registered as "developers" of the app. If you need to act on behalf of arbitrary users, you need a web app.
What fields does a Submission object return?
PRAW dynamically provides attributes based on what Reddit's API returns. Reddit can add or remove fields without notice, and PRAW does not document specific attributes for this reason. That said, in practice these are the fields you can rely on as of March 2026:
| Field | Type | Notes |
|---|---|---|
id |
string (base36) | Unique identifier, e.g. "abc123de" |
title |
string | Post title. Plain text, max 300 characters. |
selftext |
string | Post body in Markdown. Empty string for link posts. |
score |
int | Net upvotes (upvotes minus downvotes). Not actual vote counts. |
upvote_ratio |
float | Proportion of upvotes to total votes. 0.5 means 50/50 split. |
num_comments |
int | Total comment count. May not match PRAW's extracted count exactly. |
created_utc |
float | Unix timestamp in seconds since epoch. |
subreddit |
Subreddit | Subreddit object. Use str(submission.subreddit) for the name. |
author |
Redditor or None | None if the account was deleted. Check before accessing. |
permalink |
string | Relative URL path, e.g. "/r/saas/comments/abc123de/..." |
url |
string | For link posts, the external URL. For self posts, the full Reddit URL. |
is_self |
bool | True for text posts, False for link posts. |
link_flair_text |
string or None | Post flair label. Subreddit-specific. None if no flair. |
over_18 |
bool | NSFW flag. |
locked |
bool | True if moderators locked comments. |
stickied |
bool | True if pinned by moderators. |
distinguished |
string or None | "moderator", "admin", or None. |
is_original_content |
bool | OC flag. Set by the post author. |
spoiler |
bool | True if marked as spoiler. |
name |
string | Fullname with type prefix, e.g. "t3_abc123de". |
Because PRAW fetches attributes lazily (it does not pull the full object until you access a field), always use defensive access in production code:
# Safe: returns 0 if field is missing or deprecated
score = getattr(submission, 'score', 0)
# Unsafe: throws AttributeError if field is removed
score = submission.score
What fields does a Comment object return?
| Field | Type | Notes |
|---|---|---|
id |
string (base36) | Unique identifier. |
body |
string | Comment text in Markdown. "[deleted]" if removed. |
score |
int | Net votes. Reddit rounds negative scores. |
created_utc |
float | Unix timestamp. |
author |
Redditor or None | None if account was deleted. |
parent_id |
string | Fullname of parent. "t3_xyz" = parent is a post. "t1_xyz" = parent is a comment. |
replies |
CommentForest | Child comments. May contain MoreComments stubs. |
is_submitter |
bool | True if comment author is the original post author (OP). |
stickied |
bool | True if pinned by moderators (usually automod or mod comments). |
distinguished |
string or None | "moderator", "admin", or None. |
How do I get all comments from a post?
This is the single most common source of bugs I've seen. When you access submission.comments, you get a CommentForest object, not a flat list. The forest contains a mix of Comment objects and MoreComments stubs. If you iterate directly, you will hit AttributeError: 'MoreComments' object has no attribute 'body'.
The correct pattern:
# Replace MoreComments stubs with actual comments.
# limit=None replaces ALL stubs (most thorough, most API calls).
# limit=32 is the default (replaces up to 32, drops the rest).
# limit=0 removes all stubs without fetching (fastest, loses deep comments).
submission.comments.replace_more(limit=32)
# Flatten the tree into a single list
all_comments = submission.comments.list()
for comment in all_comments:
print(comment.body)
Each MoreComments stub replacement costs one API call that returns roughly 30-40 comments. On a popular post with 500+ comments, replacing all stubs can consume 15-20 API calls just for one post.
| replace_more() Parameter | Behavior | API Cost | When to Use |
|---|---|---|---|
limit=None |
Replace every stub | High (1 call per stub) | Deep analysis of a single post |
limit=32 (default) |
Replace up to 32, discard the rest | Moderate | Balanced data collection |
limit=0 |
Remove all stubs, keep only already-loaded comments | Zero | Speed-critical bulk scanning |
There is also a threshold parameter (default 0) that controls the minimum number of comments a stub must represent before it gets replaced. Setting threshold=10 skips small stubs that only contain a few replies.
Comment sort order
You can control how comments are sorted before fetching:
| Sort Value | Behavior |
|---|---|
confidence |
Reddit's default ranking algorithm (surfaces relevant, high-quality comments) |
top |
Highest score first |
new |
Most recent first |
controversial |
Highest disagreement (lots of both upvotes and downvotes) |
old |
Oldest first |
qa |
Q&A mode: original poster's replies surface first |
submission.comment_sort = "new"
submission.comments.replace_more(limit=8)
comments = submission.comments.list()
What are the rate limits?
| Access Level | Rate Limit | Notes |
|---|---|---|
| Unauthenticated | 10 requests/minute | Essentially unusable for data collection |
| OAuth authenticated (free tier) | 100 requests/minute | Sufficient for small projects, tight for bulk scanning |
| Premium tier ($12,000+/year) | Higher limits (negotiated) | Contact Reddit's enterprise sales |
PRAW handles rate limiting automatically. It reads Reddit's response headers (X-Ratelimit-Remaining, X-Ratelimit-Reset) and sleeps when necessary. You do not need to add your own sleep calls.
If you exceed the limit, PRAW raises prawcore.exceptions.TooManyRequests. In practice this rarely happens because PRAW self-throttles, but it can occur if you're running multiple PRAW instances or if Reddit's headers are inconsistent.
Practical math for a scanning pipeline: at 100 req/min, you can fetch roughly 100 submissions per minute (one API call each). But if each submission has comments you want to expand, the budget shrinks fast. A post with 8 MoreComments stubs costs 9 total calls (1 for the post + 8 for stubs). Scanning 10 such posts would consume 90 of your 100 per-minute budget.
What data does the API not expose?
These are the fields people commonly expect to exist but do not.
| Missing Data | What You Get Instead | Workaround |
|---|---|---|
| Individual upvote and downvote counts | score (net) and upvote_ratio (proportion) |
Estimate: upvotes = score / upvote_ratio. Breaks when score is 0. |
| View counts | Nothing. Removed years ago. | Use num_comments as a rough engagement proxy. |
| Edit history | edited (timestamp of last edit, or False) |
None. You only see the current version. |
| Deleted content | "[deleted]" for both body and author | None via the API. Pushshift archives may have the original. |
| Private subreddit posts | 403 error | Must be an approved member of the subreddit. |
| Quarantined subreddit posts | 403 unless opted in | reddit.subreddit("name").quaran.opt_in() |
The vote count limitation is the most consequential for analysis. You're working with a survivorship-biased dataset (only non-deleted content) and approximate engagement metrics (net score, not absolute vote volume). Any analysis should acknowledge these constraints.
PRAW vs Async PRAW vs raw HTTP: which should I use?
| Approach | Best For | Rate Limiting | Thread Safety |
|---|---|---|---|
| PRAW (praw 7.7.1) | Scripts, data pipelines, CLIs | Automatic (reads response headers) | Not thread-safe. Do not share instances across threads. |
| Async PRAW (asyncpraw) | Discord bots, async web apps, asyncio pipelines | Automatic | Async-safe. Uses aiohttp internally. |
| Raw HTTP (requests/httpx) | Maximum control, non-Python languages | Manual (read headers yourself, implement backoff) | Depends on your HTTP client. |
PRAW is not thread-safe because it depends on requests.Session, which is not thread-safe. If you need parallelism, either use Async PRAW or create separate PRAW instances per thread (each with its own credentials).
For ScopeScrape, I chose synchronous PRAW because the CLI tool runs single-threaded and PRAW's automatic rate limiting means I never have to think about backoff logic.
Can I still access historical Reddit data?
This is the question that comes up most in the post-2023 landscape. The short answer: it's complicated.
| Source | Status (March 2026) | Data Range | Cost |
|---|---|---|---|
| Reddit's official API | Active, requires approval | Real-time and recent (search endpoint is limited) | Free tier: 100 req/min |
| Pushshift (pushshift.io) | Real-time ingestion stopped in 2023. Historical archives remain. | 2005 to mid-2023 | Free (archives/dumps) |
| PullPush (pullpush.io) | Community-maintained Pushshift alternative | Varies. Check their endpoints. | Free API |
| Academic Torrents / BigQuery | Static dumps available | Pushshift monthly dumps through 2023 | Free (bulk download) |
Reddit's own search endpoint (subreddit.search()) is limited. It returns at most 250 results per query, does not support date range filtering natively, and results are sorted by relevance rather than time. For any serious historical analysis, you need Pushshift archives or PullPush.
For a tool like ScopeScrape that needs recent data (past week, past month), Reddit's API is sufficient. The search limitations only matter if you're trying to build a comprehensive historical dataset.
What does a real Submission look like as JSON?
{
"id": "abc123de",
"title": "I wish there was a way to track customer churn reasons automatically",
"selftext": "We lose ~15% of customers annually and nobody knows why...",
"score": 847,
"upvote_ratio": 0.92,
"num_comments": 124,
"created_utc": 1742726400.0,
"subreddit": "saas",
"author": "founder_jane",
"permalink": "/r/saas/comments/abc123de/i_wish_there_was_a_way/",
"url": "https://www.reddit.com/r/saas/comments/abc123de/i_wish_there_was_a_way/",
"is_self": true,
"link_flair_text": "Question",
"over_18": false,
"locked": false,
"stickied": false,
"distinguished": null,
"is_original_content": false,
"spoiler": false,
"name": "t3_abc123de"
}
Note: this is a representative example, not a live API response. The actual JSON from Reddit includes additional internal fields (subreddit_id, domain, media, gilded, etc.) that are less commonly used in data analysis.
Common PRAW errors and how to fix them
| Error | Cause | Fix |
|---|---|---|
AttributeError: 'MoreComments' object has no attribute 'body' |
Iterating over CommentForest without calling replace_more() first | Call submission.comments.replace_more(limit=N) then .list() |
prawcore.exceptions.OAuthException |
Invalid client_id, client_secret, or Reddit credentials | Double-check all four values. client_id is the short string, client_secret is the long one. |
prawcore.exceptions.ResponseException: 403 |
Accessing a private or quarantined subreddit without permission | For quarantined: reddit.subreddit("name").quaran.opt_in(). For private: must be approved member. |
prawcore.exceptions.TooManyRequests |
Exceeded 100 req/min (rare with PRAW, common with raw HTTP) | PRAW auto-throttles. If this fires, you may have multiple instances running. |
praw.exceptions.RedditAPIException |
Various Reddit-side errors (post not found, subreddit banned, etc.) | Catch and inspect exception.items for the specific error code. |
AttributeError on a submission field |
Reddit removed or renamed the field; PRAW fetches lazily | Use getattr(obj, 'field', default) for all field access in production. |
How I use this data for pain point detection
Given these API constraints, here is the approach ScopeScrape takes. I scan three text fields per post: title, selftext, and each comment's body. I match against 60+ signal phrases organized into four tiers:
| Tier | Signal Type | Example Phrases |
|---|---|---|
| 1 (strongest) | Explicit pain | "I'm frustrated with", "this is broken", "driving me crazy", "we struggle with" |
| 2 | Seeking solutions | "is there a tool for", "anyone know how to", "looking for a way to" |
| 3 | Workarounds in use | "what I do instead", "my hack for this", "I ended up building" |
| 4 (weakest) | Implicit signals | "would be nice if", "thinking about switching", "looking for alternatives" |
Each detected signal gets scored across four dimensions (frequency, intensity, specificity via NER, recency via time decay) with configurable YAML weights. The scoring framework is documented in a separate post.
Since the API does not expose individual vote counts, I use comment score distribution as a proxy for community consensus: if multiple comments expressing the same pain point have high scores, the community is validating that pain. High variance in scores suggests disagreement rather than consensus.
def estimate_consensus(pain_comments):
scores = [c.score for c in pain_comments]
if not scores:
return {"mean": 0, "median": 0, "agreement": "none"}
mean_score = sum(scores) / len(scores)
sorted_scores = sorted(scores)
median_score = sorted_scores[len(scores) // 2]
variance = sum((s - mean_score) ** 2 for s in scores) / len(scores)
# Low variance + high mean = strong agreement
# High variance = contentious topic
agreement = "strong" if variance < 100 and mean_score > 10 else "weak"
return {"mean": mean_score, "median": median_score, "agreement": agreement}
The full source code for ScopeScrape's Reddit adapter is on GitHub.