Mining GitHub Issues and Discussions for Product Pain Points

GitHub issues are where developers talk about problems they actually face. Not the cleaned-up version from a marketing landing page. The real thing. Someone opens an issue when their workflow breaks, when a library does not do what they need, when they find a bug. The issue thread is where they and the maintainer collaborate on a solution.

This makes GitHub a goldmine for pain point discovery. And GitHub gives you a free API to search this data. No cost, no approval process. ScopeScrape uses it to monitor pain signals across open-source ecosystems.

What makes GitHub data different from Reddit or Twitter?

GitHub issues come from developers, often highly technical ones. The context matters. A complaint about performance on GitHub is usually backed by benchmarks or profiling. A feature request on GitHub comes with a use case. The person writing it understands the tool well enough to know what is missing.

This is different from Reddit, where anyone can post. GitHub issues self-select for signal strength. The person writing an issue has already tried the tool, encountered a real problem, and decided it was worth documenting publicly.

Issues also get responses from maintainers. You can see how the maintainer reacts to the complaint. Do they fix it? Do they explain why it is by design? Do they ignore it? This adds context to the pain signal.

Discussions (a newer GitHub feature) are more like forums. They allow longer-form conversations about problems, best practices, and roadmapping. A discussion might be titled "How do people handle X?" and attract hundreds of responses with different approaches. This is a signal of a systemic pain point that people are actively working around.

How does the GitHub REST API work?

GitHub's REST API is straightforward. You make HTTP requests to https://api.github.com and get back JSON. No GraphQL required (though GraphQL is available if you prefer).

For issues search, the endpoint is:

GET https://api.github.com/search/issues?q=repo:owner/repo+label:bug&per_page=30&sort=created&order=desc

The q parameter accepts a query string. You can search by repo, label, state (open or closed), author, creation date, and keyword. A simple example:

repo:nodejs/node is:issue label:performance created:>=2025-01-01

This returns all open performance-related issues in Node.js created after January 1, 2025. The API returns up to 1000 results per query.

Rate limits: 60 req/hr unauthenticated, 5000 with a token

Unauthenticated requests are rate-limited to 60 per hour. For exploratory work, this is fine. You can pull 60 queries worth of data, then wait. But if you are building a tool that runs continuously, 60 per hour is tight.

Authenticate with a GitHub personal access token and you get 5000 requests per hour. A token requires almost no setup. Go to GitHub Settings, create a Personal Access Token with the public_repo scope, and pass it in the Authorization header. You are done.

curl -H "Authorization: token YOUR_TOKEN" https://api.github.com/search/issues?q=repo:nodejs/node

5000 requests per hour means you can monitor dozens of repositories continuously. ScopeScrape uses an authenticated token to avoid rate limit surprises.

REST API vs GraphQL

GitHub offers both. The REST API is simpler. You point at an endpoint, pass parameters, get JSON back. GraphQL is more flexible. You specify exactly which fields you want, and you can fetch multiple resources in one request.

For ScopeScrape, I chose REST. The reasons are practical. REST searches are simpler to express. The rate limits are the same either way. And REST responses are smaller when you do not need every field. For issues, I only care about title, body, creation date, and comment count. REST gives me those without extra noise.

If you need nested data (e.g., getting issues and their comments and the comments' reactions in a single request), GraphQL wins. But for issue search, REST is straightforward.

What data do you actually get from issues?

An issue object includes:

Title and body (the original problem statement)
Author, creation date, last update date
State (open or closed)
Labels (if the repo uses them)
Number of comments
Number of reactions (likes, thumbs down, etc.)
Linked pull requests (if someone opened a PR to fix it)

Comments are not returned by the search endpoint. You need a separate request to get them. This matters for pain point analysis. The issue title might be vague, but the first comment could be detailed. To get the full picture, you need to fetch comments separately.

For discussions, the endpoint is different (/repos/owner/repo/discussions), and the structure is similar. Discussions are newer, so some repos do not use them. But where they exist, they expose different pain signals than issues.

Searching issues effectively

The key is good query syntax. Bad queries get noise. Good queries surface real pain.

A weak query: repo:rails/rails error. This returns thousands of issues mentioning the word "error" somewhere. Most are noise.

A stronger query: repo:rails/rails is:issue state:open label:bug -label:wontfix created:>=2025-01-01. This returns only open bug reports filed in the past three months, excluding issues marked as wontfix.

Even better: combine keywords that signal pain. Search for "performance" and "slow", or "broke" and "upgrade", or "confusing" in the issue body. These are more specific signals than generic keywords.

Discussions reveal pain patterns you do not see in issues

Issues are problem reports. Discussions are conversations. Someone opens a discussion titled "How do you handle X in production?" and gets 50 responses with different approaches. This tells you people are not confident in the standard way of doing things. That is a pain signal.

Discussions also surface problems people work around. If a discussion is titled "Workarounds for Y", the fact that it exists means the problem is real enough that people have banded together to share solutions. That is a stronger signal than a single issue.

ScopeScrape treats issues and discussions as separate adapters (or sub-adapters). The scoring is the same, but the interpretation is different. A discussion with 100 comments signals persistent confusion. An issue with 100 comments signals a bug many people hit.

Tradeoffs and limits

GitHub is not complete data. You only see public issues. Private repositories are invisible. And not all projects use GitHub. If your market research depends on a tool that lives on GitLab or Gitea, GitHub data is incomplete.

Also, issues are reported by users who have access to a GitHub account and took the time to report. If a problem is so bad that users cannot be bothered to file an issue, you will not see it in GitHub. You are sampling from the motivated subset of the user base.

But what you do see is high-confidence signal. An issue with a clear reproduction case and a comment from the maintainer saying "yes, this is a bug" is actionable. That is not true of a Reddit post that might be a complaint, a misunderstanding, or both.