Simon Willison's Weblog: hacker-news

Profiling Hacker News users based on their comments

2026-03-21T23:59:47+00:00

Here's a mildly dystopian prompt I've been experimenting with recently: "Profile this user", accompanied by a copy of their last 1,000 comments on Hacker News.

Obtaining those comments is easy. The Algolia Hacker News API supports listing comments sorted by date that have a specific tag, and the author of a comment is tagged there as author_username. Here's a JSON feed of my (simonw) most recent comments, for example:

https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_simonw&hitsPerPage=1000

The Algolia API is served with open CORS headers, which means you can access the API from JavaScript running on any web page.

Last August I had ChatGPT build me a simple tool for hitting that API for any user which fetches their comments and gives me a mobile-friendly "copy to clipboard" button. I've since tweaked it a few times with Claude.

I can then paste the whole lot into any LLM - these days I mostly use Claude Opus 4.6 for this - and prompt "profile this user" to see what it can figure out.

It's startlingly effective! It feels invasive to quote the profile of another user here, so I'll show you what it produces for me:

This is Simon Willison — a prolific, independent software developer, blogger, and one of the most visible voices in the AI-assisted coding space. Here's a profile drawn from his comments:

Professional identity: Independent developer and writer. Co-creator of Django, creator of Datasette and many other open source tools. On the board of the Python Software Foundation. Runs a heavily-trafficked personal blog at simonwillison.net with multiple content types (entries, link blog, notes, quotations). Monetizes through GitHub sponsors, ethical ads, and occasional consulting — not through any AI company payroll, a point he defends frequently.

Core thesis on AI coding: He's the leading evangelist for what he calls "agentic engineering" — using coding agents (Claude Code, Codex CLI) as productivity multipliers for skilled developers. His central argument: LLMs don't replace programmers, they amplify existing expertise. He's writing a guide/book on this at simonwillison.net/guides. He insists these tools are deceptively difficult to use well, and that most people who report bad results simply haven't learned the craft yet.

Working style: He does most of his programming from his iPhone via Claude Code for web, often while riding BART, walking the dog, or sitting in the garden. He embraces "YOLO mode" (auto-approving agent actions) and frequently runs 2-3 agent sessions in parallel. He starts every session with "run uv run pytest" to anchor the agent in TDD. He's on the $200/month Claude Max plan.

Key technical interests: Sandboxing and security (WebAssembly, Pyodide, sandbox-exec, Firecracker), SQLite (deeply), Python packaging and tooling (uv, PyPI distribution tricks), browser-in-a-browser experiments (v86, WASM Linux), and local LLM inference. He's been learning Go "by osmosis" through coding agents.

Security consciousness: He coined the term "prompt injection" and the "lethal trifecta" (access to private data + exposure to untrusted input + ability to take actions). He's genuinely alarmed about OpenClaw/Clawdbot security risks and has been predicting a headline-grabbing prompt injection attack "in the next six months" for over two years — acknowledging it keeps not happening while insisting the risk is real. He references the "normalization of deviance" pattern frequently.

Personality and debate style: Energetic, combative when challenged, but generally good-natured. He engages heavily in HN comment threads — sometimes dozens of replies in a single discussion. He's transparent about his biases and maintains a public disclosures page. He pushes back hard against what he sees as motivated reasoning from AI skeptics, but also against AGI hype ("I refer to that as the science fiction version of AI"). He gets visibly frustrated by sock-puppet accounts and bad-faith criticism. He has a self-deprecating streak ("my stupid pelican benchmark").

Recurring themes across hundreds of comments:

"Two things can be true at the same time" — he holds nuanced positions

Tests are for productivity, not just quality

The November 2025 model releases (Opus 4.5, GPT-5.2) were a genuine inflection point

Code review is the biggest bottleneck in agent-assisted workflows

"Cognitive debt" is a real and unsolved problem

The best engineering practices (tests, docs, CI/CD, clean code) also make agents work better

He's deliberately trying to "teach people good software engineering while tricking them into thinking the book is about AI"

Personal interests mentioned: Niche museums (runs niche-museums.com), New Zealand kākāpō parrots, cooking food from countries he's never visited, chickens in his garden. Lives in Half Moon Bay, CA near a major pelican roost. The pelican-on-a-bicycle SVG test is his signature LLM benchmark.

In a sentence: A deeply experienced, independently-minded developer who's genuinely excited about AI coding tools, fights hard against both uncritical hype and reflexive dismissal, and is trying to professionalize how the industry uses these tools — while also worrying loudly about the security implications almost nobody else takes seriously enough.

This all checks out! I ran this in Claude incognito mode to hopefully prevent Claude from guessing that I was evaluating myself and sycophantically glazing me - the tone of the response it gave here is similar to the tone I've seen against other accounts.

I expect it guessed my real name due to my habit of linking to my own writing from some of my comments, which provides plenty of simonwillison.net URLs for it to associate with my public persona. I haven't seen it take a guess at a real name for any of the other profiles I've generated.

It's a little creepy to be able to derive this much information about someone so easily, even when they've shared that freely in a public (and API-available) place.

I mainly use this to check that I'm not getting embroiled in an extensive argument with someone who has a history of arguing in bad faith. Thankfully that's rarely the case - Hacker News continues to be a responsibly moderated online space.

Tags: hacker-news, ai, generative-ai, llms, ai-ethics

Tips for getting coding agents to write good Python tests

2026-01-26T23:55:29+00:00

Someone asked on Hacker News if I had any tips for getting coding agents to write decent quality tests. Here's what I said:

I work in Python which helps a lot because there are a TON of good examples of pytest tests floating around in the training data, including things like usage of fixture libraries for mocking external HTTP APIs and snapshot testing and other neat patterns.

Or I can say "use pytest-httpx to mock the endpoints" and Claude knows what I mean.

Keeping an eye on the tests is important. The most common anti-pattern I see is large amounts of duplicated test setup code - which isn't a huge deal, I'm much more more tolerant of duplicated logic in tests than I am in implementation, but it's still worth pushing back on.

"Refactor those tests to use pytest.mark.parametrize" and "extract the common setup into a pytest fixture" work really well there.

Generally though the best way to get good tests out of a coding agent is to make sure it's working in a project with an existing test suite that uses good patterns. Coding agents pick the existing patterns up without needing any extra prompting at all.

I find that once a project has clean basic tests the new tests added by the agents tend to match them in quality. It's similar to how working on large projects with a team of other developers work - keeping the code clean means when people look for examples of how to write a test they'll be pointed in the right direction.

One last tip I use a lot is this:

Clone datasette/datasette-enrichments
from GitHub to /tmp and imitate the
testing patterns it uses

I do this all the time with different existing projects I've written - the quickest way to show an agent how you like something to be done is to have it look at an example.

Tags: hacker-news, python, testing, ai, pytest, generative-ai, llms, coding-agents

The most popular blogs of Hacker News in 2025

2026-01-02T19:10:43+00:00

Michael Lynch maintains HN Popularity Contest, a site that tracks personal blogs on Hacker News and scores them based on how well they perform on that platform.

The engine behind the project is the domain-meta.csv CSV on GiHub, a hand-curated list of known personal blogs with author and bio and tag metadata, which Michael uses to separate out personal blog posts from other types of content.

I came top of the rankings in 2023, 2024 and 2025 but I'm listed in third place for all time behind Paul Graham and Brian Krebs.

I dug around in the browser inspector and was delighted to find that the data powering the site is served with open CORS headers, which means you can easily explore it with external services like Datasette Lite.

Here's a convoluted window function query Claude Opus 4.5 wrote for me which, for a given domain, shows where that domain ranked for each year since it first appeared in the dataset:

with yearly_scores as (
  select 
    domain,
    strftime('%Y', date) as year,
    sum(score) as total_score,
    count(distinct date) as days_mentioned
  from "hn-data"
  group by domain, strftime('%Y', date)
),
ranked as (
  select 
    domain,
    year,
    total_score,
    days_mentioned,
    rank() over (partition by year order by total_score desc) as rank
  from yearly_scores
)
select 
  r.year,
  r.total_score,
  r.rank,
  r.days_mentioned
from ranked r
where r.domain = :domain
  and r.year >= (
    select min(strftime('%Y', date)) 
    from "hn-data"
    where domain = :domain
  )
order by r.year desc

(I just noticed that the last and r.year >= ( clause isn't actually needed here.)

My simonwillison.net results show me ranked 3rd in 2022, 30th in 2021 and 85th back in 2007 - though I expect there are many personal blogs from that year which haven't yet been manually added to Michael's list.

Also useful is that every domain gets its own CORS-enabled CSV file with details of the actual Hacker News submitted from that domain, e.g. https://hn-popularity.cdn.refactoringenglish.com/domains/simonwillison.net.csv. Here's that one in Datasette Lite.

Via Hacker News

Tags: hacker-news, sql, sqlite, datasette, datasette-lite, cors

Could LLMs encourage new programming languages?

2025-11-07T16:00:42+00:00

My hunch is that existing LLMs make it easier to build a new programming language in a way that captures new developers.

Most programming languages are similar enough to existing languages that you only need to know a small number of details to use them: what's the core syntax for variables, loops, conditionals and functions? How does memory management work? What's the concurrency model?

For many languages you can fit all of that, including illustrative examples, in a few thousand tokens of text.

So ship your new programming language with a Claude Skills style document and give your early adopters the ability to write it with LLMs. The LLMs should handle that very well, especially if they get to run an agentic loop against a compiler or even a linter that you provide.

This post started as a comment.

Tags: hacker-news, programming-languages, ai, generative-ai, llms, ai-assisted-programming, coding-agents, skills

Setting up a codebase for working with coding agents

2025-10-25T18:42:24+00:00

Someone on Hacker News asked for tips on setting up a codebase to be more productive with AI coding tools. Here's my reply:

Good automated tests which the coding agent can run. I love pytest for this - one of my projects has 1500 tests and Claude Code is really good at selectively executing just tests relevant to the change it is making, and then running the whole suite at the end.
Give them the ability to interactively test the code they are writing too. Notes on how to start a development server (for web projects) are useful, then you can have them use Playwright or curl to try things out.
I'm having great results from maintaining a GitHub issues collection for projects and pasting URLs to issues directly into Claude Code.
I actually don't think documentation is too important: LLMs can read the code a lot faster than you to figure out how to use it. I have comprehensive documentation across all of my projects but I don't think it's that helpful for the coding agents, though they are good at helping me spot if it needs updating.
Linters, type checkers, auto-formatters - give coding agents helpful tools to run and they'll use them.

For the most part anything that makes a codebase easier for humans to maintain turns out to help agents as well.

Update: Thought of another one: detailed error messages! If a manual or automated test fails the more information you can return back to the model the better, and stuffing extra data in the error message or assertion is a very inexpensive way to do that.

Tags: hacker-news, ai, pytest, generative-ai, llms, ai-assisted-programming, coding-agents

Quoting IanCal

2025-09-06T06:41:49+00:00

RDF has the same problems as the SQL schemas with information scattered. What fields mean requires documentation.

There - they have a name on a person. What name? Given? Legal? Chosen? Preferred for this use case?

You only have one ID for Apple eh? Companies are complex to model, do you mean Apple just as someone would talk about it? The legal structure of entities that underpins all major companies, what part of it is referred to?

I spent a long time building identifiers for universities and companies (which was taken for ROR later) and it was a nightmare to say what a university even was. What’s the name of Cambridge? It’s not “Cambridge University” or “The university of Cambridge” legally. But it also is the actual name as people use it. [It's The Chancellor, Masters, and Scholars of the University of Cambridge]

The university of Paris went from something like 13 institutes to maybe one to then a bunch more. Are companies locations at their headquarters? Which headquarters?

Someone will suggest modelling to solve this but here lies the biggest problem:

The correct modelling depends on the questions you want to answer.

— IanCal, on Hacker News, discussing RDF

Tags: hacker-news, metadata, rdf, sql

Quoting potatolicious

2025-08-21T21:44:19+00:00

Most classical engineering fields deal with probabilistic system components all of the time. In fact I'd go as far as to say that inability to deal with probabilistic components is disqualifying from many engineering endeavors.

Process engineers for example have to account for human error rates. On a given production line with humans in a loop, the operators will sometimes screw up. Designing systems to detect these errors (which are highly probabilistic!), mitigate them, and reduce the occurrence rates of such errors is a huge part of the job. [...]

Software engineering is unlike traditional engineering disciplines in that for most of its lifetime it's had the luxury of purely deterministic expectations. This is not true in nearly every other type of engineering.

— potatolicious, in a conversation about AI engineering

Tags: hacker-news, software-engineering, ai, generative-ai

Quoting mrmincent

2025-07-01T17:07:27+00:00

To misuse a woodworking metaphor, I think we’re experiencing a shift from hand tools to power tools.

You still need someone who understands the basics to get the good results out of the tools, but they’re not chiseling fine furniture by hand anymore, they’re throwing heaps of wood through the tablesaw instead. More productive, but more likely to lose a finger if you’re not careful.

— mrmincent, Hacker News comment on Claude Code

Tags: hacker-news, ai, generative-ai, llms, ai-assisted-programming, claude-code

My AI Skeptic Friends Are All Nuts

2025-06-02T23:56:49+00:00

My AI Skeptic Friends Are All Nuts

Thomas Ptacek's frustrated tone throughout this piece perfectly captures how it feels sometimes to be an experienced programmer trying to argue that "LLMs are actually really useful" in many corners of the internet.

Some of the smartest people I know share a bone-deep belief that AI is a fad — the next iteration of NFT mania. I’ve been reluctant to push back on them, because, well, they’re smarter than me. But their arguments are unserious, and worth confronting. Extraordinarily talented people are doing work that LLMs already do better, out of spite. [...]

You’ve always been responsible for what you merge to main. You were five years go. And you are tomorrow, whether or not you use an LLM. [...]

Reading other people’s code is part of the job. If you can’t metabolize the boring, repetitive code an LLM generates: skills issue! How are you handling the chaos human developers turn out on a deadline?

And on the threat of AI taking jobs from engineers (with a link to an old comment of mine):

So does open source. We used to pay good money for databases.

We're a field premised on automating other people's jobs away. "Productivity gains," say the economists. You get what that means, right? Fewer people doing the same stuff. Talked to a travel agent lately? Or a floor broker? Or a record store clerk? Or a darkroom tech?

The post has already attracted 695 comments on Hacker News in just two hours, which feels like some kind of record even by the usual standards of fights about AI on the internet.

Update: Thomas, another hundred or so comments later:

A lot of people are misunderstanding the goal of the post, which is not necessarily to persuade them, but rather to disrupt a static, unproductive equilibrium of uninformed arguments about how this stuff works. The commentary I've read today has to my mind vindicated that premise.

Via @sockpuppet.org

Tags: hacker-news, thomas-ptacek, ai, generative-ai, llms, ai-assisted-programming

llm-hacker-news 0.1.1

2025-05-05T17:10:45+00:00

Release: llm-hacker-news 0.1.1

Tags: hacker-news, llm

The GeoGuessr StreetView meta-game

2025-04-26T16:56:59+00:00

My post on o3 guessing locations from photos made it to Hacker News and by far the most interesting comments are from SamPatt, a self-described competitive GeoGuessr player.

In a thread about meta-knowledge of the StreetView card uses in different regions:

The photography matters a great deal - they're categorized into "Generations" of coverage. Gen 2 is low resolution, Gen 3 is pretty good but has a distinct car blur, Gen 4 is highest quality. Each country tends to have only one or two categories of coverage, and some are so distinct you can immediately know a location based solely on that (India is the best example here). [...]

Nigeria and Tunisia have follow cars. Senegal, Montenegro and Albania have large rifts in the sky where the panorama stitching software did a poor job. Some parts of Russia had recent forest fires and are very smokey. One road in Turkey is in absurdly thick fog. The list is endless, which is why it's so fun!

Sam also has his own custom Obsidian flashcard deck "with hundreds of entries to help me remember road lines, power poles, bollards, architecture, license plates, etc".

I asked Sam how closely the GeoGuessr community track updates to street view imagery, and unsurprisingly those are a big deal. Sam pointed me to this 10 minute video review by zi8gzag of the latest big update from three weeks ago:

This is one of the biggest updates in years in my opinion. It could be the biggest update since the 2022 update that gave Gen 4 to Nigeria, Senegal, and Rwanda. It's definitely on the same level as the Kazakhstan update or the Germany update in my opinion.

Tags: geospatial, hacker-news, streetview, geoguessing

llm-hacker-news

2025-04-08T00:11:30+00:00

llm-hacker-news

I built this new plugin to exercise the new register_fragment_loaders() plugin hook I added to LLM 0.24. It's the plugin equivalent of the Bash script I've been using to summarize Hacker News conversations for the past 18 months.

You can use it like this:

llm install llm-hacker-news
llm -f hn:43615912 'summary with illustrative direct quotes'

You can see the output in this issue.

The plugin registers a hn: prefix - combine that with the ID of a Hacker News conversation to pull that conversation into the context.

It uses the Algolia Hacker News API which returns JSON like this. Rather than feed the JSON directly to the LLM it instead converts it to a hopefully more LLM-friendly format that looks like this example from the plugin's test:

[1] BeakMaster: Fish Spotting Techniques

[1.1] CoastalFlyer: The dive technique works best when hunting in shallow waters.

[1.1.1] PouchBill: Agreed. Have you tried the hover method near the pier?

[1.1.2] WingSpan22: My bill gets too wet with that approach.

[1.1.2.1] CoastalFlyer: Try tilting at a 40° angle like our Australian cousins.

[1.2] BrownFeathers: Anyone spotted those "silver fish" near the rocks?

[1.2.1] GulfGlider: Yes! They're best caught at dawn.
Just remember: swoop > grab > lift

That format was suggested by Claude, which then wrote most of the plugin implementation for me. Here's that Claude transcript.

Tags: hacker-news, plugins, projects, ai, generative-ai, llms, ai-assisted-programming, llm, anthropic, claude

llm-hacker-news 0.1

2025-04-07T23:59:49+00:00

Release: llm-hacker-news 0.1

Tags: hacker-news, llm

A professional workflow for translation using LLMs

2025-02-02T04:23:19+00:00

A professional workflow for translation using LLMs

Tom Gally is a professional translator who has been exploring the use of LLMs since the release of GPT-4. In this Hacker News comment he shares a detailed workflow for how he uses them to assist in that process.

Tom starts with the source text and custom instructions, including context for how the translation will be used. Here's an imaginary example prompt, which starts:

The text below in Japanese is a product launch presentation for Sony's new gaming console, to be delivered by the CEO at Tokyo Game Show 2025. Please translate it into English. Your translation will be used in the official press kit and live interpretation feed. When translating this presentation, please follow these guidelines to create an accurate and engaging English version that preserves both the meaning and energy of the original: [...]

It then lists some tone, style and content guidelines custom to that text.

Tom runs that prompt through several different LLMs and starts by picking sentences and paragraphs from those that form a good basis for the translation.

As he works on the full translation he uses Claude to help brainstorm alternatives for tricky sentences:

When I am unable to think of a good English version for a particular sentence, I give the Japanese and English versions of the paragraph it is contained in to an LLM (usually, these days, Claude) and ask for ten suggestions for translations of the problematic sentence. Usually one or two of the suggestions work fine; if not, I ask for ten more. (Using an LLM as a sentence-level thesaurus on steroids is particularly wonderful.)

He uses another LLM and prompt to check his translation against the original and provide further suggestions, which he occasionally acts on. Then as a final step he runs the finished document through a text-to-speech engine to try and catch any "minor awkwardnesses" in the result.

I love this as an example of an expert using LLMs as tools to help further elevate their work. I'd love to read more examples like this one from experts in other fields.

Tags: hacker-news, translation, ai, generative-ai, llms, tom-gally

Hacker News conversation on feature flags

2025-02-02T01:18:44+00:00

Hacker News conversation on feature flags

I posted the following comment in a thread on Hacker News about feature flags, in response to this article It’s OK to hardcode feature flags. This kicked off a very high quality conversation on build-vs-buy and running feature flags at scale involving a bunch of very experienced and knowledgeable people. I recommend reading the comments.

Here's what I said:

The single biggest value add of feature flags is that they de-risk deployment. They make it less frightening and difficult to turn features on and off, which means you'll do it more often. This means you can build more confidently and learn faster from what you build. That's worth a lot.

I think there's a reasonable middle ground-point between having feature flags in a JSON file that you have to redeploy to change and using an (often expensive) feature flags as a service platform: roll your own simple system.

A relational database lookup against primary keys in a table with a dozen records is effectively free. Heck, load the entire collection at the start of each request - through a short lived cache if your profiling says that would help.

Once you start getting more complicated (flags enabled for specific users etc) you should consider build-vs-buy more seriously, but for the most basic version you really can have no-deploy-changes at minimal cost with minimal effort.

There are probably good open source libraries you can use here too, though I haven't gone looking for any in the last five years.

Tags: hacker-news, feature-flags

Ask HN: What happens to ".io" TLD after UK gives back the Chagos Islands?

2024-10-03T17:25:21+00:00

Ask HN: What happens to ".io" TLD after UK gives back the Chagos Islands?

This morning on the BBC: UK will give sovereignty of Chagos Islands to Mauritius. The Chagos Islands include the area that the UK calls the British Indian Ocean Territory. The .io ccTLD uses the ISO-3166 two-letter country code for that designation.

As the owner of datasette.io the question of what happens to that ccTLD is suddenly very relevant to me.

This Hacker News conversation has some useful information. It sounds like there's a very real possibility that .io could be deleted after a few years notice - it's happened before, for ccTLDs such as .zr for Zaire (which renamed to Democratic Republic of the Congo in 1997, with .zr withdrawn in 2001) and .cs for Czechoslovakia, withdrawn in 1995.

Could .io change status to the same kind of TLD as .museum, unaffiliated with any particular geography? The convention is for two letter TLDs to exactly match ISO country codes, so that may not be an option.

Tags: dns, domains, hacker-news

New improved commit messages for scrape-hacker-news-by-domain

2024-09-06T05:40:01+00:00

New improved commit messages for scrape-hacker-news-by-domain

My simonw/scrape-hacker-news-by-domain repo has a very specific purpose. Once an hour it scrapes the Hacker News /from?site=simonwillison.net page (and the equivalent for datasette.io) using my shot-scraper tool and stashes the parsed links, scores and comment counts in JSON files in that repo.

It does this mainly so I can subscribe to GitHub's Atom feed of the commit log - visit simonw/scrape-hacker-news-by-domain/commits/main and add .atom to the URL to get that.

NetNewsWire will inform me within about an hour if any of my content has made it to Hacker News, and the repo will track the score and comment count for me over time. I wrote more about how this works in Scraping web pages from the command line with shot-scraper back in March 2022.

Prior to the latest improvement, the commit messages themselves were pretty uninformative. The message had the date, and to actually see which Hacker News post it was referring to, I had to click through to the commit and look at the diff.

I built my csv-diff tool a while back to help address this problem: it can produce a slightly more human-readable version of a diff between two CSV or JSON files, ideally suited for including in a commit message attached to a git scraping repo like this one.

I got that working, but there was still room for improvement. I recently learned that any Hacker News thread has an undocumented URL at /latest?id=x which displays the most recently added comments at the top.

I wanted that in my commit messages, so I could quickly click a link to see the most recent comments on a thread.

So... I added one more feature to csv-diff: a new --extra option lets you specify a Python format string to be used to add extra fields to the displayed difference.

My GitHub Actions workflow now runs this command:

csv-diff simonwillison-net.json simonwillison-net-new.json \
  --key id --format json \
  --extra latest 'https://news.ycombinator.com/latest?id={id}' \
  >> /tmp/commit.txt

This generates the diff between the two versions, using the id property in the JSON to tie records together. It adds a latest field linking to that URL.

The commits now look like this:

Tags: hacker-news, json, projects, github-actions, git-scraping, shot-scraper

Quoting dang

2024-08-12T22:04:18+00:00

We had to exclude [dead] and eventually even just [flagged] posts from the public API because many third-party clients and sites were displaying them as if they were regular posts. […]

IMO this issue is existential for HN. We've spent years and so much energy trying to find a balance between openness and human decency, a task which oscillates between barely-possible and simply-doomed, so the idea that anybody anywhere sees anything labeled "Hacker News" that pours all the toxic waste back into the ecosystem is physically painful to me.

— dang

Tags: hacker-news, moderation

Hacker News homepage with links to comments ordered by most recent first

2024-07-15T17:48:07+00:00

Hacker News homepage with links to comments ordered by most recent first

Conversations on Hacker News are displayed as a tree, which can make it difficult to spot new comments added since the last time you viewed the thread.

There's a workaround for this using the Hacker News Algolia Search interface: search for story:STORYID, select "comments" and the result will be a list of comments sorted by most recent first.

I got fed up of doing this manually so I built a quick tool in an Observable Notebook that documents the hack, provides a UI for pasting in a Hacker News URL to get back that search interface link and also shows the most recent items on the homepage with links to their most recently added comments.

See also my How to read Hacker News threads with most recent comments first TIL from last year.

Via Show HN

Tags: hacker-news, projects, observable

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun

2024-05-10T16:42:55+00:00

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun

A real tour de force of data engineering. Wilson Lin fetched 40 million posts and comments from the Hacker News API (using Node.js with a custom multi-process worker pool) and then ran them all through the BGE-M3 embedding model using RunPod, which let him fire up ~150 GPU instances to get the whole run done in a few hours, using a custom RocksDB and Rust queue he built to save on Amazon SQS costs.

Then he crawled 4 million linked pages, embedded that content using the faster and cheaper jina-embeddings-v2-small-en model, ran UMAP dimensionality reduction to render a 2D map and did a whole lot of follow-on work to identify topic areas and make the map look good.

That's not even half the project - Wilson built several interactive features on top of the resulting data, and experimented with custom rendering techniques on top of canvas to get everything to render quickly.

There's so much in here, and both the code and data (multiple GBs of arrow files) are available if you want to dig in and try some of this out for yourself.

In the Hacker News comments Wilson shares that the total cost of the project was a couple of hundred dollars.

One tiny detail I particularly enjoyed - unrelated to the embeddings - was this trick for testing which edge location is closest to a user using JavaScript:

const edge = await Promise.race(
  EDGES.map(async (edge) => {
    // Run a few times to avoid potential cold start biases.
    for (let i = 0; i < 3; i++) {
      await fetch(`https://${edge}.edge-hndr.wilsonl.in/healthz`);
    }
    return edge;
  }),
);

Via Show HN

Tags: hacker-news, embeddings, jina

Everything Google's Python team were responsible for

2024-04-27T18:52:32+00:00

Everything Google's Python team were responsible for

In a questionable strategic move, Google laid off the majority of their internal Python team a few days ago. Someone on Hacker News asked what the team had been responsible for, and team member zem relied with this fascinating comment providing detailed insight into how the team worked and indirectly how Python is used within Google.

Tags: google, hacker-news, python

Quoting wkirby on Hacker News

2024-04-16T19:49:16+00:00

Permissions have three moving parts, who wants to do it, what do they want to do, and on what object. Any good permission system has to be able to efficiently answer any permutation of those variables. Given this person and this object, what can they do? Given this object and this action, who can do it? Given this person and this action, which objects can they act upon?

— wkirby on Hacker News

Tags: hacker-news, permissions

Quoting dang

2024-02-19T15:57:50+00:00

Spam, and its cousins like content marketing, could kill HN if it became orders of magnitude greater—but from my perspective, it isn't the hardest problem on HN. [...]

By far the harder problem, from my perspective, is low-quality comments, and I don't mean by bad actors—the community is pretty good about flagging and reporting those; I mean lame and/or mean comments by otherwise good users who don't intend to and don't realize they're doing that.

— dang

Tags: hacker-news, moderation, social-software, spam

Analytics: Hacker News v.s. a tweet from Elon Musk

2023-02-17T22:11:44+00:00

My post Bing: “I will not harm you unless you harm me first” really took off.

It sat at the top of Hacker News for a full day, and is currently the 18th most popular post of all time on that site.

And then this happened:

Might need a bit more polish …https://t.co/rGYCxoBVeA
- Elon Musk (@elonmusk) February 15, 2023

Given recent changes made to the Twitter algorithm, a lot of people saw that. Twitter currently reports 30.4M views of that tweet.

A bunch of people asked me how much of that converted into page views. So let's dive in!

Headline figures

Here's my Plausible dashboard for that post over the past few days:

Overall numbers: 959k unique visitors, 1.1M page views.

Top sources of traffic:

Twitter: 721k
Direct / None: 132k (this includes traffic from Mastodon)
Hacker News: 49.5k
Facebook: 13.4k
Reddit: 8.3k
Google: 7.8k
tldrnewsletter: 6k
LinkedIn: 5.4k

If we assume the vast majority of the Twitter traffic was from Elon (which seems reasonable) that's 30.4M / 721k = roughly a 2.37% click through rate.

Notable that sticking at the top of Hacker News for a day really does drive an enormous amount of traffic - 18% of the traffic you get from the second most followed account on Twitter (looks like Barack Obama is still number one).

More detailed analytics via Plausible and Cloudflare

I mainly use Plausible for my site's analytics. I really like them: they're privacy-focused, open source (though I use their hosted version) and show me exactly the subset of data I want to see. Most importantly, they don't set cookies.

My site also runs behind Cloudflare, which also provides analytics. I don't pay for the upgraded analytics, but it turns out you can still get some pretty detailed numbers out of them - especially if you're willing to dig around in the browser DevTools.

Plausible offers an "export" button, so I used that... and got a zip file with a bunch of CSVs in it. Here they are in a GitHub repo.

Cloudflare - at least for the free tier - doesn't have a detailed export. But... under the hood the Cloudflare web application uses their GraphQL API to retrieve stats for display, and with a bit of digging you can get numbers out that way.

I extracted this 3.2MB JSON file using the Cloudflare API.

Loading it into Datasette

I wrote this script to load the data I had extracted into SQLite database files, and then deployed them to Vercel using Datasette.

You can explore the result here: https://i-will-not-harm-you-unless-you-harm-me-first.vercel.app/

Here's page views according to Plausible over the time period in question:

It looks to me like the timezone for that data is Pacific Time.

This page shows page views count according to Cloudflare, by hour.

This data is in UTC, where 7pm UTC corresponds to 11am Pacific.

These numbers should differ, because Plausible uses JavaScript to track analytics while Cloudflare is server-side, plus Plausible is filtered to just hits to the specific page while Cloudflare is showing all hits to any page on my site.

There are plenty more ways to slice and dice the data in Datasette:

Unique visitors over time according to Plausible
Uniques over time according to Cloudflare
Full data for those traffic sources from Plausible
Plausible device breakdown - 778,678 mobile, 101,216 desktop, 47,781 laptop (not sure how it distinguishes between desktop and laptop though), 16,967 tablet.
Percentage of cached requests over time according to Cloudflare using a custom SQL query - this was around 40% before the Elon tweet, then jumped up to over 90% and stayed there, thankfully!

I've long been a fan of full-page HTTP caching as protection against surprise traffic events - it's a pattern I've implemented in the past using Varnish and Fastly, and I've been using it on my blog via Cloudflare for several years.

It definitely paid off this time!

Tags: analytics, bing, hacker-news, twitter, datasette, cloudflare

Scraping web pages from the command line with shot-scraper

2022-03-14T01:29:56+00:00

I've added a powerful new capability to my shot-scraper command line browser automation tool: you can now use it to load a web page in a headless browser, execute JavaScript to extract information and return that information back to the terminal as JSON.

Among other things, this means you can construct Unix pipelines that incorporate a full headless web browser as part of their processing.

It's also a really neat web scraping tool.

shot-scraper

I introduced shot-scraper last Thursday. It's a Python utility that wraps Playwright, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages.

% pip install shot-scraper
% shot-scraper https://simonwillison.net/ --height 800
Screenshot of 'https://simonwillison.net/' written to 'simonwillison-net.png'

Since Thursday shot-scraper has had a flurry of releases, adding features like PDF exports, the ability to dump the Chromium accessibilty tree and the ability to take screenshots of authenticated web pages. But the most exciting new feature landed today.

Executing JavaScript and returning the result

Release 0.9 takes the tool in a new direction. The following command will execute JavaScript on the page and return the resulting value:

% shot-scraper javascript simonwillison.net document.title
"Simon Willison\u2019s Weblog"

Or you can return a JSON object:

% shot-scraper javascript https://datasette.io/ "({
  title: document.title,
  tagline: document.querySelector('.tagline').innerText
})"
{
  "title": "Datasette: An open source multi-tool for exploring and publishing data",
  "tagline": "An open source multi-tool for exploring and publishing data"
}

Or if you want to use functions like setTimeout() - for example, if you want to insert a delay to allow an animation to finish before running the rest of your code - you can return a promise:

% shot-scraper javascript datasette.io "
new Promise(done => setInterval(
  () => {
    done({
      title: document.title,
      tagline: document.querySelector('.tagline').innerText
    });
  }, 1000
));"

Errors that occur in the JavaScript turn into an exit code of 1 returned by the tool - which means you can also use this to execute simple tests in a CI flow. This example will fail a GitHub Actions workflow if the extracted page title is not the expected value:

- name: Test page title
  run: |-
    shot-scraper javascript datasette.io "
      if (document.title != 'Datasette') {
        throw 'Wrong title detected';
      }"

Using this to scrape a web page

The most exciting use case for this new feature is web scraping. I'll illustrate that with an example.

Posts from my blog occasionally show up on Hacker News - sometimes I spot them, sometimes I don't.

https://news.ycombinator.com/from?site=simonwillison.net is a Hacker News page showing content from the specified domain. It's really useful, but it sadly isn't included in the official Hacker News API.

So... let's write a scraper for it.

I started out running the Firefox developer console against that page, trying to figure out the right JavaScript to extract the data I was interested in. I came up with this:

Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})

The great thing about modern JavaScript is that everything you could need to write a scraper is already there in the default environment.

I'm using document.querySelectorAll('.itemlist .athing') to loop through each element that matches that selector.

I wrap that with Array.from(...) so I can use the .map() method. Then for each element I can extract out the details that I need.

The resulting array contains 30 items that look like this:

[
  {
    "id": "30658310",
    "title": "Track changes to CLI tools by recording their help output",
    "url": "https://simonwillison.net/2022/Feb/2/help-scraping/",
    "dt": "2022-03-13T05:36:13",
    "submitter": "appwiz",
    "commentsUrl": "https://news.ycombinator.com/item?id=30658310",
    "numComments": 19
  }
]

Running it with shot-scraper

Now that I have a recipe for a scraper, I can run it in the terminal like this:

shot-scraper javascript 'https://news.ycombinator.com/from?site=simonwillison.net' "
Array.from(document.querySelectorAll('.athing'), el => {
  const title = el.querySelector('.titleline a').innerText;
  const points = parseInt(el.nextSibling.querySelector('.score').innerText);
  const url = el.querySelector('.titleline a').href;
  const dt = el.nextSibling.querySelector('.age').title;
  const submitter = el.nextSibling.querySelector('.hnuser').innerText;
  const commentsUrl = el.nextSibling.querySelector('.age a').href;
  const id = commentsUrl.split('?id=')[1];
  // Only posts with comments have a comments link
  const commentsLink = Array.from(
    el.nextSibling.querySelectorAll('a')
  ).filter(el => el && el.innerText.includes('comment'))[0];
  let numComments = 0;
  if (commentsLink) {
    numComments = parseInt(commentsLink.innerText.split()[0]);
  }
  return {id, title, url, dt, points, submitter, commentsUrl, numComments};
})" > simonwillison-net.json

simonwillison-net.json is now a JSON file containing the scraped data.

Running the scraper in GitHub Actions

I want to keep track of changes to this data structure over time. My preferred technique for that is something I call Git scraping - the core idea is to keep the data in a Git repository and commit an update any time it updates. This provides a cheap and robust history of changes over time.

Running the scraper in GitHub Actions means I don't need to administrate my own server to keep this running.

So I built exactly that, in the simonw/scrape-hacker-news-by-domain repository.

The GitHub Actions workflow is in .github/workflows/scrape.yml. It runs the above command once an hour, then pushes a commit back to the repository should the file have any changes since last time it ran.

The commit history of simonwillison-net.json will show me any time a new link from my site appears on Hacker News, or a comment is added.

(Fun GitHub trick: add .atom to the end of that URL to get an Atom feed of those commits.)

The whole scraper, from idea to finished implementation, took less than fifteen minutes to build and deploy.

I can see myself using this technique a lot in the future.

Tags: cli, github, hacker-news, scraping, github-actions, git-scraping, shot-scraper

Launch HN Instructions

2021-07-19T01:05:37+00:00

Launch HN Instructions

The instructions for YC companies that are posting their launch announcement on Hacker News are really interesting to read. “As founders, you’re used to talking to users, customers, and investors. HN readers are not any of those—what they are is peers, and using any of those styles with peers feels clueless and entitled. [...] To interest HN, write in a factual, personal, and modest way about what problem you solve, why it matters, how you solve it, and how you got there.”

Via dang

Tags: hacker-news, marketing, y-combinator