<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: internet-archive</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/internet-archive.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2026-03-08T14:59:48+00:00</updated><author><name>Simon Willison</name></author><entry><title>Quoting Joseph Weizenbaum</title><link href="https://simonwillison.net/2026/Mar/8/joseph-weizenbaum/#atom-tag" rel="alternate"/><published>2026-03-08T14:59:48+00:00</published><updated>2026-03-08T14:59:48+00:00</updated><id>https://simonwillison.net/2026/Mar/8/joseph-weizenbaum/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://archive.org/details/computerpowerhum0000weiz_v0i3?q=realized"&gt;&lt;p&gt;What I had not realized is that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://archive.org/details/computerpowerhum0000weiz_v0i3?q=realized"&gt;Joseph Weizenbaum&lt;/a&gt;, creator of ELIZA, in 1976 (&lt;a href="https://www.tiktok.com/@professorcasey/video/7614890527711825183"&gt;via&lt;/a&gt;)&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/computer-history"&gt;computer-history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;



</summary><category term="computer-history"/><category term="internet-archive"/><category term="ai"/><category term="ai-ethics"/></entry><entry><title>Spotlighting The World Factbook as We Bid a Fond Farewell</title><link href="https://simonwillison.net/2026/Feb/5/the-world-factbook/#atom-tag" rel="alternate"/><published>2026-02-05T00:23:38+00:00</published><updated>2026-02-05T00:23:38+00:00</updated><id>https://simonwillison.net/2026/Feb/5/the-world-factbook/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.cia.gov/stories/story/spotlighting-the-world-factbook-as-we-bid-a-fond-farewell/"&gt;Spotlighting The World Factbook as We Bid a Fond Farewell&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Somewhat devastating news today from CIA:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;One of CIA’s oldest and most recognizable intelligence publications, The World Factbook, has sunset.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;There's not even a hint as to &lt;em&gt;why&lt;/em&gt; they decided to stop maintaining this publication, which has been their most useful public-facing initiative since 1971 and a cornerstone of the public internet since 1997.&lt;/p&gt;
&lt;p&gt;In a bizarre act of cultural vandalism they've not just removed the entire site (including the archives of previous versions) but they've also set every single page to be a 302 redirect to their closure announcement.&lt;/p&gt;
&lt;p&gt;The Factbook has been released into the public domain since the start. There's no reason not to continue to serve archived versions - a banner at the top of the page saying it's no longer maintained would be much better than removing all of that valuable content entirely.&lt;/p&gt;
&lt;p&gt;Up until 2020 the CIA published annual zip file archives of the entire site. Those are available (along with the rest of the Factbook) &lt;a href="https://web.archive.org/web/20260203124934/https://www.cia.gov/the-world-factbook/about/archives/"&gt;on the Internet Archive&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I downloaded the 384MB &lt;code&gt;.zip&lt;/code&gt; file for the year 2020 and extracted it into a new GitHub repository, &lt;a href="https://github.com/simonw/cia-world-factbook-2020/"&gt;simonw/cia-world-factbook-2020&lt;/a&gt;. I've enabled GitHub Pages for that repository so you can browse the archived copy at &lt;a href="https://simonw.github.io/cia-world-factbook-2020"&gt;simonw.github.io/cia-world-factbook-2020/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of the CIA World Factbook website homepage. Header reads &amp;quot;THE WORLD FACTBOOK&amp;quot; with a dropdown labeled &amp;quot;Please select a country to view.&amp;quot; Navigation tabs: ABOUT, REFERENCES, APPENDICES, FAQs. Section heading &amp;quot;WELCOME TO THE WORLD FACTBOOK&amp;quot; followed by descriptive text: &amp;quot;The World Factbook provides information on the history, people and society, government, economy, energy, geography, communications, transportation, military, and transnational issues for 267 world entities. The Reference tab includes: a variety of world, regional, country, ocean, and time zone maps; Flags of the World; and a Country Comparison function that ranks the country information and data in more than 75 Factbook fields.&amp;quot; A satellite image of Earth is displayed on the right. Below it: &amp;quot;WHAT'S NEW :: Today is: Wednesday, February 4.&amp;quot; Left sidebar links with icons: WORLD TRAVEL FACTS, ONE-PAGE COUNTRY SUMMARIES, REGIONAL AND WORLD MAPS, FLAGS OF THE WORLD, GUIDE TO COUNTRY COMPARISONS. Right side shows news updates dated December 17, 2020 about Electricity access and new Economy fields, and December 10, 2020 about Nepal and China agreeing on the height of Mount Everest at 8,848.86 meters. A &amp;quot;VIEW ALL UPDATES&amp;quot; button appears at the bottom." src="https://static.simonwillison.net/static/2025/factbook-2020.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;Here's a neat example of the editorial voice of the Factbook from the &lt;a href="https://simonw.github.io/cia-world-factbook-2020/docs/whatsnew.html"&gt;What's New page&lt;/a&gt;, dated December 10th 2020:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Years of wrangling were brought to a close this week when officials from Nepal and China announced that they have agreed on the height of Mount Everest. The mountain sits on the border between Nepal and Tibet (in western China), and its height changed slightly following an earthquake in 2015. The new height of 8,848.86 meters is just under a meter higher than the old figure of 8,848 meters. &lt;em&gt;The World Factbook&lt;/em&gt; rounds the new measurement to 8,849 meters and this new height has been entered throughout the &lt;em&gt;Factbook&lt;/em&gt; database.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=46891794"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/cia"&gt;cia&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/github"&gt;github&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;&lt;/p&gt;



</summary><category term="cia"/><category term="github"/><category term="internet-archive"/></entry><entry><title>Reddit will block the Internet Archive</title><link href="https://simonwillison.net/2025/Aug/11/reddit-will-block-the-internet-archive/#atom-tag" rel="alternate"/><published>2025-08-11T18:11:49+00:00</published><updated>2025-08-11T18:11:49+00:00</updated><id>https://simonwillison.net/2025/Aug/11/reddit-will-block-the-internet-archive/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.theverge.com/news/757538/reddit-internet-archive-wayback-machine-block-limit"&gt;Reddit will block the Internet Archive&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Well this &lt;em&gt;sucks&lt;/em&gt;. Jay Peters for the Verge:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.&lt;/p&gt;
&lt;/blockquote&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/reddit"&gt;reddit&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/scraping"&gt;scraping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-ethics"&gt;ai-ethics&lt;/a&gt;&lt;/p&gt;



</summary><category term="internet-archive"/><category term="reddit"/><category term="scraping"/><category term="ai"/><category term="training-data"/><category term="ai-ethics"/></entry><entry><title>TextSynth Server</title><link href="https://simonwillison.net/2024/Nov/21/textsynth-server/#atom-tag" rel="alternate"/><published>2024-11-21T05:16:55+00:00</published><updated>2024-11-21T05:16:55+00:00</updated><id>https://simonwillison.net/2024/Nov/21/textsynth-server/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://bellard.org/ts_server/"&gt;TextSynth Server&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I'd missed this: Fabrice Bellard (yes, &lt;a href="https://en.wikipedia.org/wiki/Fabrice_Bellard"&gt;&lt;em&gt;that&lt;/em&gt; Fabrice Bellard&lt;/a&gt;) has a project called TextSynth Server which he describes like this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ts_server&lt;/strong&gt; is a web server proposing a REST API to large language models. They can be used for example for text completion, question answering, classification, chat, translation, image generation, ...&lt;/p&gt;
&lt;p&gt;It has the following characteristics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All is included in a single binary. Very few external dependencies (Python is not needed) so installation is easy.&lt;/li&gt;
&lt;li&gt;Supports many Transformer variants (&lt;a href="https://github.com/kingoflolz/mesh-transformer-jax"&gt;GPT-J&lt;/a&gt;, &lt;a href="https://github.com/EleutherAI/gpt-neox"&gt;GPT-NeoX&lt;/a&gt;, &lt;a href="https://github.com/EleutherAI/gpt-neo"&gt;GPT-Neo&lt;/a&gt;, &lt;a href="https://github.com/facebookresearch/metaseq"&gt;OPT&lt;/a&gt;, &lt;a href="https://github.com/pytorch/fairseq/tree/main/examples/moe_lm"&gt;Fairseq GPT&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2010.11125"&gt;M2M100&lt;/a&gt;, &lt;a href="https://github.com/salesforce/CodeGen"&gt;CodeGen&lt;/a&gt;, &lt;a href="https://github.com/openai/gpt-2"&gt;GPT2&lt;/a&gt;, &lt;a href="https://arxiv.org/abs/2210.11416"&gt;T5&lt;/a&gt;, &lt;a href="https://github.com/BlinkDL/RWKV-LM"&gt;RWKV&lt;/a&gt;, &lt;a href="https://github.com/facebookresearch/llama"&gt;LLAMA&lt;/a&gt;, &lt;a href="https://falconllm.tii.ae/"&gt;Falcon&lt;/a&gt;, &lt;a href="https://github.com/mosaicml/llm-foundry"&gt;MPT&lt;/a&gt;, Llama 3.2, Mistral, Mixtral, Qwen2, Phi3, Whisper) and &lt;a href="https://github.com/CompVis/stable-diffusion"&gt;Stable Diffusion&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;[...]&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;Unlike many of his other notable projects (such as FFmpeg, QEMU, QuickJS) this isn't open source - in fact it's not even source available, you instead can download compiled binaries for Linux or Windows that are available for non-commercial use only.&lt;/p&gt;
&lt;p&gt;Commercial terms are available, or you can visit &lt;a href="https://textsynth.com/"&gt;textsynth.com&lt;/a&gt; and pre-pay for API credits which can then be used with the hosted REST API there.&lt;/p&gt;
&lt;p&gt;This is not a new project: the earliest evidence I could find of it was &lt;a href="https://web.archive.org/web/20190704131718/http://textsynth.org/tech.html"&gt;this July 2019 page&lt;/a&gt; in the Internet Archive, which said:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Text Synth is build using the &lt;a href="https://openai.com/blog/better-language-models/"&gt;GPT-2 language model&lt;/a&gt; released by OpenAI. [...] This implementation is original because instead of using a GPU, it runs using only 4 cores of a Xeon E5-2640 v3 CPU at 2.60GHz. With a single user, it generates 40 words per second. It is programmed in plain C using the &lt;a href="https://bellard.org/nncp/"&gt;LibNC library&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://registerspill.thorstenball.com/p/they-all-use-it"&gt;They all use it - Thorsten Ball&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpt-2"&gt;gpt-2&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/fabrice-bellard"&gt;fabrice-bellard&lt;/a&gt;&lt;/p&gt;



</summary><category term="internet-archive"/><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="gpt-2"/><category term="fabrice-bellard"/></entry><entry><title>Wayback Machine: Models - Anthropic (8th October 2024)</title><link href="https://simonwillison.net/2024/Oct/22/opus/#atom-tag" rel="alternate"/><published>2024-10-22T22:42:17+00:00</published><updated>2024-10-22T22:42:17+00:00</updated><id>https://simonwillison.net/2024/Oct/22/opus/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://web.archive.org/web/20241008222204/https://docs.anthropic.com/en/docs/about-claude/models"&gt;Wayback Machine: Models - Anthropic (8th October 2024)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The Internet Archive is only &lt;a href="https://blog.archive.org/2024/10/21/internet-archive-services-update-2024-10-21/"&gt;intermittently available&lt;/a&gt; at the moment, but the Wayback Machine just came back long enough for me to confirm that the &lt;a href="https://docs.anthropic.com/en/docs/about-claude/models"&gt;Anthropic Models&lt;/a&gt; documentation page listed Claude 3.5 Opus as coming “Later this year” at least as recently as the 8th of October, but today makes no mention of that model at all.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;October 8th 2024&lt;/strong&gt;&lt;/p&gt;
&lt;div style="text-align: center; margin-bottom: 1em"&gt;&lt;a style="border-bottom: none" href="https://static.simonwillison.net/static/2024/anthropic-models-8-oct-2024.png"&gt;&lt;img alt="Internet Archive capture of the Claude models page - shows both Claude 3.5 Haiku and Claude 3.5 Opus as Later this year" src="https://static.simonwillison.net/static/2024/anthropic-models-8-oct-2024-thumb2.png" width="500"&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;October 22nd 2024&lt;/strong&gt;&lt;/p&gt;
&lt;div style="text-align: center; margin-bottom: 1em"&gt;&lt;a style="border-bottom: none" href="https://static.simonwillison.net/static/2024/anthropic-models-22-oct-2024.png"&gt;&lt;img alt="That same page today shows Claude 3.5 Haiku as later this year but no longer mentions Claude 3.5 Opus at all" src="https://static.simonwillison.net/static/2024/anthropic-models-22-oct-2024-thumb2.png" width="500"&gt;&lt;/a&gt;&lt;/div&gt;

&lt;p&gt;Claude 3 came in three flavors: Haiku (fast and cheap), Sonnet (mid-range) and Opus (best). We were expecting 3.5 to have the same three levels, and both 3.5 Haiku and 3.5 Sonnet fitted those expectations, matching their prices to the Claude 3 equivalents.&lt;/p&gt;
&lt;p&gt;It looks like 3.5 Opus may have been entirely cancelled, or at least delayed for an unpredictable amount of time. I guess that means &lt;a href="https://simonwillison.net/2024/Oct/22/computer-use/#bad-names"&gt;the new 3.5 Sonnet&lt;/a&gt; will be Anthropic's best overall model for a while, maybe until Claude 4.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/anthropic"&gt;anthropic&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/claude"&gt;claude&lt;/a&gt;&lt;/p&gt;



</summary><category term="internet-archive"/><category term="ai"/><category term="generative-ai"/><category term="llms"/><category term="anthropic"/><category term="claude"/></entry><entry><title>Today's research challenge: why is August 1st "World Wide Web Day"?</title><link href="https://simonwillison.net/2024/Aug/1/august-1st-world-wide-web-day/#atom-tag" rel="alternate"/><published>2024-08-01T17:34:29+00:00</published><updated>2024-08-01T17:34:29+00:00</updated><id>https://simonwillison.net/2024/Aug/1/august-1st-world-wide-web-day/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://fedi.simonwillison.net/@simon/112887537705995720"&gt;Today&amp;#x27;s research challenge: why is August 1st &amp;quot;World Wide Web Day&amp;quot;?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Here's a fun mystery. A bunch of publications will tell you that today, August 1st, is "World Wide Web Day"... but where did that idea come from?&lt;/p&gt;
&lt;p&gt;It's not an official day marked by any national or international organization. It's not celebrated by CERN or the W3C.&lt;/p&gt;
&lt;p&gt;The date August 1st doesn't appear to hold any specific significance in the history of the web. The first website &lt;a href="https://www.npr.org/2021/08/06/1025554426/a-look-back-at-the-very-first-website-ever-launched-30-years-later"&gt;was launched on August 6th 1991&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I posed the following three questions this morning on Mastodon:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Who first decided that August 1st should be "World Wide Web Day"?&lt;/li&gt;
&lt;li&gt;Why did they pick that date?&lt;/li&gt;
&lt;li&gt;When was the first World Wide Web Day celebrated?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Finding answers to these questions has proven stubbornly difficult. Searches on Google have proven futile, and illustrate the growing impact of LLM-generated slop on the web: they turn up dozens of articles celebrating the day, many from news publications playing the "write about what people might search for" game and many others that have distinctive ChatGPT vibes to them.&lt;/p&gt;
&lt;p&gt;One early hint we've found is in the "Bylines 2010 Writer's Desk Calendar" by Snowflake Press, published in January 2009. Jessamyn West &lt;a href="https://glammr.us/@jessamyn/112887883859701567"&gt;spotted that&lt;/a&gt; on the &lt;a href="https://archive.org/details/isbn_9781933509068/mode/2up?q=%22World+Wide+Web+Day%22"&gt;book's page in the Internet Archive&lt;/a&gt;, but it merely lists "World Wide Web Day" at the bottom of the July calendar page (clearly a printing mistake, the heading is meant to align with August 1st on the next page) without any hint as to the origin:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Screenshot of a section of the calendar showing July 30 (Friday) and 31st (Saturday) - at the very bottom of the Saturday block is the text World Wide Web Day" src="https://static.simonwillison.net/static/2024/www-day-calendar.jpg" /&gt;&lt;/p&gt;
&lt;p&gt;I found two earlier mentions from August 1st 2008 on Twitter, from &lt;a href="https://twitter.com/GabeMcCauley/status/874683727"&gt;@GabeMcCauley&lt;/a&gt; and from &lt;a href="https://twitter.com/iJess/status/874964457"&gt;@iJess&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Our earliest news media reference, spotted &lt;a href="https://mastodon.social/@hugovk/112888079773787541"&gt;by Hugo van Kemenade&lt;/a&gt;, is also from August 1st 2008: &lt;a href="https://www.thesunchronicle.com/opinion/unseen-eclipse-opens-summer-countdown/article_7ee3234d-f1e2-54c6-a688-a29bd542e3e3.html"&gt;this opinion piece in the Attleboro Massachusetts Sun Chronicle&lt;/a&gt;, which has no byline so presumably was written by the paper's editorial board:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Today is World Wide Web Day, but who cares? We'd rather nap than surf. How about you? Better relax while you can: August presages the start of school, a new season of public meetings, worries about fuel costs, the rundown to the presidential election and local races.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;So the mystery remains! Who decided that August 1st should be "World Wide Web Day", why that date and how did it spread so widely without leaving a clear origin story?&lt;/p&gt;
&lt;p&gt;If your research skills are up to the challenge, &lt;a href="https://fedi.simonwillison.net/@simon/112887537705995720"&gt;join the challenge&lt;/a&gt;!


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/history"&gt;history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/w3c"&gt;w3c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/web"&gt;web&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mastodon"&gt;mastodon&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/slop"&gt;slop&lt;/a&gt;&lt;/p&gt;



</summary><category term="history"/><category term="internet-archive"/><category term="w3c"/><category term="web"/><category term="mastodon"/><category term="slop"/></entry><entry><title>Quoting quora.com/robots.txt</title><link href="https://simonwillison.net/2024/Mar/19/quora-robots/#atom-tag" rel="alternate"/><published>2024-03-19T23:09:31+00:00</published><updated>2024-03-19T23:09:31+00:00</updated><id>https://simonwillison.net/2024/Mar/19/quora-robots/#atom-tag</id><summary type="html">
    &lt;blockquote cite="https://www.quora.com/robots.txt"&gt;&lt;p&gt;People share a lot of sensitive material on Quora - controversial political views, workplace gossip and compensation, and negative opinions held of companies. Over many years, as they change jobs or change their views, it is important that they can delete or anonymize their previously-written answers.&lt;/p&gt;
&lt;p&gt;We opt out of the wayback machine because inclusion would allow people to discover the identity of authors who had written sensitive answers publicly and later had made them anonymous, and because it would prevent authors from being able to remove their content from the internet if they change their mind about publishing it.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="https://www.quora.com/robots.txt"&gt;quora.com/robots.txt&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/robots-txt"&gt;robots-txt&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/quora"&gt;quora&lt;/a&gt;&lt;/p&gt;



</summary><category term="internet-archive"/><category term="robots-txt"/><category term="quora"/></entry><entry><title>Internet Archive Software Library: Flash</title><link href="https://simonwillison.net/2020/Nov/19/internet-archive-flash/#atom-tag" rel="alternate"/><published>2020-11-19T21:19:51+00:00</published><updated>2020-11-19T21:19:51+00:00</updated><id>https://simonwillison.net/2020/Nov/19/internet-archive-flash/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://archive.org/details/softwarelibrary_flash"&gt;Internet Archive Software Library: Flash&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A fantastic new initiative from the Internet Archive: they’re now archiving Flash (.swf) files and serving them for modern browsers using Ruffle, a Flash Player emulator written in Rust and compiled to WebAssembly. They are fully interactive and audio works too. Considering the enormous quantity of creative material released in Flash over the decades this helps fill a big hole in the Internet’s cultural memory.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://twitter.com/textfiles/status/1329525300846276608"&gt;Jason Scott&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/flash"&gt;flash&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jason-scott"&gt;jason-scott&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rust"&gt;rust&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webassembly"&gt;webassembly&lt;/a&gt;&lt;/p&gt;



</summary><category term="flash"/><category term="internet-archive"/><category term="jason-scott"/><category term="rust"/><category term="webassembly"/></entry><entry><title>Usage of ARIA attributes via HTTP Archive</title><link href="https://simonwillison.net/2018/Jul/12/usage-aria-attributes-http-archive/#atom-tag" rel="alternate"/><published>2018-07-12T03:16:26+00:00</published><updated>2018-07-12T03:16:26+00:00</updated><id>https://simonwillison.net/2018/Jul/12/usage-aria-attributes-http-archive/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://discuss.httparchive.org/t/usage-of-aria-attributes/778"&gt;Usage of ARIA attributes via HTTP Archive&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A neat example of a Google BigQuery query you can run against the HTTP Archive public dataset (a crawl of the “top” websites run periodically by the Internet Archive, which captures the full details of every resource fetched) to see which ARIA attributes are used the most often. Linking to this because I used it successfully today as the basis for my own custom query—I love that it’s possible to analyze a huge representative sample of the modern web in this way.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/aria"&gt;aria&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/http"&gt;http&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/big-data"&gt;big-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="aria"/><category term="http"/><category term="internet-archive"/><category term="big-data"/></entry><entry><title>Elaborate Halloween Costume Tips from a 19th-Century Guide to Fancy Dress</title><link href="https://simonwillison.net/2017/Oct/26/gilded-age/#atom-tag" rel="alternate"/><published>2017-10-26T14:01:40+00:00</published><updated>2017-10-26T14:01:40+00:00</updated><id>https://simonwillison.net/2017/Oct/26/gilded-age/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://hyperallergic.com/406549/halloween-costume-tips/"&gt;Elaborate Halloween Costume Tips from a 19th-Century Guide to Fancy Dress&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The gilded age had some ridiculous parties. Here are highlights of the most popular costume guide of the era, now available on the Internet Archive.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="http://www.metafilter.com/170211/But-what-are-we-to-wear"&gt;But, what are we to wear? | MetaFilter&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/history"&gt;history&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;&lt;/p&gt;



</summary><category term="history"/><category term="internet-archive"/></entry><entry><title>Recovering missing content from the Internet Archive</title><link href="https://simonwillison.net/2017/Oct/8/missing-content/#atom-tag" rel="alternate"/><published>2017-10-08T19:08:57+00:00</published><updated>2017-10-08T19:08:57+00:00</updated><id>https://simonwillison.net/2017/Oct/8/missing-content/#atom-tag</id><summary type="html">
    &lt;p&gt;When &lt;a href="https://simonwillison.net/2017/Oct/1/ship/"&gt;I restored my blog last weekend&lt;/a&gt; I used the most recent SQL backup of my blog’s database from back in 2010. I thought it had all of my content from before I started my 7 year hiatus, but in watching the 404 logs I started seeing the occasional hit to something that really should have been there but wasn’t. Turns out the SQL backup I was working from was missing some content.&lt;/p&gt;
&lt;p&gt;Thank goodness then for &lt;a href="https://archive.org/web/"&gt;the Wayback Machine&lt;/a&gt; at the Internet Archive! I tried some of the missing URLs there and found they had been captured and preserved. But how to get them back?&lt;/p&gt;
&lt;p&gt;A quick search turned up &lt;a href="https://github.com/hartator/wayback-machine-downloader"&gt;wayback-machine-downloader&lt;/a&gt;, an open-source Ruby script that claims to be able to &lt;em&gt;Download an entire website from the Internet Archive Wayback Machine&lt;/em&gt;. I gem installed it and tried it out (after some cargo cult incantations to work around &lt;a href="https://rvm.io/support/fixing-broken-ssl-certificates"&gt;some weird certificate errors&lt;/a&gt; I was seeing)&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;rvm osx-ssl-certs update all
gem update --system
gem install wayback_machine_downloader

wayback_machine_downloader http://simonwillison.net/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And it worked! I left it running overnight and came back to a folder containing 18,952 HTML files, neatly arranged in a directory structure that matched my site:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$ find . | more
.
./simonwillison.net
./simonwillison.net/2002
./simonwillison.net/2002/Aug
./simonwillison.net/2002/Aug/1
./simonwillison.net/2002/Aug/1/cetis
./simonwillison.net/2002/Aug/1/cetis/index.html
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial
./simonwillison.net/2002/Aug/1/cssSelectorsTutorial/index.html
...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I tarred them up into an archive and backed them up to Dropbox.&lt;/p&gt;
&lt;p&gt;Next challenge: how to restore the missing content?&lt;/p&gt;
&lt;p&gt;I’m a recent and enthusiastic adopter of &lt;a href="https://jupyter-notebook.readthedocs.io/en/latest/notebook.html"&gt;Jupyter notebooks&lt;/a&gt;. As a huge fan of development in a REPL I’m shocked I was so late to this particular party. So I fired up Jupyter and used it to start playing with the data.&lt;/p&gt;
&lt;p&gt;Here’s &lt;a href="https://github.com/simonw/simonwillisonblog/blob/0d233afbf3bf4fbe8778b8e6e022616d73e11568/jupyter-notebooks/Recover%20content%20from%20the%20wayback%20machine.ipynb"&gt;the final version of my notebook&lt;/a&gt;. I ended up with a script that did the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Load in the full list of paths from the tar archive, and filter for just the ones matching the /YYYY/Mon/DD/slug/ format used for my blog content&lt;/li&gt;
&lt;li&gt;Talk to my local Django development environment and load in the full list of actual content URLs represented in that database.&lt;/li&gt;
&lt;li&gt;Calculate the difference between the two - those are the 213 items that need to be recovered.&lt;/li&gt;
&lt;li&gt;For each of those 213 items, load the full HTML that had been saved by the Internet Archive and feed it into the &lt;a href="https://www.crummy.com/software/BeautifulSoup/"&gt;BeautifulSoup&lt;/a&gt; HTML parsing library.&lt;/li&gt;
&lt;li&gt;Detect if each one is an entry, a blogmark or a quotation. Scrape the key content out of each one based on the type.&lt;/li&gt;
&lt;li&gt;Scrape the tags for each item, using this delightful one-liner: &lt;code&gt;[a.text for a in soup.findAll('a', {'rel': 'tag'})]&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Scrape the comments for each item separately. These were mostly spam, so I haven’t yet recovered these for publication (I need to do some aggressive spam filtering first). I have however stashed them in the database for later processing.&lt;/li&gt;
&lt;li&gt;Write all of the scraped data out to a giant JSON file and upload it to a gist (a nice cheap way of giving it a URL).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Having executed the above script, I now have a JSON file containing the parsed content for all of the missing items found in the Wayback Machine. All I needed then was a script which could take that JSON and turn it into records in the database. I implemented that as &lt;a href="https://github.com/simonw/simonwillisonblog/blob/0d233afbf3bf4fbe8778b8e6e022616d73e11568/blog/management/commands/import_blog_json.py"&gt;a custom Django management command&lt;/a&gt; and deployed it to Heroku.&lt;/p&gt;
&lt;p&gt;Last step: shell into a Heroku dyno (using &lt;code&gt;heroku run bash&lt;/code&gt;) and run the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;./manage.py import_blog_json \
    --url_to_json=https://gist.github.com/simonw/5a5bc1f58297d2c7d68dd7448a4d6614/raw/28d5d564ae3fe7165802967b0f9c4eff6091caf0/recovered-blog-content.json \
    --tag_with=recovered
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result: &lt;a href="https://simonwillison.net/tags/recovered/"&gt;213 recovered items&lt;/a&gt; (which I tagged with &lt;code&gt;recovered&lt;/code&gt; so I could easily browse them). Including the most important entry on my whole site, &lt;a href="https://simonwillison.net/2010/Jun/21/married/"&gt;my write-up of my wedding&lt;/a&gt;!&lt;/p&gt;
&lt;p&gt;So thank you very much to the &lt;a href="https://archive.org/"&gt;Internet Archive&lt;/a&gt; team, and thank you &lt;a href="https://twitter.com/Hartator"&gt;Hartator&lt;/a&gt; for your extremely useful wayback-machine-downloader tool.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/beautifulsoup"&gt;beautifulsoup&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/urls"&gt;urls&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/jupyter"&gt;jupyter&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="beautifulsoup"/><category term="internet-archive"/><category term="urls"/><category term="jupyter"/></entry><entry><title>tr.im is "discontinuing service"</title><link href="https://simonwillison.net/2009/Aug/10/trim/#atom-tag" rel="alternate"/><published>2009-08-10T11:06:41+00:00</published><updated>2009-08-10T11:06:41+00:00</updated><id>https://simonwillison.net/2009/Aug/10/trim/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://blog.tr.im/post/159369789/tr-im-r-i-p"&gt;tr.im is &amp;quot;discontinuing service&amp;quot;&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“However, all tr.im links will continue to redirect, and will do so until at least December 31, 2009.Your tweets with tr.im URLs in them will not be affected.”—these statements seem to contradict themselves. Will tr.im URLs in tweets stop working after December 31st or not? Any chance they could hand the domain over to the Internet Archive? At any rate, this is exactly why centralised URL shorteners are a harmful trend.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redirects"&gt;redirects&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/trim"&gt;trim&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twitter"&gt;twitter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/urls"&gt;urls&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/urlshorteners"&gt;urlshorteners&lt;/a&gt;&lt;/p&gt;



</summary><category term="internet-archive"/><category term="redirects"/><category term="trim"/><category term="twitter"/><category term="urls"/><category term="urlshorteners"/></entry><entry><title>A new leaf.</title><link href="https://simonwillison.net/2009/Apr/28/kewlchops/#atom-tag" rel="alternate"/><published>2009-04-28T00:55:18+00:00</published><updated>2009-04-28T00:55:18+00:00</updated><id>https://simonwillison.net/2009/Apr/28/kewlchops/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://george08.blogspot.com/2009/04/new-leaf.html"&gt;A new leaf.&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
George Oates is now heading up the Open Library project at the Internet Archive. Sounds like a perfect match.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/george-oates"&gt;george-oates&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openlibrary"&gt;openlibrary&lt;/a&gt;&lt;/p&gt;



</summary><category term="george-oates"/><category term="internet-archive"/><category term="openlibrary"/></entry><entry><title>TinyURL - Archiveteam</title><link href="https://simonwillison.net/2009/Apr/3/tinyurl/#atom-tag" rel="alternate"/><published>2009-04-03T23:11:39+00:00</published><updated>2009-04-03T23:11:39+00:00</updated><id>https://simonwillison.net/2009/Apr/3/tinyurl/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://archiveteam.org/index.php?title=TinyURL"&gt;TinyURL - Archiveteam&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Excellent: the Internet Archive are crawling TinyURL (and hopefully other URL shortening services as well). The wiki page was created back in January. UPDATE from comments: Archiveteam are a separate organisation from the Internet Archive.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/archive"&gt;archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/archiveteam"&gt;archiveteam&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tinyurl"&gt;tinyurl&lt;/a&gt;&lt;/p&gt;



</summary><category term="archive"/><category term="archiveteam"/><category term="internet-archive"/><category term="tinyurl"/></entry><entry><title>Quoting Me</title><link href="https://simonwillison.net/2009/Mar/8/twitter/#atom-tag" rel="alternate"/><published>2009-03-08T14:59:34+00:00</published><updated>2009-03-08T14:59:34+00:00</updated><id>https://simonwillison.net/2009/Mar/8/twitter/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://twitter.com/simonw/status/1296514801"&gt;&lt;p&gt;The Internet Archive should actively partner with bit.ly / tinyurl.com / icanhaz.com etc. and maintain a mirror database of their redirects&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://twitter.com/simonw/status/1296514801"&gt;Me&lt;/a&gt;, on Twitter&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/bitly"&gt;bitly&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/icanhaz"&gt;icanhaz&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/me"&gt;me&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/tinyurl"&gt;tinyurl&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twitter"&gt;twitter&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/urlshorteners"&gt;urlshorteners&lt;/a&gt;&lt;/p&gt;



</summary><category term="bitly"/><category term="icanhaz"/><category term="internet-archive"/><category term="me"/><category term="tinyurl"/><category term="twitter"/><category term="urlshorteners"/></entry><entry><title>My Future of Web Apps talk as a slidecast</title><link href="https://simonwillison.net/2007/Mar/12/slidecast/#atom-tag" rel="alternate"/><published>2007-03-12T23:57:25+00:00</published><updated>2007-03-12T23:57:25+00:00</updated><id>https://simonwillison.net/2007/Mar/12/slidecast/#atom-tag</id><summary type="html">
    &lt;p&gt;The team at Carson Systems have a pretty quick turnaround on their podcasts; they've had full recordings of every speaker up &lt;a href="http://www.futureofwebapps.com/" title="The Future of Web Apps"&gt;for a few days now&lt;/a&gt;. I spent a bunch of time over the weekend splicing the recording of my talk together with my slides, and the result is now available at &lt;a href="http://simonwillison.net/2007/openid-fowa/"&gt;The Future of OpenID (a slidecast)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I managed to crunch it down to a 41.2 MB H.264 MPEG file; there is also a Flash video version are available on &lt;a href="http://www.archive.org/details/thefutureofopenid" title="Internet Archive: Details: The Future of OpenID"&gt;the Internet Archive&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A quick aside: I'm hosting the main video file in the Internet Archive's &lt;a href="http://www.archive.org/details/opensource_movies"&gt;Open Source Movies&lt;/a&gt; collection. They actively encourage people to &lt;a href="http://www.archive.org/create/"&gt;submit their own digital artifacts&lt;/a&gt;, and once you've uploaded something they'll automatically create thumbnails, derive an FLV version, mirror it to a bunch of places and &lt;a href="http://www.us.archive.org/log_show.php?task_id=13552773" title="Log file of tasks performed on my video"&gt;much more besides&lt;/a&gt;. If you've got a large video to distribute this is a great way to share it.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/future-of-web-apps"&gt;future-of-web-apps&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/internet-archive"&gt;internet-archive&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/openid"&gt;openid&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/presenting"&gt;presenting&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/slidecast"&gt;slidecast&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/my-talks"&gt;my-talks&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="future-of-web-apps"/><category term="internet-archive"/><category term="openid"/><category term="presenting"/><category term="slidecast"/><category term="my-talks"/></entry></feed>