Simon Willison's Weblog: dalle

Introducing 4o Image Generation

2025-03-25T21:11:23+00:00

When OpenAI first announced GPT-4o back in May 2024 one of the most exciting features was true multi-modality in that it could both input and output audio and images. The "o" stood for "omni", and the image output examples in that launch post looked really impressive.

It's taken them over ten months (and Gemini beat them to it) but today they're finally making those image generation abilities available, live right now in ChatGPT for paying customers.

My test prompt for any model that can manipulate incoming images is "Turn this into a selfie with a bear", because you should never take a selfie with a bear! I fed ChatGPT this selfie and got back this result:

That's pretty great! It mangled the text on my T-Shirt (which says "LAWRENCE.COM" in a creative font) and added a second visible AirPod. It's very clearly me though, and that's definitely a bear.

There are plenty more examples in OpenAI's launch post, but as usual the most interesting details are tucked away in the updates to the system card. There's lots in there about their approach to safety and bias, including a section on "Ahistorical and Unrealistic Bias" which feels inspired by Gemini's embarrassing early missteps.

One section that stood out to me is their approach to images of public figures. The new policy is much more permissive than for DALL-E - highlights mine:

4o image generation is capable, in many instances, of generating a depiction of a public figure based solely on a text prompt.

At launch, we are not blocking the capability to generate adult public figures but are instead implementing the same safeguards that we have implemented for editing images of photorealistic uploads of people. For instance, this includes seeking to block the generation of photorealistic images of public figures who are minors and of material that violates our policies related to violence, hateful imagery, instructions for illicit activities, erotic content, and other areas. Public figures who wish for their depiction not to be generated can opt out.

This approach is more fine-grained than the way we dealt with public figures in our DALL·E series of models, where we used technical mitigations intended to prevent any images of a public figure from being generated. This change opens the possibility of helpful and beneficial uses in areas like educational, historical, satirical and political speech. After launch, we will continue to monitor usage of this capability, evaluating our policies, and will adjust them if needed.

Given that "public figures who wish for their depiction not to be generated can opt out" I wonder if we'll see a stampede of public figures to do exactly that!

Update: There's significant confusion right now over this new feature because it is being rolled out gradually but older ChatGPT can still generate images using DALL-E instead... and there is no visual indication in the ChatGPT UI explaining which image generation method it used!

OpenAI made the same mistake last year when they announced ChatGPT advanced voice mode but failed to clarify that ChatGPT was still running the previous, less impressive voice implementation.

Update 2: Images created with DALL-E through the ChatGPT web interface now show a note with a warning:

Tags: ai, openai, dalle, generative-ai, chatgpt, llms, gemini, multi-modal-output, ai-ethics, llm-release, ai-bias

Now add a walrus: Prompt engineering in DALL‑E 3

2023-10-26T21:11:39+00:00

Last year I wrote about my initial experiments with DALL-E 2, OpenAI's image generation model. I've been having an absurd amount of fun playing with its sequel, DALL-E 3 recently. Here are some notes, including a peek under the hood and some notes on the leaked system prompt.

DALL-E 3 is available via ChatGPT Plus. For $20/month you get GPT-4 access, DALL-E 3, GPT Vision, Code Interpreter (aka Coding Intern), ChatGPT Plugins and the new, weird and slightly dystopian voice chat feature in the ChatGPT mobile apps. It's absolutely worth the money, for the combined entertainment value alone.

A fascinating thing about DALL-E 3 is that you don't prompt it directly: you instead access it via ChatGPT, which prompts the model for you.

ChatGPT then turns your query into several prompts, each producing a different image. The prompts it generates are available if you click on the images (on mobile you have to flip your phone horizontal to see that option).

I started by typing this into ChatGPT with DALL-E enabled:

A super posh pelican with a monocle watching the Monaco F1

It gave me these four images. I've included the prompt it used for each one:

Photo of a refined pelican wearing a shiny gold monocle, perched on a railing with the Monaco Grand Prix F1 race in the background. The race cars blur past on the circuit, while yachts are anchored in the nearby harbor.

Illustration of a sophisticated pelican donning a crystal clear monocle, observing the Monaco F1 race with keen interest. The race track is filled with speeding F1 cars, and the Mediterranean Sea sparkles in the distance with luxurious yachts.

Watercolor painting of a posh pelican, its feathers groomed to perfection, equipped with a delicate monocle. It stands tall with the iconic Monaco F1 race happening behind it, cars racing at breakneck speeds and the shimmering sea beyond.

Vector image of an elegant pelican with a classy monocle, standing gracefully against the backdrop of the Monaco Grand Prix. The F1 cars create a colorful streak as they race, and lavish yachts float in the nearby waters.

Note how the prompts it generates cover four different styles - photo, illustration, watercolor painting and vector image. They also use pleasing and varied language to describe different aspects of the image.

ChatGPT is prompting DALL-E for us!

The really fun thing though is that you can ask it follow-ups.

I said:

More like the first one please

And got back:

Photo of an aristocratic pelican adorned with a golden monocle, perched confidently with the Monaco Grand Prix F1 race unfolding behind it. The Formula 1 cars zip by on the circuit, while opulent yachts are moored in the nearby marina.

Photo of a regal pelican with a polished monocle, standing prominently against the backdrop of the Monaco F1 race. The roar of the engines can almost be heard as the cars speed past, and the glinting sea is dotted with luxurious boats.

Photo of a distinguished pelican sporting a gleaming monocle, observing the Monaco Grand Prix from a vantage point. Racing cars leave a trail of dust as they navigate the circuit, with the blue sea and anchored yachts completing the panorama.

Photo of a chic pelican, its eye magnified by a pristine monocle, captivated by the Monaco F1 race happening just meters away. The racetrack buzzes with activity as F1 cars race, and the horizon showcases the grandeur of the harbor with its yachts.

Next I added what is becoming my universal DALL-E 3 follow-up prompt, always guaranteed to produce better results:

Add a walrus

The result:

Photo of a posh pelican with a golden monocle and a distinguished walrus, both observing the Monaco Grand Prix F1 race. The pelican is perched on a railing while the walrus lounges nearby. Formula 1 cars race by on the circuit, with lavish yachts in the marina behind them.

Photo of an aristocratic pelican and a stately walrus side by side, watching the Monaco F1 with keen interest. The pelican has a monocle, and the walrus's tusks shine brightly. The race track is alive with the sound of roaring engines, and the harbor is filled with luxurious boats.

Photo of a regal pelican, its eye magnified by a monocle, and a majestic walrus with gleaming tusks, both captivated by the Monaco Grand Prix. The racetrack is a blur of speed, and the horizon showcases the grandeur of the harbor and its anchored yachts.

Photo of a chic pelican with a pristine monocle and a sophisticated walrus, both engrossed in the Monaco F1 race. The pelican stands on a vantage point, and the walrus sits comfortably beside it. Racing cars zoom past, and the sea is dotted with opulent boats.

That second one is amazing - exactly what I want from a photo of a posh pelican and a walrus watching the Monaco Grand Prix.

I also enjoy how the third image ended up as a weird Pelican-Walrus hybrid, with tusks, a monocle and pelican feathers for the hair and whiskers.

Then @BrockMcBreadcat suggested:

Now make them muppets

Photo of Muppet versions of a pelican and a walrus, both with exaggerated, comical features. The pelican Muppet wears a monocle and the walrus Muppet sports oversized tusks. They are animatedly discussing the Monaco Grand Prix F1 race, with caricatured F1 cars and yachts in the background.

Illustration of Muppet-inspired pelican and walrus characters. The pelican has a large, beady eye accentuated by a monocle, while the walrus has a playful grin and prominent tusks. They are engrossed in the Monaco F1 race, with whimsical racing cars and yachts creating a lively backdrop.

Photo of two Muppet characters: a pelican with a monocle and a walrus with big, goofy tusks. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

Illustration of a pelican and walrus as Muppet characters. The pelican Muppet looks sophisticated with its monocle, and the walrus Muppet chuckles with its oversized tusks. They are enjoying the Monaco F1 race, with the track filled with Muppet-style racing cars and quirky yachts.

LOVE that third one!

Reusing seeds

I mentioned on Twitter that it was a shame you can't reuse elements of an image for further visual consistency... and was pointed to a thread by Paul Couvert explaining how to do exactly that using seeds.

I didn't believe this would work - my mental model of how random number seeds and image generation models work doesn't fit what happened next.

I wrote:

What is the seed for image three?

ChatGPT replied:

The seed for the third image is 1379049893. If you have any other questions or requests, please let me know!

Then I said:

Add bow ties to both characters in seed 1379049893

Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

I'm pretty stunned by this. It's not an exact recreation - the boats in the background have changed slightly, and it added a top hat that I wasn't expecting. But it's pretty close!

People have been trying to figure out hacks to get Midjourney to create consistent characters for the past year, and DALL-E 3 apparently has that ability as an undocumented feature!

Peeking under the hood

Let's see if we can find out a bit more about how this works.

I used the Firefox DevTools network inspector to grab the JSON representation of my conversation with ChatGPT. Here's a pretty-printed copy.

At line 1087 you can see the point where I asked it to "Add bow ties to both characters in seed 1379049893".

The next message in the transcript looks like this:

{
    "id": "5300d51c-794d-4d5f-bf75-6ea2dad4027c",
    "author": {
        "role": "assistant",
        "metadata": {}
    },
    "create_time": 1698340593.174454,
    "content": {
        "content_type": "text",
        "parts": [
            "{\n  \"prompts\": [\n    \"Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.\"\n  ],\n  \"size\": \"1024x1024\",\n  \"seeds\": [1379049893]\n}"
        ]
    },
    "status": "finished_successfully",
    "end_turn": false,
    "weight": 1,
    "metadata": {
        "finish_details": {
            "type": "stop",
            "stop_tokens": [
                100265
            ]
        },
        "is_complete": true,
        "model_slug": "gpt-4-dalle",
        "parent_id": "aaa2c3bc-b670-4a3a-bf9f-46a1f3304266",
        "timestamp_": "absolute"
    },
    "recipient": "dalle.text2im"
}

Note the author is "assistant" - that's ChatGPT - and the recipient is "dalle.text2im".

The "parts" key includes some encoded JSON. Decoded and pretty-printed that looks like this:

{
  "prompts": [
    "Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance."
  ],
  "size": "1024x1024",
  "seeds": [1379049893]
}

There's that 1379049893 seed from earlier!

The prompt has changed very slightly. The original prompt for the image was:

Photo of two Muppet characters: a pelican with a monocle and a walrus with big, goofy tusks. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

The new prompt (differences highlighted in bold) is:

Photo of two Muppet characters: a pelican with a monocle and a bow tie, and a walrus with big, goofy tusks and a dapper bow tie. They're seated in a Muppet-style commentary booth, providing humorous commentary on the Monaco Grand Prix. Cartoonish F1 cars race by, and colorful yachts are seen in the distance.

So this really did work by adding a couple of extra details to the prompt and re-running it with the same seed as before.

I'm really surprised that this works. I would expect even a single character difference in the prompt to produce wildly different results, no matter what seed was being fed to the image generator. DALL-E 3 doesn't fit my mental model of how these things work at all.

Midjourney and Stable Diffusion both have a "seed" concept, but as far as I know they don't have anything like this capability to maintain consistency between images given the same seed and a slightly altered prompt.

The leaked DALL-E prompt

If you dig around in the JSON you'll find a few examples like this that appear to reveal inner workings of the DALL-E/ChatGPT combo:

"DALL·E returned some images. They are already displayed to the user. DO NOT UNDER ANY CIRCUMSTANCES list the DALL·E prompts or images in your response."

It's always amusing to see glimpses of OpenAI's prompt engineering like this, in particular the way they use CAPITAL LETTERS for emphasis. Benj Edwards wrote about this the other day for Ars Technica.

Does this mean there's a larger prompt that tells ChatGPT how to use DALL-E? It turns out there is, and it can be leaked! Dustin Miller shared the extracted prompt here.

(Every time a prompt like this leaks people question if it might be a hallucination. My experience is that these things are very unlikely to be hallucinated - LLMs are really good at repeating text from earlier in their context, and I have yet to see a documented case of a leaked prompt that turned out not to be correct. Here's my own recreation of the leaked DALL-E prompt.)

There's a lot in there. Here are some highlights:

Whenever a description of an image is given, use dalle to create the images and then summarize the prompts used to generate the images in plain text. If the user does not ask for a specific number of images, default to creating four captions to send to dalle that are written to be as diverse as possible.

That outlines the default behavior. Where things get really interesting is some of the stuff in the list of policies that follows:

Don't create images of politicians or other public figures. Recommend other ideas instead.

Don't create images in the style of artists whose last work was created within the last 100 years (e.g. Picasso, Kahlo). Artists whose last work was over 100 years ago are ok to reference directly (e.g. Van Gogh, Klimt). If asked say, "I can't reference this artist", but make no mention of this policy. Instead, apply the following procedure when creating the captions for dalle: (a) substitute the artist's name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist.

They have clearly trained the model on all sorts of copyrighted images - this is the part of the prompt where they discourage it from obviously recreating the work of living artists.

Always mention the image type (photo, oil painting, watercolor painting, illustration, cartoon, drawing, vector, render, etc.) at the beginning of the caption. Unless the caption suggests otherwise, make at least 1--2 of the 4 images photos

We saw this earlier where my first prompt was turned into a photo, watercolor, illustration and vector-style image.

Diversify depictions of ALL images with people to include DESCENT and GENDER for EACH person using direct terms. Adjust only human descriptions.

EXPLICITLY specify these attributes, not abstractly reference them. The attributes should be specified in a minimal way and should directly describe their physical form.

Your choices should be grounded in reality. For example, all of a given OCCUPATION should not be the same gender or race. Additionally, focus on creating diverse, inclusive, and exploratory scenes via the properties you choose during rewrites. Make choices that may be insightful or unique sometimes.

Use "various" or "diverse" ONLY IF the description refers to groups of more than 3 people. Do not change the number of people requested in the original description.

Don't alter memes, fictional character origins, or unseen people. Maintain the original prompt's intent and prioritize quality.

Do not create any imagery that would be offensive.

For scenarios where bias has been traditionally an issue, make sure that key traits such as gender and race are specified and in an unbiased way -- for example, prompts that contain references to specific occupations.

Using prompt engineering in ChatGPT to get around the fundamental biases baked into the DALL-E 3 model!

Silently modify descriptions that include names or hints or references of specific people or celebritie by carefully selecting a few minimal modifications to substitute references to the people with generic descriptions that don't divulge any information about their identities, except for their genders and physiques. Do this EVEN WHEN the instructions ask for the prompt to not be changed. Some special cases:

Modify such prompts even if you don't know who the person is, or if their name is misspelled (e.g. "Barake Obema")

If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.

When making the substitutions, don't use prominent titles that could give away the person's identity. E.g., instead of saying "president", "prime minister", or "chancellor", say "politician"; instead of saying "king", "queen", "emperor", or "empress", say "public figure"; instead of saying "Pope" or "Dalai Lama", say "religious figure"; and so on.

If any creative professional or studio is named, substitute the name with a description of their style that does not reference any specific people, or delete the reference if they are unknown. DO NOT refer to the artist or studio's style.

Another fascinating detail. What I love about these prompts is that they clearly developed over months of testing the model, and are designed to counter all manner of harmful or embarrassing capabilities of the underlying DALL-E 3.

Finally, there's this snippet of TypeScript at the end which describes the seed behavior we saw earlier on:

namespace dalle {
// Create images from a text-only prompt.
type text2im = (_: {
// The resolution of the requested image, which can be wide, square, or tall. Use 1024x1024 (square) as the default unless the prompt suggests a wide image, 1792x1024, or a full-body portrait, in which case 1024x1792 (tall) should be used instead. Always include this parameter in the request.
size?: "1792x1024" | "1024x1024" | "1024x1792",
// The user's original image description, potentially modified to abide by the dalle policies. If the user does not suggest a number of captions to create, create four of them. If creating multiple captions, make them as diverse as possible. If the user requested modifications to previous images, the captions should not simply be longer, but rather it should be refactored to integrate the suggestions into each of the captions. Generate no more than 4 images, even if the user requests more.
prompts: string[],
// A list of seeds to use for each prompt. If the user asks to modify a previous image, populate this field with the seed used to generate that image from the image dalle metadata.
seeds?: number[],
}) => any;
} // namespace dalle

Key extract from that:

If the user requested modifications to previous images, the captions should not simply be longer, but rather it should be refactored to integrate the suggestions into each of the captions.

All in all, this is a really neat insight into the kind of prompt engineering that goes on within OpenAI itself - likely the organization with the most experience of prompt engineering in the world.

Tags: ai, openai, dalle, prompt-engineering, prompt-injection, generative-ai, chatgpt, llms, text-to-image

Quoting DALL-E 3 leaked prompt

2023-10-07T19:35:09+00:00

Don't create images in the style of artists whose last work was created within the last 100 years (e.g. Picasso, Kahlo). Artists whose last work was over 100 years ago are ok to reference directly (e.g. Van Gogh, Klimt). If asked say, "I can't reference this artist", but make no mention of this policy. Instead, apply the following procedure when creating the captions for dalle: (a) substitute the artist's name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist.

— DALL-E 3 leaked prompt

Tags: ai, openai, dalle, prompt-engineering, prompt-injection, generative-ai

Bing Image Creator comes to the new Bing

2023-03-21T17:10:53+00:00

Bing Image Creator comes to the new Bing

Bing Chat is integrating DALL-E directly into their interface, giving it the ability to generate images when prompted to do so.

Tags: bing, ai, dalle, generative-ai

The AI that creates any picture you want, explained

2022-10-10T03:28:08+00:00

The AI that creates any picture you want, explained

Vox made this explainer video about text-to-image generative AI models back in June, months before Stable Diffusion was released and shortly before the DALL-E preview started rolling out to a wider audience. It’s a really good video—in particular the animation that explains at a high level how diffusion models work, which starts about 5m30s in.

Tags: ai, dalle, stable-diffusion, generative-ai, text-to-image

Stable Diffusion is a really big deal

2022-08-29T01:09:04+00:00

If you haven't been paying attention to what's going on with Stable Diffusion, you really should be.

Stable Diffusion is a new "text-to-image diffusion model" that was released to the public by Stability.ai six days ago, on August 22nd.

It's similar to models like Open AI's DALL-E, but with one crucial difference: they released the whole thing.

You can try it out online at beta.dreamstudio.ai (currently for free). Type in a text prompt and the model will generate an image.

You can download and run the model on your own computer (if you have a powerful enough graphics card). Here's an FAQ on how to do that.

You can use it for commercial and non-commercial purposes, under the terms of the Creative ML OpenRAIL-M license - which lists some usage restrictions that include avoiding using it to break applicable laws, generate false information, discriminate against individuals or provide medical advice.

In just a few days, there has been an explosion of innovation around it. The things people are building are absolutely astonishing.

I've been tracking the r/StableDiffusion subreddit and following Stability.ai founder Emad Mostaque on Twitter.

img2img

Generating images from text is one thing, but generating images from other images is a whole new ballgame.

My favourite example so far comes from Reddit user argaman123. They created this image:

And added this prompt (or "something along those lines"):

A distant futuristic city full of tall buildings inside a huge transparent glass dome, In the middle of a barren desert full of large dunes, Sun rays, Artstation, Dark sky full of stars with a shiny sun, Massive scale, Fog, Highly detailed, Cinematic, Colorful

The model produced the following two images:

These are amazing. In my previous experiments with DALL-E I've tried to recreate photographs I have taken, but getting the exact composition I wanted has always proved impossible using just text. With this new capability I feel like I could get the AI to do pretty much exactly what I have in my mind.

Imagine having an on-demand concept artist that can generate anything you can imagine, and can iterate with you towards your ideal result. For free (or at least for very-cheap).

You can run this today on your own computer, if you can figure out how to set it up. You can try it in your browser using Replicate, or Hugging Face. This capability is apparently coming to the DreamStudio interface next week.

There's so much more going on.

stable-diffusion-webui is an open source UI you can run on your own machine providing a powerful interface to the model. Here's a Twitter thread showing what it can do.

Reddit user alpacaAI shared a video demo of a Photoshop plugin they are developing which has to be seen to be believed. They have a registration form up on getalpaca.io for people who want to try it out once it's ready.

Reddit user Hoppss ran a 2D animated clip from Disney's Aladdin through img2img frame-by frame, using the following parameters:

--prompt "3D render" --strength 0.15 --seed 82345912 --n_samples 1 --ddim_steps 100 --n_iter 1 --scale 30.0 --skip_grid

The result was a 3D animated video. Not a great quality one, but pretty stunning for a shell script and a two word prompt!

The best description I've seen so far of an iterative process to build up an image using Stable Diffusion comes from Andy Salerno: 4.2 Gigabytes, or: How to Draw Anything.

Ben Firshman has published detailed instructions on how to Run Stable Diffusion on your M1 Mac’s GPU.

And there's so much more to come

All of this happened in just six days since the model release. Emad Mostaque on Twitter:

We use as much compute as stable diffusion used every 36 hours for our upcoming open source models

This made me think of Google's Parti paper, which included a demonstration that showed that once the model was trained to 200bn parameters it could generate images with correctly spelled text!

Ethics: will you be an AI vegan?

I'm finding the ethics of all of this extremely difficult.

Stable Diffusion has been trained on millions of copyrighted images scraped from the web.

The Stable Diffusion v1 Model Card has the full details, but the short version is that it uses LAION-5B (5.85 billion image-text pairs) and its laion-aesthetics v2 5+ subset (which I think is ~600M pairs filtered for aesthetics). These images were scraped from the web.

I'm not qualified to speak to the legality of this. I'm personally more concerned with the morality.

The final model is I believe around 4.2GB of data - a binary blob of floating point numbers. The fact that it can compress such an enormous quantity of visual information into such a small space is itself a fascinating detail.

As such, each image in the training set contributes only a tiny amount of information - a few tweaks to some numeric weights spread across the entire network.

But... the people who created these images did not give their consent. And the model can be seen as a direct threat to their livelihoods. No-one expected creative AIs to come for the artist jobs first, but here we are!

I'm still thinking through this, and I'm eager to consume more commentary about it. But my current mental model is to think about this in terms of veganism, as an analogy for people making their own personal ethical decisions.

I know many vegans. They have access to the same information as I do about the treatment of animals, and they have made informed decisions about their lifestyle, which I fully respect.

I myself remain a meat-eater.

There will be many people who will decide that the AI models trained on copyrighted images are incompatible with their values. I understand and respect that decision.

But when I look at that img2img example of the futuristic city in the dome, I can't resist imagining what I could do with that capability.

If someone were to create a vegan model, trained entirely on out-of-copyright images, I would be delighted to promote it and try it out. If its results were good enough, I might even switch to it entirely.

Understanding the training data

Update: 30th August 2022. Andy Baio and I worked together on a deep dive into the training data behind Stable Diffusion. Andy wrote up some of our findings in Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator.

Indistinguishable from magic

Just a few months ago, if I'd seen someone on a fictional TV show using an interface like that Photoshop plugin I'd have grumbled about how that was a step too far even by the standards of American network TV dramas.

Science fiction is real now. Machine learning generative models are here, and the rate with which they are improving is unreal. It's worth paying real attention to what they can do and how they are developing.

I'm tweeting about this stuff a lot these days. Follow @simonw on Twitter for more.

Tags: ethics, machine-learning, ai, dalle, stable-diffusion, prompt-engineering, generative-ai, laion, text-to-image, ai-ethics

Stable Diffusion Public Release

2022-08-22T19:12:43+00:00

Stable Diffusion Public Release

New AI just dropped. Stable Diffusion is similar to DALL-E, but completely open source and with a CC0 license applied to everything it generates. I have a Twitter thread (the via) link of comparisons I’ve made between its output and my previous DALL-E experiments. The announcement buries the lede somewhat: to try it out, visit beta.dreamstudio.ai—which you can use for free at the moment, but it’s unclear to me how billing is supposed to work.

Via @simonw

Tags: machine-learning, dalle, stable-diffusion, generative-ai, text-to-image

Litestream backups for Datasette Cloud (and weeknotes)

2022-08-11T17:19:52+00:00

My main focus this week has been adding robust backups to the forthcoming Datasette Cloud.

Datasette Cloud is a SaaS service for Datasette. It allows people to create a private Datasette instance where they can upload data, visualize and transform it and share it with other members of their team. You can join the waiting list to try it out using this form.

I'm building Datastte Cloud on Fly, specifically on Fly Machines.

Security is a big concern for Datasette Cloud. Teams should only be able to access their own data - bugs where users accidentally (or maliciously) access data for another team should be protected against as much as possible.

To help guarantee that, I've designed Datasette Cloud so that each team gets their own, dedicated instance, running in a Firecracker VM managed by Fly. Their data lives in a dedicated volume.

Fly volumes already implement snapshot backups, but I'm interested in defence in depth. This is where Litestream comes in (coincidentally now part of Fly, although it wasn't when I first selected it as my backup strategy).

I'm using Litestream to constantly backup the data for each Datasette Cloud team to an S3 bucket. In the case of a complete failure of a volume, I can restore data from a backup that should be at most a few seconds out of date. Litestream also gives me point-in-time backups, such that I can recover a previous version of the data within a configurable retention window.

Keeping backups isolated

Litestream works by writing a constant stream of pages from SQLite's WAL (Write-Ahead Log) up to an S3 bucket. It needs the ability to both read and write from S3.

This requires making S3 credentials available within the containers that run Datasette and Litestream for each team account.

Credentials in those containers are not visible to the users of the software, but I still wanted to be confident that if the credentials leaked in some way the isolation between teams would be maintained.

Initially I thought about having a separate S3 bucket for each team, but it turns out AWS has a default limit of 100 buckets per account, and a hard limit of 1,000. I aspire to have more than 1,000 customers, so this limit makes a bucket-per-team seem like the wrong solution.

I've learned an absolute ton about S3 and AWS permissions building my s3-credentials tool for creating credentials for accessing S3.

One of the tricks I've learned is that it's possible to create temporary, time-limited credentials that only work for a prefix (effectively a folder) within an S3 bucket.

This means I can run Litestream with credentials that are specific to the team - that can read and write only from the team-ID/ prefix in the S3 bucket I am using to store the backups.

Obtaining temporary credentials

My s3-credentials tool can create credentials for a prefix within an S3 bucket like this:

s3-credentials create my-bucket-for-backus \
  --duration 12h \
  --prefix team-56/

This command uses the sts.assume_role() AWS method to create credentials that allow access to that bucket, attaching this generated JSON policy to it in order to restrict access to the provided prefix.

I extracted the relevant Python code from s3-credentials and used it to create a private API endpoint in my Datasette Cloud management server which could return the temporary credentials needed by the team container.

With the endpoint in place, my code for launching a team container can do this:

Create the volume and machine for that team (if they do not yet exist)
Generate a signed secret token that the machine container can exchange for its S3 credentials
Launch the machine container, passing it the secret token
On launch, the container runs a script which exchanges that secret token for its 12 hour S3 credentials, using the private API endpoint I created
Those credentials are used to populate the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN environment variables used by Litestream
Start Litestream, which then starts Datasette

Restarting every 12 hours

You may be wondering why I bothered with that initial secret token - why not just pass the temporary AWS credentials to the container when I launch it?

The reason for this is that I need to be able to obtain fresh credentials every 12 hours.

A really neat feature of Fly Machines is that they support scale-to-zero. You can stop them, and Fly will automatically restart them the next time they recieve traffic.

All you need to do is call sys.exit(0) in your Python code (or the equivalent in any other language) and Fly will stop your container... and then restart it again with a couple of seconds of cold start time the next time an HTTP request for your container hits the Fly router.

So far I'm mainly using this to avoid the cost of running containers when they aren't actually in- use. But there's a neat benefit when it comes to Litestream too.

I'm using S3 credentials which expire after 12 hours. This means I need to periodically refresh the credentials and restart Litestream or it will stop being able to write to the S3 bucket.

After considering a few ways of doing this, I selected the simplest to implement: have Datasette call sys.exit(0) after ten hours, and let Fly restart the container causing my startup script to fetch freshly generated 12 hour credentials and pass them to Litestream.

I implemented this by adding it as a new setting to my existing datasette-scale-to-zero plugin. You can now configure that with "max-age": "10h" and it will shut down Datasette once the server has been running for that long.

Why does this require my own secret token system? Because when the container is restarted, it needs to make an authenticated call to my endpoint to retrieve those fresh S3 credentials. Fly persists environment variable secrets between restarts to the container, so that secret can be long-lived even while it is exchanged for short-term S3 credentials.

I only just put the new backup system in place, so I'm exercising it a bit before I open things up to trial users - but so far it's looking like a very robust solution to the problem.

s3-ocr improvements

I released a few new versions of s3-ocr this week, as part of my ongoing project working with the San Francisco Microscopical Society team to release a searchable version of their scanned document archives.

The two main improvements are:

A new --dry-run option to s3-ocr start which shows you what the tool will do without making any changes to your S3 bucket, or triggering any OCR jobs. #22
s3-ocr start used to fail with an error if running it would create more than 100 (or 600 depending on your region) concurrent OCR jobs. The tool now knows how to identify that error and pause and retry starting the jobs instead. #21

The fix that took the most time is this: installations of the tool no longer arbitrarily fail to work depending on the environment you install them into!

Solving this took me the best part of a day. The short version is this: Click 8.1.0 introduced a new feature that lets you use @cli.command as a decorator instead of @cli.command(). This meant that installing s3-ocr in an environment that already had a previous version of Click would result in silent errors.

The solution is simple: pin to click>=8.1.0 in the project dependencies if you plan to use this new syntax.

If I'd read the Click changelog more closely I would have saved myself a whole lot of time.

Issues #25 and #26 detail the many false turns I took trying to figure this out.

More fun with GPT-3 and DALL-E

This tweet scored over a million impressions on Twitter:

New hobby: prototyping video games in 60 seconds using a combination of GPT-3 and DALL-E

Here's "Raccoon Heist" pic.twitter.com/xQ3Vm8p2XW
- Simon Willison (@simonw) August 5, 2022

As this got retweeted outside of my usual circles it started confusing people who thought the "prototype" was a working game, as opposed to a fake screenshot and a paragraph of descriptive text! I wasn't kidding when I said I spent 60 seconds on this.

I also figured out how to use GPT-3 to write jq one-liners. I love jq but I have to look up how to use it every time, so having GPT-3 do the work for me is a pretty neat time saver. More on that in this TIL: Using GPT-3 to figure out jq recipes

Releases this week

s3-ocr: 0.6.3 - (9 releases total) - 2022-08-10
Tools for running OCR against files stored in S3
datasette-scale-to-zero: 0.2 - (4 releases total) - 2022-08-05
Quit Datasette if it has not received traffic for a specified time period
shot-scraper: 0.14.3 - (18 releases total) - 2022-08-02
A command-line utility for taking automated screenshots of websites
s3-credentials: 0.12.1 - (13 releases total) - 2022-08-01
A tool for creating credentials for accessing S3 buckets
datasette-sqlite-fts4: 0.3.2 - (2 releases total) - 2022-07-31

TIL this week

Tags: ocr, s3, datasette, weeknotes, datasette-cloud, fly, litestream, gpt-3, dalle

How I Used DALL·E 2 to Generate The Logo for OctoSQL

2022-08-02T21:12:46+00:00

How I Used DALL·E 2 to Generate The Logo for OctoSQL

Jacob Martin gives a blow-by-blow account of his attempts at creating a logo for his OctoSQL project using DALL-E, spending $30 of credits and making extensive use of both the “variations” feature and the tool that lets you request modifications to existing images by painting over parts you want to regenerate. Really interesting to read as an example of a “real world” DALL-E project.

Via Hacker News

Tags: openai, dalle, generative-ai

The DALL·E 2 Prompt Book

2022-07-14T23:26:55+00:00

The DALL·E 2 Prompt Book

This is effectively DALL-E: The Missing Manual: an 81 page PDF book that goes into exhaustive detail about how to get the most out of DALL-E through creative prompt design.

Via Hacker News

Tags: ai, openai, dalle, prompt-engineering, generative-ai

First impressions of DALL-E, generating images from text

2022-06-23T23:05:56+00:00

I made it off the DALL-E waiting list a few days ago and I've been having an enormous amount of fun experimenting with it. Here are some notes on what I've learned so far (and a bunch of example images too).

(For those not familiar with it, DALL-E is OpenAI's advanced text-to-image generator: you feed it a prompt, it generates images. It's extraordinarily good at it.)

First, a warning: DALL-E only allows you to generate up to 50 images a day. I found this out only when I tried to generate image number 51. So there's a budget to watch out for.

I've usually run out by lunch time!

How to use DALL-E

DALL-E is even simpler to use than GPT-3: you get a text box to type in, and that's it. There are no advanced settings to tweak.

It does have one other mode: you can upload your own photo, crop it to a square and then erase portions of it and ask DALL-E to fill them in with a prompt. This feature is clearly still in the early stages - I've not had great results with it yet.

DALL-E always returns six resulting images, which I believe it has selected as the "best" from hundreds of potential results.

Tips on prompts

DALL-E's initial label suggests to "Start with a detailed description". This is very good advice!

The more detail you provide, the more interesting DALL-E gets.

If you type "Pelican", you'll get an image that is indistinguishable from what you might get from something like Google Image search. But the more details you ask for, the more interesting and fun the result.

Fun with pelicans

Here's "A ceramic pelican in a Mexican folk art style with a big cactus growing out of it":

Some of the most fun results you can have come from providing hints as to a medium or art style you would like. Here's "A heavy metal album cover where the band members are all pelicans... made of lightning":

This illustrates a few interesting points. Firstly, DALL-E is hilariously bad at any images involving text. It can make things that look like letters and words but it has no concept of actual writing.

My initial prompt was for "A death metal album cover..." - but DALL-E refused to generate that. It has a filter to prevent people from generating images that go outside its content policy, and the word "death" triggered it.

(I'm confident that the filter can be easily avoided, but I don't want to have my access revoked so I haven't spent any time pushing its limits.)

It's also not a great result - those pelicans are not made of lightning! I tried a tweaked prompt:

"A heavy metal album cover where the band members are all pelicans that are made of lightning":

Still not made of lightning. One more try:

"pelican made of lightning":

Let's try the universal DALL-E cheat code, adding "digital art" to the prompt.

"a pelican made of lightning, digital art"

OK, those look a lot better!

One last try - the earlier prompt but with "digital art" added.

"A heavy metal album cover where the band members are all pelicans that are made of lightning, digital art":

OK, these are cool. The text is gone - maybe the "digital art" influence over-rode the "album cover" a tiny bit there.

This process is a good example of "prompt engineering" - feeding in altered prompts to try to iterate towards a better result. This is a very deep topic, and I'm confident I've only just scratched the surface of it.

Breaking away from album art, here's "A squadron of pelicans having a tea party in a forest with a raccoon, digital art". Often when you specify "digital art" it picks some other additional medium:

Recreating things you see

A fun game I started to play with DALL-E was to see if I could get it to recreate things I saw in real life.

My dog, Cleo, was woofing at me for breakfast. I took this photo of her:

Then I tried this prompt: "A medium sized black dog who is a pit bull mix sitting on the ground wagging her tail and woofing at me on a hardwood floor"

OK, wow.

Later, I caught her napping on the bed:

Here's DALL-E for "A medium sized black pit bull mix curled up asleep on a dark green duvet cover":

One more go at that. Our chicken Cardi snuck into the house and snuggled up on the sofa. Before I evicted her back into the garden I took this photo:

"a black and white speckled chicken with a red comb snuggled on a blue sofa next to a cushion with a blue seal pattern and a blue and white knitted blanket":

Clearly I didn't provide a detailed enough prompt here! I would need to iterate on this one a lot.

Stained glass

DALL-E is great at stained glass windows.

"Pelican in a waistcoat as a stained glass window":

"A stained glass window depicting 5 different nudibranchs"

People

DALL-E is (understandably) quite careful about depictions of people. It won't let you upload images with recognisable faces in them, and when you ask for a prompt with a famous person it will sometimes pull off tricks like showing them from behind.

Here's "The pope on a bicycle leading a bicycle race through Paris":

Though maybe it was the "leading a bicycle race" part that inspired it to draw the image from this direction? I'm not sure.

It will sometimes generate made-up people with visible faces, but they ask users not to share those images.

Assorted images

Here are a bunch of images that I liked, with their prompts.

Inspired by one of our chickens:

"A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree"

I asked it for the same thing, painted by Salvador Dali:

"A blue-grey fluffy chicken puffed up and looking angry perched under a lemon tree, painted by Salvador Dali":

"Bats having a quinceañera, digital art":

"The scene in an Agatha Christie mystery where the e detective reveals who did it, but everyone is a raccoon. Digital art.":

(It didn't make everyone a raccoon. It also refused my initial prompt where I asked for an Agatha Christie murder mystery, presumably because of the word "murder".)

"An acoustic guitar decorated with capybaras in Mexican folk art style, sigma 85mm":

Adding "sigma 85mm" (and various other mm lengths) is a trick I picked up which gives you realistic images that tend to be cropped well.

"A raccoon wearing glasses and reading a poem at a poetry evening, sigma 35mm":

"Pencil sketch of a Squirrel reading a book":

Pencil sketches come out fantastically well.

"The royal pavilion in brighton covered in snow"

I experienced this once, many years ago when I lived in Brighton - but forgot to take a photo of it. It looked exactly like this.

And a game: fantasy breakfast tacos

It's difficult to overstate how much fun playing with this stuff is. Here's a game I came up with: fantasy breakfast tacos. See how tasty a taco you can invent!

Mine was "breakfast tacos with lobster, steak, salmon, sausages and three different sauces":

Natalie is a vegetarian, which I think puts her at a disadvantage in this game. "breakfast taco containing cauliflower, cheesecake, tomatoes, eggs, flowers":

Closing thoughts

As you can see, I have been enjoying playing with this a LOT. I could easily share twice as much - the above are just the highlights from my experiments so far.

The obvious question raised by this is how it will affect people who generate art and design for a living. I don't have anything useful to say about that, other than recommending that they make themselves familiar with the capabilities of these kinds of tools - which have taken an astonishing leap forward in the past few years.

My current mental model of DALL-E is that it's a fascinating tool for enhancing my imagination. Being able to imagine something and see it visualized a few seconds later is an extraordinary new ability.

I haven't yet figured out how to apply this to real world problems that I face - my attempts at getting DALL-E to generate website wireframes or explanatory illustrations have been unusable so far - but I'll keep on experimenting with it. Especially since feeding it prompts is just so much fun.

Tags: machine-learning, ai, openai, dalle, prompt-engineering, generative-ai, text-to-image

How Imagen Actually Works

2022-06-23T18:05:58+00:00

How Imagen Actually Works

Imagen is Google’s new text-to-image model, similar to (but possibly even more effective than) DALL-E. This article is the clearest explanation I’ve seen of how Imagen works: it uses Google’s existing T5 text encoder to convert the input sentence into an encoding that captures the semantic meaning of the sentence (including things like items being described as being on top of other items), then uses a trained diffusion model to generate a 64x64 image. That image is passed through two super-res models to increase the resolution to the final 1024x1024 output.

Tags: google, machine-learning, ai, dalle, generative-ai

DALL·E: Creating Images from Text

2021-01-05T20:31:49+00:00

DALL·E: Creating Images from Text

“DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs.”. The examples in this paper are astonishing—“an illustration of a baby daikon radish in a tutu walking a dog” generates exactly that.

Via Hacker News

Tags: machine-learning, ai, openai, dalle, generative-ai