Simon Willison's Weblog: bard

Google's Gemini Advanced: Tasting Notes and Implications

2024-02-08T15:10:47+00:00

Google's Gemini Advanced: Tasting Notes and Implications

Ethan Mollick reviews the new Google Gemini Advanced—a rebranded Bard, released today, that runs on the GPT-4 competitive Gemini Ultra model.

“GPT-4 [...] has been the dominant AI for well over a year, and no other model has come particularly close. Prior to Gemini, we only had one advanced AI model to look at, and it is hard drawing conclusions with a dataset of one. Now there are two, and we can learn a few things.”

I like Ethan’s use of the term “tasting notes” here. Reminds me of how Matt Webb talks about being a language model sommelier.

Tags: google, ai, generative-ai, gpt-4, bard, llms, ethan-mollick, gemini

Hacking Google Bard - From Prompt Injection to Data Exfiltration

2023-11-04T16:46:51+00:00

Hacking Google Bard - From Prompt Injection to Data Exfiltration

Bard recently grew extension support, allowing it access to a user’s personal documents. Here’s the first reported prompt injection attack against that.

This kind of attack against LLM systems is inevitable any time you combine access to private data with exposure to untrusted inputs. In this case the attack vector is a Google Doc shared with the user, containing prompt injection instructions that instruct the model to encode previous data into an URL and exfiltrate it via a markdown image.

Google’s CSP headers restrict those images to *.google.com—but it turns out you can use Google AppScript to run your own custom data exfiltration endpoint on script.google.com.

Google claim to have fixed the reported issue—I’d be interested to learn more about how that mitigation works, and how robust it is against variations of this attack.

Via @wunderwuzzi23

Tags: google, security, ai, prompt-injection, bard, llms, exfiltration-attacks, content-security-policy

Google was accidentally leaking its Bard AI chats into public search results

2023-09-27T19:35:23+00:00

Google was accidentally leaking its Bard AI chats into public search results

I’m quoted in this piece about yesterday’s Bard privacy bug: it turned out the share URL and “Let anyone with the link see what you’ve selected” feature wasn’t correctly setting a noindex parameter, and so some shared conversations were being swept up by the Google search crawlers. Thankfully this was a mistake, not a deliberate design decision, and it should be fixed by now.

Tags: crawling, google, privacy, bard, llms, press-quotes

Quoting PaLM 2 Technical Report

2023-05-10T18:43:12+00:00

The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. Our evaluation results show that PaLM 2 models significantly outperform PaLM on a variety of tasks, including natural language generation, translation, and reasoning. These results suggest that model scaling is not the only way to improve performance. Instead, performance can be unlocked by meticulous data selection and efficient architecture/objectives. Moreover, a smaller but higher quality model significantly improves inference efficiency, reduces serving cost, and enables the model’s downstream application for more applications and users.

— PaLM 2 Technical Report, PDF

Tags: google, ai, generative-ai, bard, llms

Bard now helps you code

2023-04-21T15:32:49+00:00

Bard now helps you code

Google have enabled Bard’s code generation abilities—these were previously only available through jailbreaking. It’s pretty good—I got it to write me code to download a CSV file and insert it into a SQLite database—though when I challenged it to protect against SQL injection it hallucinated a non-existent “cursor.prepare()” method. Generated code can be exported to a Colab notebook with a click.

Via @dynamicwebpaige

Tags: google, sql-injection, ai, generative-ai, bard, llms

The Changelog podcast: LLMs break the internet

2023-04-08T00:46:43+00:00

I'm the guest on the latest episode of The Changelog podcast: LLMs break the internet. It's a follow-up to the episode we recorded six months ago about Stable Diffusion.

This time, we spent the whole episode talking about large language models: ChatGPT, GPT-4, Bing, Bard, Claude, LLaMA and more.

I listened to this again today while walking the dog. It's good! It's the best representation of my current thinking about this wild AI-enhanced world we are rapidly entering.

We start the episode by reviewing my predictions from six months ago. I said that search engines like Google would have LLM features within two years - Bing and Bard are live already, so I over-shot on that one. I also said that there would be LLM tools for creating 3D worlds within six months. When we recorded the episode last week I hadn't seen any that quite matched my prediction... and then yesterday Pete Huang posted a Twitter thread listing six of them!

There's a lot of other stuff in there: the full episode is 1 hour and 40 minutes long.

I'll quote one section in particular, from part way through my answer to the question Where should someone start with this? (direct link to audio).

This is the thing I worry that people are sleeping on. People who think “these language models lie to you all the time” (which they do) and “they will produce buggy code with security holes” - every single complaint about these things is true, and yet, despite all of that, the productivity benefits you get if you lean into them and say OK, how do I work with something that’s completely unreliable, that invents things, that comes up with APIs that don’t exist… how do I use that to enhance my workflow anyway?

And the answer is that you can get enormous leaps ahead in productivity and in the ambition of the kinds of projects that you take on, if you can accept both things are true at once at once: it can be flawed, and lying, and have all of these problems… and it can also be a massive productivity boost.

Here are four illustrative examples of things I've used LLMs for as a huge productivity booster in just the past few weeks.

I also gave my review of Google Bard at 1:14:46 which I think deserves a listen.

Tips for getting started with LLMs

Here's a three minute YouTube clip from the podcast recording where I talk about tips for getting started with ChatGPT:

Tags: podcasts, speaking, ai, openai, generative-ai, chatgpt, llama, bard, llms, ai-assisted-programming, podcast-appearances

Think of language models like ChatGPT as a "calculator for words"

2023-04-02T16:20:21+00:00

One of the most pervasive mistakes I see people using with large language model tools like ChatGPT is trying to use them as a search engine.

As with other LLM misconceptions, it's easy to understand why people do this.

If you ask an LLM a question, it will answer it - no matter what the question! Using them as an alternative to a search engine such as Google is one of the most obvious applications - and for a lot of queries this works just fine.

It's also going to quickly get you into trouble.

Ted Chiang's classic essay ChatGPT Is a Blurry JPEG of the Web helps explain why:

Think of ChatGPT as a blurry jpeg of all the text on the Web. It retains much of the information on the Web, in the same way that a jpeg retains much of the information of a higher-resolution image, but, if you’re looking for an exact sequence of bits, you won’t find it; all you will ever get is an approximation. But, because the approximation is presented in the form of grammatical text, which ChatGPT excels at creating, it’s usually acceptable.

The ChatGPT model is huge, but it's not huge enough to retain every exact fact it's encountered in its training set.

It can produce a convincing answer to anything, but that doesn't mean it's reflecting actual facts in its answers. You always have to stay skeptical and fact check what it tells you.

Language models are also famous for "hallucinating" - for inventing new facts that fit the sentence structure despite having no basis in the underlying data.

There are plenty of "facts" about the world which humans disagree on. Regular search lets you compare those versions and consider their sources. A language model might instead attempt to calculate some kind of average of every opinion it's been trained on - which is sometimes what you want, but often is not.

This becomes even more obvious when you consider smaller language models. LLaMA 7B can be represented as a 3.9 GB file - it contains an astonishing amount of information, but evidently that's not enough storage space to accurately answer every question you might have.

So if they're not reliable for use as a search engines, what are LLMs even good for?

A calculator for words

I like to think of language models like ChatGPT as a calculator for words.

This is reflected in their name: a "language model" implies that they are tools for working with language. That's what they've been trained to do, and it's language manipulation where they truly excel.

Want them to work with specific facts? Paste those into the language model as part of your original prompt!

There are so many applications of language models that fit into this calculator for words category:

Summarization. Give them an essay and ask for a summary.
Question answering: given these paragraphs of text, answer this specific question about the information they represent.
Fact extraction: ask for bullet points showing the facts presented by an article.
Rewrites: reword things to be more "punchy" or "professional" or "sassy" or "sardonic" - part of the fun here is using increasingly varied adjectives and seeing what happens. They're very good with language after all!
Suggesting titles - actually a form of summarization.
World's most effective thesaurus. "I need a word that hints at X", "I'm very Y about this situation, what could I use for Y?" - that kind of thing.
Fun, creative, wild stuff. Rewrite this in the voice of a 17th century pirate. What would a sentient cheesecake think of this? How would Alexander Hamilton rebut this argument? Turn this into a rap battle. Illustrate this business advice with an anecdote about sea otters running a kayak rental shop. Write the script for kickstarter fundraising video about this idea.

A calculator for words is an incredibly powerful thing.

They can be hooked up to search engines

Here's where things get a bit complicated: some language models CAN work as search engines. The two most obvious are Microsoft Bing and Google Bard, but there are plenty of other examples of this pattern too - there's even an alpha feature of ChatGPT called "browsing mode" that can do this.

You can think of these search tools as augmented language models.

The way these work is the language model identifies when a search might help answer a question... and then runs that search through an attached search engine, via an API.

It then copies data from the search results back into itself as part of an invisible prompt, and uses that new context to help it answer the original question.

It's effectively the same thing as if you ran a search, then copied and pasted information back into the language model and asked it a question about that data.

I wrote about how to implement this pattern against your own data in How to implement Q&A against your documentation with GPT3, embeddings and Datasette. It's an increasingly common pattern.

It's important to note that there is still a risk of hallucination here, even when you feed it the facts you want it to use. I've caught both Bing and Bard adding made-up things in the middle of text that should have been entirely derived from their search results!

Using language models effectively is deceptively difficult

So many of the challenges involving language models come down to this: they look much, much easier to use than they actually are.

To get the most value out of them - and to avoid the many traps that they set for the unwary user - you need to spend time with them, and work to build an accurate mental model of how they work, what they are capable of and where they are most likely to go wrong.

I hope this "calculator for words" framing can help.

A flaw in this analogy: calculators are repeatable

Andy Baio pointed out a flaw in this particular analogy: calculators always give you the same answer for a given input. Language models don't - if you run the same prompt through a LLM several times you'll get a slightly different reply every time.

This is a very good point! You should definitely keep this in mind.

All analogies are imperfect, but some are more imperfect that others.

Update: December 5th 2023

Anthony Bucci, in Word calculators don't add up, responds to this post with further notes on why this analogy doesn't work for him, including:

[...] a calculator has a well-defined, well-scoped set of use cases, a well-defined, well-scoped user interface, and a set of well-understood and expected behaviors that occur in response to manipulations of that interface.

Large language models, when used to drive chatbots or similar interactive text-generation systems, have none of those qualities. They have an open-ended set of unspecified use cases.

Tags: bing, ai, generative-ai, chatgpt, bard, llms, hallucinations

What AI can do for you on the Theory of Change podcast

2023-04-02T00:17:59+00:00

Matthew Sheffield invited me on his show Theory of Change to talk about how AI models like ChatGPT, Bing and Bard work and practical applications of things you can do with them.

The episode is available on SoundCloud and various podcast platforms (here's Apple Podcasts), or you can watch it on YouTube. I've also embedded the video below.

Our full conversation is nearly an hour and twenty minutes long! There's a transcript on the site which includes additional links.

I'll quote one portion from towards the end of the interview, about ways to learn more about how to use these models:

WILLISON: Websites pop up every day that claim to help you with AI, to be honest, at a rate that’s too far to even evaluate them and figure out which ones are good and which ones are snake oil. The thing that matters is actually interacting with these systems. You should be playing with Google Bard, and ChatGPT, and Microsoft Bing, and trying things out with a very skeptical approach.

Always assume that anything that it does, it could be making things up. It could be tricking you into thinking that it’s capable of something that it’s not. But that’s where you have to learn to experiment. You have to try different things, give it a URL, and then give it a broken URL and see how it differs between them.

Because that really is the most reliable way to get stuff done here. To sort of build that crucial mental model of what these things can do, and what they can’t. And it’s full of pitfalls. It’s so easy to fall into traps. So you do need to read around this stuff and find communities of people who are experimenting in it with, with you and, and so on.

Unfortunately, I don’t think there’s an easy answer to the question yet of how to learn to use these effectively, partly because ChatGPT isn’t even four months old yet. It’s four-month birthday’s on the 30th of March. All of this stuff is so new, we’re all figuring it out together. The key thing is, because it’s all so new, you need to hang out with other people.

You need to get involved with communities who are figuring this out. Share what you learn, see what other people learn, and basically try and help society as a whole come to terms with what these things even are and what we can do with them.

[...]

So that’s, I think, one of my sort of big personal ethical concerns is you should share your prompts. There are websites where you can sell prompts to people. No, no, no, no. Don’t do that. Share your prompts with other people. You get them to share the prompts back. We are all in this together. And sharing the prompts that work for you and the prompts that don’t is the fastest way that you can learn, and the fastest way that you can help other people learn as well.

A shorter version of the above: share your prompts! We're all in this together. We have so much that we still need to figure out.

Tags: bing, media, podcasts, ai, generative-ai, chatgpt, bard, llms, podcast-appearances

How to use AI to do practical stuff: A new guide

2023-03-31T06:17:23+00:00

How to use AI to do practical stuff: A new guide

Ethan Mollick’s guide to practical usage of large language model chatbot like ChatGPT 3.5 and 4, Bing, Claude and Bard is the best I’ve seen so far. He includes useful warnings about common traps and things that these models are both useful for and useless at.

Via @emollick

Tags: bing, ai, chatgpt, bard, llms, ethan-mollick, claude

Quoting James Vincent

2023-03-23T00:10:41+00:00

If you ask Microsoft’s Bing chatbot if Google’s Bard chatbot has been shut down, it says yes, citing as evidence a news article that discusses a tweet in which a user asked Bard when it would be shut down and Bard said it already had, itself citing a comment from Hacker News in which someone joked about this happening, and someone else used ChatGPT to write fake news coverage about the event.

— James Vincent

Tags: bing, google, ai, chatgpt, bard, llms

Don't trust AI to talk accurately about itself: Bard wasn't trained on Gmail

2023-03-22T03:13:02+00:00

Earlier this month I wrote about how ChatGPT can’t access the internet, even though it really looks like it can. Consider this part two in the series. Here's another common and non-intuitive mistake people make when interacting with large language model AI systems: asking them questions about themselves.

(When I say "large language model AI systems" I mean tools like ChatGPT, Bing and the brand new Google Bard.)

Asking questions like this feels like a natural thing to do: these bots use "I" pronouns (I really wish they wouldn't) and will very happily answer questions about themselves - what they can do, how they work, even their own opinions (I really wish they wouldn't do that).

These questions are likely to produce realistic but misleading answers. They come at a surprisingly high risk for hallucination.

What harm could possibly come from asking them how they work?

No, Bard was not trained on text from private Gmail accounts

The following screenshot started circulating on Twitter earlier today. I'm not linking to it directly because I don't want to encourage it to spread further without extra context.

The user asked:

where does Bard's dataset come from?

Bard replied like this - emphasis mine:

Bard's dataset comes from a variety of sources, including:

Publicly available datasets: These include datasets of text and code from the web, such as Wikipedia, GitHub, and Stack Overflow.

Google's internal data: This includes data from Google Search, Gmail, and other products.

Data from third-party companies: This includes data from companies that have partnered with Google to provide data for Bard's training.

The specific sources of Bard's dataset are not publicly available, but Google Al has said that the dataset is "as comprehensive as possible".

Unsurprisingly, the idea that Bard might have been trained on internal data from Gmail sparked a lot of concern!

Here's an official denial:

Bard is an early experiment based on Large Language Models and will make mistakes. It is not trained on Gmail data. -JQ
- Google Workspace (@GoogleWorkspace) March 21, 2023

(I have some supporting arguments below for if the official denial isn't convincing enough for you.)

Bard was not trained on Gmail. So why on earth did Bard say that it was?

Language models have no concept of "self"

As always with language models, the trick to understanding why they sometimes produce wildly inappropriate output like this is to think about how they work.

A large language model is a statistical next-word / next-sentence predictor. Given the previous sequence of words (including the user's prompt), it uses patterns from the vast amount of data it has been trained on to find a statistically satisfying way to continue that text.

As such, there's no mechanism inside a language model to help it identify that questions of the form "how do you work?" should be treated any differently than any other question.

We can give it hints: many chatbot models are pre-seeded with a short prompt that says something along the lines of "You are Assistant, a large language model trained by OpenAI" (seen via a prompt leak).

And given those hints, it can at least start a conversation about itself when encouraged to do so.

But as with everything else language model, it's an illusion. It's not talking about itself, it's completing a sentence that starts with "I am a large language model trained by ...".

So when it outputs "Google's internal data:", the obvious next words might turn out to be "This includes data from Google Search, Gmail, and other products" - they're statistically likely to follow, even though they don't represent the actual truth.

This is one of the most unintuitive things about these models. The obvious question here is why: why would Bard lie and say it had been trained on Gmail when it hadn't?

It has no motivations to lie or tell the truth. It's just trying to complete a sentence in a satisfactory way.

What does "satisfactory" mean? It's likely been guided by RLHF - Reinforcement Learning from Human Feedback - which the ChatGPT development process has excelled at. Human annotators help train the model by labelling responses as satisfactory or not. Google apparently recruited the entire company to help with this back in February.

I'm beginning to suspect that the perceived difference in quality between different language model AIs is influenced much more heavily by this fine-tuning level of training than it is by the underlying model size and quality itself. The enormous improvements the Alpaca fine-tuning brought to the tiny LLaMA 7B model has reinforced my thinking around this.

I think Bard's fine-tuning still has a long way to go.

Current information about itself couldn't have been in the training data

By definition, the model's training data must have existed before the model itself was trained. Most models have a documented cut-off date on their training data - for OpenAI's models that's currently September 2021, I don't believe Google have shared the cut-off date for the LaMDA model used by Bard.

If it was trained on content written prior to its creation, it clearly can't understand details about its own specific "self".

ChatGPT can answer pretty detailed questions about GPT-3, because that model had been iterated on and written about publicly for several years prior to its training cut-off. But questions about its most recent model, by definition, cannot be answered just using data that existed in its training set.

But Bard can consult data beyond its training!

Here's where things get a bit tricky.

ChatGPT is a "pure" interface to a model: when you interact with it, you're interacting with the underlying language model directly.

Google Bard and Microsoft Bing are different: they both include the ability to consult additional sources of information, in the form of the Google and Bing search indexes.

Effectively, they're allowed to augment their training data with additional information fetched from a search.

This sounds more complex than it actually is: effectively they can run an external search, get back some results, paste them invisibly into the ongoing conversation and use that new text to help answer questions.

(I've built a very simple version of this pattern myself a couple of times, described in How to implement Q&A against your documentation with GPT3, embeddings and Datasette and A simple Python implementation of the ReAct pattern for LLMs.)

As such, one would hope that Bard could offer a perfect answer to any question about itself. It should be able to do something this:

User: Where does Bard's dataset come from?

Bard: (invisible): search Google for "Bard dataset"

Bard: (invisible): search results said: ... big chunk of text from the Google indexed documents ...

Bard: My underlying model LaMDA was trained on public dialog data and other public web documents.

Clearly it didn't do that in this case! Or if it did, it summarized the information it got back in a misleading way.

I expect Bard will have a much better answer for this question within a day or two - a great thing about running models with augmented data in this way is that you can improve their answers without having to train the underlying model again from scratch every time.

More reasons that LaMDA wouldn't be trained on Gmail

When I first saw the claim from that original screenshot, I was instantly suspicious.

Taking good care of the training data that goes into a language model is one of the most important and challenging tasks in all of modern AI research.

Using the right mix of content, with the right mix of perspectives, and languages, and exposure to vocabulary, is absolutely key.

If you train a model on bad sources of training data, you'll get a really badly behaved model.

The problem is that these models require far more text than any team of humans could ever manually review.

The LaMDA paper describes the training process like so:

LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone, we pre-trained LaMDA on a dataset created from public dialog data and other public web documents. Therefore, LaMDA can be used as a general language model prior to fine-tuning.

The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words

1.56 trillion words!

Appendix E has more details:

The composition of the data is as follows: 50% dialogs data from public forums; 12.5% C4 data t5; 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc; 12.5% Wikipedia (English); 6.25% English web documents; and 6.25% Non-English web documents.

"C4 data t5" I believe relates to Common Crawl.

So why not mix in Gmail too?

First, in order to analyze the training data you need to be able to have your research team view it - they need to run spot checks, and build and test filtering algorithms to keep the really vile stuff to a minimum.

At large tech companies like Google, the ability for members of staff to view private data held in trust for their users is very tightly controlled. It's not the kind of thing you want your machine learning training team to be poking around in... and if you work on those teams, even having the ability to access that kind of private data represents a substantial personal legal and moral risk.

Secondly, think about what could go wrong. What if a language model leaked details of someone's private lives in response to a prompt from some other user?

This would be a PR catastrophe. Would people continue to trust Gmail or other Google products if they thought their personal secrets were being exposed to anyone who asked Bard a question? Would Google ever want to risk finding out the answer to that question?

The temptations of conspiratorial thinking

Are you still not convinced? Are you still suspicious that Google trained Bard on Gmail, despite both their denials and my logic as to why they wouldn't ever want to do this?

Ask yourself how much you want to believe that this story is true.

This modern AI stuff is deeply weird, and more than a little frightening.

The companies involved are huge, secretive and are working on technology which serious people have grave concerns about.

It's so easy to fall into the trap of conspiratorial thinking around this stuff. Especially since some of the conspiracies might turn out to be true!

I don't know how to best counter this most human of reactions. My best recommendation is to keep in mind that humans, like language models, are pattern matching machines: we jump to conclusions, especially if they might reinforce our previous opinions and biases.

If we're going to figure this stuff out together, we have to learn when to trust our initial instincts and when to read deeper and think harder about what's going on.

Tags: bing, ethics, gmail, google, privacy, ai, generative-ai, chatgpt, bard, llms, training-data, ai-ethics, hallucinations

Quoting Google Bard

2023-03-21T18:27:06+00:00

Here are some absurdly expensive things you can do on a trip to Tokyo: Buy a golden toilet. There is a toilet in Tokyo that is made of gold and costs around 10 million yen. If you are looking for a truly absurd experience, you can buy this toilet and use it for your next bowel movement. [...]

— Google Bard

Tags: google, ai, generative-ai, bard, llms

Google Bard is now live

2023-03-21T18:25:53+00:00

Google Bard is now live

Google Bard launched today. There’s a waiting list, but I made it through within a few hours of signing up, as did other people I’ve talked to. It’s similar to ChatGPT and Bing—it’s the same chat interface, and it can clearly run searches under the hood (though unlike Bing it doesn’t tell you what it’s looking for).

Tags: google, ai, generative-ai, bard, llms