Simon Willison's Weblog: trust

Thoughts on the WWDC 2024 keynote on Apple Intelligence

2024-06-10T20:19:13+00:00

Today's WWDC keynote finally revealed Apple's new set of AI features. The AI section (Apple are calling it Apple Intelligence) started over an hour into the keynote - this link jumps straight to that point in the archived YouTube livestream, or you can watch it embedded here:

There's also a detailed Apple newsroom post: Introducing Apple Intelligence, the personal intelligence system that puts powerful generative models at the core of iPhone, iPad, and Mac.

There are a lot of interesting things here. Apple have a strong focus on privacy, finally taking advantage of the Neural Engine accelerator chips in the A17 Pro chip on iPhone 15 Pro and higher and the M1/M2/M3 Apple Silicon chips in Macs. They're using these to run on-device models - I've not yet seen any information on which models they are running and how they were trained.

On-device models that can outsource to Apple's servers

Most notable is their approach to features that don't work with an on-device model. At 1h14m43s:

When you make a request, Apple Intelligence analyses whether it can be processed on device. If it needs greater computational capacity, it can draw on Private Cloud Compute, and send only the data that's relevant to your task to be processed on Apple Silicon servers.

Your data is never stored or made accessible to Apple. It's used exclusively to fulfill your request.

And just like your iPhone, independent experts can inspect the code that runs on the servers to verify this privacy promise.

In fact, Private Cloud Compute cryptographically ensures your iPhone, iPad, and Mac will refuse to talk to a server unless its software has been publicly logged for inspection.

There's some fascinating computer science going on here! I'm looking forward to learning more about this - it sounds like the details will be public by design, since that's key to the promise they are making here.

Update: Here are the details, and they are indeed extremely impressive - more of my notes here.

An ethical approach to AI generated images?

Their approach to generative images is notable in that they're shipping an on-device model in a feature called Image Playground, with a very important limitation: it can only output images in one of three styles: sketch, illustration and animation.

This feels like a clever way to address some of the ethical objections people have to this specific category of AI tool:

If you can't create photorealistic images, you can't generate deepfakes or offensive photos of people
By having obvious visual styles you ensure that AI generated images are instantly recognizable as such, without watermarks or similar
Avoiding the ability to clone specific artist's styles further helps sidestep ethical issues about plagiarism and copyright infringement

The social implications of this are interesting too. Will people be more likely to share AI-generated images if there are no awkward questions or doubts about how they were created, and will that help it more become socially acceptable to use them?

I've not seen anything on how these image models were trained. Given their limited styles it seems possible Apple used entirely ethically licensed training data, but I'd like to see more details on this.

App Intents and prompt injection

Siri will be able to both access data on your device and trigger actions based on your instructions.

This is the exact feature combination that's most at risk from prompt injection attacks: what happens if someone sends you a text message that tricks Siri into forwarding a password reset email to them, and you ask for a summary of that message?

Security researchers will no doubt jump straight onto this as soon as the beta becomes available. I'm fascinated to learn what Apple have done to mitigate this risk.

Integration with ChatGPT

Rumors broke last week that Apple had signed a deal with OpenAI to use ChatGPT. That's now been confirmed: here's OpenAI's partnership announcement:

Apple is integrating ChatGPT into experiences within iOS, iPadOS, and macOS, allowing users to access ChatGPT’s capabilities—including image and document understanding—without needing to jump between tools.

Siri can also tap into ChatGPT’s intelligence when helpful. Apple users are asked before any questions are sent to ChatGPT, along with any documents or photos, and Siri then presents the answer directly.

The keynote talks about that at 1h36m21s. Those prompts to confirm that the user wanted to share data with ChatGPT are very prominent in the demo!

This integration (with GPT-4o) will be free - and Apple don't appear to be charging for their other server-side AI features either. I guess they expect the supporting hardware sales to more than cover the costs of running these models.

Tags: apple, ethics, privacy, security, trust, ai, openai, prompt-injection, generative-ai, chatgpt, llms, apple-intelligence, ai-ethics

Update on the Recall preview feature for Copilot+ PCs

2024-06-07T17:30:40+00:00

Update on the Recall preview feature for Copilot+ PCs

This feels like a very good call to me: in response to widespread criticism Microsoft are making Recall an opt-in feature (during system onboarding), adding encryption to the database and search index beyond just disk encryption and requiring Windows Hello face scanning to access the search feature.

Via Wired: Microsoft Will Switch Off Recall by Default After Security Backlash

Tags: microsoft, privacy, security, trust, windows, ai, recall

Quoting Zac Bowden

2024-06-07T17:23:54+00:00

In fact, Microsoft goes so far as to promise that it cannot see the data collected by Windows Recall, that it can't train any of its AI models on your data, and that it definitely can't sell that data to advertisers. All of this is true, but that doesn't mean people believe Microsoft when it says these things. In fact, many have jumped to the conclusion that even if it's true today, it won't be true in the future.

— Zac Bowden

Tags: microsoft, privacy, trust, windows, ai, recall

The AI trust crisis

2023-12-14T16:14:11+00:00

Dropbox added some new AI features. In the past couple of days these have attracted a firestorm of criticism. Benj Edwards rounds it up in Dropbox spooks users with new AI features that send data to OpenAI when used.

The key issue here is that people are worried that their private files on Dropbox are being passed to OpenAI to use as training data for their models - a claim that is strenuously denied by Dropbox.

As far as I can tell, Dropbox built some sensible features - summarize on demand, "chat with your data" via Retrieval Augmented Generation - and did a moderately OK job of communicating how they work... but when it comes to data privacy and AI, a "moderately OK job" is a failing grade. Especially if you hold as much of people's private data as Dropbox does!

Two details in particular seem really important. Dropbox have an AI principles document which includes this:

Customer trust and the privacy of their data are our foundation. We will not use customer data to train AI models without consent.

They also have a checkbox in their settings that looks like this:

Update: Some time between me publishing this article and four hours later, that link stopped working.

I took that screenshot on my own account. It's toggled "on" - but I never turned it on myself.

Does that mean I'm marked as "consenting" to having my data used to train AI models?

I don't think so: I think this is a combination of confusing wording and the eternal vagueness of what the term "consent" means in a world where everyone agrees to the terms and conditions of everything without reading them.

But a LOT of people have come to the conclusion that this means their private data - which they pay Dropbox to protect - is now being funneled into the OpenAI training abyss.

People don't believe OpenAI

Here's copy from that Dropbox preference box, talking about their "third-party partners" - in this case OpenAI:

Your data is never used to train their internal models, and is deleted from third-party servers within 30 days.

It's increasing clear to me like people simply don't believe OpenAI when they're told that data won't be used for training.

What's really going on here is something deeper then: AI is facing a crisis of trust.

I quipped on Twitter:

"OpenAI are training on every piece of data they see, even when they say they aren't" is the new "Facebook are showing you ads based on overhearing everything you say through your phone's microphone"

Here's what I meant by that.

Facebook don't spy on you through your microphone

Have you heard the one about Facebook spying on you through your phone's microphone and showing you ads based on what you're talking about?

This theory has been floating around for years. From a technical perspective it should be easy to disprove:

Mobile phone operating systems don't allow apps to invisibly access the microphone.
Privacy researchers can audit communications between devices and Facebook to confirm if this is happening.
Running high quality voice recognition like this at scale is extremely expensive - I had a conversation with a friend who works on server-based machine learning at Apple a few years ago who found the entire idea laughable.

The non-technical reasons are even stronger:

Facebook say they aren't doing this. The risk to their reputation if they are caught in a lie is astronomical.
As with many conspiracy theories, too many people would have to be "in the loop" and not blow the whistle.
Facebook don't need to do this: there are much, much cheaper and more effective ways to target ads at you than spying through your microphone. These methods have been working incredibly well for years.
Facebook gets to show us thousands of ads a year. 99% of those don't correlate in the slightest to anything we have said out loud. If you keep rolling the dice long enough, eventually a coincidence will strike.

Here's the thing though: none of these arguments matter.

If you've ever experienced Facebook showing you an ad for something that you were talking about out-loud about moments earlier, you've already dismissed everything I just said. You have personally experienced anecdotal evidence which overrides all of my arguments here.

Here's a Reply All podcast episode from Novemember 2017 that explores this issue: 109 Is Facebook Spying on You?. Their conclusion: Facebook are not spying through your microphone. But if someone already believes that there is no argument that can possibly convince them otherwise.

I've experienced this effect myself - over the past few years I've tried talking people out of this, as part of my own personal fascination with how sticky this conspiracy theory is.

The key issue here is the same as the OpenAI training issue: people don't believe these companies when they say that they aren't doing something.

One interesting difference here is that in the Facebook example people have personal evidence that makes them believe they understand what's going on.

With AI we have almost the complete opposite: AI models are weird black boxes, built in secret and with no way of understanding what the training data was or how it influences the model.

As with so much in AI, people are left with nothing more than "vibes" to go on. And the vibes are bad.

This really matters

Trust is really important. Companies lying about what they do with your privacy is a very serious allegation.

A society where big companies tell blatant lies about how they are handling our data - and get away with it without consequences - is a very unhealthy society.

A key role of government is to prevent this from happening. If OpenAI are training on data that they said they wouldn't train on, or if Facebook are spying on us through our phone's microphones, they should be hauled in front of regulators and/or sued into the ground.

If we believe that they are doing this without consequence, and have been getting away with it for years, our intolerance for corporate misbehavior becomes a victim as well. We risk letting companies get away with real misconduct because we incorrectly believed in conspiracy theories.

Privacy is important, and very easily misunderstood. People both overestimate and underestimate what companies are doing, and what's possible. This isn't helped by the fact that AI technology means the scope of what's possible is changing at a rate that's hard to appreciate even if you're deeply aware of the space.

If we want to protect our privacy, we need to understand what's going on. More importantly, we need to be able to trust companies to honestly and clearly explain what they are doing with our data.

On a personal level we risk losing out on useful tools. How many people cancelled their Dropbox accounts in the last 48 hours? How many more turned off that AI toggle, ruling out ever evaluating if those features were useful for them or not?

What can we do about it?

There is something that the big AI labs could be doing to help here: tell us how you are training!

The fundamental question here is about training data: what are OpenAI using to train their models?

And the answer is: we have no idea! The entire process could not be more opaque.

Given that, is it any wonder that when OpenAI say "we don't train on data submitted via our API" people have trouble believing them?

The situation with ChatGPT itself is even more messy. OpenAI say that they DO use ChatGPT interactions to improve their models - even those from paying customers, with the exception of the "call us" priced ChatGPT Enterprise.

If I paste a private document into ChatGPT to ask for a summary, will snippets of that document be leaked to future users after the next model update? Without more details on HOW they are using ChatGPT to improve their models I can't come close to answering that question.

Clear explanations of how this stuff works could go a long way to improving the trust relationship OpenAI have with their users, and the world at large.

Maybe take a leaf from large scale platform companies. They publish public post-mortem incident reports on outages, to regain trust with their customers through transparency about exactly what happened and the steps they are taking to prevent it from happening again. Dan Luu has collected a great list of examples.

An opportunity for local models

One consistent theme I've seen in conversations about this issue is that people are much more comfortable trusting their data to local models that run on their own devices than models hosted in the cloud.

The good news is that local models are consistently both increasing in quality and shrinking in size.

I figured out how to run Mixtral-8x7b-Instruct on my laptop last night - the first local model I've tried which really does seem to be equivalent in quality to ChatGPT 3.5.

Microsoft's Phi-2 is a fascinating new model in that it's only 2.7 billion parameters (most useful local models start at 7 billion) but claims state-of-the-art performance against some of those larger models. And it looks like they trained it for around $35,000.

While I'm excited about the potential of local models, I'd hate to see us lose out on the power and convenience of the larger hosted models over privacy concerns which turn out to be incorrect.

The intersection of AI and privacy is a critical issue. We need to be able to have the highest quality conversations about it, with maximum transparency and understanding of what's actually going on.

This is hard already, and it's made even harder if we straight up disbelieve anything that companies tell us. Those companies need to earn our trust. How can we help them understand how to do that?

Tags: trust, dropbox, ai, openai, local-llms, llms, training-data, microphone-ads-conspiracy, digital-literacy

AI and Trust

2023-12-05T21:43:03+00:00

AI and Trust

Barnstormer of an essay by Bruce Schneier about AI and trust. It’s worth spending some time with this—it’s hard to extract the highlights since there are so many of them.

A key idea is that we are predisposed to trust AI chat interfaces because they imitate humans, which means we are highly susceptible to profit-seeking biases baked into them.

Bruce suggests that what’s needed is public models, backed by government funds: “A public model is a model built by the public for the public. It requires political accountability, not just market accountability.”

Tags: bruce-schneier, trust, ai, generative-ai, llms

Can We Trust Search Engines with Generative AI? A Closer Look at Bing’s Accuracy for News Queries

2023-02-18T18:09:19+00:00

Can We Trust Search Engines with Generative AI? A Closer Look at Bing’s Accuracy for News Queries

Computational journalism professor Nick Diakopoulos takes a deeper dive into the quality of the summarizations provided by AI-assisted Bing. His findings are troubling: for news queries, which are a great test for AI summarization since they include recent information that may have sparse or conflicting stories, Bing confidently produces answers with important errors: claiming the Ohio train derailment happened on February 9th when it actually happened on February 3rd for example.

Via @ndiakopoulos

Tags: bing, search, trust, generative-ai, llms, ai-assisted-search, digital-literacy

Wikipedia trust colouring (with demo)

2007-09-01T01:42:59+00:00

Wikipedia trust colouring (with demo)

“The text background of Wikipedia articles is colored according to a value of trust, computed from the reputation of the authors who contributed the text, as well as those who edited the text.”

Via Kevin Gamble

Tags: kevin-gamble, trust, ucsc, wikipedia

An OpenID is not an account!

2007-01-10T10:53:35+00:00

I'm excited to see that OpenID has finally started to gain serious traction outside of the Identity community. Understandably, misconceptions about OpenID continue to crop-up. The one I want to address in this entry is the idea that an OpenID can be used as a replacement for a regular user account.

Update at 23:55pm: I originally tried to illustrate this misconception with a quote from Don Park; unfortunately I misunderstood the quote and consequently misrepresented his position, for which I apologise unreservedly.

The old OpenID homepage (which I miss; the new one uses jargon-heavy terms like "a free framework for user-centric digital identity") included this in nice big letters:

What about trust?

This is not a trust system. Trust requires identity first.

OpenID solves the identity problem, not the trust problem. When a user authenticates with OpenID, what they are doing is stating "I have the ability to prove my ownership of this URL".

I used the phrase "have the ability" deliberately. Just because the OpenID authentication was successful doesn't mean that the user is the only person who can authenticate against that OpenID. It would be trivial to create the OpenID equivalent of Mailinator: an identity provider that says "Yes, that's them!" to any identity request. I'm tempted to build it just to emphasize this point! Update: Jayant Gandhi has built one.

The key thing here is that you should never trust an OpenID on its own. It could be a real person, or it could be a spammer, psycopath or general undesirable.

Does this mean OpenID is completely useless? Absolutely not! You just have to think of it in the same way that you think of username and password combinations: as the "key" to an account.

Most web application signup processes work something like this:

Bob selects a username
Bob enters a password, twice
Bob enters his e-mail address
Bob clicks a validation link in an e-mail sent to that address

Some sites throw a CAPTCHA in there for good measure.

OpenID replaces at most the first two steps of that registration process. Instead of having a user set up a new password you get them to authenticate with their OpenID at the start of the process. After that you might still want them to pick a username (especially if you are integrating OpenID in to an existing account system) and you'll almost certainly want them to jump through the e-mail and/or CAPTCHA steps.

In the future, they can sign in to your site using their OpenID rather than having to dig around for whichever username and password they used.

A nice thing about the above flow is that it demonstrates how easy it should be to add OpenID support to an existing Web application. If you've already got a user account system that's fine - just give your users a mechanism to associate an OpenID with their existing account so they can log in without using their password. You could even require new users to set up a full account (complete with password that they never intend to use) and then associate it with their OpenID, although doing so eliminates the lower barrier to entry advantage for OpenID users.

The trust issue is now null and void. An OpenID is just as trustworthy as a regular username and password (which could have been posted to bugmenot and shared with thousands of people).

One last time: an OpenID is not an account. Just treat it as an alternative to a traditional username and password and you can't go wrong.

Tags: identity, openid, trust