Simon Willison's Weblog: gmail

Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent

2024-10-17T12:32:47+00:00

The other day I found myself needing to add up some numeric values that were scattered across twelve different emails.

I didn't particularly feel like copying and pasting all of the numbers out one at a time, so I decided to try something different: could I record a screen capture while browsing around my Gmail account and then extract the numbers from that video using Google Gemini?

This turned out to work incredibly well.

AI Studio and QuickTime

I recorded the video using QuickTime Player on my Mac: File -> New Screen Recording. I dragged a box around a portion of my screen containing my Gmail account, then clicked on each of the emails in turn, pausing for a couple of seconds on each one.

I uploaded the resulting file directly into Google's AI Studio tool and prompted the following:

Turn this into a JSON array where each item has a yyyy-mm-dd date and a floating point dollar amount for that date

... and it worked. It spat out a JSON array like this:

[
  {
    "date": "2023-01-01",
    "amount": 2...
  },
  ...
]

I wanted to paste that into Numbers, so I followed up with:

turn that into copy-pastable csv

Which gave me back the same data formatted as CSV.

You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers. It got everything right.

I had intended to use Gemini 1.5 Pro, aka Google's best model... but it turns out I forgot to select the model and I'd actually run the entire process using the much less expensive Gemini 1.5 Flash 002.

How much did it cost?

According to AI Studio I used 11,018 tokens, of which 10,326 were for the video.

Gemini 1.5 Flash charges $0.075/1 million tokens (the price dropped in August).

11018/1000000 = 0.011018
0.011018 * $0.075 = $0.00082635

So this entire exercise should have cost me just under 1/10th of a cent!

And in fact, it was free. Google AI Studio currently "remains free of charge regardless of if you set up billing across all supported regions". I believe that means they can train on your data though, which is not the case for their paid APIs.

The alternatives aren't actually that great

Let's consider the alternatives here.

I could have clicked through the emails and copied out the data manually one at a time. This is error prone and kind of boring. For twelve emails it would have been OK, but for a hundred it would have been a real pain.
Accessing my Gmail data programatically. This seems to get harder every year - it's still possible to access it via IMAP right now if you set up a dedicated app password but that's a whole lot of work for a one-off scraping task. The official API is no fun at all.
Some kind of browser automation (Playwright or similar) that can click through my Gmail account for me. Even with an LLM to help write the code this is still a lot more work, and it doesn't help deal with formatting differences in emails either - I'd have to solve the email parsing step separately.
Using some kind of much more sophisticated pre-existing AI tool that has access to my email. A separate Google product also called Gemini can do this if you grant it access, but my results with that so far haven't been particularly great. AI tools are inherently unpredictable. I'm also nervous about giving any tool full access to my email account due to the risk from things like prompt injection.

Video scraping is really powerful

The great thing about this video scraping technique is that it works with anything that you can see on your screen... and it puts you in total control of what you end up exposing to the AI model.

There's no level of website authentication or anti-scraping technology that can stop me from recording a video of my screen while I manually click around inside a web application.

The results I get depend entirely on how thoughtful I was about how I positioned my screen capture area and how I clicked around.

There is no setup cost for this at all - sign into a site, hit record, browse around a bit and then dump the video into Gemini.

And the cost is so low that I had to re-run my calculations three times to make sure I hadn't made a mistake.

I expect I'll be using this technique a whole lot more in the future. It also has applications in the data journalism world, which frequently involves the need to scrape data from sources that really don't want to be scraped.

A note on reliability

Added 22nd December 2024. As with anything involving LLMs, its worth noting that you cannot trust these models to return exactly correct results with 100% reliability. I verified the results here manually through eyeball comparison of the JSON to the underlying video, but in a larger project this may not be feasible. Consider spot-checks or other strategies for double-checking the results, especially if mistakes could have meaningful real-world impact.

Bonus: An LLM pricing calculator

In writing up this experiment I got fed up of having to manually calculate token prices. I actually usually outsource that to ChatGPT Code Interpreter, but I've caught it messing up the conversion from dollars to cents once or twice so I always have to double-check its work.

So I got Claude 3.5 Sonnet with Claude Artifacts to build me this pricing calculator tool (source code here):

You can set the input/output token prices by hand, or click one of the preset buttons to pre-fill it with the prices for different existing models (as-of 16th October 2024 - I won't promise that I'll promptly update them in the future!)

The entire thing was written by Claude. Here's the full conversation transcript - we spent 19 minutes iterating on it through 10 different versions.

Rather than hunt down all of those prices myself, I took screenshots of the pricing pages for each of the model providers and dumped those directly into the Claude conversation:

Tags: data-journalism, gmail, google, scraping, ai, generative-ai, llms, ai-assisted-programming, claude, gemini, vision-llms, claude-artifacts, claude-3-5-sonnet, prompt-to-app

Don't trust AI to talk accurately about itself: Bard wasn't trained on Gmail

2023-03-22T03:13:02+00:00

Earlier this month I wrote about how ChatGPT can’t access the internet, even though it really looks like it can. Consider this part two in the series. Here's another common and non-intuitive mistake people make when interacting with large language model AI systems: asking them questions about themselves.

(When I say "large language model AI systems" I mean tools like ChatGPT, Bing and the brand new Google Bard.)

Asking questions like this feels like a natural thing to do: these bots use "I" pronouns (I really wish they wouldn't) and will very happily answer questions about themselves - what they can do, how they work, even their own opinions (I really wish they wouldn't do that).

These questions are likely to produce realistic but misleading answers. They come at a surprisingly high risk for hallucination.

What harm could possibly come from asking them how they work?

No, Bard was not trained on text from private Gmail accounts

The following screenshot started circulating on Twitter earlier today. I'm not linking to it directly because I don't want to encourage it to spread further without extra context.

The user asked:

where does Bard's dataset come from?

Bard replied like this - emphasis mine:

Bard's dataset comes from a variety of sources, including:

Publicly available datasets: These include datasets of text and code from the web, such as Wikipedia, GitHub, and Stack Overflow.

Google's internal data: This includes data from Google Search, Gmail, and other products.

Data from third-party companies: This includes data from companies that have partnered with Google to provide data for Bard's training.

The specific sources of Bard's dataset are not publicly available, but Google Al has said that the dataset is "as comprehensive as possible".

Unsurprisingly, the idea that Bard might have been trained on internal data from Gmail sparked a lot of concern!

Here's an official denial:

Bard is an early experiment based on Large Language Models and will make mistakes. It is not trained on Gmail data. -JQ
- Google Workspace (@GoogleWorkspace) March 21, 2023

(I have some supporting arguments below for if the official denial isn't convincing enough for you.)

Bard was not trained on Gmail. So why on earth did Bard say that it was?

Language models have no concept of "self"

As always with language models, the trick to understanding why they sometimes produce wildly inappropriate output like this is to think about how they work.

A large language model is a statistical next-word / next-sentence predictor. Given the previous sequence of words (including the user's prompt), it uses patterns from the vast amount of data it has been trained on to find a statistically satisfying way to continue that text.

As such, there's no mechanism inside a language model to help it identify that questions of the form "how do you work?" should be treated any differently than any other question.

We can give it hints: many chatbot models are pre-seeded with a short prompt that says something along the lines of "You are Assistant, a large language model trained by OpenAI" (seen via a prompt leak).

And given those hints, it can at least start a conversation about itself when encouraged to do so.

But as with everything else language model, it's an illusion. It's not talking about itself, it's completing a sentence that starts with "I am a large language model trained by ...".

So when it outputs "Google's internal data:", the obvious next words might turn out to be "This includes data from Google Search, Gmail, and other products" - they're statistically likely to follow, even though they don't represent the actual truth.

This is one of the most unintuitive things about these models. The obvious question here is why: why would Bard lie and say it had been trained on Gmail when it hadn't?

It has no motivations to lie or tell the truth. It's just trying to complete a sentence in a satisfactory way.

What does "satisfactory" mean? It's likely been guided by RLHF - Reinforcement Learning from Human Feedback - which the ChatGPT development process has excelled at. Human annotators help train the model by labelling responses as satisfactory or not. Google apparently recruited the entire company to help with this back in February.

I'm beginning to suspect that the perceived difference in quality between different language model AIs is influenced much more heavily by this fine-tuning level of training than it is by the underlying model size and quality itself. The enormous improvements the Alpaca fine-tuning brought to the tiny LLaMA 7B model has reinforced my thinking around this.

I think Bard's fine-tuning still has a long way to go.

Current information about itself couldn't have been in the training data

By definition, the model's training data must have existed before the model itself was trained. Most models have a documented cut-off date on their training data - for OpenAI's models that's currently September 2021, I don't believe Google have shared the cut-off date for the LaMDA model used by Bard.

If it was trained on content written prior to its creation, it clearly can't understand details about its own specific "self".

ChatGPT can answer pretty detailed questions about GPT-3, because that model had been iterated on and written about publicly for several years prior to its training cut-off. But questions about its most recent model, by definition, cannot be answered just using data that existed in its training set.

But Bard can consult data beyond its training!

Here's where things get a bit tricky.

ChatGPT is a "pure" interface to a model: when you interact with it, you're interacting with the underlying language model directly.

Google Bard and Microsoft Bing are different: they both include the ability to consult additional sources of information, in the form of the Google and Bing search indexes.

Effectively, they're allowed to augment their training data with additional information fetched from a search.

This sounds more complex than it actually is: effectively they can run an external search, get back some results, paste them invisibly into the ongoing conversation and use that new text to help answer questions.

(I've built a very simple version of this pattern myself a couple of times, described in How to implement Q&A against your documentation with GPT3, embeddings and Datasette and A simple Python implementation of the ReAct pattern for LLMs.)

As such, one would hope that Bard could offer a perfect answer to any question about itself. It should be able to do something this:

User: Where does Bard's dataset come from?

Bard: (invisible): search Google for "Bard dataset"

Bard: (invisible): search results said: ... big chunk of text from the Google indexed documents ...

Bard: My underlying model LaMDA was trained on public dialog data and other public web documents.

Clearly it didn't do that in this case! Or if it did, it summarized the information it got back in a misleading way.

I expect Bard will have a much better answer for this question within a day or two - a great thing about running models with augmented data in this way is that you can improve their answers without having to train the underlying model again from scratch every time.

More reasons that LaMDA wouldn't be trained on Gmail

When I first saw the claim from that original screenshot, I was instantly suspicious.

Taking good care of the training data that goes into a language model is one of the most important and challenging tasks in all of modern AI research.

Using the right mix of content, with the right mix of perspectives, and languages, and exposure to vocabulary, is absolutely key.

If you train a model on bad sources of training data, you'll get a really badly behaved model.

The problem is that these models require far more text than any team of humans could ever manually review.

The LaMDA paper describes the training process like so:

LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone, we pre-trained LaMDA on a dataset created from public dialog data and other public web documents. Therefore, LaMDA can be used as a general language model prior to fine-tuning.

The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words

1.56 trillion words!

Appendix E has more details:

The composition of the data is as follows: 50% dialogs data from public forums; 12.5% C4 data t5; 12.5% code documents from sites related to programming like Q&A sites, tutorials, etc; 12.5% Wikipedia (English); 6.25% English web documents; and 6.25% Non-English web documents.

"C4 data t5" I believe relates to Common Crawl.

So why not mix in Gmail too?

First, in order to analyze the training data you need to be able to have your research team view it - they need to run spot checks, and build and test filtering algorithms to keep the really vile stuff to a minimum.

At large tech companies like Google, the ability for members of staff to view private data held in trust for their users is very tightly controlled. It's not the kind of thing you want your machine learning training team to be poking around in... and if you work on those teams, even having the ability to access that kind of private data represents a substantial personal legal and moral risk.

Secondly, think about what could go wrong. What if a language model leaked details of someone's private lives in response to a prompt from some other user?

This would be a PR catastrophe. Would people continue to trust Gmail or other Google products if they thought their personal secrets were being exposed to anyone who asked Bard a question? Would Google ever want to risk finding out the answer to that question?

The temptations of conspiratorial thinking

Are you still not convinced? Are you still suspicious that Google trained Bard on Gmail, despite both their denials and my logic as to why they wouldn't ever want to do this?

Ask yourself how much you want to believe that this story is true.

This modern AI stuff is deeply weird, and more than a little frightening.

The companies involved are huge, secretive and are working on technology which serious people have grave concerns about.

It's so easy to fall into the trap of conspiratorial thinking around this stuff. Especially since some of the conspiracies might turn out to be true!

I don't know how to best counter this most human of reactions. My best recommendation is to keep in mind that humans, like language models, are pattern matching machines: we jump to conclusions, especially if they might reinforce our previous opinions and biases.

If we're going to figure this stuff out together, we have to learn when to trust our initial instincts and when to read deeper and think harder about what's going on.

Tags: bing, ethics, gmail, google, privacy, ai, generative-ai, chatgpt, bard, llms, training-data, ai-ethics, hallucinations

WARNING: Google Buzz Has A Huge Privacy Flaw

2010-02-11T11:30:27+00:00

WARNING: Google Buzz Has A Huge Privacy Flaw

Interesting one this: by default, Buzz creates a public profile for you that lists the people you follow—but your default set of followers is derived from the people you contact most frequently using Gmail. This means users of Buzz may inadvertently reveal their most frequent contacts, which is an issue for people like journalists with anonymous sources, unhappy employees seeking new work or even people having e-mail based affairs.

Tags: buzz, followers, gmail, google, privacy

Fixing the Google Account problem

2010-01-25T11:21:34+00:00

Fixing the Google Account problem

3,000+ words explaining how to open a Google Doc invitation sent to an e-mail address that isn’t associated with your Google account. Worth reading just to get an idea for the enormous complexity involved in running a large scale identity system and designing an interface for managing aliases and multiple profiles. Google haven’t got it right yet—has anyone else?

Tags: accounts, drummondreed, gmail, google, identity, usability

Introducing Closure Tools

2009-11-06T07:33:56+00:00

Introducing Closure Tools

Google have released the pure-JavaScript library, apparently used for Gmail, Google Docs and Google Maps. It comes with a powerful JavaScript optimiser tool with linting built in and an accompanying Firebug extension to ensure the obfuscated code it produces can still be debugged. There’s also a template system which precompiles down to JavaScript and can also be called from Java.

Tags: closure, firebug, gmail, google, google-docs, javascript, libraries

How to avoid ads in gmail

2009-07-31T01:40:38+00:00

How to avoid ads in gmail

“After extensive testing I’ve discovered you need 1 catastrophic event or tragedy for every 167 words in the rest of the email.”

Tags: ads, gmail

The Anatomy Of The Twitter Attack

2009-07-20T00:55:35+00:00

The Anatomy Of The Twitter Attack

Long-winded explanation of the recent Twitter break-in, but you can scroll to the bottom for a numbered list summary. The attacker first broke in to a Twitter employee’s personal Gmail account by “recovering” it against an expired Hotmail account (which the attacker could hence register themselves). They gained access to more passwords by searching for e-mails from badly implemented sites that send you your password in the clear.

Tags: gmail, hotmail, passwords, security, twitter

Quoting Ed Felten

2008-09-22T16:21:41+00:00

Yahoo could also have followed Gmail's lead, and disabled the security-question mechanism unless no logged-in user had accessed the account for five days. This clever trick prevents password "recovery" when there is evidence that somebody who knows the password is actively using the account.

— Ed Felten

Tags: forgottenpasswords, gmail, security, webmail, yahoo

Find Your Friends

2008-04-01T01:01:38+00:00

Find Your Friends

Flickr have added a characteristically classy friend import feature, pulling from Gmail, Yahoo! and Hotmail address books without any unhygienic password sharing. It’s a crying shame that the Yahoo! contacts API they are using isn’t available outside the company.

Tags: flickr, gmail, hotmail, passwordantipattern, portablesocialnetworks, yahoo

David Airey: Google's Gmail security failure leaves my business sabotaged

2007-12-26T12:16:48+00:00

David Airey: Google's Gmail security failure leaves my business sabotaged

Gmail had a CSRF hole a while ago that allowed attackers to add forwarding filter rules to your account. David Airey’s domain name was hijacked by an extortionist who forwarded the transfer confirmation e-mail on to themselves.

Via XSS News

Tags: csrf, david-airey, gmail, google, security

Gmail Greasemonkey API

2007-11-07T10:38:38+00:00

Gmail Greasemonkey API

The new version of Gmail includes API hooks for Greasemonkey script authors. The documentation is by Mark Pilgrim, author of Greasemonkey Hacks.

Via Mihai Parparita

Tags: gmail, google, greasemonkey, javascript, mark-pilgrim, mihai-parparita

Mailplane

2007-10-25T07:57:48+00:00

Mailplane

A commercial OS X Gmail client built around a site-specific browser.

Via Chris Messina

Tags: chris-messina, gmail, google, macos, mail, mailplane, sitespecificbrowsers

The password anti-pattern

2007-10-12T09:25:25+00:00

The password anti-pattern

What I don’t understand is why Google / Yahoo! / other webmail providers haven’t just deployed a simple OAuth-style API for accessing the address book. Sites have been scraping them for years anyway; surely it’s better to offer an official API than continue to see users hand out their passwords?

Tags: gmail, google, jeremy-keith, oauth, passwords, phishing, yahoo

identity-matcher

2007-10-04T14:53:12+00:00

identity-matcher

Dopplr’s social network importing code (for Gmail, Twitter, Facebook and sites supporting Microformats), implemented as a Rails ActiveRecord plugin.

Tags: dopplr, facebook, fowa, fowa2007, gmail, identitymatcher, matt-biddulph, microformats, openid, plugins, portablesocialnetwork, rails, ruby, social-graph, twitter

WebRunner 0.7 - New and Improved

2007-09-27T13:55:57+00:00

WebRunner 0.7 - New and Improved

A simple application for running a site-specific browser for a service (e.g. Twitter, Gmail etc). This is a great idea: it isolates your other browser windows from crashes and also isolates your cookies, helping guard against CSRF attacks.

Tags: browsers, csrf, gmail, security, sitespecificbrowsers, twitter, webrunner, xulrunner

Google GMail E-mail Hijack Technique

2007-09-27T10:29:50+00:00

Google GMail E-mail Hijack Technique

Apparently Gmail has a CSRF vulnerability that lets malicious sites add new filters to your filter list—meaning an attacker could add a rule that forwards all messages to them without your knowledge.

Tags: csrf, gmail, google, security, vulnerability

Google To "Out Open" Facebook On November 5

2007-09-21T23:23:30+00:00

Google To "Out Open" Facebook On November 5

“Google will announce a new set of APIs on November 5 that will allow developers to leverage Google’s social graph data. They’ll start with Orkut and iGoogle (Google’s personalized home page), and expand from there to include Gmail, Google Talk and other Google services over time.”

Tags: gmail, google, google-talk, igoogle, michael-arrington, openid, orkut, social-graph, techcrunch

Never use a warning when you mean undo

2007-07-17T11:00:24+00:00

Never use a warning when you mean undo

The abundance of “undo” is one of my favourite things about Gmail. I wonder if there’s anything Django could do to make implementing undo functionality easier...

Tags: a-list-apart, django, gmail, undo, usability, warning

Gmail and Django

2007-07-02T21:46:57+00:00

Gmail and Django

I’d never considered using Gmail to send e-mail from applications, but it could be a useful way of avoiding having outbound e-mail falsely flagged as spam.

Tags: django, email, gmail, nathan-ostgard