Simon Willison's Weblog: jina

A Practical Guide to Implementing DeepSearch / DeepResearch

2025-03-04T17:25:16+00:00

A Practical Guide to Implementing DeepSearch / DeepResearch

I really like the definitions Han Xiao from Jina AI proposes for the terms DeepSearch and DeepResearch in this piece:

DeepSearch runs through an iterative loop of searching, reading, and reasoning until it finds the optimal answer. [...]

DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports.

I've recently found myself cooling a little on the classic RAG pattern of finding relevant documents and dumping them into the context for a single call to an LLM.

I think this definition of DeepSearch helps explain why. RAG is about answering questions that fall outside of the knowledge baked into a model. The DeepSearch pattern offers a tools-based alternative to classic RAG: we give the model extra tools for running multiple searches (which could be vector-based, or FTS, or even systems like ripgrep) and run it for several steps in a loop to try to find an answer.

I think DeepSearch is a lot more interesting than DeepResearch, which feels to me more like a presentation layer thing. Pulling together the results from multiple searches into a "report" looks more impressive, but I still worry that the report format provides a misleading impression of the quality of the "research" that took place.

Tags: search, ai, generative-ai, llms, rag, llm-tool-use, jina, ai-assisted-search

q and qv zsh functions for asking questions of websites and YouTube videos with LLM

2024-12-19T15:42:34+00:00

q and qv zsh functions for asking questions of websites and YouTube videos with LLM

Spotted these in David Gasquez's zshrc dotfiles: two shell functions that use my LLM tool to answer questions about a website or YouTube video.

Here's how to ask a question of a website:

q https://simonwillison.net/ 'What has Simon written about recently?'

I got back:

Recently, Simon Willison has written about various topics including:

Building Python Tools - Exploring one-shot applications using Claude and dependency management with uv.

Modern Java Usage - Discussing recent developments in Java that simplify coding.

GitHub Copilot Updates - New free tier and features in GitHub Copilot for Vue and VS Code.

AI Engagement on Bluesky - Investigating the use of bots to create artificially polite disagreements.

OpenAI WebRTC Audio - Demonstrating a new API for real-time audio conversation with models.

It works by constructing a Jina Reader URL to convert that URL to Markdown, then piping that content into LLM along with the question.

The YouTube one is even more fun:

qv 'https://www.youtube.com/watch?v=uRuLgar5XZw' 'what does Simon say about open source?'

It said (about this 72 minute video):

Simon emphasizes that open source has significantly increased productivity in software development. He points out that before open source, developers often had to recreate existing solutions or purchase proprietary software, which often limited customization. The availability of open source projects has made it easier to find and utilize existing code, which he believes is one of the primary reasons for more efficient software development today.

The secret sauce behind that one is the way it uses yt-dlp to extract just the subtitles for the video:

local subtitle_url=$(yt-dlp -q --skip-download --convert-subs srt --write-sub --sub-langs "en" --write-auto-sub --print "requested_subtitles.en.url" "$url")
local content=$(curl -s "$subtitle_url" | sed '/^$/d' | grep -v '^[0-9]*$' | grep -v '\-->' | sed 's/<[^>]*>//g' | tr '\n' ' ')

That first line retrieves a URL to the subtitles in WEBVTT format - I saved a copy of that here. The second line then uses curl to fetch them, then sed and grep to remove the timestamp information, producing this.

Via Useful LLM tools (2024 Edition)

Tags: youtube, ai, generative-ai, llms, llm, zsh, jina

docs.jina.ai - the Jina meta-prompt

2024-10-30T17:07:42+00:00

docs.jina.ai - the Jina meta-prompt

From Jina AI on Twitter:

curl docs.jina.ai - This is our Meta-Prompt. It allows LLMs to understand our Reader, Embeddings, Reranker, and Classifier APIs for improved codegen. Using the meta-prompt is straightforward. Just copy the prompt into your preferred LLM interface like ChatGPT, Claude, or whatever works for you, add your instructions, and you're set.

The page is served using content negotiation. If you hit it with curl you get plain text, but a browser with text/html in the accept: header gets an explanation along with a convenient copy to clipboard button.

Tags: documentation, ai, generative-ai, llms, llm, jina

My Jina Reader tool

2024-10-14T16:47:56+00:00

My Jina Reader tool

I wanted to feed the Cloudflare Durable Objects SQLite documentation into Claude, but I was on my iPhone so copying and pasting was inconvenient. Jina offer a Reader API which can turn any URL into LLM-friendly Markdown and it turns out it supports CORS, so I got Claude to build me this tool (second iteration, third iteration, final source code).

Paste in a URL to get the Jina Markdown version, along with an all important "Copy to clipboard" button.

Tags: projects, markdown, ai, generative-ai, llms, ai-assisted-programming, claude, claude-3-5-sonnet, cors, jina

Bridging Language Gaps in Multilingual Embeddings via Contrastive Learning

2024-10-10T16:00:35+00:00

Bridging Language Gaps in Multilingual Embeddings via Contrastive Learning

Most text embeddings models suffer from a "language gap", where phrases in different languages with the same semantic meaning end up with embedding vectors that aren't clustered together.

Jina claim their new jina-embeddings-v3 (CC BY-NC 4.0, which means you need to license it for commercial use if you're not using their API) is much better on this front, thanks to a training technique called "contrastive learning".

There are 30 languages represented in our contrastive learning dataset, but 97% of pairs and triplets are in just one language, with only 3% involving cross-language pairs or triplets. But this 3% is enough to produce a dramatic result: Embeddings show very little language clustering and semantically similar texts produce close embeddings regardless of their language

Via @JinaAI_

Tags: machine-learning, ai, embeddings, jina

Jina AI Reader

2024-06-16T19:33:58+00:00

Jina AI Reader

Jina AI provide a number of different AI-related platform products, including an excellent family of embedding models, but one of their most instantly useful is Jina Reader, an API for turning any URL into Markdown content suitable for piping into an LLM.

Add r.jina.ai to the front of a URL to get back Markdown of that page, for example https://r.jina.ai/https://simonwillison.net/2024/Jun/16/jina-ai-reader/ - in addition to converting the content to Markdown it also does a decent job of extracting just the content and ignoring the surrounding navigation.

The API is free but rate-limited (presumably by IP) to 20 requests per minute without an API key or 200 request per minute with a free API key, and you can pay to increase your allowance beyond that.

The Apache 2 licensed source code for the hosted service is on GitHub - it's written in TypeScript and uses Puppeteer to run Readabiliy.js and Turndown against the scraped page.

It can also handle PDFs, which have their contents extracted using PDF.js.

There's also a search feature, s.jina.ai/search+term+goes+here, which uses the Brave Search API.

Tags: apis, markdown, ai, puppeteer, llms, jina, brave

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun

2024-05-10T16:42:55+00:00

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun

A real tour de force of data engineering. Wilson Lin fetched 40 million posts and comments from the Hacker News API (using Node.js with a custom multi-process worker pool) and then ran them all through the BGE-M3 embedding model using RunPod, which let him fire up ~150 GPU instances to get the whole run done in a few hours, using a custom RocksDB and Rust queue he built to save on Amazon SQS costs.

Then he crawled 4 million linked pages, embedded that content using the faster and cheaper jina-embeddings-v2-small-en model, ran UMAP dimensionality reduction to render a 2D map and did a whole lot of follow-on work to identify topic areas and make the map look good.

That's not even half the project - Wilson built several interactive features on top of the resulting data, and experimented with custom rendering techniques on top of canvas to get everything to render quickly.

There's so much in here, and both the code and data (multiple GBs of arrow files) are available if you want to dig in and try some of this out for yourself.

In the Hacker News comments Wilson shares that the total cost of the project was a couple of hundred dollars.

One tiny detail I particularly enjoyed - unrelated to the embeddings - was this trick for testing which edge location is closest to a user using JavaScript:

const edge = await Promise.race(
  EDGES.map(async (edge) => {
    // Run a few times to avoid potential cold start biases.
    for (let i = 0; i < 3; i++) {
      await fetch(`https://${edge}.edge-hndr.wilsonl.in/healthz`);
    }
    return edge;
  }),
);

Via Show HN

Tags: hacker-news, embeddings, jina

Execute Jina embeddings with a CLI using llm-embed-jina

2023-10-26T03:47:08+00:00

Berlin-based Jina AI just released a new family of embedding models, boasting that they are the "world's first open-source 8K text embedding model" and that they rival OpenAI's text-embedding-ada-002 in quality.

I wrote about embeddings extensively the other day - if you're not familiar with what they are and what you can do with them I suggest reading that first.

This evening I built and released a new plugin for my LLM tool which adds support for Jina's new embedding models.

Trying out llm-embed-jina

The plugin is called llm-embed-jina. Here's the quickest way to get started with it:

First, install LLM if you haven't already. You can use pipx:

pipx install llm

Or pip:

pip install llm

Unfortunately installing LLM using Homebrew doesn't currently work with this plugin as PyTorch has not yet been released for Python 3.12 - details in this issue.

Now you can install the llm-embed-jina plugin:

llm install llm-embed-jina

The llm install command ensures it gets installed in the correct virtual environment, no matter how you installed LLM itself.

Run this command to check that it added the models:

llm embed-models

You should see output like this:

ada-002 (aliases: ada, oai)
jina-embeddings-v2-small-en
jina-embeddings-v2-base-en
jina-embeddings-v2-large-en

The jina-emebddings-v2-large-en model isn't available yet, but should work as soon as Jina release it. I expect it will show up at huggingface.co/jinaai/jina-embeddings-v2-large-en (currently a 404).

Now you can run one of the models. The -small-en model is a good starting point, it's only a 65MB download - the -base-en model is 275MB.

The model will download the first time you try to use it. Run this:

llm embed -m jina-embeddings-v2-small-en -c 'Hello world'

This will return a JSON array of 512 floating point numbers - the embedding vector for the string "Hello world".

Embeddings are much more interesting if you store them somewhere and then use them to run comparisons. The llm embed-multi command can do that.

Change directory to a folder that you know contains README.md files (anything with a node_modules folder will do) and run this:

llm embed-multi readmes \
    -m jina-embeddings-v2-small-en \
    --files . '**/README.md' \
    --database readmes.db

This will create a SQLite database called readmes.db, then search for every README.md file in the current directory and all subdirectories, embed the content of each one and store the results in that database.

Those embeddings will live in a collection called readmes.

If you leave off the --database readmes.db option the collections will be stored in a default SQLite database tucked away somewhere on your system.

Having done this, you can run semantic similarity searches against the new collection like this:

llm similar readmes -d readmes.db -c 'utility functions'

When I ran that in my hmb-map directory I got these:

{"id": "node_modules/@maplibre/maplibre-gl-style-spec/src/feature_filter/README.md", "score": 0.7802185991017785, "content": null, "metadata": null}
{"id": "node_modules/kind-of/README.md", "score": 0.7725600920927725, "content": null, "metadata": null}
{"id": "node_modules/which/README.md", "score": 0.7645426557095619, "content": null, "metadata": null}
{"id": "node_modules/@mapbox/point-geometry/README.md", "score": 0.7636548563018607, "content": null, "metadata": null}
{"id": "node_modules/esbuild/README.md", "score": 0.7633325127194481, "content": null, "metadata": null}
{"id": "node_modules/maplibre-gl/src/shaders/README.md", "score": 0.7614428292518743, "content": null, "metadata": null}
{"id": "node_modules/minimist/README.md", "score": 0.7581314986768929, "content": null, "metadata": null}
{"id": "node_modules/split-string/README.md", "score": 0.7563253351715924, "content": null, "metadata": null}
{"id": "node_modules/assign-symbols/README.md", "score": 0.7555915219064293, "content": null, "metadata": null}
{"id": "node_modules/maplibre-gl/build/README.md", "score": 0.754027372081506, "content": null, "metadata": null}

These are the top ten results by similarity to the string I entered.

You can also pass in the ID of an item in the collection to see other similar items:

llm similar readmes -d readmes.db node_modules/esbuild/README.md | jq .id

I piped it through | jq .id to get back just the IDs. I got this:

"node_modules/@esbuild/darwin-arm64/README.md"
"node_modules/rollup/README.md"
"node_modules/assign-symbols/README.md"
"node_modules/split-string/node_modules/extend-shallow/README.md"
"node_modules/isobject/README.md"
"node_modules/maplibre-gl/build/README.md"
"node_modules/vite/README.md"
"node_modules/nanoid/README.md"
"node_modules/@mapbox/tiny-sdf/README.md"
"node_modules/split-string/node_modules/is-extendable/README.md"

See the LLM embeddings documentation for more details on things you can do with this tool.

How I built the plugin

I built the first version of this plugin in about 15 minutes. It took another hour to iron out a couple of bugs.

I started with this cookiecutter template, followed by pasting in the recipe in the LLM documentation on writing embedding model plugins combined with some example code that Jina provided in their model release. Here's their code:

from transformers import AutoModel
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-small-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
print(cos_sim(embeddings[0], embeddings[1]))

That numpy and cos_sim bit isn't necessary, so I ignored that.

The first working version of the plugin was a file called llm_embed_jina.py that looked like this:

import llm
from transformers import AutoModel


@llm.hookimpl
def register_embedding_models(register):
    for model_id in (
        "jina-embeddings-v2-small-en",
        "jina-embeddings-v2-base-en",
        "jina-embeddings-v2-large-en",
    ):
        register(JinaEmbeddingModel(model_id))


class JinaEmbeddingModel(llm.EmbeddingModel):
    def __init__(self, model_id):
        self.model_id = model_id
        self._model = None

    def embed_batch(self, texts):
        if self._model is None:
            self._model = AutoModel.from_pretrained(
                "jinaai/{}".format(self.model_id), trust_remote_code=True
            )
        results = self._model.encode(texts)
        return (list(map(float, result)) for result in results)

There's really not a lot to it.

The register_embedding_models() function is a plugin hook that LLM calls to register all of the embedding models.

JinaEmbeddingModel is a subclass of llm.EmbeddingModel. It just needs to implement two things: a constructor and that embed_batch(self, texts) method.

AutoModel.from_pretrained() is provided by Hugging Face Transformers. It downloads and caches the model the first time you call it.

The model returns numpy arrays, but LLM wants a regular Python list of floats - that's what that last return line is doing.

I found a couple of bugs with this. The model didn't like having .encode(texts) called with a generator, so I needed to convert that into a list. Then later I found that text longer than 8192 characters could cause the model to hang in some situations, so I added my own truncated.

The current version (0.1.2) of the plugin, with fixes for both of those issues, looks like this:

import llm
from transformers import AutoModel

MAX_LENGTH = 8192


@llm.hookimpl
def register_embedding_models(register):
    for model_id in (
        "jina-embeddings-v2-small-en",
        "jina-embeddings-v2-base-en",
        "jina-embeddings-v2-large-en",
    ):
        register(JinaEmbeddingModel(model_id))


class JinaEmbeddingModel(llm.EmbeddingModel):
    def __init__(self, model_id):
        self.model_id = model_id
        self._model = None

    def embed_batch(self, texts):
        if self._model is None:
            self._model = AutoModel.from_pretrained(
                "jinaai/{}".format(self.model_id), trust_remote_code=True
            )
        results = self._model.encode([text[:MAX_LENGTH] for text in texts])
        return (list(map(float, result)) for result in results)

I'm really pleased with how quickly this came together - I think it's a strong signal that the LLM embeddings plugin design is working well.

Tags: cli, plugins, projects, ai, embeddings, llm, jina