MLC: Bringing Open Large Language Models to Consumer Devices

2023-05-22T19:25:13+00:00

MLC: Bringing Open Large Language Models to Consumer Devices

“We bring RedPajama, a permissive open language model to WebGPU, iOS, GPUs, and various other platforms.” I managed to get this running on my Mac (see via link) with a few tweaks to their official instructions.

Via mlc-chat - RedPajama-INCITE-Chat-3B on macOS

Tags: ai, generative-ai, local-llms, llms, mlc, redpajama, webgpu, gpus

OpenLLaMA

2023-05-03T20:58:19+00:00

OpenLLaMA

The first openly licensed model I’ve seen trained on the RedPajama dataset. This initial release is a 7B model trained on 200 billion tokens, but the team behind it are promising a full 1 trillion token model in the near future. I haven’t found a live demo of this one running anywhere yet.

Tags: ai, generative-ai, llama, local-llms, llms, redpajama, llm-release

What's in the RedPajama-Data-1T LLM training set

2023-04-17T18:57:42+00:00

RedPajama is "a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens". It's a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.

They just announced their first release: RedPajama-Data-1T, a 1.2 trillion token dataset modelled on the training data described in the original LLaMA paper.

The full dataset is 2.67TB, so I decided not to try and download the whole thing! Here's what I've figured out about it so far.

How to download it

The data is split across 2,084 different files. These are listed in a plain text file here:

https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt

The dataset card suggests you could download them all like this - assuming you have 2.67TB of disk space and bandwith to spare:

wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt

I prompted GPT-4 a few times to write a quick Python script to run a HEAD request against each URL in that file instead, in order to collect the Content-Length and calculate the total size of the data. My script is at the bottom of this post.

I then processed the size data into a format suitable for loading into Datasette Lite.

Exploring the size data

Here's a link to a Datasette Lite page showing all 2,084 files, sorted by size and with some useful facets.

This is already revealing a lot about the data.

The top_folders facet inspired me to run this SQL query:

select
  top_folders,
  cast (sum(size_gb) as integer) as total_gb,
  count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc

Here are the results:

top_folders	total_gb	num_files
c4	806	1024
common_crawl/2023-06	288	175
common_crawl/2020-05	286	198
common_crawl/2021-04	276	176
common_crawl/2022-05	251	157
common_crawl/2019-30	237	153
github	212	98
wikipedia	111	1
book	100	1
arxiv	87	100
stackexchange	74	1

There's a lot of Common Crawl data in there!

The RedPajama announcement says:

CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.

C4: Standard C4 dataset

It looks like they used CommonCrawl from 5 different dates, from 2019-30 (30? That's not a valid month - looks like it's a week number) to 2022-05. I wonder if they de-duplicated content within those different crawls?

C4 is "a colossal, cleaned version of Common Crawl's web crawl corpus" - so yet another copy of Common Crawl, cleaned in a different way.

I downloaded the first 100MB of that 100GB book.jsonl file - the first 300 rows in it are all full-text books from Project Gutenberg, starting with The Bible Both Testaments King James Version from 1611.

The data all appears to be in JSONL format - newline-delimited JSON. Different files I looked at had different shapes, though a common pattern was a "text" key containing the text and a "meta" key containing a dictionary of metadata.

For example, the first line of books.jsonl looks like this (after pretty-printing using jq):

{
  "meta": {
    "short_book_title": "The Bible Both Testaments King James Version",
    "publication_date": 1611,
    "url": "http://www.gutenberg.org/ebooks/10"
  },
  "text": "\n\nThe Old Testament of the King James Version of the Bible\n..."
}

There are more details on the composition of the dataset in the dataset card.

My Python script

I wrote a quick Python script to do the next best thing: run a HEAD request against each URL to figure out the total size of the data.

I prompted GPT-4 a few times, and came up with this:

import httpx
from tqdm import tqdm

async def get_sizes(urls):
    sizes = {}
    async def fetch_size(url):
        try:
            response = await client.head(url)
            content_length = response.headers.get('Content-Length')
            if content_length is not None:
                return url, int(content_length)
        except Exception as e:
            print(f"Error while processing URL '{url}': {e}")
        return url, 0
    async with httpx.AsyncClient() as client:
        # Create a progress bar using tqdm
        with tqdm(total=len(urls), desc="Fetching sizes", unit="url") as pbar:
            # Use asyncio.as_completed to process results as they arrive
            coros = [fetch_size(url) for url in urls]
            for coro in asyncio.as_completed(coros):
                url, size = await coro
                sizes[url] = size
                # Update the progress bar
                pbar.update(1)
    return sizes

I pasted this into python3 -m asyncio - the -m asyncio flag ensures the await statement can be used in the interactive interpreter - and ran the following:

>>> urls = httpx.get("https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt").text.splitlines()
>>> sizes = await get_sizes(urls)
Fetching sizes: 100%|██████████████████████████████████████| 2084/2084 [00:08<00:00, 256.60url/s]
>>> sum(sizes.values())
2936454998167

Then I added the following to turn the data into something that would work with Datasette Lite:

output = []
for url, size in sizes.items():
    path = url.split('/redpajama-data-1T/v1.0.0/')[1]
    output.append({
        "url": url,
        "size": size,
        "size_mb": size / 1024 / 1024,
        "size_gb": size / 1024 / 1024 / 1024,
        "path": path,
        "top_folder": path.split("/")[0],
        "top_folders": path.rsplit("/", 1)[0],
    })
open("/tmp/sizes.json", "w").write(json.dumps(output, indent=2))

I pasted the result into a Gist.

Tags: ai, datasette, datasette-lite, generative-ai, llama, local-llms, llms, ai-assisted-programming, redpajama, training-data

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens

2023-04-17T17:13:02+00:00

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens

With the amount of projects that have used LLaMA as a foundation model since its release two months ago—despite its non-commercial license—it’s clear that there is a strong desire for a fully openly licensed alternative.

RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute aiming to build exactly that.

Step one is gathering the training data: the LLaMA paper described a 1.2 trillion token training set gathered from sources that included Wikipedia, Common Crawl, GitHub, arXiv, Stack Exchange and more.

RedPajama-Data-1T is an attempt at recreating that training set. It’s now available to download, as 2,084 separate multi-GB jsonl files—2.67TB total.

Even without a trained model, this is a hugely influential contribution to the world of open source LLMs. Any team looking to build their own LLaMA from scratch can now jump straight to the next stage, training the model.

Tags: open-source, ai, generative-ai, llama, local-llms, llms, redpajama, training-data