<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: redpajama</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/redpajama.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2023-05-22T19:25:13+00:00</updated><author><name>Simon Willison</name></author><entry><title>MLC: Bringing Open Large Language Models to Consumer Devices</title><link href="https://simonwillison.net/2023/May/22/mlc-redpajama/#atom-tag" rel="alternate"/><published>2023-05-22T19:25:13+00:00</published><updated>2023-05-22T19:25:13+00:00</updated><id>https://simonwillison.net/2023/May/22/mlc-redpajama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://mlc.ai/blog/2023/05/22/bringing-open-large-language-models-to-consumer-devices"&gt;MLC: Bringing Open Large Language Models to Consumer Devices&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
“We bring RedPajama, a permissive open language model to WebGPU, iOS, GPUs, and various other platforms.” I managed to get this running on my Mac (see via link) with a few tweaks to their official instructions.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://til.simonwillison.net/llms/mlc-chat-redpajama"&gt;mlc-chat - RedPajama-INCITE-Chat-3B on macOS&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/mlc"&gt;mlc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redpajama"&gt;redpajama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/webgpu"&gt;webgpu&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/gpus"&gt;gpus&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="local-llms"/><category term="llms"/><category term="mlc"/><category term="redpajama"/><category term="webgpu"/><category term="gpus"/></entry><entry><title>OpenLLaMA</title><link href="https://simonwillison.net/2023/May/3/openllama/#atom-tag" rel="alternate"/><published>2023-05-03T20:58:19+00:00</published><updated>2023-05-03T20:58:19+00:00</updated><id>https://simonwillison.net/2023/May/3/openllama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/openlm-research/open_llama"&gt;OpenLLaMA&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The first openly licensed model I’ve seen trained on the RedPajama dataset. This initial release is a 7B model trained on 200 billion tokens, but the team behind it are promising a full 1 trillion token model in the near future. I haven’t found a live demo of this one running anywhere yet.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redpajama"&gt;redpajama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llm-release"&gt;llm-release&lt;/a&gt;&lt;/p&gt;



</summary><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="redpajama"/><category term="llm-release"/></entry><entry><title>What's in the RedPajama-Data-1T LLM training set</title><link href="https://simonwillison.net/2023/Apr/17/redpajama-data/#atom-tag" rel="alternate"/><published>2023-04-17T18:57:42+00:00</published><updated>2023-04-17T18:57:42+00:00</updated><id>https://simonwillison.net/2023/Apr/17/redpajama-data/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://www.together.xyz/blog/redpajama"&gt;RedPajama&lt;/a&gt; is "a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens". It's a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.&lt;/p&gt;
&lt;p&gt;They just announced their first release: &lt;a href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T"&gt;RedPajama-Data-1T&lt;/a&gt;, a 1.2 trillion token dataset modelled on the training data described in &lt;a href="https://www.arxiv-vanity.com/papers/2302.13971/"&gt;the original LLaMA paper&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The full dataset is 2.67TB, so I decided not to try and download the whole thing! Here's what I've figured out about it so far.&lt;/p&gt;
&lt;h4&gt;How to download it&lt;/h4&gt;
&lt;p&gt;The data is split across 2,084 different files. These are listed in a plain text file here:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt"&gt;https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The dataset card suggests you could download them all like this - assuming you have 2.67TB of disk space and bandwith to spare:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;wget -i https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I prompted GPT-4 a few times to write a quick Python script to run a &lt;code&gt;HEAD&lt;/code&gt; request against each URL in that file instead, in order to collect the &lt;code&gt;Content-Length&lt;/code&gt; and calculate the total size of the data. My script is at the bottom of this post.&lt;/p&gt;
&lt;p&gt;I then processed the size data into &lt;a href="https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464"&gt;a format&lt;/a&gt; suitable for loading into &lt;a href="https://github.com/simonw/datasette-lite"&gt;Datasette Lite&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Exploring the size data&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://lite.datasette.io/?json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464#/data/raw?_facet=top_folder&amp;amp;_facet=top_folders&amp;amp;_sort_desc=size_gb"&gt;Here's a link&lt;/a&gt; to a Datasette Lite page showing all 2,084 files, sorted by size and with some useful facets.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2023/redpajama-sizes.jpg" alt="Datasette showing the rows, faceted by top_folder and top_folders. The largest file is wikipedia/wiki.jsonl at 111GB, then book/book.jsonl at 100GB, then stackexchange/stackexchange.jsonl at 74GB, then various filtered GitHub files" style="max-width: 100%;" /&gt;&lt;/p&gt;
&lt;p&gt;This is already revealing a lot about the data.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;top_folders&lt;/code&gt; facet inspired me to &lt;a href="https://lite.datasette.io/?install=datasette-copyable&amp;amp;json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464#/data?sql=select%0A++top_folders%2C%0A++cast+%28sum%28size_gb%29+as+integer%29+as+total_gb%2C%0A++count%28*%29+as+num_files%0Afrom+raw%0Agroup+by+top_folders%0Aorder+by+sum%28size_gb%29+desc"&gt;run this SQL query&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;select&lt;/span&gt;
  top_folders,
  cast (&lt;span class="pl-c1"&gt;sum&lt;/span&gt;(size_gb) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;integer&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; total_gb,
  &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; num_files
&lt;span class="pl-k"&gt;from&lt;/span&gt; raw
&lt;span class="pl-k"&gt;group by&lt;/span&gt; top_folders
&lt;span class="pl-k"&gt;order by&lt;/span&gt; &lt;span class="pl-c1"&gt;sum&lt;/span&gt;(size_gb) &lt;span class="pl-k"&gt;desc&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Here are the results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;top_folders&lt;/th&gt;
&lt;th&gt;total_gb&lt;/th&gt;
&lt;th&gt;num_files&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;c4&lt;/td&gt;
&lt;td&gt;806&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2023-06&lt;/td&gt;
&lt;td&gt;288&lt;/td&gt;
&lt;td&gt;175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2020-05&lt;/td&gt;
&lt;td&gt;286&lt;/td&gt;
&lt;td&gt;198&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2021-04&lt;/td&gt;
&lt;td&gt;276&lt;/td&gt;
&lt;td&gt;176&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2022-05&lt;/td&gt;
&lt;td&gt;251&lt;/td&gt;
&lt;td&gt;157&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;common_crawl/2019-30&lt;/td&gt;
&lt;td&gt;237&lt;/td&gt;
&lt;td&gt;153&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;github&lt;/td&gt;
&lt;td&gt;212&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wikipedia&lt;/td&gt;
&lt;td&gt;111&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;book&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arxiv&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;stackexchange&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There's a lot of Common Crawl data in there!&lt;/p&gt;
&lt;p&gt;The &lt;a href="https://www.together.xyz/blog/redpajama"&gt;RedPajama announcement&lt;/a&gt; says:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.&lt;/li&gt;
&lt;li&gt;C4: Standard C4 dataset&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;It looks like they used &lt;a href="https://commoncrawl.org/"&gt;CommonCrawl&lt;/a&gt; from 5 different dates, from 2019-30 (30? That's not a valid month - looks like &lt;a href="https://hachyderm.io/@xek/110215763306634784"&gt;it's a week number&lt;/a&gt;) to 2022-05. I wonder if they de-duplicated content within those different crawls?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://paperswithcode.com/dataset/c4"&gt;C4&lt;/a&gt; is "a colossal, cleaned version of Common Crawl's web crawl corpus" - so yet another copy of Common Crawl, cleaned in a different way.&lt;/p&gt;
&lt;p&gt;I downloaded the first 100MB of that 100GB &lt;code&gt;book.jsonl&lt;/code&gt; file - the first 300 rows in it are all full-text books from Project Gutenberg, starting with &lt;a href="https://www.gutenberg.org/ebooks/10"&gt;The Bible Both Testaments King James Version&lt;/a&gt; from 1611.&lt;/p&gt;
&lt;p&gt;The data all appears to be in JSONL format - newline-delimited JSON. Different files I looked at had different shapes, though a common pattern was a &lt;code&gt;"text"&lt;/code&gt; key containing the text and a &lt;code&gt;"meta"&lt;/code&gt; key containing a dictionary of metadata.&lt;/p&gt;
&lt;p&gt;For example, the first line of &lt;code&gt;books.jsonl&lt;/code&gt; looks like this (after pretty-printing using &lt;code&gt;jq&lt;/code&gt;):&lt;/p&gt;
&lt;div class="highlight highlight-source-json"&gt;&lt;pre&gt;{
  &lt;span class="pl-ent"&gt;"meta"&lt;/span&gt;: {
    &lt;span class="pl-ent"&gt;"short_book_title"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;The Bible Both Testaments King James Version&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"publication_date"&lt;/span&gt;: &lt;span class="pl-c1"&gt;1611&lt;/span&gt;,
    &lt;span class="pl-ent"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;http://www.gutenberg.org/ebooks/10&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
  },
  &lt;span class="pl-ent"&gt;"text"&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-cce"&gt;\n\n&lt;/span&gt;The Old Testament of the King James Version of the Bible&lt;span class="pl-cce"&gt;\n&lt;/span&gt;...&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;There are more details on the composition of the dataset in &lt;a href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T#dataset-creation"&gt;the dataset card&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;My Python script&lt;/h4&gt;
&lt;p&gt;I wrote a quick Python script to do the next best thing: run a &lt;code&gt;HEAD&lt;/code&gt; request against each URL to figure out the total size of the data.&lt;/p&gt;
&lt;p&gt;I &lt;a href="https://gist.github.com/simonw/38246d2f230bd1d5cf8b4907e8871ed1"&gt;prompted GPT-4 a few times&lt;/a&gt;, and came up with this:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;httpx&lt;/span&gt;
&lt;span class="pl-k"&gt;from&lt;/span&gt; &lt;span class="pl-s1"&gt;tqdm&lt;/span&gt; &lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;tqdm&lt;/span&gt;

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;get_sizes&lt;/span&gt;(&lt;span class="pl-s1"&gt;urls&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;sizes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; {}
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;fetch_size&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;):
        &lt;span class="pl-k"&gt;try&lt;/span&gt;:
            &lt;span class="pl-s1"&gt;response&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;client&lt;/span&gt;.&lt;span class="pl-en"&gt;head&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;)
            &lt;span class="pl-s1"&gt;content_length&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;response&lt;/span&gt;.&lt;span class="pl-s1"&gt;headers&lt;/span&gt;.&lt;span class="pl-en"&gt;get&lt;/span&gt;(&lt;span class="pl-s"&gt;'Content-Length'&lt;/span&gt;)
            &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;content_length&lt;/span&gt; &lt;span class="pl-c1"&gt;is&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-c1"&gt;None&lt;/span&gt;:
                &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;content_length&lt;/span&gt;)
        &lt;span class="pl-k"&gt;except&lt;/span&gt; &lt;span class="pl-v"&gt;Exception&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;e&lt;/span&gt;:
            &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f"Error while processing URL '&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;url&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;': &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;e&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;"&lt;/span&gt;)
        &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;
    &lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-s1"&gt;httpx&lt;/span&gt;.&lt;span class="pl-v"&gt;AsyncClient&lt;/span&gt;() &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;client&lt;/span&gt;:
        &lt;span class="pl-c"&gt;# Create a progress bar using tqdm&lt;/span&gt;
        &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;tqdm&lt;/span&gt;(&lt;span class="pl-s1"&gt;total&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;urls&lt;/span&gt;), &lt;span class="pl-s1"&gt;desc&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"Fetching sizes"&lt;/span&gt;, &lt;span class="pl-s1"&gt;unit&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;"url"&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;pbar&lt;/span&gt;:
            &lt;span class="pl-c"&gt;# Use asyncio.as_completed to process results as they arrive&lt;/span&gt;
            &lt;span class="pl-s1"&gt;coros&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-en"&gt;fetch_size&lt;/span&gt;(&lt;span class="pl-s1"&gt;url&lt;/span&gt;) &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;urls&lt;/span&gt;]
            &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;coro&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;asyncio&lt;/span&gt;.&lt;span class="pl-en"&gt;as_completed&lt;/span&gt;(&lt;span class="pl-s1"&gt;coros&lt;/span&gt;):
                &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; &lt;span class="pl-s1"&gt;coro&lt;/span&gt;
                &lt;span class="pl-s1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-s1"&gt;url&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;size&lt;/span&gt;
                &lt;span class="pl-c"&gt;# Update the progress bar&lt;/span&gt;
                &lt;span class="pl-s1"&gt;pbar&lt;/span&gt;.&lt;span class="pl-en"&gt;update&lt;/span&gt;(&lt;span class="pl-c1"&gt;1&lt;/span&gt;)
    &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;sizes&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;I pasted this into &lt;code&gt;python3 -m asyncio&lt;/code&gt; - the &lt;code&gt;-m asyncio&lt;/code&gt; flag ensures the &lt;code&gt;await&lt;/code&gt; statement can be used in the interactive interpreter - and ran the following:&lt;/p&gt;
&lt;div class="highlight highlight-text-python-console"&gt;&lt;pre&gt;&amp;gt;&amp;gt;&amp;gt; urls &lt;span class="pl-k"&gt;=&lt;/span&gt; httpx.get(&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;).text.splitlines()
&amp;gt;&amp;gt;&amp;gt; sizes &lt;span class="pl-k"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;await&lt;/span&gt; get_sizes(urls)
Fetching sizes: 100%|██████████████████████████████████████| 2084/2084 [00:08&amp;lt;00:00, 256.60url/s]
&amp;gt;&amp;gt;&amp;gt; &lt;span class="pl-c1"&gt;sum&lt;/span&gt;(sizes.values())
2936454998167&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Then I added the following to turn the data into something that would work with Datasette Lite:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-s1"&gt;output&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; []
&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;, &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;sizes&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;path&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;url&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;'/redpajama-data-1T/v1.0.0/'&lt;/span&gt;)[&lt;span class="pl-c1"&gt;1&lt;/span&gt;]
    &lt;span class="pl-s1"&gt;output&lt;/span&gt;.&lt;span class="pl-en"&gt;append&lt;/span&gt;({
        &lt;span class="pl-s"&gt;"url"&lt;/span&gt;: &lt;span class="pl-s1"&gt;url&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"size"&lt;/span&gt;: &lt;span class="pl-s1"&gt;size&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"size_mb"&lt;/span&gt;: &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"size_gb"&lt;/span&gt;: &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt; &lt;span class="pl-c1"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;1024&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"path"&lt;/span&gt;: &lt;span class="pl-s1"&gt;path&lt;/span&gt;,
        &lt;span class="pl-s"&gt;"top_folder"&lt;/span&gt;: &lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;"/"&lt;/span&gt;)[&lt;span class="pl-c1"&gt;0&lt;/span&gt;],
        &lt;span class="pl-s"&gt;"top_folders"&lt;/span&gt;: &lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;rsplit&lt;/span&gt;(&lt;span class="pl-s"&gt;"/"&lt;/span&gt;, &lt;span class="pl-c1"&gt;1&lt;/span&gt;)[&lt;span class="pl-c1"&gt;0&lt;/span&gt;],
    })
&lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s"&gt;"/tmp/sizes.json"&lt;/span&gt;, &lt;span class="pl-s"&gt;"w"&lt;/span&gt;).&lt;span class="pl-en"&gt;write&lt;/span&gt;(&lt;span class="pl-s1"&gt;json&lt;/span&gt;.&lt;span class="pl-en"&gt;dumps&lt;/span&gt;(&lt;span class="pl-s1"&gt;output&lt;/span&gt;, &lt;span class="pl-s1"&gt;indent&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;2&lt;/span&gt;))&lt;/pre&gt;
&lt;p&gt;I pasted the result &lt;a href="https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464"&gt;into a Gist&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette-lite"&gt;datasette-lite&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai-assisted-programming"&gt;ai-assisted-programming&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redpajama"&gt;redpajama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ai"/><category term="datasette"/><category term="datasette-lite"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="ai-assisted-programming"/><category term="redpajama"/><category term="training-data"/></entry><entry><title>RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens</title><link href="https://simonwillison.net/2023/Apr/17/redpajama/#atom-tag" rel="alternate"/><published>2023-04-17T17:13:02+00:00</published><updated>2023-04-17T17:13:02+00:00</updated><id>https://simonwillison.net/2023/Apr/17/redpajama/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.together.xyz/blog/redpajama"&gt;RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
With the amount of projects that have used LLaMA as a foundation model since its release two months ago—despite its non-commercial license—it’s clear that there is a strong desire for a fully openly licensed alternative.&lt;/p&gt;

&lt;p&gt;RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute aiming to build exactly that.&lt;/p&gt;

&lt;p&gt;Step one is gathering the training data: the LLaMA paper described a 1.2 trillion token training set gathered from sources that included Wikipedia, Common Crawl, GitHub, arXiv, Stack Exchange and more.&lt;/p&gt;

&lt;p&gt;RedPajama-Data-1T is an attempt at recreating that training set. It’s now available to download, as 2,084 separate multi-GB jsonl files—2.67TB total.&lt;/p&gt;

&lt;p&gt;Even without a trained model, this is a hugely influential contribution to the world of open source LLMs. Any team looking to build their own LLaMA from scratch can now jump straight to the next stage, training the model.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/open-source"&gt;open-source&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ai"&gt;ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/generative-ai"&gt;generative-ai&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llama"&gt;llama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/local-llms"&gt;local-llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/llms"&gt;llms&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/redpajama"&gt;redpajama&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/training-data"&gt;training-data&lt;/a&gt;&lt;/p&gt;



</summary><category term="open-source"/><category term="ai"/><category term="generative-ai"/><category term="llama"/><category term="local-llms"/><category term="llms"/><category term="redpajama"/><category term="training-data"/></entry></feed>