<?xml version="1.0" encoding="utf-8"?>
<feed xml:lang="en-us" xmlns="http://www.w3.org/2005/Atom"><title>Simon Willison's Weblog: regular-expressions</title><link href="http://simonwillison.net/" rel="alternate"/><link href="http://simonwillison.net/tags/regular-expressions.atom" rel="self"/><id>http://simonwillison.net/</id><updated>2025-02-18T21:53:56+00:00</updated><author><name>Simon Willison</name></author><entry><title>tc39/proposal-regex-escaping</title><link href="https://simonwillison.net/2025/Feb/18/tc39proposal-regex-escaping/#atom-tag" rel="alternate"/><published>2025-02-18T21:53:56+00:00</published><updated>2025-02-18T21:53:56+00:00</updated><id>https://simonwillison.net/2025/Feb/18/tc39proposal-regex-escaping/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/tc39/proposal-regex-escaping"&gt;tc39/proposal-regex-escaping&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I just heard &lt;a href="https://social.coop/@kriskowal/114026510846190089"&gt;from Kris Kowal&lt;/a&gt; that this proposal for ECMAScript has been approved for ECMA TC-39:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Almost 20 years later, @simon’s RegExp.escape idea comes to fruition. This reached “Stage 4” at ECMA TC-39 just now, which formalizes that multiple browsers have shipped the feature and it’s in the next revision of the JavaScript specification.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I'll be honest, I had completely forgotten about my 2006 blog entry &lt;a href="https://simonwillison.net/2006/Jan/20/escape/"&gt;Escaping regular expression characters in JavaScript&lt;/a&gt; where I proposed that JavaScript should have an equivalent of the Python &lt;a href="https://docs.python.org/3/library/re.html#re.escape"&gt;re.escape()&lt;/a&gt; function.&lt;/p&gt;
&lt;p&gt;It turns out my post was referenced in &lt;a href="https://esdiscuss.org/topic/regexp-escape"&gt;this 15 year old thread&lt;/a&gt; on the esdiscuss mailing list, which evolved over time into a proposal which turned into &lt;a href="https://caniuse.com/mdn-javascript_builtins_regexp_escape"&gt;implementations&lt;/a&gt; in Safari, Firefox and soon Chrome - here's &lt;a href="https://github.com/v8/v8/commit/b5c08badc7b3d4b85b2645b1a4d9973ee6efaa91"&gt;the commit landing it in v8&lt;/a&gt; on February 12th 2025.&lt;/p&gt;
&lt;p&gt;One of the best things about having a long-running blog is that sometimes posts you forgot about over a decade ago turn out to have a life of their own.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/blogging"&gt;blogging&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/ecmascript"&gt;ecmascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/standards"&gt;standards&lt;/a&gt;&lt;/p&gt;



</summary><category term="blogging"/><category term="ecmascript"/><category term="javascript"/><category term="regular-expressions"/><category term="standards"/></entry><entry><title>Why I invented "dash encoding", a new encoding scheme for URL paths</title><link href="https://simonwillison.net/2022/Mar/5/dash-encoding/#atom-tag" rel="alternate"/><published>2022-03-05T21:50:38+00:00</published><updated>2022-03-05T21:50:38+00:00</updated><id>https://simonwillison.net/2022/Mar/5/dash-encoding/#atom-tag</id><summary type="html">
    &lt;p&gt;&lt;a href="https://datasette.io/"&gt;Datasette&lt;/a&gt; now includes its own custom string encoding scheme, which I've called &lt;strong&gt;dash encoding&lt;/strong&gt;. I really didn't want to have to invent something new here, but unfortunately I think this is the best solution to my very particular problem. Some notes on how dash encoding works and why I created it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 18th March 2022&lt;/strong&gt;: This turned out not to be the right idea for my project after all! I ended up settling on a &lt;a href="https://simonwillison.net/2022/Mar/19/weeknotes/#tilde-encoding"&gt;Tilde encoding&lt;/a&gt; scheme instead.&lt;/p&gt;

&lt;h4&gt;Table names and rows in URLs&lt;/h4&gt;
&lt;p&gt;I've put a lot of thought into the design of Datasette's URLs.&lt;/p&gt;
&lt;p&gt;Datasette exposes relational databases tables, as both web pages and a JSON API.&lt;/p&gt;
&lt;p&gt;Consider a database in a SQLite file called &lt;code&gt;legislators.db&lt;/code&gt;, containing a table called &lt;code&gt;legislator_terms&lt;/code&gt; (example from &lt;a href="https://datasette.io/tutorials/explore"&gt;this tutorial&lt;/a&gt;). The URL path to the web interface for that table will be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms"&gt;/legislators/legislator_terms&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the JSON API will be here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms.json"&gt;/legislators/legislator_terms.json&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;(Worth noting that Datasette supports other formats here too - &lt;a href="https://docs.datasette.io/en/stable/csv_export.html"&gt;CSV&lt;/a&gt; by default, and plugins can add more formats such as &lt;a href="https://datasette.io/plugins/datasette-geojson"&gt;GeoJSON&lt;/a&gt; or &lt;a href="https://datasette.io/plugins/datasette-atom"&gt;Atom&lt;/a&gt; or &lt;a href="https://datasette.io/plugins/datasette-ics"&gt;iCal&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;Datasette also provides pages (and APIs) for individual rows, identified by their primary key:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms/1"&gt;/legislators/legislator_terms/1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://congress-legislators.datasettes.com/legislators/legislator_terms/1.json"&gt;/legislators/legislator_terms/1.json&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For tables with compound primary keys, these pages can include the primary key values separated by commas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://latest.datasette.io/fixtures/compound_three_primary_keys/a,a,a"&gt;/fixtures/compound_three_primary_keys/a,a,a&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is all pretty straightforward so far. But now we get to the challenge: what if a table's name or a row's primary key contains a forward slash or a period character?&lt;/p&gt;
&lt;p&gt;This could break the URL scheme!&lt;/p&gt;
&lt;p&gt;SQLite table names are allowed to contain almost any character, and Datasette is designed to work with any existing SQLite database - so I can't guarantee that a table with one of those characters won't need to be handled.&lt;/p&gt;
&lt;p&gt;Consider a database with two tables - one called &lt;code&gt;legislator_terms&lt;/code&gt; and another called &lt;code&gt;legislator_terms/1&lt;/code&gt; - given the URL &lt;code&gt;/legislators/legislator_terms/1&lt;/code&gt; it's no longer clear if it refers to the table with that name or the row with primary key 1 in the other table!&lt;/p&gt;
&lt;p&gt;A similar problem exists for table names with as &lt;code&gt;legislators.csv&lt;/code&gt; - which end in a format. Or primary key string values that end in &lt;code&gt;.json&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Why URL encoding doesn't work here&lt;/h4&gt;
&lt;p&gt;Up until now, Datasette has solved this problem using &lt;a href="https://en.wikipedia.org/wiki/Percent-encoding"&gt;URL percent encoding&lt;/a&gt;. This provides a standard mechanism for encoding "special" characters in URLs.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;legislator_terms/1&lt;/code&gt; encodes to &lt;code&gt;legislator_terms%2F1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;This should be enough to solve the problem. The URL to that weirdly named table can now be:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;/legislators/legislator_terms%2F1&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;When routing the URL, the application can take this into account and identify that this it a table named &lt;code&gt;legislator_terms/1&lt;/code&gt;, as opposed to a request for the row with ID &lt;code&gt;1&lt;/code&gt; in the &lt;code&gt;legislator_terms&lt;/code&gt; table.&lt;/p&gt;
&lt;p&gt;There are two remaining problems.&lt;/p&gt;
&lt;p&gt;Firstly, the "." character is ignored by URL encoding, so we still can't tell the difference between &lt;code&gt;/db/table.json&lt;/code&gt; and a table called &lt;code&gt;table.json&lt;/code&gt;. I worked around this issue in Datasette by supporting an optional alternative &lt;code&gt;?_format=json&lt;/code&gt; parameter, but it's &lt;a href="https://github.com/simonw/datasette/issues/1439"&gt;messy and confusing&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Much more seriously, it turns out there are numerous common pieces of web infrastructure that "helpfully" decode escaped characters in URLs before passing them on to the underlying web application!&lt;/p&gt;
&lt;p&gt;I first encountered this in the ASGI standard itself, which decoded characters in the &lt;code&gt;path&lt;/code&gt; field before they were passed to the rest of the application.I submitted &lt;a href="https://github.com/django/asgiref/issues/87#issuecomment-500168070"&gt;a PR&lt;/a&gt; adding &lt;code&gt;raw_path&lt;/code&gt; to ASGI precisely to work around this problem for Datasette.&lt;/p&gt;
&lt;p&gt;Over time though, the problem kept cropping up. Datasette aims to run on as many hosting platforms as possible. I've seen URL escaping applied at a higher level enough times now to be very suspicious of any load balancer or proxy or other web server mechanism that might end up executing between Datasette and the rest of the web.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Flask core maintainer David Lord &lt;a href="https://twitter.com/davidism/status/1500251083070787585"&gt;confirms on Twitter&lt;/a&gt; that this is a long-standing known problem:&lt;/p&gt;

&lt;blockquote cite="https://twitter.com/davidism/status/1500251083070787585"&gt;&lt;p&gt;This behavior in Apache/nginx/etc is why WSGI/ASGI can't specify "literal URL the user typed in", because anything in front of the app might modify slashes or anything else. So all the spec can provide is "decoded URL".&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;So, I need a way of encoding a table name that might include &lt;code&gt;/&lt;/code&gt; and &lt;code&gt;.&lt;/code&gt; characters in a way that will survive some other layer of the stack decoding URL encoded strings in the URL path before Datasette gets to see them!&lt;/p&gt;
&lt;h4&gt;Introducing dash encoding&lt;/h4&gt;
&lt;p&gt;That's where dash encoding comes in. I tried to design the fastest, simplest encoding mechanism I could that would solve this very specific problem.&lt;/p&gt;
&lt;p&gt;Loose requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reversible - it's crucial to at any possible value survives a round-trip through the encoding&lt;/li&gt;
&lt;li&gt;Avoid changing the string at all if possible. Otherwise I could use something like base64, but I wanted to keep the name in the URL as close to readable as possible&lt;/li&gt;
&lt;li&gt;Survive interference by proxies and load balancer that might try to be helpful&lt;/li&gt;
&lt;li&gt;Fast to apply the transformation&lt;/li&gt;
&lt;li&gt;As simple as possible&lt;/li&gt;
&lt;li&gt;Easy to implement, including in languages other than Python&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dash encoding consists of three simple steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Replace all single hyphen characters &lt;code&gt;-&lt;/code&gt; with two hyphens &lt;code&gt;--&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Replace any forward slash &lt;code&gt;/&lt;/code&gt; character with hyphen forward slash &lt;code&gt;-/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Replace any period character &lt;code&gt;.&lt;/code&gt; with hyphen period &lt;code&gt;-.&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To reverse the encoding, run those steps backwards.&lt;/p&gt;
&lt;p&gt;Here the Python implementation of this encoding scheme:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;dash_encode&lt;/span&gt;(&lt;span class="pl-s1"&gt;s&lt;/span&gt;: &lt;span class="pl-s1"&gt;str&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;str&lt;/span&gt;:
     &lt;span class="pl-s"&gt;"Returns dash-encoded string - for example ``/foo/bar`` -&amp;gt; ``-/foo-/bar``"&lt;/span&gt;
     &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;s&lt;/span&gt;.&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"-"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--"&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"."&lt;/span&gt;, &lt;span class="pl-s"&gt;"-."&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"/"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-/"&lt;/span&gt;)

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;dash_decode&lt;/span&gt;(&lt;span class="pl-s1"&gt;s&lt;/span&gt;: &lt;span class="pl-s1"&gt;str&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;str&lt;/span&gt;:
     &lt;span class="pl-s"&gt;"Decodes a dash-encoded string, so ``-/foo-/bar`` -&amp;gt; ``/foo/bar``"&lt;/span&gt;
     &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;s&lt;/span&gt;.&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"-/"&lt;/span&gt;, &lt;span class="pl-s"&gt;"/"&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"-."&lt;/span&gt;, &lt;span class="pl-s"&gt;"."&lt;/span&gt;).&lt;span class="pl-en"&gt;replace&lt;/span&gt;(&lt;span class="pl-s"&gt;"--"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-"&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;And the pytest tests for it:&lt;/p&gt;
&lt;pre&gt;&lt;span class="pl-en"&gt;@&lt;span class="pl-s1"&gt;pytest&lt;/span&gt;.&lt;span class="pl-s1"&gt;mark&lt;/span&gt;.&lt;span class="pl-en"&gt;parametrize&lt;/span&gt;(&lt;/span&gt;
&lt;span class="pl-en"&gt;     &lt;span class="pl-s"&gt;"original,expected"&lt;/span&gt;,&lt;/span&gt;
&lt;span class="pl-en"&gt;     (&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"abc"&lt;/span&gt;, &lt;span class="pl-s"&gt;"abc"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"/foo/bar"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-/foo-/bar"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"/-/bar"&lt;/span&gt;, &lt;span class="pl-s"&gt;"-/---/bar"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;         (&lt;span class="pl-s"&gt;"-/db-/table---.csv-.csv"&lt;/span&gt;, &lt;span class="pl-s"&gt;"---/db---/table-------.csv---.csv"&lt;/span&gt;),&lt;/span&gt;
&lt;span class="pl-en"&gt;     ),&lt;/span&gt;
&lt;span class="pl-en"&gt; )&lt;/span&gt;
 &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test_dash_encoding&lt;/span&gt;(&lt;span class="pl-s1"&gt;original&lt;/span&gt;, &lt;span class="pl-s1"&gt;expected&lt;/span&gt;):
     &lt;span class="pl-s1"&gt;actual&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;utils&lt;/span&gt;.&lt;span class="pl-en"&gt;dash_encode&lt;/span&gt;(&lt;span class="pl-s1"&gt;original&lt;/span&gt;)
     &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;actual&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s1"&gt;expected&lt;/span&gt;
     &lt;span class="pl-c"&gt;# And test round-trip&lt;/span&gt;
     &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-s1"&gt;original&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s1"&gt;utils&lt;/span&gt;.&lt;span class="pl-en"&gt;dash_decode&lt;/span&gt;(&lt;span class="pl-s1"&gt;actual&lt;/span&gt;)&lt;/pre&gt;
&lt;p&gt;Here's &lt;a href="https://github.com/simonw/datasette/commit/d1cb73180b4b5a07538380db76298618a5fc46b6"&gt;the full commit&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This meets my requirements.&lt;/p&gt;
&lt;h4&gt;Capturing these with a regular expression&lt;/h4&gt;
&lt;p&gt;There was one remaining challenge. Datasette uses regular expressions - inspired by Django - to route requests to the correct page.&lt;/p&gt;
&lt;p&gt;I wanted to use a regular expression to extract out dash encoded values, that could also distinguish them from &lt;code&gt;/&lt;/code&gt; and &lt;code&gt;-&lt;/code&gt; and &lt;code&gt;.&lt;/code&gt; characters that were not encoded in that way.&lt;/p&gt;
&lt;p&gt;Here's the pattern I came up with for strings matching this pattern:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;([^\/\-\.]*|(\-/)|(\-\.)|(\-\-))*&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Broken down:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;[^\/\-\.]*&lt;/code&gt; means 0 or more characters that are NOT one of &lt;code&gt;.&lt;/code&gt; or &lt;code&gt;/&lt;/code&gt; or &lt;code&gt;-&lt;/code&gt; - since we don't care about those characters at all&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(\-/)&lt;/code&gt; means the explicit sequence &lt;code&gt;-/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(\-\.)&lt;/code&gt; means the explicit sequence &lt;code&gt;-.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;(\-\-)&lt;/code&gt; means the explicit sequence &lt;code&gt;--&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Those four are wrapped in a group combined with the &lt;code&gt;|&lt;/code&gt; or operator&lt;/li&gt;
&lt;li&gt;The group is then wrapped in a &lt;code&gt;(..)*&lt;/code&gt; - specifying that it can repeat as many times as you like&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A better way to break down this regular expression is visually, &lt;a href="https://www.debuggex.com/r/KYfCocdmuBHxHETv"&gt;using Debuggex&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/dash-encoding-regex.png" alt="A visualization of the regular expression, showing how it loops around the inner concept of none of those three characters or one of the three explicit character groupings." style="max-width:100%;" /&gt;&lt;/p&gt;
&lt;p&gt;Combining this into the full regular expression that matches a &lt;code&gt;/database/table.format&lt;/code&gt; path is even messier, due to the need to add non-capturing group syntax &lt;code&gt;(?:..)&lt;/code&gt; and named groups &lt;code&gt;(?P&amp;lt;name&amp;gt;...)&lt;/code&gt; - it ends up looking like this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;^/(?P&amp;lt;database&amp;gt;[^/]+)/(?P&amp;lt;table&amp;gt;(?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*?)\.(?P&amp;lt;format&amp;gt;\w+)?$&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Visualized &lt;a href="https://www.debuggex.com/r/aTF6lx5JpaMN6UYz"&gt;with Debuggex&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/dash-encoding-regex-full.png" alt="The more complex regex visualized." style="max-width:100%;" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; Thanks to suggestions &lt;a href="https://twitter.com/dracos/status/1500236433809973248"&gt;from Matthew Somerville&lt;/a&gt; I simplified this further to:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;^/(?P&amp;lt;database&amp;gt;[^/]+)/(?P&amp;lt;table&amp;gt;[^\/\-\.]*|\-/|\-\.|\-\-)*(?P&amp;lt;format&amp;gt;\.\w+)?$&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="https://static.simonwillison.net/static/2022/dash-encoding-regex-simpler.png" alt="This looks less complex in Debuggex" style="max-width:100%;" /&gt;&lt;/p&gt;

&lt;h4&gt;Next steps: implementation&lt;/h4&gt;
&lt;p&gt;I'm currently working on integrating it into Datasette in &lt;a href="https://github.com/simonw/datasette/pull/1648"&gt;this PR&lt;/a&gt;. The full history of my thinking around this problem can be found &lt;a href="https://github.com/simonw/datasette/issues/1439"&gt;in issue 1439&lt;/a&gt;, with comments stretching back to August last year!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/urls"&gt;urls&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/datasette"&gt;datasette&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="regular-expressions"/><category term="urls"/><category term="datasette"/></entry><entry><title>The unexpected Google wide domain check bypass</title><link href="https://simonwillison.net/2020/Mar/9/unexpected-google-wide-domain-check-bypass/#atom-tag" rel="alternate"/><published>2020-03-09T23:27:41+00:00</published><updated>2020-03-09T23:27:41+00:00</updated><id>https://simonwillison.net/2020/Mar/9/unexpected-google-wide-domain-check-bypass/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://bugs.xdavidhu.me/google/2020/03/08/the-unexpected-google-wide-domain-check-bypass/"&gt;The unexpected Google wide domain check bypass&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Fantastic story of discovering a devious security vulnerability in a bunch of Google products stemming from a single exploitable regular expression in the Google closure JavaScript library.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=22527842"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;&lt;/p&gt;



</summary><category term="regular-expressions"/><category term="security"/></entry><entry><title>Details of the Cloudflare outage on July 2, 2019</title><link href="https://simonwillison.net/2019/Jul/12/details-cloudflare-outage-july-2-2019/#atom-tag" rel="alternate"/><published>2019-07-12T17:36:25+00:00</published><updated>2019-07-12T17:36:25+00:00</updated><id>https://simonwillison.net/2019/Jul/12/details-cloudflare-outage-july-2-2019/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/"&gt;Details of the Cloudflare outage on July 2, 2019&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Best retrospective I’ve read in a long time. The outage was caused by a backtracking regex rule that was added to the Web Application Firewall project, which rolls out globally and skips most of Cloudflare’s regular graduar rollout process (delightfully animal themed, named DOG for the dogfooding PoP that their employees use, PIG for the Guinea Pig PoPs reserved for free customers, then Canary for the final step) so that they can deploy counter-measures to newly discovered vulnerabilities as quickly as possible—but the real value in the retro is that it provides an extremely deep insight into how Cloudflare organize, test and manage their changes. Really interesting stuff.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=20421538"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/operations"&gt;operations&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/cloudflare"&gt;cloudflare&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/postmortem"&gt;postmortem&lt;/a&gt;&lt;/p&gt;



</summary><category term="operations"/><category term="regular-expressions"/><category term="cloudflare"/><category term="postmortem"/></entry><entry><title>r1chardj0n3s/parse: Parse strings using a specification based on the Python format() syntax.</title><link href="https://simonwillison.net/2018/Feb/25/parse/#atom-tag" rel="alternate"/><published>2018-02-25T16:58:32+00:00</published><updated>2018-02-25T16:58:32+00:00</updated><id>https://simonwillison.net/2018/Feb/25/parse/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/r1chardj0n3s/parse"&gt;r1chardj0n3s/parse: Parse strings using a specification based on the Python format() syntax.&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Really neat API design: parse() behaves almost exactly in the opposite way to Python’s built-in format(), so you can use format strings as an alternative to regular expressions for extracting specific data from a string.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://github.com/kennethreitz/requests-html/blob/master/Pipfile"&gt;requests-html/Pipfile&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;



</summary><category term="python"/><category term="regular-expressions"/></entry><entry><title>A Regular Expression Matcher: Code by Rob Pike, Exegesis by Brian Kernighan</title><link href="https://simonwillison.net/2017/Dec/5/regex/#atom-tag" rel="alternate"/><published>2017-12-05T18:36:12+00:00</published><updated>2017-12-05T18:36:12+00:00</updated><id>https://simonwillison.net/2017/Dec/5/regex/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html"&gt;A Regular Expression Matcher: Code by Rob Pike, Exegesis by Brian Kernighan&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
Delightfully clear and succinct 30-line C implementation of a regular expression matcher that supports $, ^, . and * operations.

    &lt;p&gt;&lt;small&gt;&lt;/small&gt;Via &lt;a href="https://news.ycombinator.com/item?id=15840487"&gt;Hacker News&lt;/a&gt;&lt;/small&gt;&lt;/p&gt;


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/c"&gt;c&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/rob-pike"&gt;rob-pike&lt;/a&gt;&lt;/p&gt;



</summary><category term="c"/><category term="regular-expressions"/><category term="rob-pike"/></entry><entry><title>Escaping regular expression characters in JavaScript (updated)</title><link href="https://simonwillison.net/2010/Jul/4/escaping/#atom-tag" rel="alternate"/><published>2010-07-04T18:23:00+00:00</published><updated>2010-07-04T18:23:00+00:00</updated><id>https://simonwillison.net/2010/Jul/4/escaping/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://simonwillison.net/2006/Jan/20/escape/#p-6"&gt;Escaping regular expression characters in JavaScript (updated)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
The JavaScript regular expression meta-character escaping code I posted back in 2006 has some serious flaws—I’ve just posted an update to the original post.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/escaping"&gt;escaping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/recovered"&gt;recovered&lt;/a&gt;&lt;/p&gt;



</summary><category term="escaping"/><category term="javascript"/><category term="regular-expressions"/><category term="recovered"/></entry><entry><title>Quoting Andrew Clover</title><link href="https://simonwillison.net/2009/Nov/16/regex/#atom-tag" rel="alternate"/><published>2009-11-16T10:32:15+00:00</published><updated>2009-11-16T10:32:15+00:00</updated><id>https://simonwillison.net/2009/Nov/16/regex/#atom-tag</id><summary type="html">
    &lt;blockquote cite="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;&lt;p&gt;Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p class="cite"&gt;&amp;mdash; &lt;a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454"&gt;Andrew Clover&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/andrew-clover"&gt;andrew-clover&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/funny"&gt;funny&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/html"&gt;html&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/parsing"&gt;parsing&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regex"&gt;regex&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/stackoverflow"&gt;stackoverflow&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/xhtml"&gt;xhtml&lt;/a&gt;&lt;/p&gt;



</summary><category term="andrew-clover"/><category term="funny"/><category term="html"/><category term="parsing"/><category term="regex"/><category term="regular-expressions"/><category term="stackoverflow"/><category term="xhtml"/></entry><entry><title>Django security updates released</title><link href="https://simonwillison.net/2009/Oct/10/django/#atom-tag" rel="alternate"/><published>2009-10-10T00:24:59+00:00</published><updated>2009-10-10T00:24:59+00:00</updated><id>https://simonwillison.net/2009/Oct/10/django/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.djangoproject.com/weblog/2009/oct/09/security/"&gt;Django security updates released&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
A potential denial of service vulnerability has been discovered in the regular expressions used by Django form library’s EmailField and URLField—a malicious input could trigger a pathological performance. Patches (and patched releases) for Django 1.1 and Django 1.0 have been published.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/denial-of-service"&gt;denial-of-service&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/security"&gt;security&lt;/a&gt;&lt;/p&gt;



</summary><category term="denial-of-service"/><category term="django"/><category term="python"/><category term="regular-expressions"/><category term="security"/></entry><entry><title>Introducing Yardbird</title><link href="https://simonwillison.net/2009/May/22/yardbird/#atom-tag" rel="alternate"/><published>2009-05-22T23:13:39+00:00</published><updated>2009-05-22T23:13:39+00:00</updated><id>https://simonwillison.net/2009/May/22/yardbird/#atom-tag</id><summary type="html">
    
&lt;p&gt;&lt;strong&gt;&lt;a href="http://zork.net/motd/nick/django/introducing-yardbird.html"&gt;Introducing Yardbird&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
I absolutely love it—an IRC bot built on top of Twisted that passes incoming messages off to Django code running in a separate thread. Requests and Response objects are used to represent incoming and outgoing messages, and Django’s regex-based URL routing is used to dispatch messages to different handling functions based on their content.


    &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/django"&gt;django&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/irc"&gt;irc&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/threads"&gt;threads&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/twisted"&gt;twisted&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/yardbird"&gt;yardbird&lt;/a&gt;&lt;/p&gt;



</summary><category term="django"/><category term="irc"/><category term="regular-expressions"/><category term="threads"/><category term="twisted"/><category term="yardbird"/></entry><entry><title>Escaping regular expression characters in JavaScript</title><link href="https://simonwillison.net/2006/Jan/20/escape/#atom-tag" rel="alternate"/><published>2006-01-20T12:19:13+00:00</published><updated>2006-01-20T12:19:13+00:00</updated><id>https://simonwillison.net/2006/Jan/20/escape/#atom-tag</id><summary type="html">
    &lt;p id="p-0"&gt;JavaScript's support for regular expressions is generally pretty good, but there is one notable omission: an escaping mechanism for literal strings. Say for example you need to create a regular expression that removes a specific string from the end of a string. If you know the string you want to remove when you write the script this is easy:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;newString&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;oldString&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;replace&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-pds"&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;span class="pl-s"&gt;R&lt;/span&gt;&lt;span class="pl-s"&gt;e&lt;/span&gt;&lt;span class="pl-s"&gt;m&lt;/span&gt;&lt;span class="pl-s"&gt;o&lt;/span&gt;&lt;span class="pl-s"&gt;v&lt;/span&gt;&lt;span class="pl-s"&gt;e&lt;/span&gt;&lt;span class="pl-s"&gt; &lt;/span&gt;&lt;span class="pl-s"&gt;f&lt;/span&gt;&lt;span class="pl-s"&gt;r&lt;/span&gt;&lt;span class="pl-s"&gt;o&lt;/span&gt;&lt;span class="pl-s"&gt;m&lt;/span&gt;&lt;span class="pl-s"&gt; &lt;/span&gt;&lt;span class="pl-s"&gt;e&lt;/span&gt;&lt;span class="pl-s"&gt;n&lt;/span&gt;&lt;span class="pl-s"&gt;d&lt;/span&gt;&lt;span class="pl-cce"&gt;$&lt;/span&gt;&lt;span class="pl-c1"&gt;/&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;''&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p id="p-1"&gt;But what if the string to be removed comes from a variable? You'll need to construct a regular expression from the variable, using the RegExp constructor function:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;re&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;RegExp&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;stringToRemove&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s"&gt;'$'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;newString&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;oldString&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;replace&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;re&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;''&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p id="p-2"&gt;But what if the string you want to remove may contain regular expression metacharacters - characters like $ or . that affect the behaviour of the expression? Languages such as Python provide functions for escaping these characters (see &lt;a href="https://docs.python.org/2/library/re.html#re.escape" title="Python re module contents"&gt;re.escape&lt;/a&gt;); with JavaScript you have to write your own.&lt;/p&gt;

&lt;p id="p-3"&gt;Here's mine:&lt;/p&gt;

&lt;div class="highlight highlight-source-js"&gt;&lt;pre&gt;&lt;span class="pl-v"&gt;RegExp&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;escape&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;function&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s1"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
  &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-c1"&gt;!&lt;/span&gt;&lt;span class="pl-smi"&gt;arguments&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;callee&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;sRE&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-kos"&gt;{&lt;/span&gt;
    &lt;span class="pl-k"&gt;var&lt;/span&gt; &lt;span class="pl-s1"&gt;specials&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-kos"&gt;[&lt;/span&gt;
      &lt;span class="pl-s"&gt;'/'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'.'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'*'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'+'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'?'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'|'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt;
      &lt;span class="pl-s"&gt;'('&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;')'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'['&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;']'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'{'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'}'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'\\'&lt;/span&gt;
    &lt;span class="pl-kos"&gt;]&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
    &lt;span class="pl-smi"&gt;arguments&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;callee&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;sRE&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;new&lt;/span&gt; &lt;span class="pl-v"&gt;RegExp&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;
      &lt;span class="pl-s"&gt;'(\\'&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s1"&gt;specials&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;join&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-s"&gt;'|\\'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s"&gt;')'&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'g'&lt;/span&gt;
    &lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
  &lt;span class="pl-kos"&gt;}&lt;/span&gt;
  &lt;span class="pl-k"&gt;return&lt;/span&gt; &lt;span class="pl-s1"&gt;text&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-en"&gt;replace&lt;/span&gt;&lt;span class="pl-kos"&gt;(&lt;/span&gt;&lt;span class="pl-smi"&gt;arguments&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;callee&lt;/span&gt;&lt;span class="pl-kos"&gt;.&lt;/span&gt;&lt;span class="pl-c1"&gt;sRE&lt;/span&gt;&lt;span class="pl-kos"&gt;,&lt;/span&gt; &lt;span class="pl-s"&gt;'\\$1'&lt;/span&gt;&lt;span class="pl-kos"&gt;)&lt;/span&gt;&lt;span class="pl-kos"&gt;;&lt;/span&gt;
&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p id="p-4"&gt;This deals with another common problem in JavaScript: compiling a regular expression once (rather than every time you use it) while keeping it local to a function. &lt;code&gt;argmuments.callee&lt;/code&gt; inside a function always refers to the function itself, and since JavaScript functions are objects you can store properties on them. In this case, the first time the function is run it compiles a regular expression and stashes it in the sRE property. On subsequent calls the pre-compiled expression can be reused.&lt;/p&gt;

&lt;p id="p-5"&gt;In the above snippet I've added my function as a property of the &lt;code&gt;RegExp&lt;/code&gt; constructor. There's no pressing reason to do this other than a desire to keep generic functionality relating to regular expression handling the same place. If you rename the function it will still work as expected, since the use of &lt;code&gt;arguments.callee&lt;/code&gt; eliminates any coupling between the function definition and the rest of the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 18th Feb 2025&lt;/strong&gt;: 19 years after I published this &lt;code&gt;RegExp.escape()&lt;/code&gt; has &lt;a href="https://simonwillison.net/2025/Feb/18/tc39proposal-regex-escaping/"&gt;made it into the language&lt;/a&gt;!&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/escaping"&gt;escaping&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/javascript"&gt;javascript&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="escaping"/><category term="javascript"/><category term="regular-expressions"/></entry><entry><title>"sexeger"[::-1]</title><link href="https://simonwillison.net/2003/Sep/17/sexeger/#atom-tag" rel="alternate"/><published>2003-09-17T01:53:18+00:00</published><updated>2003-09-17T01:53:18+00:00</updated><id>https://simonwillison.net/2003/Sep/17/sexeger/#atom-tag</id><summary type="html">
    &lt;p&gt;Via &lt;a href="http://www.nedbatchelder.com/blog/200309.html#e20030916T162322" title="Reversing regular expressions"&gt;Ned Batchelder&lt;/a&gt;, an article on &lt;a href="http://www.perl.com/pub/a/2001/05/01/expressions.html"&gt;Reversing Regular Expressions&lt;/a&gt; from Perl.com. Otherwise known as &lt;a href="http://japhy.perlmonk.org/sexeger/sexeger.html" title="Sex, Eger! or Reverse Regular Expressions"&gt;Sexeger&lt;/a&gt;, these offer a performance boost over normal regular expressions for certain tasks. The basic idea is pretty simple: searching &lt;em&gt;backwards&lt;/em&gt; through a string using a regular expression can be a messy business, but by reversing both the string and the expression, running it, then reversing the result far better performance can be achieved (reversing a string is a relatively inexpensive operation). The example code is in Perl, but I couldn't resist trying it in Python. The challenge is to find the &lt;em&gt;last&lt;/em&gt; number occurring in a string.&lt;/p&gt;

&lt;pre&gt;&lt;code class="python"&gt;&amp;gt;&amp;gt;&amp;gt; import re
&amp;gt;&amp;gt;&amp;gt; lastnum = re.compile(r'(\d+)(?!\D*\d)')
&amp;gt;&amp;gt;&amp;gt; s = ' this isa 454 asd very very 
  very long strin9 asd9 009 76 with numbers 
  99 in it and here is the last 537 number'
  # NB this was all on one line originally
&amp;gt;&amp;gt;&amp;gt; lastnum.search(s).group(0)
'537'
&amp;gt;&amp;gt;&amp;gt; import timeit
&amp;gt;&amp;gt;&amp;gt; t1 = timeit.Timer('lastnum.search(s).group(0)', 
         'from __main__ import lastnum, s')
&amp;gt;&amp;gt;&amp;gt; print "%.2f usec/pass" % (1000000 * t1.timeit(number=100000)/100000)
26.82 usec/pass
&amp;gt;&amp;gt;&amp;gt; lastnumrev = re.compile('(\d+)')
&amp;gt;&amp;gt;&amp;gt; lastnumrev.search(s[::-1]).group(0)[::-1]
'537'
&amp;gt;&amp;gt;&amp;gt; t2 = timeit.Timer('lastnumrev.search(s[::-1]).group(0)[::-1]', 
         'from __main__ import lastnumrev, s')
&amp;gt;&amp;gt;&amp;gt; print "%.2f usec/pass" % (1000000 * t2.timeit(number=100000)/100000)
9.26 usec/pass
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are a few points worth explaining in the above code. The &lt;code class="python"&gt;(?!\D*\d)&lt;/code&gt; part of the first regular expression is a &lt;em&gt;negative lookahead assertion&lt;/em&gt; - it basically means "match the subpattern provided it isn't followed by a string of non-digits followed by at least one digit. This is the bit that ensures we only get back the last digit in the string, and is also the bit that could cause a performance problem.&lt;/p&gt;

&lt;p&gt;&lt;code class="python"&gt;'some string'[::-1]&lt;/code&gt; is an example of &lt;a href="http://www.python.org/doc/2.3/whatsnew/section-slices.html"&gt;Extended Slices&lt;/a&gt;, introduced in Python 2.3. Its affect is to reverse the string, by stepping through it from start to end going back one character at a time.&lt;/p&gt;

&lt;p&gt;The actual benchmarking code makes use of the new &lt;a href="http://www.python.org/doc/2.3/lib/module-timeit.html"&gt;timeit module&lt;/a&gt; from Python 2.3 - I copied it verbatim from that module's &lt;a href="http://www.python.org/doc/2.3/lib/node397.html" title="10.10.2 Examples"&gt;example section&lt;/a&gt; in the manual.&lt;/p&gt;

&lt;p&gt;The results speak for themselves: 26.82 for the lookahead assertion expression compared to just 9.26 for the reversed regular expression. This is definitely a useful trick to add to the tool box.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="python"/><category term="regular-expressions"/></entry><entry><title>Verbose Regular Expressions</title><link href="https://simonwillison.net/2003/Apr/11/verboseRegularExpressions/#atom-tag" rel="alternate"/><published>2003-04-11T03:01:15+00:00</published><updated>2003-04-11T03:01:15+00:00</updated><id>https://simonwillison.net/2003/Apr/11/verboseRegularExpressions/#atom-tag</id><summary type="html">
    &lt;p&gt;Ned Batchelder describes &lt;a href="https://nedbatchelder.com/blog/200304/verbose_python_regular_expressions.html"&gt;Verbose Python regular expressions&lt;/a&gt;. This is one of the things I've known about (as in known that they exist) for ages but have never got around to using. I've been working with some pretty heavy regular expressions recently that could really do with the clarity of being defined in verbose format with comments.&lt;/p&gt;

&lt;p&gt;&lt;acronym title="PHP: Hypertext Preprocessor"&gt;PHP&lt;/acronym&gt; also has support for verbose &lt;acronym title="Regular Expressions"&gt;REs&lt;/acronym&gt;, thanks to the excellent &lt;a href="http://www.php.net/manual/en/ref.pcre.php" title="PHP: Regular Expression Functions (Perl-Compatible)"&gt;pcre functions&lt;/a&gt;. Just use the 'x' modifier as explained on &lt;a href="http://www.php.net/manual/en/pcre.pattern.modifiers.php" title="PHP: Pattern Modifiers"&gt;this manual page&lt;/a&gt;.&lt;/p&gt;
    
        &lt;p&gt;Tags: &lt;a href="https://simonwillison.net/tags/ned-batchelder"&gt;ned-batchelder&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/php"&gt;php&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/python"&gt;python&lt;/a&gt;, &lt;a href="https://simonwillison.net/tags/regular-expressions"&gt;regular-expressions&lt;/a&gt;&lt;/p&gt;
    

</summary><category term="ned-batchelder"/><category term="php"/><category term="python"/><category term="regular-expressions"/></entry></feed>