Simon Willison's Weblog: seo

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL

2024-08-24T23:00:01+00:00

SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL

A new paper from Google Research describing custom syntax for analytical SQL queries that has been rolling out inside Google since February, reaching 1,600 "seven-day-active users" by August 2024.

A key idea is here is to fix one of the biggest usability problems with standard SQL: the order of the clauses in a query. Starting with SELECT instead of FROM has always been confusing, see SQL queries don't start with SELECT by Julia Evans.

Here's an example of the new alternative syntax, taken from the Pipe query syntax documentation that was added to Google's open source ZetaSQL project last week.

For this SQL query:

SELECT component_id, COUNT(*)
FROM ticketing_system_table
WHERE
  assignee_user.email = 'username@email.com'
  AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
GROUP BY component_id
ORDER BY component_id DESC;

The Pipe query alternative would look like this:

FROM ticketing_system_table
|> WHERE
    assignee_user.email = 'username@email.com'
    AND status IN ('NEW', 'ASSIGNED', 'ACCEPTED')
|> AGGREGATE COUNT(*)
   GROUP AND ORDER BY component_id DESC;

The Google Research paper is released as a two-column PDF. I snarked about this on Hacker News:

Google: you are a web company. Please learn to publish your research papers as web pages.

This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from, have poor accessibility (see this Mastodon conversation) and are generally just bad citizens of the web.

Having complained about this I felt compelled to see if I could address it myself. Google's own Gemini Pro 1.5 model can process PDFs, so I uploaded the PDF to Google AI Studio and prompted the gemini-1.5-pro-exp-0801 model like this:

Convert this document to neatly styled semantic HTML

This worked surprisingly well. It output HTML for about half the document and then stopped, presumably hitting the output length limit, but a follow-up prompt of "and the rest" caused it to continue from where it stopped and run until the end.

Here's the result (with a banner I added at the top explaining that it's a conversion): Pipe-Syntax-In-SQL.html

I haven't compared the two completely, so I can't guarantee there are no omissions or mistakes.

The figures from the PDF aren't present - Gemini Pro output tags like <img src="figure1.png" alt="Figure 1: SQL syntactic clause order doesn't match semantic evaluation order. (From [25].)"> but did nothing to help me create those images.

Amusingly the document ends with <p>(A long list of references, which I won't reproduce here to save space.)</p> rather than actually including the references from the paper!

So this isn't a perfect solution, but considering it took just the first prompt I could think of it's a very promising start. I expect someone willing to spend more than the couple of minutes I invested in this could produce a very useful HTML alternative version of the paper with the assistance of Gemini Pro.

One last amusing note: I posted a link to this to Hacker News a few hours ago. Just now when I searched Google for the exact title of the paper my HTML version was already the third result!

I've now added a <meta name="robots" content="noindex, follow"> tag to the top of the HTML to keep this unverified AI slop out of their search index. This is a good reminder of how much better HTML is than PDF for sharing information on the web!

Via Hacker News

Tags: google, pdf, seo, sql, ai, julia-evans, generative-ai, llms, gemini, slop

Google is the only search engine that works on Reddit now thanks to AI deal

2024-07-24T18:29:55+00:00

Google is the only search engine that works on Reddit now thanks to AI deal

This is depressing. As of around June 25th reddit.com/robots.txt contains this:

User-agent: *
Disallow: /

Along with a link to Reddit's Public Content Policy.

Is this a direct result of Google's deal to license Reddit content for AI training, rumored at $60 million? That's not been confirmed but it looks likely, especially since accessing that robots.txt using the Google Rich Results testing tool (hence proxied via their IP) appears to return a different file, via this comment, my copy here.

Via Hacker News

Tags: google, reddit, search-engines, seo, ai, llms

Give people something to link to so they can talk about your features and ideas

2024-07-13T16:06:28+00:00

If you have a project, an idea, a product feature, or anything else that you want other people to understand and have conversations about... give them something to link to!

Two illustrative examples are ChatGPT Code Interpreter and Boring Technology.

ChatGPT Code Interpreter is effectively invisible

ChatGPT Code Interpreter has been one of my favourite AI tools for over a year. It's the feature of ChatGPT which allows the bot to write and then execute Python code as part of responding to your prompts. It's incredibly powerful... and almost invisible! If you don't know how to use prompts to activate the feature you may not realize it exists.

OpenAI don't even have a help page for it (and it very desperately needs documentation) - if you search their site you'll find confusing technical docs about an API feature and misleading outdated forum threads.

I evangelize this tool a lot, but OpenAI really aren't helping me do that. I end up linking people to my code-interpreter tag page because it's more useful than anything on OpenAI's own site.

Compare this with Claude's similar Artifacts feature which at least has an easily discovered help page - though the Artifacts announcement post was shared with Claude 3.5 Sonnet so isn't obviously linkable. Even that help page isn't quite what I'm after. Features deserve dedicated pages!

GitHub understand this: here are their feature landing pages for Codespaces and Copilot (I could even guess the URL for Copilot's page based on the Codespaces one).

Update: It turns out there IS documentation about Code Interpreter mode... but I failed to find it because it didn't use those terms anywhere on the page! The title is Data analysis with ChatGPT.

This amuses me greatly because OpenAI have been oscillating on the name for this feature almost since they launched - Code Interpreter, then Advanced Data Analysis, now Data analysis with ChatGPT. I made fun of this last year.

Boring Technology: an idea with a website

Dan McKinley coined the term Boring Technology in an essay in 2015. The key idea is that any development team has a limited capacity to solve new problems which should be reserved for the things that make their product unique. For everything else they should pick the most boring and well-understood technologies available to them - stuff where any bugs or limitations have been understood and discussed online for years.

(I'm very proud that Django has earned the honorific of "boring technology" in this context!)

Dan turned that essay into a talk, and then he turned that talk into a website with a brilliant domain name:

boringtechnology.club

The idea has stuck. I've had many productive conversations about it, and more importantly if someone hasn't heard the term before I can drop in that one link and they'll be up to speed a few minutes later.

I've tried to do this myself for some of my own ideas: baked data, git scraping and prompt injection all have pages that I frequently link people to. I never went as far as committing to a domain though and I think maybe that was a mistake - having a clear message that "this is the key page to link to" is a very powerful thing.

This is about both SEO and conversations

One obvious goal here is SEO: if someone searches for your product feature you want them to land on your own site, not surrender valuable attention to someone else who's squatting on the search term.

I personally value the conversation side of it even more. Hyperlinks are the best thing about the web - if I want to talk about something I'd much rather drop in a link to the definitive explanation rather than waste a paragraph (as I did earlier with Code Interpreter) explaining what the thing is for the upmteenth time!

If you have an idea, project or feature that you want people to understand and discuss, build it the web page it deserves. Give people something to link to!

Tags: github, marketing, seo, writing, openai, chatgpt, claude, boring-technology, code-interpreter, coding-agents

Weeknotes: python_requires, documentation SEO

2022-01-25T23:54:52+00:00

Fixed Datasette on Python 3.6 for the last time. Worked on documentation infrastructure improvements. Spent some time with Fly Volumes.

Datasette 0.60.1 for Python 3.6

I got a report that users of Python 3.6 were seeing errors when they tried to install Datasette.

I actually dropped support for 3.6 a few weeks ago, but that shouldn't have affected the already released Datasette 0.60 - so something was clearly wrong.

This lead me to finally get my head around how pip install handles Python version support. It's actually a very neat system which I hadn't previously taken the time to understand.

Python packages can (and should!) provide a python_requires= line in their setup.py. That line for Datasette currently looks like this:

python_requires=">=3.7"

But in the 0.60 release it was still this:

python_requires=">=3.6"

When you run pip install package this becomes part of the pip resolution mechanism - it will default to attempting to install the highest available version of the package that supports your version of Python.

So why did pip install datasette break? It turned out that one of Datasette's dependencies, Uvicorn, had dropped support for Python 3.6 but did not have a python_requires indicator that pip could use to resolve the correct version.

Coincidentally, Uvicorn actually added python_requires just a few weeks ago - but it wasn't out in a release yet, so pip install couldn't take it into account.

I raised this issue with the Uvicorn development team and they turned around a fix really promptly - 0.17.0.post1.

But before I had seen how fast the Uvicorn team could move I figured out how to fix the issue myself, thanks to a tip from Sam Hames on Twitter.

The key to fixing it was environment markers, a feature of Python's dependency resolution system that allows you to provide extra rules for when a dependency should be used.

Here's an install_requires= example showing these in action:

install_requires=[
    "uvicorn~=0.11",
    'uvicorn<=0.16.0;python_version<="3.6"'
]

This will install a Uvicorn version that loosely matches 0.11, but over-rides that rule to specify that it must be <=0.16.0 if the user's Python version is 3.6 or less.

Since Datasette 0.60.1 will be the last version of Datasette to support Python 3.6, I decided to play it safe and pin the dependencies of every library to the most recent version that I have tested in Python 3.6 myself. Here's the setup.py file I constructed for that.

This ties into a larger open question for me about Datasette's approach to pinning dependencies.

The rule of thumb I've heard is that you should pin dependencies for standalone released tools but leave dependencies loose for libraries that people will use as dependencies in their own projects - ensuring those users can run with different dependency versions if their projects require them.

Datasette is mostly a standalone tool - but it can also be used as a library. I'm actually planning to make its use as a library more obvious through improvements to the documentation in the future.

As such, pinning exact versions doesn't feel quite right to me.

Maybe the solution here is to split the reusable library parts of Datasette out into a separate package - maybe datasette-core - and have the datasette CLI package depend on exact pinned dependencies while the datasette-core library uses loose dependencies instead.

Still thinking about this.

Datasette documentation tweaks

Datasette uses Read The Docs to host the documentation. Among other benefits, this makes it easy to host multiple documentation versions:

docs.datasette.io/en/latest/ is the latest version of the documentation, continuously deployed from the main branch on GitHub
docs.datasette.io/en/stable/ is the documentation for the most recent stable (non alpha or beta) release - currently 0.60.1. This is the version you get when you run pip install datasette.
docs.datasette.io/en/0.59/ is the documentation for version 0.59 - and every version back to 0.22.1 is hosted under similar URLs, currently covering 73 different releases.

Those previous versions all automatically show a note at the top of the page warning that this is out-dated documentation and linking back to stable - a feature which Read The Docs provides automatically.

But... I noticed that /en/latest/ didn't do this. I wanted a warning banner to let people know that they were looking at the in-development version of the documentation.

After some digging around, I fixed it using a little bit of extra JavaScript added to my documentation template. Here's the key implementation detail:

jQuery(function ($) {
  // Show banner linking to /stable/ if this is a /latest/ page
  if (!/\/latest\//.test(location.pathname)) {
    return;
  }
  var stableUrl = location.pathname.replace("/latest/", "/stable/");
  // Check it's not a 404
  fetch(stableUrl, { method: "HEAD" }).then((response) => {
    if (response.status == 200) {
      // Page exists, insert a warning banner linking to it

This uses fetch() to make an HTTP HEAD request for the stable documentation page, and inserts a warning banner only if that page isn't a 404. This avoids linking to a non-existant documentation page if it has been created in development but not yet released as part of a stable release. Example here.

Thinking about this problem got me thinking about SEO.

A problem I've had with other projects that host multiple versions of their documentation is that sometimes I'll search on Google and end up landing on a page covering a much older version of their project. I think I've had this happen for both PostgreSQL and Python in the past.

What's best practice for avoiding this? I asked on Twitter and also started digging around for answers. "If in doubt, imitate Django" is a good general rule of thumb, so I had a look at how Django did this and spotted the following in the HTML of one of their prior version pages:

<link rel="canonical" href="https://docs.djangoproject.com/en/4.0/topics/db/">

So Django are using the rel=canonical tag to point crawlers towards their most recent release.

I decided to implement that myself... and then discovered that the Datasette documentation was doing it already! Read The Docs implement this piece of SEO best practice out of the box.

Datasette on Fly volumes

This one isn't released yet, but I made some good progress on it this week.

Fly.io announced this week that they would be providing 3GB of volume storage to accounts on their free tier. They called this announcement Free Postgres Databases, but tucked away in the blog post was this:

The lede is "free Postgres" because that's what matters to full stack apps. You don't have to use these for Postgres. If SQLite is more your jam, mount up to 3GB of volumes and use "free SQLite." Yeah, we're probably underselling that.

(There is evidence that they may have been nerd sniping me with that paragraph.)

I have a plugin called datasette-publish-fly which publishes Datasette instances to Fly. Obviously that needs to grow support for configuring volumes!

I've so far completed the research on how that feature should work. The next step is to finish implementing the feature.

sqlite-utils --help

I pushed out minor release sqlite-utils 3.22.1 today with one notable improvement: every single one of the 39 commands in the CLI tool now includes an example of usage as part of the --help text.

This feature was inspired by the new CLI reference page in the documentation, which shows the help output for every command on one page - making it much easier to spot potential improvements.

Releases this week

sqlite-utils: 3.22.1 - (94 releases total) - 2022-01-26
Python CLI utility and library for manipulating SQLite databases
s3-credentials: 0.10 - (10 releases total) - 2022-01-25
A tool for creating credentials for accessing S3 buckets
datasette: 0.60.1 - (106 releases total) - 2022-01-21
An open source multi-tool for exploring and publishing data

TIL this week

Tags: python, seo, datasette, weeknotes, fly, sqlite-utils, read-the-docs

datasette-block-robots

2020-06-23T03:28:00+00:00

datasette-block-robots

Another little Datasette plugin: this one adds a /robots.txt page with Disallow: / to block all indexing of a Datasette instance from respectable search engine crawlers. I built this in less than ten minutes from idea to deploy to PyPI thanks to the datasette-plugin cookiecutter template.

Tags: crawling, plugins, projects, robots-txt, seo, datasette

Building a sitemap.xml with a one-off Datasette plugin

2020-01-06T23:02:48+00:00

One of the fun things about launching a new website is re-learning what it takes to promote a website from scratch on the modern web. I've been thoroughly enjoying using Niche Museums as an excuse to explore 2020-era SEO.

I used to use Google Webmaster Tools for this, but apparently that got rebranded as Google Search Console back in May 2015. It's really useful. It shows which search terms got impressions, which ones got clicks and lets you review which of your pages are indexed and which have returned errors.

Niche Museums has been live since October 24th, but it was a SPA for the first month. I switched it to server-side rendering (with separate pages for each museum) on November 25th. The Google Search Console shows it first appeared in search results on 2nd December.

So far, I've had 35 clicks! Not exactly earth-shattering, but every site has to start somewhere.

In a bid to increase the number of indexed pages, I decided to build a sitemap.xml. This probably isn't necessary - Google advise that you might not need one if your site is "small", defined as 500 pages or less (Niche Museums lists 88 museums, though it's still increasing by one every day). It's nice to be able to view that sitemap and confirm that those pages have all been indexed inside the Search Console though.

Since Niche Museums is entirely powered by a customized Datasette instance, I needed to figure out how best to build that sitemap.

One-off plugins

Datasette's most powerful customization options are provided by the plugins mechanism. Back in June I ported Datasette to ASGI, and the subsequent release of Datasette 0.29 introduced a new asgi_wrapper plugin hook. This hook makes it possible to intercept requests and implement an entirely custom response - ideal for serving up a /sitemap.xml page.

I considered building and releasing a generic datasette-sitemap plugin that could be used anywhere, but that felt like over-kill for this particular problem. Instead, I decided to take advantage of the --plugins-dir= Datasette option to build a one-off custom plugin for the site.

The Datasette instance that runs Niche Museums starts up like this:

$ datasette browse.db about.db \
    --template-dir=templates/ \
    --plugins-dir=plugins/ \
    --static css:static/ \
    -m metadata.json

This serves the two SQLite database files, loads custom templatse from the templates/ directory, sets up www.niche-museums.com/css/museums.css to serve data from the static/ directory and loads metadata settings from metadata.json. All of these files are on GitHub.

It also tells Datasette to look for any Python files in the plugins/ directory and load those up as plugins.

I currently have four Python files in that directory - you can see them here. The sitemap.xml is implemented using the new sitemap.py plugin file.

Here's the first part of that file, which wraps the Datasette ASGI app with middleware that checks for the URL /robots.txt or /sitemap.xml and returns custom content for either of them:

from datasette import hookimpl
from datasette.utils.asgi import asgi_send


@hookimpl
def asgi_wrapper(datasette):
    def wrap_with_robots_and_sitemap(app):
        async def robots_and_sitemap(scope, recieve, send):
            if scope["path"] == "/robots.txt":
                await asgi_send(
                    send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
                )
            elif scope["path"] == "/sitemap.xml":
                await send_sitemap(send, datasette)
            else:
                await app(scope, recieve, send)

        return robots_and_sitemap

    return wrap_with_robots_and_sitemap

The boilerplate here is a little convoluted, but this does the job. I'm considering adding alternative plugin hooks for custom pages that could simplify this in the future.

The asgi_wrapper(datasette) plugin function is expected to return a function which will be used to wrap the Datasette ASGI application. In this case that wrapper function is called wrap_with_robots_and_sitemap(app). Here's the Datasette core code that builds the ASGI app and applies the wrappers:

asgi = AsgiLifespan(
    AsgiTracer(DatasetteRouter(self, routes)), on_startup=setup_db
)
for wrapper in pm.hook.asgi_wrapper(datasette=self):
    asgi = wrapper(asgi)

So this plugin will be executed as:

asgi = wrap_with_robots_and_sitemap(asgi)

The wrap_with_robots_and_sitemap(app) function then returns another, asynchronous function. This function follows the ASGI protocol specification, and has the following signature and body:

async def robots_and_sitemap(scope, recieve, send):
    if scope["path"] == "/robots.txt":
        await asgi_send(
            send, "Sitemap: https://www.niche-museums.com/sitemap.xml", 200
        )
    elif scope["path"] == "/sitemap.xml":
        await send_sitemap(send, datasette)
    else:
        await app(scope, recieve, send)

If the incoming URL path is /robots.txt, the function directly returns a reference to the sitemap, as seen at www.niche-museums.com/robots.txt.

If the path is /sitemap.xml, it calls the send_sitemap(...) function.

For any other path, it proxies the call to the original ASGI app function that was passed to the wrapper function: await app(scope, recieve, send).

The most interesting part of the implementation is that send_sitemap() function. This is the function which constructs the sitemap.xml returned by www.niche-museums.com/sitemap.xml.

Here's what that function looks like:

async def send_sitemap(send, datasette):
    content = [
        '<?xml version="1.0" encoding="UTF-8"?>',
        '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
    ]
    for db in datasette.databases.values():
        hidden = await db.hidden_table_names()
        tables = await db.table_names()
        for table in tables:
            if table in hidden:
                continue
            for row in await db.execute("select id from [{}]".format(table)):
                content.append(
                    "<url><loc>https://www.niche-museums.com/browse/{}/{}</loc></url>".format(
                        table, row["id"]
                    )
                )
    content.append("</urlset>")
    await asgi_send(send, "\n".join(content), 200, content_type="application/xml")

The key trick here is to use the datasette instance object which was passed to the asgi_wrapper() plugin hook.

The code uses that instance to introspect the attached SQLite databases. It loops through them listing all of their tables, and filtering out any hidden tables (which in this case are tables used by the SQLite FTS indexing mechanism). Then for each of those tables it runs select id from [tablename] and uses the results to build the URLs that are listed in the sitemap.

Finally, the resulting XML is concatenated together and sent back to the client with an application/xml content type.

For the moment, Niche Museums only has one table that needs including in the sitemap - the museums table.

I have a longer-term goal to provide detailed documentation for the datasette object here: since it's exposed to plugins it's become part of the API interface for Datasette itself. I want to stabilize this before I release Datasette 1.0.

This week's new museums

The Donkey Sanctuary in East Devon
Hazel-Atlas Sand Mine in Antioch
Ilfracombe Tunnels Beaches in North Devon
The Beat Museum in San Francisco
Sea Lions at Pier 39 in San Francisco
Griffith Observatory in Los Angeles
Cohen Bray House in Oakland

I had a lot of fun writing up the Griffith Observatory: it turns out founding donor Griffith J. Griffith was a truly terrible individual.

Tags: museums, plugins, projects, seo, datasette, weeknotes

Evolving “nofollow” – new ways to identify the nature of links

2019-09-10T21:16:53+00:00

Evolving “nofollow” – new ways to identify the nature of links

Slightly confusing announcement from Google: they’re introducing rel=ugc and rel=sponsored in addition to rel=nofollow, and will be treating all three values as “hints” for their indexing system. They’re very unclear as to what the concrete effects of these hints will be, presumably because they will become part of the secret sauce of their ranking algorithm.

Via Hacker News

Tags: google, nofollow, seo

Googlebot's Javascript random() function is deterministic

2018-02-07T02:41:16+00:00

Googlebot's Javascript random() function is deterministic

random() as executed by Googlebot returns the same predicable sequence. More interestingly, Googlebot runs a much faster timer for setTimeout and setInterval—as Tom Anthony points out, “Why actually wait 5 seconds when you are a bot?”

Tags: crawling, google, seo

What is the plural of blitz?

2017-11-25T17:42:07+00:00

What is the plural of blitz?

Wow, WordHippo is a straight up masterclass in keyword SEO tactics. Everything from the page URL to the keyword-crammed content to the enormous quantity of related links.

Tags: seo

Whether 404 custom error page necessary for a website?

2014-01-03T13:14:00+00:00

My answer to Whether 404 custom error page necessary for a website? on Quora

They aren't required, but if you don't have a custom 404 page you're missing out on a very easy way of improving the user experience of your site, and protecting against expired or incorrect links from elsewhere on the web.

Even just a search box and a link to your homepage is enough to ensure visitors who arrive on a 404 can still visit the rest of your site, and hopefully find what they were looking for when they clicked on the link.

Tags: http, seo, quora

What are good sources to learn about SEO?

2013-12-05T11:33:00+00:00

My answer to What are good sources to learn about SEO? on Quora

The Beginner's Guide to SEO from Moz (previously SEOMoz) is an excellent introduction to SEO fundamentals.

Tags: seo, quora

Do comments really count for SEO link building?

2013-01-01T09:48:00+00:00

My answer to Do comments really count for SEO link building? on Quora

Most sensible commenting systems will put rel=nofollow on links to discourage comment spam, which will have a significant effect on SEO.

Tags: search-engines, seo, quora

What are the ways to Convert Dynamic JSP pages to a Static HTML to Appear in Google search results?

2012-09-22T17:40:00+00:00

My answer to What are the ways to Convert Dynamic JSP pages to a Static HTML to Appear in Google search results? on Quora

You don't have to do anything. You're misunderstanding how dynamic server-side languages like JSP work.

Hit view source in your browser on a "static" HTML site (if you can find one - they're increasingly rare these days, and as an end user it's actually impossible to tell the difference, as I'm about to explain). You'll see what Google's search crawler sees: a bunch of HTML.

Now do the same thing on a "dynamically" generated site - anything with a .php or .jsp extension is a good start (since they're revealing their technology choices through their URL, which is a bit tacky but does at least let you see what the'yre using). You'll see a bunch of HTML.

Dynamic server-side technologies like JSP, PHP, Django, Rails, ASP.NET etc run on the server - they generate HTML, which is then served to regular users and to search engine crawlers alike. It's not possible to tell for sure if that HTML was generated by code or is just a single static file that someone hosted on a web server.

Tags: seo, quora

What is the optimal description length in the Apple App Store?

2012-02-09T17:14:00+00:00

My answer to What is the optimal description length in the Apple App Store? on Quora

Have you ever come across one if those ugly, long pages advertising an ebook - the ones that bang on for dozens of paragraphs with bullet points, pictures, testimonials, headings, more testimonials, more bullet points and so on?

Guess what: they work! The general format is called a "sales letter" - http://en.m.wikipedia.org/wiki/S...

We know they work because people have been split testing them for decades.

I imagine iPhone developers have discovered that the same trick (way too much information) works for 99 cent purchases on the App Store.

Tags: iphone, seo, quora, ios

Why does Google use "Allow" in robots.txt, when the standard seems to be "Disallow?"

2012-02-04T09:45:00+00:00

My answer to Why does Google use "Allow" in robots.txt, when the standard seems to be "Disallow?" on Quora

The Disallow command prevents search engines from crawling your site.

The Allow command allows them to crawl your site.

If you're using the Google Webmaster tools, you probably want Google to crawl your site.

Am I misunderstanding your question?

Tags: crawling, google, search-engines, seo, quora

Why is Google indexing & displaying www1 versions of my site and how might I stop this?

2012-01-09T12:43:00+00:00

My answer to Why is Google indexing & displaying www1 versions of my site and how might I stop this? on Quora

You should stop serving your site to the public on multiple subdomains. Configure your site to serve a 301 permanent redirect from www1-www4 to the equivalent page on www - also, make sure that your site accessed without the www redirects to the right place as well.

The use of 301s will avoid any SEO penalty.

Why should you take this relatively extreme measure? Because serving on multiple subdomains hurts you in a bunch if ways:

1) Your SEO is spread across multiple copies of the same page, hurting your page rank.

2) Your cookies may end up spread across multiple domains, hurting your analytics and resulting in frustrated users who are signed in on only some of your subdomains

3) You're damaging your scores on social media sharing sites. To use quite an old example, delicious used to use the number of bookmarks to a unique URL to decide what would appear on their "popular" page. Having multiple URLs for a piece of content split that score, making it much less likely you would appear there.

4) You're making life harder for yourself should you need to switch to serving your entire site over SSL (which you may need to do to see Google search referral information as they move more of their search results pages to SSL)

Using rel=canonical is a good short-term fix, but it's not too hate to implement the proper 301 fix and in my opinion it's well worth the effort.

Tags: domains, google, search-engines, seo, quora

What are the best SEO conferences around Cincinnati?

2012-01-06T14:50:00+00:00

My answer to What are the best SEO conferences around Cincinnati? on Quora

It doesn't look like there are many (any?) SEO events in Cincinnati, but Chicago has has SES in November 2012: http://sesconference.com/chicago/

Tags: conferences, seo, quora

Does domain name masking negatively effect SEO?

2011-12-30T17:18:00+00:00

My answer to Does domain name masking negatively effect SEO? on Quora

Yes, because you've made it impossible for people to share links to sub-pages on your site - which means you won't get incoming links to those pages, a crucial ranking metric.

Tags: seo, quora

Is it a good idea to allocate URLs such as quora.com/username to users?

2010-12-22T15:17:00+00:00

My answer to Is it a good idea to allocate URLs such as quora.com/username to users? on Quora

There's an interesting discussion about this issue on this question: How do sites prevent vanity URLs from colliding with future features ?

Tags: seo, urls, quora

If I have data that loads using json / JavaScript will it get indexed by Google?

2010-10-29T15:30:00+00:00

My answer to If I have data that loads using json / JavaScript will it get indexed by Google? on Quora

No. Personally I dislike sites with content that is only accessible through JavaScript, but if you absolutely insist on doing this you should look in to implementing the Google Ajax Crawling mechanism: http://code.google.com/web/ajaxc...

Tags: ajax, jquery, json, seo, web-development, quora

Great Literature Retitled To Boost Website Traffic

2010-06-17T10:32:00+00:00

Great Literature Retitled To Boost Website Traffic

“7 Awesome Ways Barnyard Animals Are Like Communism”.

Via Jacob Kaplan-Moss

Tags: copy, funny, seo, recovered, awful, headlines

Quoting Mark Pilgrim

2010-06-08T20:48:00+00:00

I’m renaming the book to “Dive Into HTML 5” for better SEO. This is not a joke. The book is the #5 search result for “HTML5” (no space) but #13 for “HTML 5” (with a space). I get 514 visitors a day searching Google for “HTML5” but only 53 visitors a day searching for “HTML 5”.

— Mark Pilgrim

Tags: html5, mark-pilgrim, seo, recovered, diveintohtml5

Official Google Webmaster Blog: A proposal for making AJAX crawlable

2009-10-08T17:52:31+00:00

Official Google Webmaster Blog: A proposal for making AJAX crawlable

It's horrible! The Google crawler would map url#!state to url?_escaped_fragment_=state, then expect your site to provide rendered HTML that reflects that state (they even go as far as to suggest running a headless browser within your web server to do this). Just stick to progressive enhancement instead, it's far less hideous. It looks like the proposal may have originated with the GWT team.

Tags: ajax, crawling, google, gwt, javascript, progressive-enhancement, search-engines, seo

Specify your canonical

2009-02-14T11:28:20+00:00

Specify your canonical

You can now use a link rel=“canonical” to tell Google that a page has a canonical URL elsewhere. I’ve run in to this problem a bunch of times—in some sites it really does make sense to have the same content shown in two different places—and this seems like a neat solution that could apply to much more than just metadata for external search engines.

Tags: canonical, google, metadata, relcanonical, search-engines, seo, urls

Underscores are now word separators, proclaims Google

2008-08-13T13:06:16+00:00

Underscores are now word separators, proclaims Google

I missed this story last year—the change was announced by Matt Cutts at WordCamp 2007.

Tags: google, hyphens, matt-cutts, seo, underscores, wordcamp, wordpress

Search Engine Optimization Through Hoax News

2008-05-22T18:09:27+00:00

Search Engine Optimization Through Hoax News

Devious new black-hat SEO technique: invent a news story that’s pure link-bait. The recent “13 year old steals dad’s credit card to buy hookers” story was a hoax: it was a pure play for PageRank.

Tags: blackhat, google, pagerank, seo

Some thoughts on Mahalo

2007-08-20T17:23:46+00:00

Some thoughts on Mahalo

Rich Skrenta with notes on running a large site that lives and dies by SEO traffic.

Tags: mahalo, rich-skrenta, seo

Quoting Rich Skrenta

2007-04-07T00:32:46+00:00

If you're designing social media systems, you should be keeping an eye on the $2B industry that sells links from your site to their clients.

— Rich Skrenta

Tags: rich-skrenta, seo

Why people hate SEO... (and why SMO is bulls$%t)

2007-02-08T07:47:19+00:00

Why people hate SEO... (and why SMO is bulls$%t)

Jason Calacanis explains SMO, or “Social Media Optimisation”—digg spamming now has its own TLA.

Tags: jason-calacanis, seo, smo, spam, tla

The dangers of PageRank

2004-02-06T16:58:23+00:00

A well documented side effect of the weblog format is that it brings Google PageRank in almost absurd quantities. I'm now the 5th result for simon on Google, and I've been the top result for simon willison almost since the day I launched. High rankings however are not always a good thing, especially when combined with a comment system. A growing number of bloggers have found themselves at the top position for terms of little or no relevance to the rest of their sites, which in turn can attract truly surreal comments from visitors from search engines who may never have encountered a blog before.

I know of a couple of entries on my own blog that are attracting this kind of traffic. The most interesting is probably this entry on artifical diamonds, which has attracted comments from both buyers and sellers of artificial gems. My entry on MSN messenger usability problems from 2002 has drawn a steady stream of hilarious comments, no doubt caused in part by its top rating on Google for msn messenger sucks. Amusingly, for a long time Microsoft's own search engine was giving my page a high rank for a wide variety of less negative messenger related terms.

My own experiences of this phenomenon pale in to significance to some of the others I've seen. The most impressive example has to be Jason Kottke's brief review of the Matrix Reloaded, which drew over 900 comments from Google strays, developed its own micro-community and resulted in Jason pondering who owns the conversation on my web site? Jason eventually deciding to close and archive the thread after the page grew to more than a megabyte in size.

The problem can take on a far more disturbing twist. I won't link directly to these entries for fear of adding to their predicaments, but searches for crime scene cleanup and suicide chat rooms both return blogs in the first two results. The former thread is mostly crime scene cleanup companies marketing their services, but the latter is quite frankly disturbing. It's certainly lead me to double check the titles of my entries before posting them.

Thankfully, avoiding this kind of unwanted comment traffic is pretty simple. One way is to simply disable comments for entries older than a certain time (generally a couple of weeks), although personally I like to see the occasional comment on old entries. A neater solution proposed by Russell Beattie last year is to simply hide comments from search engine referrals, thus ensuring that random strays won't leave their mark without understanding the nature of your site first.

Tags: blogging, jason-kottke, pagerank, seo