Simon Willison's Weblog: hadoop

Quoting Facebook Data Team

2009-11-30T11:30:12+00:00

Today, Facebook counts 29% of its employees (and growing!) as Hive users. More than half (51%) of those users are outside of Engineering. They come from distinct groups like User Operations, Sales, Human Resources, and Finance. Many of them had never used a database before working here. Thanks to Hive, they are now all data ninjas who are able to move fast and make great decisions with data.

— Facebook Data Team

Tags: facebook, hadoop, hive

Introducing Cloudera Desktop

2009-10-21T18:48:36+00:00

Introducing Cloudera Desktop

It’s a GUI for Hadoop, and under the hood is a whole stack of open source software, including Python, Django, MooTools, Twisted, lxml, CherryPy, Mako, Java and AspectJ.

Tags: aspectj, cherrypy, cloudera, django, hadoop, java, lxml, mako, mootools, open-source, python, twisted

Finding similar items with Amazon Elastic MapReduce, Python, and Hadoop streaming

2009-04-07T09:19:38+00:00

Tutorial for running Hadoop jobs on Elastic MapReduce using Python and the 2005 Audioscrobbler dataset.

Tags: amazon, amazon-web-services, audioscrobbler, elasticmapreduce, hadoop, mapreduce, python

Amazon Elastic MapReduce

2009-04-02T10:25:37+00:00

Amazon Elastic MapReduce

Hadoop as a service. Basically a web based GUI around Hadoop—you could roll this yourself on EC2 but for a small markup on regular EC2 prices you get to avoid the extra work setting everything up. Data processing scripts can be written in Java, Ruby, Perl, Python, PHP, R, or C++ and are loaded in to S3 before firing off the job.

Via Joe Drumgoole

Tags: amazon, amazon-web-services, cloud-computing, ec2, hadoop, mapreduce, s3

Cascading

2008-10-01T13:22:19+00:00

Cascading

A Java API abstraction layer over Hadoop that lets developers think in terms of pipes and filters rather than map/reduce. The Cascading developers claim that this model is easier to understand and less error prone.

Tags: cascading, hadoop, java, mapreduce, pipesfilters

3 and 1/2 minutes to sort a Terabyte, and a look at Hadoop's code structure

2008-07-07T14:15:23+00:00

3 and 1/2 minutes to sort a Terabyte, and a look at Hadoop's code structure

Bill de hÓra uses some clever static analysis tools to explore Hadoop’s 100,000+ lines of code.

Tags: bill-de-hora, hadoop, java, static-analysis

Python + Hadoop = Flying Circus Elephant

2008-05-31T14:14:56+00:00

Python + Hadoop = Flying Circus Elephant

Last.fm have released Dumbo, a Python module that lets you easily write Hadoop map/reduce tasks using Python and generators.

Tags: dumbo, generators, hadoop, lastfm, mapreduce, python

Writing An Hadoop MapReduce Program In Python

2007-10-09T11:33:58+00:00

Writing An Hadoop MapReduce Program In Python

Hadoop (the open source map/reduce framework) can interact with any program that reads from stdin and outputs on stdout—so it’s trivial to drop in Python scripts for the map and reduce steps.

Tags: hadoop, mapreduce, python

Hadoop

2006-08-23T08:36:14+00:00

Hadoop

Open-source Google File System / map-reduce equivalent. Apparently scales amazingly well.

Tags: hadoop