Parallel Reindexing

Posted on September 24, 2007

One of the issues we’ve had to deal with at myYearbook is how to deal with reducing index bloat in PostgreSQL. The reindex database command does this pretty handily, but since it is performed table by table in serial, it can be pretty slow when you're sporting 30GB+ tables. To solve this I threw together this little CLI php script.

The idea is pretty straightforward - get a list of all the tables with indexes from the PostgreSQL system catalog for a given database, split them up into chunks and run multiple reindexes at the same time. If you find it useful or have ideas to improve it, drop me a note and let me know.

Here is an example of usage:

[gmr@gmr-imac ~]$ ./parallelReindex.php 

Error: Required parameters not set.

Usage: parallelReindex.php parameters

Parameters:

  -host      Specify the database host to connect to - required
  -port      Specify the port to connect to
  -dbname    Specify the database name to connect to - required
  -user      Specify the database user to connect as - required
  -password  Specify the database user password
  -threads   Set the number of parallel reindex tasks to run (default 15)

Example:

  ./parallelReindex.php -host localhost -user postgres -dbname test

I don't get it

Posted on September 04, 2007
I was processing a bunch of responses to job postings today and in responding via email, I received five bounded emails stating that the email address doesn't exist. Why would people apply for a job with bad email address info?

Amazon S3

Posted on August 29, 2007

I finally bit the S3 apple at work today. I've been using S3 for a bit with Bandwagon for backing up my iTunes files, but I hadn't seriously considered how to best use it at myYearbook.com until I had a call today with one of Amazon's Business Development folks. It was during the call that instead of using it as a CDN as I had been thinking, and buying the optical backup system I had been contemplating, that it was a perfect disaster-recovery backup mechanism.

Since I could sync up with our S3 buckets on a fairly regular basis and they can be served publicly off of S3 in a pinch, I could focus on putting systems in place to keep S3 in sync instead of installing, managing and using an optical backup system. In addition, I didn't have the up-front purchasing cost for the system that we would have to put in place to be able to back up the many terabytes of data we have. Where media, power and rack space are a commodities that I'd pay for on an ongoing basis, the cost per megabyte when all is said and done is fairly similar and I have the option of restoring to any location with an internet connection instead of having to purchase additional hardware for out of data-center restorations.

Installation of s3sync.rb was very easy, and while the documentation is terse, it took me all of about an hour to compile and install ruby and have the app working. A few shell scripts later, some crontab additions and we have a regular, off-site backup taking place. Restoration is equally as easy, as s3sync.rb works similarly to rsync. All things considered, off-site backup has never been easier.

Stapler: Coming Soon

Posted on August 29, 2007

We're polishing up our soon to be released stats package, Stapler. My friend Carl was cool enough to throw together this neat logo for it:

Stapler Logo

Crazy Framewerk Changes

Posted on August 14, 2007

In working to optimize Framewerk to run on a site as large as myYearbook.com and our related sites, we've found several bottlenecks which we're now working to resolve. Some of the bottlenecks can be attributed to our hardware; for example, on our very cool Isilon NAS setup, there is a NFS penalty for file stat operations, and the Framewerk engine does its fair share of file stats.

As part of the cleanup we’re doing the following:

  • Adding a base level fFile class which will allow for transparent engine level caching using Shared Memory functions (shmop).
  • Unifying as much of the XML subdirectory into one file. fRegistry has been completely rewritten to allow for nesting and arrays of nodes to support moving configuration.xml into registry.xml
  • Caching maps of file locations that are used both at the base level class location and at the fTheme XSL, CSS and JavaScript auto-location detection routines.

In part of gearing all this up, and wanting to get Framewerk to a releasable state, the development team decided that we would pare the distribution down to a core framework without applications and to remove any database dependence for the core framework. The applications are not going away, they are just being moved to a satellite project level. As part of the base installer, the user will be prompted for their requirements, such as should database support be installed, what kind of user backend should be installed, if any, etc.

In addition, one of the ideas being tossed about is to add a function hook mechanism to fObject and fSingletonObject classes.

As such, the trunk of svn is once again unstable, but should become stable enough for a beta real soon. Keep an eye out for updates!

New blog

Posted on August 08, 2007
I'm experimenting with Mephisto, a Ruby-On-Rails blogging system for a corporate blog. Updates to follow.

New Job

Posted on April 30, 2007

I've been a bit remiss in posting as of late primarily because life has kept me very busy.  On April 2nd, I started my new job as CTO of MyYearbook.com, a social networking startup that just celebrated its second year birthday.  In line with taking the position, I have moved to Bucks County, PA and am working in New Hope.

It's a pretty neat place with lots of fun technical challenges, mainly keeping a site that's growing like crazy up and running.  We're trying to squeeze everything we can out of our production environment while testing our new environment which is designed to be more scalable.

Within my first month here, we've managed to implement memcache pretty heavily reducing some of the larger quantity of queries, implemented Slony based replication on the PostgreSQL to use the slaves as the primary read database for our more time intensive queries and have gotten to play with some high end equipment including SAN's and a SSD.

The SAN, SSD and operating system under PostgreSQL issue is an interesting one and we're not through the woods yet.  Once I have more concrete information I will post an entry dedicated to that subject alone.