Better, faster, smarter: Python yesterday, today … and tomorrow
Sunday, September 30th, 2007I got this video in Miro the other day. It’s a good overview of what’s changed in Python since 2.2.
I got this video in Miro the other day. It’s a good overview of what’s changed in Python since 2.2.
Ben and I have been discussing how to do the caching on Miro Guide. To summarize Matt’s analysis from earlier this month, the most expensive queries we have are doing COUNT() queries over 500k records. Even with the correct indicies, that’s a little expensive when it’s happening periodically (a couple seconds with no load on the DB) and really expensive when it happens a lot (often over 20 seconds). What Matt put together when he was doing that research was a small script which does that calculation and populates a database table with it. Then the guide looks up the value in that table instead of performing the calculation.
What I’ve done is brought that idea into the world of memcached. The popularity values are calculated if they’re not in the cache, and then they’re put there and retrieved. The keys are time-sensitive; they’re only good for 5 minutes for last 24 hour popularity, and an hour for the last month’s popularity. It’s working well, and it’s making the Guide faster. When subscriptions are added, the cache is also incremented, so the cached values stay very close to the actual database values without much effort.
What we are wondering about is what people think about the different approaches, either calculating the values and putting them in a new database table and recalculating them, or putting them into the memory cache. What do you all think?
I asked for the opinions of others, and I got them. The new caching code calculates the previous 24 hours and 31 days, respectively. Those counts are updated from the DB (to remove old data) every 5 minutes and 60 minutes. It actually wasn’t a difficult change. The only thing that’s different is the cache key. Previously, the month and today counts were based on the date. Now, they’re based on Unix timestamp integer divided by the time interval, so the cache key will change periodically. I don’t have to remove the old data, it’ll simply be removed from the cache eventually when it needs more memory.
We keep track of subscriptions to channels; it’s how we tell what channels are popular and what channels are similar. It’s also the largest table in our database, and accessing it has gotten increasingly expensive. To find out what channels are the most popular, we have to count through all those records. Today, I rolled out some caching to fix that.
Each channel has three subscription counts: one for all time, one for this month, and one for today. Each of these values is stored in the cache and retrieved when needed. If a value for a channel isn’t found, it’s calculated and placed into the cache. When someone subscribes to a channel, those values are incremented.
The change that’s controversial (at least between myself and BDK) is how the today and month values were calculated. The used to be calculated over the past 24 hours and 31 days respectively. But that means that the values need to be recalculated much more frequently. The new code calculates them from the start of today and the start of the month respectively. This means that the values will be 0 at the start of the day/month. BDK thinks that this is a big deal; I do not.
One idea I had was that if the top-n results (say, 10) are all 0, return the values for the previous time period. This will keep the display from ever being 0, but keep the efficiency.
Update: The old miss ratio was roughly 80%. Now with this new caching, it’s only 6.5%.
(This is simply the status e-mail I sent to the rest of the developers this evening.)
Earlier this week, I did a bit of work fixing small details on the new ratings page. However, I’ve mostly been working on getting Miro Guide to be more cache-friendly. The current hit rate is roughly 20%, up from a measly 14% last week. I think this can be much better, however.
There are two general ways to use the cache. One is to cached rendered templates, the other is to cache database queries. Both I think are important, but I’ve been focusing more on the database end of things. At that level, it’s easier to tell what objects/tables are being modified.
Monday and Tuesday I was working on implementing a List class which represents a database query that is cached in-memory and updated when inserts/updates/deletes happen in the database. However, I’m running in to problems with my current approach. Because there’s no master list of queries, there’s no way to make sure every list is updated when it should be. I came up with a new idea on the bus, which is to make each cached query have an id# which is incremented when the table is modified. This will invalidate all the old cache keys and make it hit the database to reload the cache. For queries like the popular channels, we can not invalidate the keys on table updates, but simply on time; this will keep the cached data around for a rough amount of time (say a minute or 5) before hitting the database again.
I’ll be working on implementing this idea tomorrow. What this can’t do, unfortunately, is update the cache at the same time the database is updated. I can do that for individual records (I have a nice CachedRecord class which does this, and I’ll start using it around the code base) but that doesn’t work for database queries. I plan to do a bit of research as well to see if other people have perhaps tried to solve this problem.
I’m Paul Swartz (if you didn’t get that from the title at the top of the page). I’m one of the newer devs here at the PCF. I started over the summer, working on buiding Miro on Windows with MinGW. Although it won’t ever be the default build (MinGW builds XULRunner aren’t supported by Mozilla), it was an interesting project. I also worked on Google Summer of Code 2007. My project was rewriting the tests for Twisted Conch, an SSHv2 client/server library for Python and Twisted.
Now that I’ve started full-time with the PCF, I’m focusing on Miro Guide. Some of the new features that have gone live are:
Right now, I’m working on making the Guide faster by caching a lot of our database access with memcached. I’ll be blogging here about that work, and in the future about the work that goes on behind the scenes of the Miro Guide.