Archive for the ‘Miro Guide’ Category

Better, faster, smarter: Python yesterday, today … and tomorrow

Sunday, September 30th, 2007

I got this video in Miro the other day. It’s a good overview of what’s changed in Python since 2.2.

Caching: database vs. memcached?

Friday, September 28th, 2007

Ben and I have been discussing how to do the caching on Miro Guide. To summarize Matt’s analysis from earlier this month, the most expensive queries we have are doing COUNT() queries over 500k records. Even with the correct indicies, that’s a little expensive when it’s happening periodically (a couple seconds with no load on the DB) and really expensive when it happens a lot (often over 20 seconds). What Matt put together when he was doing that research was a small script which does that calculation and populates a database table with it. Then the guide looks up the value in that table instead of performing the calculation.
What I’ve done is brought that idea into the world of memcached. The popularity values are calculated if they’re not in the cache, and then they’re put there and retrieved. The keys are time-sensitive; they’re only good for 5 minutes for last 24 hour popularity, and an hour for the last month’s popularity. It’s working well, and it’s making the Guide faster. When subscriptions are added, the cache is also incremented, so the cached values stay very close to the actual database values without much effort.
What we are wondering about is what people think about the different approaches, either calculating the values and putting them in a new database table and recalculating them, or putting them into the memory cache. What do you all think?

Caching: Popularity Redux

Friday, September 28th, 2007

I asked for the opinions of others, and I got them. The new caching code calculates the previous 24 hours and 31 days, respectively. Those counts are updated from the DB (to remove old data) every 5 minutes and 60 minutes. It actually wasn’t a difficult change. The only thing that’s different is the cache key. Previously, the month and today counts were based on the date. Now, they’re based on Unix timestamp integer divided by the time interval, so the cache key will change periodically. I don’t have to remove the old data, it’ll simply be removed from the cache eventually when it needs more memory.

Caching: Popular Channels

Thursday, September 27th, 2007

We keep track of subscriptions to channels; it’s how we tell what channels are popular and what channels are similar. It’s also the largest table in our database, and accessing it has gotten increasingly expensive. To find out what channels are the most popular, we have to count through all those records. Today, I rolled out some caching to fix that.
Each channel has three subscription counts: one for all time, one for this month, and one for today. Each of these values is stored in the cache and retrieved when needed. If a value for a channel isn’t found, it’s calculated and placed into the cache. When someone subscribes to a channel, those values are incremented.
The change that’s controversial (at least between myself and BDK) is how the today and month values were calculated. The used to be calculated over the past 24 hours and 31 days respectively. But that means that the values need to be recalculated much more frequently. The new code calculates them from the start of today and the start of the month respectively. This means that the values will be 0 at the start of the day/month. BDK thinks that this is a big deal; I do not.
One idea I had was that if the top-n results (say, 10) are all 0, return the values for the previous time period. This will keep the display from ever being 0, but keep the efficiency.
Update: The old miss ratio was roughly 80%. Now with this new caching, it’s only 6.5%.