Simple Tips to Build Scalable Websites

July 1st, 2009 3 comments

A few days ago I’ve been invited to a launch party for a web product in Paris. While the product was nice and polished, it seems like the developers didn’t understand anything about scalability. They didn’t even understand my question when I asked them if the product could scale.

It’s probably not a big deal for them: they were presenting a CMS, so most of the time it will be installed for a limited user base. I guess most people will be happy to use it on a single server, so it’s probably OK for them not to be able to scale. However I noticed that while scalability is now a fairly solved problem, there are not that many articles explaining how to prepare to scalability on the web. So here I go. I will not try to replace a good book, but just to give the very basics.

What is scalability?

It’s important to get that out of the way. Scalability is not performance: it’s not about making good use of CPU and bandwidth, and it’s not about having the page being loaded quickly in the user’s browser. It’s about being able to balance the load between several servers. So when the load increases (more users creating accounts, more visitors, more page views) you can add additional servers to balance the load. You don’t just throw in a server, you need to design your software to work on a cluster of servers.

An other point is that you will rarely create a cluster of machines from scratch: when you launch a new website you will have few users so few machines (one or two), and as your load increase you will increase the number of servers. You will have to scale different parts of your system one after the other.

#1: the web front-end

Most of the time you start with a front-end (PHP, Python, Ruby, Java…) and a data layer (MySQL, PostgreSQL, CouchDB…). As your load increase, the front-end will be the first to break. Of course server-side caching will help, but at some point you will need several front-end servers.

The key for that is to ensure you don’t store any data on the front-end. The problem sometimes arise with sessions: a lot of PHP libraries store session information locally on the server, and that prevents from balancing the load. The idea is that in a session a user may hit a server for a given page, then an other for the next page. If the session is only accessible to the first server, you’re screwed. You want it to be somewhere else. That can be in the data layer or in a special sessions server. If you write a Facebook app you don’t need to care, because Facebook takes care of the session.

Now can have as many front-ends as we want, but we have a unique database server.

#2: the read operations on the database

Most applications will have many more reads than writes. For example in a blogging software, each visitor will trigger a read on the database (OK, not each visitor if there is a good cache), but writes only occur when the author writes a new post or someone leave a comment.

That’s good, because it’s much easier to scale reads than writes. Just make sure that in your code you have different settings for reads and writes. They can point to the same database at launch time, but when the time comes you can separate those. Writes will go to your “main” database, and reads will go to a copy. There are other approaches, but for example MySQL offers replications features. Once set up, the slaves will stay in sync with the master. You can have as many slaves as you need.

OK – several front-ends, several read-only databases, but still one master database for writes. If your applications has few reads it may be fine with a beefy database server, (and some major websites just have one master database), but if you have a lot of writes (highly social applications like Facebook or Twitter) you may want to continue the scaling process.

#3: the write database

Now we want to have several databases where we can write to. Obviously, we have to be careful not to introduce inconsistencies in the process. So having an old version of a blog post on a server and the new version on an other one is not great; what if some users see an old version of your post and others see the most recent one ?

There are various strategies to divide data in a safe and consistent way, including:

  • Depending on the userid (or blogid, or whatever makes sense in your application), put the data on one server on an other. For example, all users with an even id go to server1 and all users with an odd id go to server2. Hint: make sure your algorithm lets you add more servers later, which is not the case with my example where you will be stuck at 2 servers :)
  • Put some tables on a server, some others on an other. It doesn’t help you when a table is growing too much, but it can be combined with the previous point.

Conclusion

Here you go, the basics for building a scalable website. That’s not all you have to do, if your website continues growing you will face more problems such as having to scale your network. I’m not talking about outgoing bandwidth but communication between your servers (front-end and data layers). But if your code is efficient, those simple recommendation will get you to a server that can handle a fairly big load. I really recommend Building Scalable Websites, from O’Reilly if you want to know more.

FAQ

Q: Language X doesn’t scale, but language Y does!

A: Bullshit. It’s not the language that scales, it’s your code. Some languages may not perform as good as others, so you will have to add boxes more often but the way you scale is still the same.

Q: What about cloud computing? Virtualization? All these fancy buzzwords?

Virtualization means you run on virtual machines rather than on physical ones. The benefit is that you can easily add or remove machines. For example, using Amazon EC2 you can add as many machines as you want in a few minutes, and then remove them in no more time. With a classical hosting company, you need to make a phone call, ask for the machines and you get them in maybe one week. They’ll charge you for the set-up too, and if you no longer want it you still have to pay for a full term. So cloud computing offers are generally more flexible.

Q: Does Google App Engine make it easier to scale?

In short, yes. By not letting you access the machines, Google App Engine constrain you into writing scalable code. You also don’t have to request new machines when you need them or release when you no longer need them; you just pay what you use depending on the load of your application.

I am a big fan on Google App Engine but be careful, since it’s programmed in a particular way it’s not easy to move your project out of it. You may feel locked in after you project started.

Facebook’s Bugzilla and Open Source Code

June 23rd, 2009 No comments

facebookIt’s always a pleasure when a company offers a public bug tracker for users and third party developers. So it’s great to see that Facebook has a public bugzilla.

The problem with having a public bug tracker is that you’re supposed to be responsive and fix bugs; nothing worse that real bugs rotting in the tracker because the group behind the product doesn’t really care about that specific bugs.

So far I’ve filled a few bugs for the Facebook API:

  • A security bug, letting anyone deactivate feed templates for any app (fixed)
  • 5159: Incorrect error code returned for Stream.get – I can’t believe they won’t acknowledge it’s a real bug!
  • 5624: Stream.get is not really using the updated time. In short, we miss data because the query doesn’t do what the doc says it does. They do recognize that should be fixed, but didn’t put a very high priority.

I have also filled a few feature requests, but it doesn’t really matters: they have obviously no obligation to implement them. On the other hand, it would be nice if they fixed the bugs even when they’re not security issues.

One interesting thing I’ve noticed is that not only they have a public bugzilla, but the code source for their API is Open Source. Let’s see: “Facebook Open Platform is a snapshot of the infrastructure that runs Facebook Platform. It includes the API infrastructure, the FBML parser, the FQL parser, and FBJS, as well as implementations of many common methods and tags.”

Cool. That means that if I’m motivated enough, I could fix the bugs and submit a patch! Let’s look at the buggy Stream.get method… Wait a minute – that source code is just a tiny subset of the Facebook API! Oh crap. Looks like I can’t fix it.

Categories: tech Tags: , , ,

Google App Engine

May 26th, 2009 No comments

I’ve been playing with Google App Engine recently. It’s actually pretty cool, to the point that I’m almost ashamed to have ignored it when it was released. I kind of felt like it would be too restrictive with just Python, just their own database and so on.

But so far, I like it:

  • It’s only Python (or Java), but you can do pretty much anything you would do in a non-App Engine Python project. You can load pure pythonic third party libraries by just including them in your project.
  • The free quotas are really big. It’s enough for a hobby project, and if it becomes successful enough to hit the ceiling you should be able to figure out a way to monetize it to pay your Google bill.
  • You can use your own domain name even with a free account
  • There is no SQL, but Google’s BigTable seems to be good enough. Heck, that’s what they use for most of their products!

And you get all the App Engine specific goodness: easy authentication with Google Accounts, free hosting with huge quotas, and most importantly easy scalability on Google’s infrastructure… Having to call your hosting company to add new servers is a pain in the ass (and in the wallet), having to create and delete instances on Amazon S3 is a much better, but not having to think about it at all is just pure joy.

Categories: hacking, tech Tags: , , ,

Chromium extensions on Linux?

May 14th, 2009 2 comments

I just compiled a recent build of Chromium (unbranded Google Chrome) from here. I would love to start playing with extensions, so I tried stuff from there.

Alas, it doesn’t work. Since a bunch of stuff is still missing from the Linux builds, I guess extensions don’t work either.

Did anyone manage to get extensions working on Chromium Linux? Is there a flag or something to add at compilation time?

Categories: tech Tags: , , ,

OpenID to break the web 2.0?

April 28th, 2009 2 comments

Reading on Techcrunch that Facebook was going to let people log in with OpenID left me with a mixed feeling. On one hand, it’s a shame that most services claimed to support OpenID while only offering an ID, but not letting people use outside IDs to log in. On the other hand, it’s going to be a big headache for API calls.

Think about it: with OpenID, there are 3 actors in place. The user, the OpenID provider, and the web service. So John@myspace will go to Facebook, Facebook will call MySpace to make sure that John is really that John, MySpace answers and we’re done. With a classical API call, again, 3 actors. The user, the web service, and the third party application. John goes to the RockYou page, RockYou calls Facebook to get a token to access John’s data, and we’re done.

Now, think about an API call through OpenID. 4 actors: the user, the OpenID provider, the web service and the third party application. The OpenAuth fiasco showed how hard it was to do a secure communication between 3 actors, now we’ll get to 4?

Of course, people are going to figure out technical ways to do that. I’m sure Facebook wouldn’t support OpenID if it would break their app ecosystem. But how complex is it going to be? Do the user logs in the OpenID provider, then confirms on the web service, to finally go to the web service? Talk about a simple process.

I just hope they have a good reason for supporting OpenID, not just “it would be cool to have a single login for all websites”.

Categories: tech Tags: , , ,

Thank you, AMO!

April 15th, 2009 No comments

The website addons.mozilla.org recently changed the rule for sandboxed addons: you no longer need to register and login to install one. That was a big issue because the review process is a bit heavy, and a lot of add-ons were stuck in that Limbo of Firefox.

The difference for my recent sandboxed Video Games Spy (a sidebar to get aggregated info about games) is huge. It went from 0-ish to more than 50 downloads a day! It’s still not a lot, especially compared the thousand a day Moji is still getting, but it shows there is some interest for the add-on.

Video Games Spy

Super Mario Galaxy!

There’s a New Gang in Paris

April 2nd, 2009 No comments
Thanks to my friend Manish, my homies just moved from the San Francisco Bay Area to Paris. Each one of them successfully passed all immigration checks at CDG airport, and they will soon be ruling the 17th ward.
Homies from California

Homies from California

All my homies were come from a vending machine in Fry’s, some in Palo Alto and others in Sunnyvale. If you really are into homies, you can also play one of the worst Nintendo DS game ever: Homies Rollerz.

Categories: Misc, life Tags: ,

Will Never Happen in US

March 23rd, 2009 No comments

I just saw that ad in the metro close to my office.

It’s a famous French TV show that’s been around for 20 years, and they just released a Best Of. The Best Of is called “Putain 20 ans !”, and that I would translate in English to “Fuck, 20 years!”. It’s actually a reference to a recurring joke they were making back in 1993; but I couldn’t help to imagine how that would be impossible in US, where there are some forbidden words that you just can’t say in public.

It’s good to be back home.
Putain 20 ans - Share on Ovi

Categories: life Tags:

SMS in Japanese, in France

March 16th, 2009 No comments

I’ve already written about reading and writing Japanese on a Western Nokia phone. Now that I am in France, I got to test sending SMS in Japanese from my E71 to my dear’s E65. And it works, in both ways! There is nothing to do other than setting both phones to display and input Japanese. I have no idea what encoding is used on the server, if it’s really Unicode or if the server think it’s Latin-1 while the data actually is Unicode, I just know it works.

For information, I’m using a MVNO on the SFR network, so I assume that it would work on SFR itself and any operator using their network.

Categories: life, tech Tags: , , , , , ,

Flock switching to Chrome: opinion of an ex-Flocker

March 5th, 2009 4 comments

I’ve just read on Techcrunch that Flock was going to switch from Firefox to Chrome. This is not completely news to me because at the time I was still at Flock, one colleague (Chris Campbell) was doing experiments with Chrome to see the feasibility of the switch. I don’t know what’s going on inside Flock now, so I don’t know if they’re still just playing with Chromium or seriously planning a switch.

Gecko and Webkit… Or Firefox and Chrome?

The supposed switch revived a recurring discussion about how Gecko is a pain in the arse to embed, and how it’s a breeze to embed WebKit. Additionally WebKit is young, shiny and buzzy while Gecko is old and supposedly tired.

It is true that Gecko is much harder to embed that WebKit. It is also true that Mozilla don’t give a damn about people wanting to embed Gecko (their priority is Firefox), while embedding is the main purpose of WebKit. The natural consequence is that I don’t know any recent project choosing Gecko as their rendering engine.

But that’s not the point. Flock is not embedding a rendering engine, they are taking a full blown browser – currently Firefox – and tweaking it. Flock is not a browser based on Gecko, it is a browser based on Firefox. That explains a lot of similarities between the product, and that also explains why most Firefox extensions work in Flock.

That means that if Flock switches to Chrome (actually Chromium, the Open Source version of Google Chrome), it doesn’t matter how easy it is embed WebKit: what matters is how easy it is to hack Chromium. And it doesn’t matter how responsive and nice the WebKit community is: what really matters is how Google will deal with external contributors and patches from external contributors. And as Chromium is young, this is still unsure. Of course considering how difficult it is to get patches accepted by Mozilla, it’s hard to imagine Chromium being as draconian.

But even if Chromium appears to be open to contributors, how they would react to a company building an other browser based off Chromium is pretty unknown. It’s just like with Mozilla: they release their code under an Open Source license so they explicitly allow that kind of derivative work and can’t prevent it to happen. But whether they will see Flock with a better eye than Mozilla did is a different question.

Would Flock benefit from a switch to Chrome?

I can really see only one good reason for Flock to switch to Chrome: the “process-per-tag” system.

Flock have always suffered of Mozilla’s single thread, single process logic. For example, when you fetch the Facebook friends list of a rockstar, you have to merge lists of thousands of friends then refresh the UI to reflect that. This operation happening in the main thread (the only thread) it will literally block your browser. To workaround that, Yosh wrote a simple Scheduler that sets timer to yield to the UI once in a while. It’s really a hacky way to get threads-like things, and since we didn’t apply it everywhere there were still code that were freezing the browser. With Chromium, Flock’s sidebars or topbars would run in separate process and be less disruptive for usual browsing.

It’s hard for me to see how that single reason would outweigh the cost of switch for the team and for the users. This is going to be a completely different product with new bugs, features disappearing and alienated users. I do not know Flock’s current strategy but I can only assume they are seeing more benefits to the switch that I can’t see.

Letting the user choose

That kind of controversy about what Flock should use (or will use) as a “host browser” is pretty irrelevant for extensions companies. Foxmarks, CoolIris and the company I currently work for are all examples of companies who provide a rich browser feature while letting the user choose his browser.

At its early days, Flock decided to be a browser instead of a set of extensions. This choice makes sense for Flock’s general strategy, but today I can’t help but think that if Flock was an extension company, it could just have released a Chrome version along with versions for other browsers.

Categories: flock, mozilla, tech, yoono Tags: , , ,