IIRC: Persistence in Second Life

In a colleague’s recent presentation, he mentioned the Second Life in the context of persistence in virtual worlds. As I used to work a Linden Lab, I thought I’d follow up with some more information/notes about how it actually worked. This stuff isn’t secret, they published it on their wiki, mentioned it in office hours and I’ve seen other presentations too. I worked on the small team that worked on the Simulator.

The Simulator or Sim was responsible for all simulation of a 256m x 256m region and all the connection of players within it. The state of a Sim was periodically saved to disk and uploaded by another process to a SAN (often called the asset database). The state was also saved to disk in the event of a crash. Upon restarting, the Sim would attempt to use the saved state, if it could not, it loaded the last normal save from the SAN. This meant that there was up to a 15 minute window from a change in the state of a sim to it being persistent.

This gap could be exploited by players, who would take an item into their inventory then deliberately cause crashes with various exploits. This would duplicate the item into their inventory and leave in place in the Sim.

To mitigate this, we managed to fix all of the reported crash causes. Discovering and fixing all those bugs took years and it was a constant battle to keep on top of them. We had great reporting tools and stats of call stacks. There was also an army of support people who could manually replace lost items. New features inevitably introduced new opportunities for exploit and old exploits were infrequently discovered. Although this mostly mitigated the problem, it did not solve it. Sadly, even if a Sim never crashes you cannot be sure that your transaction will be durable; Sims died for other reasons too. For example, they were killed if their performance degraded too much and occasionally there where accidental cable pulls, network problems and power outages.

The Sim state file could get pretty large (at least 100MB); it contained the full serialised representation of all the hierarchy of entities (know as Prims in SL jargon) within it. This was unlike the inventory database which just had URLs to the items within it. This was a legacy from a time when everything was a file.

The Second Life Sim was effectively single threaded; it had a game loop that had a time slice for message handling, physics and scripts. IIRC, it wrote the state by forking the process to do the write. If we had attempted to write the entire sim state each time we made a modification, it could have been a problem for performance, with the potential for users to introduce DoS attacks.

Users were not the only things that could modify Sim state; scripted items could spawn stuff or even modify themselves. That’s why it was done at a limited rate. Serialised Sim state was not the only form of persistence. Second Life had a rich data model, including user representations, land ownership, classified adverts and many other areas.

One of the largest databases was the residents (the name for players) inventory. The inventory was stored in a sharded set of MySQL databases. The items contained in the inventory were serialised and stored in a file in the SAN or S3, much like the Sim region data. The database contained a URL to the resource that represented the item. Some residents had huge inventories with hundreds of thousands of items. The inventory DB was so large (along with operational decisions to use commodity hardware) that it needed to be sharded. The sharding strategy is to bucketed user based on a hash of their UUID.

Having a centralised store for inventory was essential, most users had inventories way too big to be migrated around the world. It also has operational advantages, the Dev/Ops team were well versed with maintaining and optimising MySQL. Unfortunately, by sharding like this, you lost the ability to do transactions across users as they’re no longer part of one database.

Back when Second Life was still growing, the main architectural aim was focused on scalability, reliability was secondary. The database had historically been a point of failure. Significant effort was put in to partition it, so that it would not be again. The architectural strategy was to migrate all clients of the DB to use loosely-coupled REST web services.

REST Web services are a proven scalable technology; they are what the Internet is built of. Provided the services are stateless, they are able to scale well and will often exploit caching well. Web technologies (specifically LAMPy) used were well know by Dev/Ops; they made scaling a deployment issue.

A secondary goal, of this initiative, was to allow a federated virtual world; something to allow other companies and individuals to host regions and still continue to use their SL identity. We got part way through this before I left, but I don’t think it ever got completed since the growth of SL stopped.

Second life went the long/hard way round to achieve durable transactions. This in part was due to the general issue having a hard-to-change monolith in the simulator. This caused many other problems; it made architectural changes hard. Importantly, it wasn’t difficult to change because the individual classes/files and files were badly coded. The Second Life Sim had too highly coupled parts; a change in one part could affect something seemingly unrelated. The Sim suffered from the ball-of-mud anti-pattern; it wasn’t originally badly designed, but it grew too organically and lacked structure.

I spent significant time introducing seams to produce sensible, workable sub-systems. Persistence is a hard thing to get right; Second Life needed to change several times to support their scale. That said, the state of the art in distributed databases (and even in normal database including MySQL) has progressed significantly since that time. Get it right, engineer in the qualities of transactions and persistence that we need from the start, and it will save you significant effort.

Advertisements