Current status of Telldus Live!-servers

For more than a week now we have been having big problems with delays during peak load. This was something that happened very suddenly and didn't follow the predicted capacity needs curves. Most of you will have noticed this when trying to control devices and have had to wait for several seconds, sometimes for up to a minute. Worse than that, on a few occasions the whole database system failed to respond to new requests and had to be restarted manually.

During last week, several attempts were made to stabilize the situation and more servers were added. It became evident this weekend that it didn't help.

One big problem was found today in the database connections between our servers. The servers were automatically updated with a security fix a while ago. This fix had a side effect which led to that a simple question to the database could take up to ten seconds. We currently send 140 000 000 sensor values to the database each month, not including duplicates. When each of these values take a couple of seconds to parse it starts to clog up the servers fast!

Our servers are queue based. When all available threads are busy inserting sensor values, other tasks are quickly piling up in the queue, especially during high load.

Another effect is that if a TellStick Net does not acknowledged back to the server that a message has been received within two seconds, the server tries to send the message again. This adds even more tasks to the queue, and it can be recognized by that TellStick Net lights up and sends a message more than once, even if only one command has been sent to it. This may be a problem with for example blinds that starts and stops with the same command.

We are implementing measures to make sure commands will execute in a normal fashion again, even during high load, and are currently rolling out the first series of fixes. In the next couple of days we will roll out even more updates to make sure the service is smoother than before.

The issue with the database which requires a manual restart of them is also triggered by this high load, but is a different problem. During the weekend we planned a fix that has been implemented this morning which should solve this.

We are truly sorry for all trouble this may have caused you.

Comments

No comments.