It may become possible to rent a proper machine as a new donor is on the horizon, however i still need to gain (more) control over the database growing as this wouldn't be a machine i can configure/expand storage and so on... 🤔
Alright, i already posted around about this case, i will try to write it in a more detailed way here.
A bit history
Some of you may remember the instability of our Mastodon instance about one year ago, a bit less i guess.
Around this time Mastodon was hosted on a quite powerful SSD KVM machine, in theory more powerful than equivalent ones, however a virtual server is still a virtual server and thus normally come with a configured I/O limitation and of course still share I/O with the host machine which will be a bottleneck the more busy KVM's are on a hostmachine. Logically.
We had two issues with Mastodon: iowait, cpu steal and memory.
Iowait was the biggest issue and was only temporally fixed because the friendly hoster moved the vm to a fresh host, however it was back as soon as some more ppl joined the hostmachine .. as pointed out ... logically.
The database for example became too io intense at some point beside image resize operations as far as i remember.
Luckily enough donors made it possible to move to real hardware some time later, unluckily i tried a used HDD machine with a faulty disk on a german hoster first, it took me a while to understand what exactly was happening, Mastodon became unusual ... or more precisly .. the whole server became unusual all few days, it was caused by the raid check actually which caused a crazy high ioload and ... later even killed a vm. I was able to recover almost all the data though, not a nice situation but everything went fine.
I later moved to france (online.net) with the most ressource hungry services and moved others to a vm in germany.
This time a fresh server, pure SSD, powerful, joy. (I didn't want to give the old hoster who gave me a faulty disk by default a second chance and online.net offered much better value anyway.)
Everything was fine and yes, Matrix was hosted there too many months.
After a while i realized how freaking crazy fast the Matrix database grew, i realized that i cannot keep it this way, SSD comes also to the cost of less disk space for the money, the value situation is improving though.
(yellow is Matrix, on the current machine)
So i had two options:
- nuke Matrix
- move it to a SSD boosted HDD VM with more diskspace to keep it alive
I choosed option two.
Back to present
To be honest, i had this feeling already when i choosed the second option and we actually reached this critical point we had with Mastodon already.
I tried many things to keep the ioload lower but i ran out of options, the VM can't carry this very active database anymore, the iowait is sometimes pretty stable at 98%(!!!).
It was problematic from the beginning, actually but it's starting to totally escalate.
I actually tried to ignore this issue a while, i did what was possible for me and reaching out to users resulted in the decision to rather keep it as a slow service than nuking it, so this is what i did.
But yesterday two things happened at the same time:
- Synapse (the used Matrix serversoftware) got its typical after-some-days-memory peak, but this time so big that it even blew the swapfile and actually got oomed (luckily)
- A little setup process of a Ruby application for testing purposes was running
The result was a corrupted filesystem, i had to boot the rescue system and fix this thing, luckily nothing worse happened but it made me realize that the time has come, today i even figured that the high io even makes the autovacuum process fail and some more quite terrible things.
But there was no other donationpeak for a proper machine anymore so i cannot upgrade.
And no, i personally don't really have any money at all, i can barely afford Netflix with the support of other people using the account with me.
As of right now there is no "proper" way to cleanup old statuses in Matrix, it's designed to keep everything for ever. There is an API to remove old events from the database but it is designed per-room.
There is also a project that tries to enhance this implementation called Synpurge but i didn't make it run against external rooms this server is having the history from yet - i actually will try around with this yet and may reach out to some people for help but still, i don't feel in control about the history and databasesize this way, in my opinion, like on XMPP, i think Synapse needs a configured history limit i would set to something like three months.
This is not the case yet.
This happens next
As written above, i will do further experiments and investigating for downsizing the database and decreasing the serverload, this may mean that the server is unavailable a few hours from time to time.
Actually, after autovacuum failed today, i decided to do it manually with Synapse stopped, it took almost around ten hours.
If i don't figure this out / if i cannot keep load away from this machine / if i cannot do an emergency migration to proper hardware due missing money, i will shut this service down by the end of June (1)2018.
In this case i will actually keep a copy of the database in the hope to restore it somewhen in a brighter future.
I hope this made the current situation a bit more clear. 🙂
Feel free to participate in this thread.