Now that Lemmy 0.19.0 has been out for a few days, we will be proceeding with the update here on Lemmy.tf. I am tentatively planning to kick this off at 4pm EST today (3.5 hrs from the time of this post).
All instance data will be backed up prior to the update. This one will include a handful of major changes, the most impactful being that any existing 2FA configurations will be reset. Lemmy.ca has a post with some great change info - https://lemmy.ca/post/11378137
I noticed some timeouts and DB lag when I logged in early this afternoon, so I have gone ahead and updated the instance to 0.18.4 to hopefully help clear this up.
We also have a status page available at https://overwatch.nulltheinter.net/status-page/946fd7fd-3ae3-4214-bbbf-dd7206566104 and will soon have this working on status.lemmy.tf.
As I'm sure everyone noticed, the server died hard last night. Apparently, even though OVH advised me to disable proactive interventions, I learned this morning that "the feature is not yet implemented" and that they have proceeded to go press the reset button on the machine every time their shitty monitoring detects the tiniest of ping loss. Last night, this finally made the server mad enough not to come back up.
Luckily, I did happen to have a backup from about 2 hours before the final outage. After a slow migration to the new DC, we are up and running on the new hardware. I'm still finalizing some configuration changes and need to do performance tuning, but once that's done our outage issue will be fully resolved.
Issues-
[Fixed] Pict-rs missing some images. This was caused by an incomplete OVA export, all older images were recovered from a slightly older backup.
[Fixed?] DB or federation issues- seeing some slowness and occasional errors/crashes due to the DB timing out. This appears to have resolved itself overnight- we were about 16 hours out of sync with the rest of the federation when I had posted this.
Improvements-
VM migrated to new location in Dallas, far away from OVH. CPU cores allocated were doubled during the move.
We are now in a VMware cluster with the ability to hot migrate to other nodes in the event of any actual hardware issues.
Basic monitoring deployed, we are still working to stand up full-stack monitoring.
So after a few days of back and forth with support, I may have finally received some insight as to why the server keeps randomly rebooting. Apparently, their crappy datacenter monitoring keeps triggering ping loss alerts, so they send an engineer over to physically reboot the server every time. I was not aware that this was the default monitoring option on their current server lines, and have disabled it so this should avoid forced reboots going forward.
I am standing up a basic ping monitor to alert me via email and SMS if the server actually goes down, and can quickly reboot it myself if ever needed (may even write some script to reboot via API if x concurrent ping fails, or something). Full monitoring stack is still in progress but not truly necessary to ensure stability at the moment.
OVH has scheduled a maintenance window for 5:00 EST this evening, hopefully they will be able to pinpoint the fault and get parts replaced at the same time. This will likely be an extended outage as they have more diagnostics than I was able to run, so I would expect somewhere around an hour or two of downtime during this.
I am mildly tempted to go ahead and migrate Lemmy.tf off to my new environment but it would incur even more downtime if I rush things, so it'll have to be sometime later.
Update 7:30PM:
I just received a response on my support case, they did not replace any hardware and claim their own diagnostics tool is buggy. We may be having a rushed VM migration over to a new server in the next few days... which would incur a few hours of hard downtime to migrate over to the new server (and datacenter) and switch DNS. Ideally I'd prefer to have time to plan it out and prep for a seamless cutover but I think a few hours of downtime over the weekend is worth ending the random restarts. I'm open to suggests on ideal times for this to happen.
Previous post: https://lemmy.tf/post/393063
UPDATE 07/25 10:00AM:
Support is getting a window scheduled for their maintenance. I've asked for late afternoon/early evening today with a couple hours advance notice so I can post an outage notice.
===========
UPDATE 12:00AM:
Diagnostics did in fact return with a CPU fault. I've requested they schedule the downtime with me but technically they can proceed with it whenever they want to, so there's a good chance there will be an hour or so of downtime whenever they get to my server- I'll post some advance notice if I'm able to.
===========
As I mentioned in the previous post, we appear to have a hardware fault on the server running Lemmy.tf. My provider needs full hardware diagnostics before they can take any action, and this will require the machine to be powered down and rebooted into diagnostics mode. This should be fairly quick (~15-20mins ideally) and since it is required to determine the issue, it needs done ASAP.
I will be taking everything down at 11:00PM EST tonight to run diagnostics and will reboot into normal mode as soon as I've got a support pack. If the diagnostics pinpoint a hardware fault, followup maintenance will need to be scheduled immediately, ideally overnight but exact time is up to their engineers.
I'm also prioritizing prep work to get the instance migrated over to a better server. This has been in the works for a few weeks, but first I'll need to migrate the DB over to a new Postgres cluster and kick frontend traffic through a load balancer to prevent outages from DNS propagation whenever I finally cut over to the new server. I'd also like to get Pict-rs moved up to S3, but this will likely be a separate change down the road.
EDIT 07/24: This is an ongoing issue and may be a hardware fault with the machine the instance is running on. I've opened a support case with OVH to have them run diagnostics and investigate. In the meantime I am getting a Solarwinds server spun up to alert me anytime we have issues so I can jump on and restore service. I am also looking into migrating Lemmy.tf over to another server, but this will require some prep work to avoid hard downtime or DB conflicts during DNS cutover.
==========
OP from 07/22:
Woke up this morning to notice that everything was hard down- something tanked my baremetal at OVH overnight and apparently the Lemmy VM was not set to autostart. This has been corrected and I am digging into what caused the outage in the first place.
I know there is some malicious activity going on with some of the larger instances, but as of this time I am not seeing any evidence of intrusion attempts or a DDoS or anything.
https://lemmy.ml/post/1808829
## What is Lemmy? Lemmy is a self-hosted social link aggregation and discussion platform. It is completely free and open, and not controlled by any company. This means that there is no advertising, tracking, or secret algorithms. Content is organized into communities, so it is easy to subscribe to topics that you are interested in, and ignore others. Voting is used to bring the most interesting items to the top. ## Major Changes This release includes major improvements to performance, specifically optimizations of database queries. Special thanks to @phiresky, @ruud, @sunaurus and many others for investigating these. Additionally this version includes a fix for another cross-site scripting vulnerability. For these reasons instance admins should upgrade as soon as possible. As promised, captchas are supported again. And as usual there are countless bug fixes and minor improvements, many of them contributed by community members. ## Upgrade instructions Follow the upgrade instructions for ansible [https://github.com/LemmyNet/lemmy-ansible#upgrading] or docker [https://join-lemmy.org/docs/en/administration/install_docker.html#updating]. If you need help with the upgrade, you can ask in our support forum [https://lemmy.ml/c/lemmy_support] or on the Matrix Chat [https://matrix.to/#/#lemmy-admin-support-topics:discuss.online]. ## Support development We (@dessalines and @nutomic) have been working full-time on Lemmy for almost three years. This is largely thanks to support from NLnet foundation [https://nlnet.nl/]. If you like using Lemmy, and want to make sure that we will always be available to work full time building it, consider donating to support its development [https://join-lemmy.org/donate]. No one likes recurring donations, but they’ve proven to be the only way that open-source software like Lemmy can stay independent and alive.
I'm running the Lemmy Community Seeder script on our instance to prepopulate some additional communities. This is causing some sporadic json errors on the account I'm using with the script, but hopefully isn't impacting anyone else. Let me know if it is and I'll halt it and schedule for late-night runs only or something.
Right now I have it watching the following instances, grabbing the top 30 communities of the day on each scan.
REMOTE_INSTANCES: '[
"lemmy.world",
"lemmy.ml",
"sh.itjust.works",
"lemmy.one",
"lemmynsfw.com",
"lemmy.fmhy.ml",
"lemm.ee",
"lemmy.dbzer0.com",
"programming.dev",
"vlemmy.net",
"mander.xyz",
"reddthat.com",
"iusearchlinux.fyi",
"discuss.online",
"startrek.website",
"lemmy.ca",
"dormi.zone"]'
I may increase this beyond 30 communities per instance, and can add any other domains y'all want. This will hopefully make /All a bit more active for us. We've got plenty of storage available so this seems like a good way to make it a tad easier for everyone to discover new communities.
Also, just a reminder that I do have defed.lemmy.tf up and running to mirror some subreddits. Feel free to sign up and post on defed.lemmy.tf/c/requests2 with a post title of r/SUBREDDITNAME to have it automatically mirror new posts in a particular sub. Eventually I will federate that instance to lemmy.tf, but only after I'm done with the big historical imports from the reddit_archive user.
https://greasyfork.org/en/users/1107499-mershed-perderders
@jon
@lemmy.tf