Prometheus and anti-pattern pushgateway timeouts

Prometheus is a pretty awesome recording server for metrics of all sorts. We use it at work to record data about servers, room temperatures, and other things. The whole server gets really nice an shiny if combined with a slick dashboard like Grafana.

But enough of this fanboy-ism, there is a problem with prometheus which almost became a deal breaker for us using it: Prometheus employs a (relatively) strict pull mechanism for fetching metrics from devices. The server is configured to regularly check on peers to fetch the metrics from them. Promethues takes the active part of the data collector and therefore can detect downtimes of devices automatically. It nicely allows one to define what metrics should be available on a client and configure a server to fetch them. A nicely encapsulated design!

This design comes to its limits though when it collides with company restrictions on dataflow, also called “firewalls”. Publishing metrics to the internet from “the inside” becomes almost impossible since the active part of the prometheus system is isolated and cannot contact the machines it should “scrape” the data from. This is a well-known issue and it can partially be fixed by relying on the so-called pushgateway. Metrics are pushed to this pushgateway, are saved, and later served to the prometheus server when the pushgateway is scraped. Since metrics now are push from the devices it is possible to penetrate business firewalls and send data to servers on the internet.

However, the authors of prometheus see this use of the pushgateway as an antipattern. The official usecase for the pushgateway is be to persist metrics that are not continuously available, but are generated, for example, by an automated script runs for a short time. When it finishes it produces some metric that needs to be made available to prometheus, but it cannot be made available by the script since it is not a continuously running server process. Pushing the generated metrics on a local(!) pushgateway for later scraping from an external(!) prometheus server is the solution. Note that this is different from the proposed firewall penetration usecase for the pushgateway. To be able to push through the firewall, the pushgateway must be on the prometheus server-side, not on the devices.

The consequence of this design decision is that an important feature is missing from the pushgateway: timeouts for stored metrics. These are important in the firewall usecase, because the prometheus server cannot check if a device is offline anymore. The last stored metric is persisted in the pushgateway forever and data just “flatlines” if a device goes offline.

At work, this was a real shame: the prometheus server worked fine and was great, but we could not use it through our business firewall. Personally, I see why the original developers see it as an antipattern to try to use a pushgateway for firewall circumvention. On the other hand it is also a pitty that this software becomes entirely unsusable in this situation, expecially since the missing feature is relatively small. Therefore, it was time to code the antipattern!

Since it was needed for work, I contributed to the project and implemented the unintended feature, which is available on github and also in a binary form on docker-hub as a compiled docker image. The extension of the pusgateway allows devices to send timeout information about metrics. The pushgateway will then delete these metrics, if they were not refreshed within the defined timeout.

This works like a charm and allows us to make use of prometheus through our firewall. If servers are offline, metrics do not only flatline, but are shown as missing. Perfect! I cannot support this project at work but will probably do so every now and then in my free time. So have fun using this feature, if you feel a bit “antipattern”.

Electronics: Controlling a Robot Arm

Last year, I bought a robot arm and “modded” it a little bit with custom electronics. Specifically, I created a USB controller board for it to be able to programmatically control it from my computer. This was also aimed at getting some use out of my Arduino micro controller boards that I bought the summer before. So, the tightly scheduled two-week USB controller project was created. In the end, it took a bit longer because of a redesigns of the input power supply but the final design looked like this:

The robot arm in its final glory
The robot arm
The fully assembled controller board.
The fully assembled controller board.

You can have a look at the USB Robot Arm Controller page for more information. The article currently focuses mainly on the electrical design of the board. I might add a second page later this year that will focus more on the controller software for the robot arm.

Playing around with the Space Engineers source code

I really like the game Space Engineers developed by Keen Software House. I recently picked it up again and was a little bit disappointed that there were a bunch of new features added to the game, but the underlying network engine did not seem to have changed in over a year. Since I am a Software Engineer and personally like to do a bit of game development on the side, I got interested in the code-base for Space Engineers which was released a year or so ago on github. I wanted to see how the game was coded and how the code base of the project looks.

First of all, I discovered that my initial concerns regarding the immediate usability of the code base were true: even when following the official README-instructions, it took me about 8 hours to get everything compiled. This morning, I got it working for the first time in a Release build and with a little bit of extra effort also in the Debug build. So far so good. To my surprise, running the actual game worked out of the box! One needs to own an original version of Space Engineers on Steam from where extra libraries and assets are pulled, but the default run configuration in Visual Studio worked fine, immedately.

While trying to get the code base to compiler, I got to see lot of code already, and without a deep understanding of the code base, it seems to me that Space Engineers suffers from typical issues of large projects: feature creep, ad-hoc coding, and half-completed refactoring steps which make things worse rather than better – oh well, business as usual. However, after spending eight hours of my free time on a code base, analyzing what the hell is happening where and why, basically the same stuff that I also do at work, this led to an interesting “mod” of the main menu:

A mad, mad mod for Space Engineers…

Contains the song Jaunty Gumption by Kevin MacLeod (
Licensed under Creative Commons: By Attribution 3.0 License

Making this mod also let me work more closely with the code base of Space Engineers. As it turns out, Keen Software House did not open up much of the underlying engine VRage together with Space Engineers. Only some parts of it (modules? parts?) are included in the code base. The remaining use of VRage consists of API calls into some compiled VRage-DLLs published via Steam. It would not matter too much but the complete lack of documentation makes browsing and understanding the code base a lot harder, especially for calls to the binary libraries. Also the resource configuration files that the VRage processes, for example music and sounds configurations files, are undocumented. And finally I discovered that audio files cannot be typical sound formats, but are required to be a Wave file or some undocumented flavor of Windows Media Audio 2-file which I do not know how to generate.

Multiplayer synchronisation

Since I was so disappointed about the lack of improvements to the netcode in the past year, I also had a short look at that. From my initial sweep looking at the Space Engineers netcode it seems to me that the multiplayer entirely relies on synchronized values. I could not quickly find a footprint for any type of specialized prediction algorithms that, for example, make a difference between which player has control over a ship – everything is handled the same. This would explain the “bounciness” that one experiences when trying to play Space Engineers in multiplayer: Different clients are fighting and synchronizing the same values based on their local simulations which cannot run perfectly synchronously. In this context, a lower ping would actually increase the issue since there would be more updates and bounces per second, and the oscillation of ships would be quicker. With a lower ping, the sudden jumps would be larger, but less frequent. This is contrary to popular believe, where the glitches are (sometimes) contributed to “lag”. I say, given my very superficial research, that these issues would still be there with any amount of “lag” and always appear there until the prediction code is made smarter. I have a few small ideas on how this could, perhaps, be improved, but I first have to see if this is indeed the problem… more research needed.

Regarding the “rewrite of the netcode” with Raknet that many people want: I do not think that there is a good reason, to “want” it. Regardless of the used network-layer the glitchfest will remain. The only fix for the multiplayer glitches is to change the way the game world is synchronized with other players.

A (new) hobby Programmer Website

Freshly recreated, this is my new website where I document my projects and general stuff that people might be interested in. The page needs to grow again after being down for almost a whole year. I had to take the website down after it was hacked by Google-Drive scammers.

First thing for now: re-discover what WordPress has to offer – the last version I was using was Version 2.XXX – time to rediscover those code highlighting plugins etc…