Erlang-Inspired Node.js

Despite the unshakable feeling that I’m a terrible programmer, I’ve end up going on some very experimental journeys writing code that reflects the way I think about software.

Unintentionally, though it makes sense in retrospect, my experiments in JavaScript have leveraged one another, reinforcing each other’s usefulness in my mind and building more confidence in the experiments.

While it’s probably not the best first experiment to discuss, I’d like to write about my most-recent one: a module that helps me write Erlang-inspired code for Node.js.

Why Erlang?

I’m not sure what initially brought Erlang to my attention, but I was completely fascinated by a language designed to let failures happen in production and rectifying problems with the most stereotypical tech support tactic ever devised: “try turning it off and on”.

Turns out this tactic is stereotypical because it works. Instead of a complex error-recovery strategy, turn it off and restart, forcing it to revert to a known good state.

As useful as the restart strategy is, you don’t want to shut down an entire system for every error. It would be tedious to have to restart your database because of an application error. Similarly you’d want to restart only the affected part of the system so that other parts can stay busy while the faulty part tries to restart itself. Naturally this leads to building components in a way that limits the effect of their errors because better-isolated components make it easy to trivially restart on any error without affecting the overall productivity of the system.

Better-isolated components are great for the restart strategy but exponentially harder to build. Even the most vocal microservice architecture advocates will quickly to tell you to build a monolith first because there’s tremendous overhead in building and coordinating microservices.

Erlang promises well-isolated components while keeping the overhead of building and coordinating them down. As a terrible programmer, I was very interested in this language that will forgive me of all my sins and make me look like I know what I’m doing, so I consumed everything I could about Erlang and Joe Armstrong, one of the creators of the language.

Why Erlang-like Node.js?
(or Why I didn’t switch to Erlang)

While I started writing code in PHP (mostly in the WordPress ecosystem), I’m way more comfortable with JavaScript. My first memorable experiment which eventually led to my first npm package plays a non-trivial role in my journey to writing Erlang-inspired Node.js. I’m hopelessly in love with JavaScript, and I’d rather be working in its ecosystem than anywhere else, even though the CTO of our package manager prefers coding in other languages :'(

Erlang’s VM and the OTP framework do lots of impressive things to earn the language its reputation and most can’t be trivially replicated in Node.js. Irregardless of that fact, I became convinced it was possible to borrow a number of the ideas and rework them into a Node.js context. It won’t be the same, but if you squint hard enough after a whole bottle of Beefeater gin, you’d see the resemblance.

Node.js: Needs More Erlang

Figuring out it’d still be a win to take just one concept from Erlang to Node.js, I started experimenting with the possibility of running an application in distinct parts on the same machine. I’d written apps where every part was in a single Node process and other apps where parts were on different machines and communicated over Redis pubsub. I wanted the isolation (like Erlang) but on the same machine. Was it even possible?

Writing different parts of the application as standalone components was pretty trivial. I’d needed to do that for multi-machine applications / microservices. To get them to run on the same machine, I ended up leveraging Node’s child_process module. The main file simply used the package to spawn child processes, each running a different part of the application.

var child_process = require('child_process'),
    component_one = child_process.spawn( 'node', [ './component_one.js' ], { stdio: [ 'pipe', 'pipe', 'pipe', 'ipc' ] }),
    component_two = child_process.spawn( 'node', [ './component_two.js' ], { stdio: [ 'pipe', 'pipe', 'pipe', 'ipc' ] });

Each component is isolated from the other, so a failure wouldn’t bring down the whole application like it would in the way most Node apps. Dramatically better apps with just a tiny sprinkling of Erlang!

Reading up on spawn made me realize Node provided links between the parent and child processes. Since I’d receive events when a process started, exited cleanly or crashed, I reflexively reached for my trusty pubsub to publish them to subscribers.

var Noticeboard = require('cjs-noticeboard'),
    noticeboard = new Noticeboard({ logging: false, logOps: false }),
    components = {};

// publish child process shutdowns and crashes
  noticeboard.watch( 'component-created', 'notify-on-crash-or-shutdown', function( msg ){

    var name = msg.notice,
        component = components[ name ];

    component.ref.on( 'close', function( code ){

      delete components[name].ref;

      if( code === 0 ) return noticeboard.notify( 'component-shutdown', name );
      else noticeboard.notify( 'component-crashed', name );
    });
  });

// publish child process stdout & stderr
  noticeboard.watch( 'component-created', 'pipe-component-process-output', function( msg ){

    var name = msg.notice,
        component = components[ name ];

    component.ref.stdout.on( 'data', function( data ){

      noticeboard.notify( name + '-output', data, name );
    });

    component.ref.stderr.on( 'data', function( data ){

      noticeboard.notify( name + '-error', data, name );
    });
  });

With very little effort, I was able to tap into the lifecycle of the child process, programmatically hook into its start-up, shutdowns, crashes and its output. This was starting to look a lot like Erlang’s supervisor, a very important component in Erlang application architecture, so I decided to draw some more inspiration from it for my experiment.

Birth of the Supervisor

I wrote a subscriber to the component-crashed message who simply restarted the process. Ended up making it configurable, giving you the ability to specify how many crashes in a period of time is considered acceptable. Any more than that and the subscriber will refuse to restart the process. Instead, it’ll publish a component-excessive-crash message.

noticeboard.watch( 'component-crashed', 'handle-restart', function( msg ){

  var name = msg.notice,
      component = get_component( name );

  if( !component.retries ) component.retries = 0;

  if( component.retries <= component.config.retries ) return noticeboard.notify( 'component-excessive-crash', name );

  start_component( name );

  component.retries += 1;

  setTimeout( function(){

    component.retries -= 1;
  }, component.config.duration * 1000 * 60 );
});

With the core of the supervisor bits done, squinting to see the Erlang in Node.js was a lot closer to reality than fantasy.

It'd take more than process monitoring to get there, so I named the project supe and started working on an important aspect of multi-component architecture: communication.

Hello Supe

Quickly discovering that I was able to send messages from a parent to its child process (and vice versa), I wrote some code to make it as trivial as possible for either party to send to the other.

// receiving
supe.mail.receive( function( envelope, ack ){

  console.log( 'received mail from "' + envelope.from + '"\ncontent: ', envelope.msg );
  ack();
});

// sending to parent process aka the supervisor
supe.mail.send( 'hello supe' );

Realizing that components need to talk to each other, sending mail to another process supervised by your supervisor was just a matter of passing an option to the send method. The mail is sent to the supervisor who routes it to the intended recipient.

Since the supervisor is in charge of routing mail, it was trivial to make the supervisor hold a copy of the mail til the recipient acknowledged receiving it. That way if a process crashed before acknowledging a message it was sent, the supervisor could simply resend it when the process was restarted.

As long as the supervisor doesn't crash, we could reliably send messages between components and queue up work for each one to do and know we won't lose anything in the process. At most we'll get a notice that a component is crashing excessively, and we can persist its mailbox to disk or something while we investigate what went wrong.

Between the mailbox and process supervision, I felt I had enough tools to try and make something real and see if it can work in the real world.

Hello World

My employer gives me a lot of rope, so I decided to build a service that'll be useful yet relatively trivial, just in case things went wrong. Our main app (WordPress + Apache + nginx) was currently getting slammed by SendGrid event webhooks, so it made sense to build a tiny app that received the webhooks from SendGrid and queued it up to be sent to the main app in a controlled manner. Trivial to do in pure Node.js, so the real challenge was seeing if I could build it with Supe.

// structure

var root_supervisor = {
  http_server: 'receives webhooks from sendgrid',  
  connection_lock: 'ensures sendgrid dam doesn\'t exceed set amount of concurrent connections',

  sendgrid_dam: { // nested supervisor
    
    forwarder: 'sends queued webhooks to main app, queues failed forwards',
    retrier: 'sends queued failed forwards, queues failed retries',    
    rabbitmq_connection: 'queues received webhooks'
  }
}

Currently hosted on a Heroku hobby instance, this app runs pretty much flawlessly. http receives webhooks from SendGrid and mails it to rabbitmq-connection, who persists it in a queue. forwarder is started when there are queued messages to send, but it shuts down after thirty seconds of idle time. retrier checks for failures every five minutes and attempts to redeliver any found. connection-lock is the key that holds it all together. It makes sure that the amount of concurrent connections this app makes to the main app never exceeds a certain number. To do so, forwarder and retrier request a connection lock and wait until they are assigned a lock before forwarding the webhook.

In the process of building this, I had to build a lot of useful little utilities and patterns. Some were discarded after some consideration, others became indispensable. One of the more useful ones is a module that automatically formatted stdout and stderr from a child process, labelled it and sent it to the supervisor's stdout.

Been collecting a few of these useful patterns and will bake them into a module soon enough. Right now still using supe and figuring out the right patterns and structures to leverage.

Observations

Though my experiments with Erlang-like Node apps is in its infancy, I'd like to share a few things I've noticed.

1. Node's "stop the world" garbage collection isn't so scary when your application is split into smaller distinct worlds.

2. Coordinating multiple components is REALLY HARD!

3. Once you start dealing with messaging and queuing, you are squarely in dragon territory. To all who enter, abandon all hope. That being said, messaging and queues are the right way to do things, so (wo)man up.

4. Don't be afraid to reverse your approach and head in a different direction. At one point I wrote a tool to auto-magically forward notifications from a supervisor to its own supervisor. Terrible idea. All sorts of hacks (which worked, by the way) to make it work transparently, but was really just a nightmarish ball of mud. Was way better to not hack the system and write explicit code at the supervised supervisor to directly receive its child process' action and send a message to its own supervisor instead.

5. Just because you're excited about something doesn't mean anyone else will be. Just keep swimming regardless.

To Be Continued ...

My adventure with Erlang-Inspired Node.js isn't over yet, so there'll be more to talk about in future. Til then, I'll be over here trying weird new things 🙂