Clustering in Node.js (Part 2) – Round-robin load balancing in a node cluster – notifying all workers

Here is part two of the round-robin load balancing post, which explains how to notify all workers depending on the type of data. This actually is very simple to do in the default architecture as laid out in Part 1. This provides a lot of utility because it allows each worker to maintain a state based on the type or amount of work its completed, while the Load Balancer itself remains stateless. Essentially it works like this:

  1. Inspect the contents of the data. If the data warrants a notification to all workers, then return the type of notification to be sent. If it doesn’t, don’t send any notification
  2. If there is a notification to be sent to workers, send it to all workers before actually distributing the data.
  3. If a worker receives the notification, it will update its state before processing the data.

It’s really that simple. To implement this, I added a prototypal method to the LoadBalancer class I had written, called broadcast(data). Broadcast essentially inspects the data and determines what type of notification to send, if any, to the workers.

/**
 * This method if returns a notification type based on the data.  If the type
 * is not defined, then each worker will be notified by the type of notification
 * so the workers will need to implmement notify() to take advantage of the type
 *
 * @method broadcast
 * @param {Object} data Data is of format { data: msg, rinfo: rinfo }
 * @return {Object} Returns a NotificationPrototype or undefined to disable
 */
MyLoadBalancer.prototype.broadcast = function(data) {
    if (needsNotification(data)) {
        return NotificationType;
    } else {
        return undefined;
    }
}

Next, it was necessary to modify the LoadBalancer to send the notification first if there was one to be sent. Essentially, for each data source/server implemented in LoadBalancer, I added this:

function serverWrapper() { ...
    c.on('data', function(data) {
	  		console.log('[%s] Receiving TCP data', self.name);
	  		var data = {
	  			remote_address: c.remoteAddress,
	  			data: data
	  		};
	  		self.counter++;

	  		var notification = createNotification.call(self, self.broadcast(data), data);
			if (notification) {
		  		self.notify.call(self, self.workers, notification);	
		  	}
		  	self.distribute.call(self, self.workers, data);	
	  	});

    ...
}
/**
 * Default implementation of notify - notifies all workers about some data
 *
 * @method distribute
 * @param {Array[cluster.Worker]} workers The array of workers managed by this LoadBalancer.  
 * @param {Object} data An object containing the message to send to the workers
 */
LoadBalancer.prototype.notify = function(workers, notification) {
	var self = this;
	try {
		workers.forEach(function(worker) {
			worker.send({
				type: 'notification',
				data: notification
			});
		});
	} catch (e) {
		if (e instanceof WorkerDownError) {
			console.log('Worker %s is down.  Removing from workers for [%s:%d]', worker.name, self.name, self.instance);
			workers.splice(workers.indexOf(worker), 1);
		} else {
			throw e;
		}
	}
};

/**
* Instantiate a notification if there is one
* @method createNotification
* @param {Function} BroadcastType The notification type
* @param {Object} data The data associated with the receiver
*/
function createNotification(BroadcastType, data) {
	if (!BroadcastType) {
		return undefined;
	} else {
		return new BroadcastType(BroadcastType.prototype.type, this.name, data);
	}
}

This code is really simple. You can see that when the data event is thrown based on the server type, the prototype’s broadcast method is called to determine whether there needs to be a notification broadcast to each of the workers. If broadcast returns a notification type (‘BroadcastType’ here), then you’ll end up with an instance of the notification wrapping the data. Then for each worker, that notification is sent and the worker can determine what to do with the notification.
If there isn’t any notification (i.e. it’s undefined), then the LoadBalancer will just proceed as normal and distribute the work. In this code, notifications are really simple, essentially just plain data objects wrapped in a function in javascript. Here’s an example:

'use strict'

/**
* Notification class is a wrapper for a javascript object that gets passed upon some event triggered by a 
* data source consumed by a Load Balancer
*
* @class Notification
* @constructor
* @param {String} type A string of the type (similar to an enum)
* @param {String} source The name of the source instantiating this notification
* @param {Object} data The datagram bundled with this notification
*/
var Notification = function(type, source, data) {
	this.type = type;
	this.source = source;
	this.data = data;
	this.created = new Date().getTime();
};

//Rebind toString
Notification.prototype.stringify = function() {
	var str = '';
	str += 'Notification @: ' + this.created + ' from ' + this.source + ' of type ' + this.type;
	return str;
};

Notification.prototype.type = undefined;

module.exports = Notification;

//And in the worker class, implement this.notify
var Worker = function(worker, instance) { 
...
	this.notify = function(notification) {
		console.log('[%s:%d] Received notification from %s: %s', this.name, this.instance, notification.source, notification.type);
		switch(notification.type) {
			case 'Generic':
				break;
			case 'Example':
				break;
			default:
				break;
		}
	};
}

That’s pretty much all there is to it. Now when we receive data of a certain type, we can tell all workers to update a state variable. Neat!

[GenericLoadBalancer] Starting up Load Balancer on Port 5005 with 2 workers.
[GenericLoadBalancer] Server opening socket and listening on port 5005
[GenericLoadBalancer] Finished starting up.  Instantiating workers.
Instantiating Worker Generic Worker 0
Instantiating Worker Generic Worker 1
[GenericLoadBalancer] Received a message from 1.1.0.1:65432 @ [Fri Sep 12 2014 15:11:52 GMT-0700 (PDT)] of size [48] bytes.
[GenericLoadBalancer] Received a message from 1.2.0.2:65432 @ [Fri Sep 12 2014 15:11:52 GMT-0700 (PDT)] of size [48] bytes.
[Generic Worker:1] Received notification from GenericLoadBalancer: Generic - WAKE UP LAZY WORKERS!
[Generic Worker:2] Received notification from GenericLoadBalancer: Generic - WAKE UP LAZY WORKERS!
[Generic Worker:1] Received notification from GenericLoadBalancer: Generic - WAKE UP LAZY WORKERS!
[Generic Worker:2] Doing work [0].
[Generic Worker:1] Doing work [0].
Advertisement

Clustering in Node.js (Part 1) – Round-robin load balancing in a node cluster

As we start to scale out our analytics service here, we’ve started thinking of ways to leverage node’s very good event-driven model across multiple cores and eventually multiple machines. Node clustering is still in its nascent stages, so there’s not a whole lot of functionality and it relies on OS-level forking to “scale out.” While this works in principle, it means that clustering in node is not native per se and that node’s cluster module essentially just provides a weak layer of abstraction across several node processes that are effectively launched from your command line.

Although node already implements load balancing for HTTP requests in its cluster module, it doesn’t handle load balancing for other protocols. Also, node’s load balancing algorithm had a few quirks related to load balancing by relying on the OS’s scheme of waking up dormant processes. (see here http://strongloop.com/strongblog/whats-new-in-node-js-v0-12-cluster-round-robin-load-balancing/). Hence, in Node.js v0.11.2, the developers at Joyent decided to move back to a round-robin scheme. Round-robin is not terribly sophisticated and make no judgment as to the type of work a worker is currently processing.  While it works well when each request is essentially equal in the amount of work required, it’s not so good when the incoming work to do varies in nature.  At least it’s conceptually simple and simple to implement.

My first major stumbling block when using node’s cluster module was the fact that this load balancing wasn’t supported for UDP.  Furthermore, we receive analytics data using a range of protocols (e.g. SFTP, TCP/IP, FTP, etc.) so I didn’t want to rely on node’s native libraries in case no one thought it important enough to extend load balancing to a variety of protocols. Thus, I had to come up with my own generalized architecture for implementing load balancing. Have a look:


var dgram = require('dgram')
	cluster = require('cluster'),
	WorkerPrototype = require('../workers/WorkerPrototype.js'),
	WorkerDownError = require('../exceptions/WorkerDownError.js'),
	net = require('net');

// Private Functions
var protocols = {
	'udp': {
		'start': listen_udp,
		'stop': shutdown_udp
	},
	'tcp_server': {
		'start': open_tcp_server,
		'stop': shutdown_tcp_server
	},
	'tcp_client': {
		'start': open_tcp_client,
		'stop': shutdown_tcp_client
	}
};

function listen(callback) {
	protocols[this.protocol].start.call(this, callback);
}

function shutdown(callback) {
	protocols[this.protocol].stop.call(this, callback);
}

//This function will restart the server every retry interval if restart is enabled
function restart(callback) {
	var self = this;
	console.log('[%s] Trying restart in %d ms ...', self.name, self.retrydelay);
	setTimeout(function() {
		self.retrydelay = Math.floor(self.retrydelay * 1.5);
		listen.call(self, callback);
	}, self.retrydelay );
}

//UDP Handling
function listen_udp(callback) {
	var self = this;
	var server = dgram.createSocket('udp4');
	server.on('error', function (err) {
        console.log('[%s] Error:\n' + err.stack, self.name);
        try {
      		server.close();
      	} catch (err) {
      		if (err) {
      			console.log('[%s] Error closing tcp connection: %s', self.name, err);
      		}
      	} finally {
      		if (self.restart) {
	        	restart.call(self, callback);
	       	} else {	//otherwise, just calls callback
	        	if (callback)
	        		callback(err);
	        }
      	}
    });

    server.on('message', function (msg, rinfo) {
		console.log('[%s] Received a message from ' + rinfo.address + ':' + rinfo.port + ' @ [%s] of size [%d] bytes.', self.name, new Date(), msg.length);
		self.counter++;
		self.distribute.call(self, self.workers, {
			data: msg, 
			rinfo: rinfo
		});
	});

	server.bind(self.port, function() {
	    console.log('[%s] Server opening socket and listening on port ' + self.port, self.name);
	});

	server.on('listening', function () {
	  	var address = server.address();
	  	var msg = '[' + self.name + ']  SUCCESS! UDP socket now listening @ ' +
	      	address.address + ':' + address.port;
	  	if (callback) {
	  		self.retrydelay = self.restartdelay;
	  		callback(null, msg);
	  	}
	});

	self.server = server;
	return server;
}

function shutdown_udp(callback) {
	var self = this;
	if (self.server) {
		self.server.close();
		var msg = '[' + self.name + '] Load Balancer shutdown successfully.';
		if (callback)
			callback(null, msg);
		
	} else {
		var err = '[' + self.name + '] Load Balancer not defined!  Could not be shutdown.';
		callback(err);
	}
}

//TCP Client - creates connection to another TCP server 
function open_tcp_server(callback) {
   //implementation goes here
}

function shutdown_tcp_server(callback) {
   //implementation goes here
}

function open_tcp_client(callback) {
   //implementation goes here
}

function shutdown_tcp_client(callback) {
   //implementation goes here
}

//Private Load Balancer
function instantiateWorkers() {
	for (var i = 0; i < this.instances; i++) {
		console.log('Instantiating Worker %s %d', this.worker_name, i);
		var worker = cluster.fork({
			name: this.worker_name,
			id: i
		});
	    this.workers.push(worker);
	}
}

/**
* Load Balancer is a class that wraps functionality for opening/closing a UDP or TCP connection
* It also implements default load balancing based on the number of workers specified.
*
* @class LoadBalancer
* @constructor
* @param {String} name The name for this load balancer
* @param {Number} port The port number
* @param {String} protocol Either udp or tcp at the moment, can be extended later on
* @param {String} worker_name Give each worker a name
* @param {Number} instances The number of instances of the LoadBalancer to be "forked" although, each instance should only receive messages
*/
var LoadBalancer = function(name, port, protocol, worker_name, instances, retryonerror) {

	if (!port || !protocol) {
		throw new Error('Could not instantiate load balancer.  Port/Protocol/Workers undefined.');
	}

	if (instances <= 0) {
		throw new Error('Not enough worker instances for workers.  Failed to start.');
	}

	if (!protocols[protocol]) {
		throw new Error('Could not instantiate load balancer.  Unknown protocol %s', protocol);
	}

	//Set by User
	this.name = name;
	this.port = port;
	this.protocol = protocol;
	this.worker_name = worker_name;
	this.instances = instances;
	this.restart = retryonerror ? retryonerror : false;	//if true, will try to reopen connection upon error
	this.restartdelay = 5000;	//will multiply by 1.5 each time and reset when connection is reestablished

	//Private members
	this.workers = [];
	this.server;
	this.counter = 0;
	this.retrydelay = this.restartdelay;

	var self=this;

	this.start = function(callback) {
		listen.call(self, function(err, msg) {
			self.counter = 0;	//reset counter
			if (!err)
				instantiateWorkers.call(self);

			if (callback) {
				callback(err, msg);
			}
		});
	}

	this.stop = function(callback) {
		shutdown.call(this, function(err, msg) {
			if (err) 
				console.log(err);
			if (callback)
				callback(err, msg);
		});
	}

	this.restart = function() {
		this.stop(function(err, msg) {
			if (err) {
				throw new Error(err);
			} else {
				this.start();
			}
		});
	}
};

/**
 * Default implementation of distribute - sequential cycling through workers
 *
 * @method distribute
 * @param {Array[cluster.Worker]} workers The array of workers managed by this LoadBalancer.  
 * @param {Object} data An object containing the data received on the port for this Load Balancer.  Ex:  UDP would send data {data: msg, rinfo: rinfo}
 */
LoadBalancer.prototype.distribute = function(workers, data) {
	var self = this;
	var worker = workers[self.counter % workers.length];
	try {
		if (worker) {
			worker.send(data);
		}
	} catch (e) {
		if (e instanceof WorkerDownError) {
			console.log('Worker %s is down.  Removing from workers for [%s:%d]', worker.name, self.name, self.instance);
			workers.splice(workers.indexOf(worker), 1);
		} else {
			throw e;
		}
	}
};

module.exports = LoadBalancer;

Basically, this prototype does a few things:

    1. Implements generalized servers by protocol to receive data

You can see that when one instantiates or extends the load balancer prototype, the user will need to specify the protocol to use. Since things like ports and instances of workers are common to all load balancers, these are private attributes of the load balancer instance. Essentially, all servers under this model work in the same way. A master server is instantiated which receives all data over a socket. When the server receives a request or message, it will pass it off to one of N workers (specified by the user) which is running in its own process. That worker then does the work.

    1. Implements default methods to start, stop and restart servers

A user doesn’t need to concern herself about the starting, stopping and restarting of servers should a socket go down. That’s why there is some dummy logic for reopening a socket upon error (e.g. if the socket is flooded and closes). The default mechanism is to try reopening the socket after 5 seconds, which increases by 50% each time the socket fails to open.

    1. Can take an arbitrary number of “Workers” to handle the load

The load balancer wouldn’t be of much use if it handled the entire workload individually. This prototype (and extended versions of it) actually will instantiate itself N times. I found this to be a quirk of node where launching a separate process (via setting the cluster settings) instead of launching the same process N times prevented me from being able to actually communicate with the worker processes. For example, I would launch the load balancer and passed in four separate javascript worker processes which were separate javascript files. The main load balancer could not communicate with them, even when I passed in a reference to the cluster to each worker process. The only way I could get communication to work was if the forked processes were forked from the same node cluster. I thought that this behavior was really strange and made for some weird looking code (i.e. the default implementation of node clustering which contains a massive if statement in the same javascript file which represents not only the cluster but also all of its workers).

    1. Offers a generic round-robin load balancer irrespective of protocol

The distribute function bound to the load balancer prototype is called whenever the data is received by the master process. I added a default implementation so users wouldn’t have to implement their own method using the prototype out of the box, but it would be easy to overwrite. If you, as the user, wanted to implement a different type of load balancing, you would just need to bind a new distribute() function to the prototype and things would work out of the box.

The last implementation step for the user is to extend the load balancer and specify what the workers do whenever data is received across the wire. A default implementation might look like this:

var LoadBalancer = require('./LoadBalancerPrototype.js'),
	UDPWorker = require('../workers/UDPWorker.js');

var UDPLoadBalancer = function(name, port, worker_name, instances, retryonerror) {
	LoadBalancer.call(this, name, port, 'udp', worker_name, instances, retryonerror);

	console.log('Instantiating UDP Load Balancer on port %d with %d workers', port, instances);
};

UDPLoadBalancer.prototype = Object.create(LoadBalancer.prototype);
UDPLoadBalancer.prototype.name = 'UDP Load Balancer';
//CONFIG: To be set by the user, define the type of worker
UDPLoadBalancer.prototype.worker = RTCPWorker;


// Custom distribution function - sequential.
// Overwrite this function to specify how you want to distribute the load.
// parameters msg, rinfo change depending on the protocol of the Load Balancer
// This is just an example - it is not strictly necessary

// UDPLoadBalancer.prototype.distribute = function(workers, msg, rinfo) {
// };


if (cluster.isMaster) {
	var port = 5005, instances = 2, retry = false;
	for (var i = 0; i < process.argv.length; i++) {
		if (process.argv[i] == '-p') {
			port = process.argv[i+1] ? process.argv[i+1] : port;
		}

		if (process.argv[i] == '-i') {
			instances = process.argv[i+1] ? process.argv[i+1] : instances;
		}

		if (process.argv[i] == '-r') {
			retry = true;
		}
	}

	var udpLoadBalancer = new UDPLoadBalancer('RTCP Balancer 1', port, RTCPWorker.prototype.name, instances, true);
	udpLoadBalancer.start(function(err, msg) {
		if (err) {
			console.log(UDPLoadBalancer.prototype.name + ' error starting up: ' + err);
			udpLoadBalancer.stop(function(err, msg) {
				console.log('UDP load balancer shut down: ' + err + '   ' + msg);
			});
		} else {
			//successful
		}
	});

} else {
	var worker = new UDPLoadBalancer.prototype.worker(cluster.worker, cluster.worker.id);

	cluster.worker.on('online', function() {
		console.log('[%s:%d] is online.', worker.name, worker.instance);
	})
	//msg is of format {data: udp_msg, rinfo: rinfo}
	cluster.worker.on('message', function(msg) {
		worker.doWork(msg.data, msg.rinfo);
	});
}

module.exports = UDPLoadBalancer;

And here’s an example implementation of a worker that handles the load:

var WorkerPrototype = require('./WorkerPrototype.js'),
	WorkerDownError = require('../exceptions/WorkerDownError.js'),
	Decoder = require('../controllers/Decoder.js');

var UDPWorker = function(worker, instance) {
	WorkerPrototype.call(this, worker, RTCPWorker.prototype.name, instance);

        //implementations need to specify what work to do upon receiving a message
	this.doWork = function(msg, rinfo) {
		if (!this.worker || this.worker.suicide === true) {
			throw new WorkerDownError('[%s:%d] Worker is down. Not doing work.', this.name, this.instance);
		}
		console.log('[%s:%d] Doing work [%d].', this.name, this.instance, this.counter);
	}
};

UDPWorker.prototype.name = 'RTCP Worker';
UDPWorker.prototype.getPath = function () { 
	var fullPath = __filename; 
	return fullPath; 
}

module.exports = UDPWorker;

That’s pretty much all there is to it. You can see an example of messages being sent across the wire and how load is distributed for UDP:

Instantiating Worker UDP Worker 0
Instantiating Worker UDP Worker 1
Instantiating Worker UDP Worker 2
[UDP Balancer 1] Received a message from 1.1.0.1:65432 @ [Thu Sep 04 2014 14:27:35 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:2] Doing work [0].
[DECODER]: Decoding udp message with timestamp: 1409866055940.
[UDP Balancer 1] Received a message from 1.1.0.2:65432 @ [Thu Sep 04 2014 14:27:36 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:3] Doing work [0].
[DECODER]: Decoding udp message with timestamp: 1409866056190.
[UDP Balancer 1] Received a message from 1.1.0.3:65432 @ [Thu Sep 04 2014 14:27:36 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:1] Doing work [0].
[DECODER]: Decoding udp message with timestamp: 1409866056439.
[UDP Balancer 1] Received a message from 1.1.0.4:65432 @ [Thu Sep 04 2014 14:27:36 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:2] Doing work [1].
[DECODER]: Decoding udp message with timestamp: 1409866056687.
[UDP Balancer 1] Received a message from 1.1.0.5:65432 @ [Thu Sep 04 2014 14:27:36 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:3] Doing work [1].
[DECODER]: Decoding udp message with timestamp: 1409866056937.
[UDP Balancer 1] Received a message from 1.1.0.6:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:1] Doing work [1].
[DECODER]: Decoding udp message with timestamp: 1409866057187.
[UDP Balancer 1] Received a message from 1.2.0.1:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [204] bytes.
[UDP Worker:2] Doing work [2].
[DECODER]: Decoding udp message with timestamp: 1409866057214.
[UDP Balancer 1] Received a message from 1.1.0.7:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:3] Doing work [2].
[DECODER]: Decoding udp message with timestamp: 1409866057437.
[UDP Balancer 1] Received a message from 1.2.0.2:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [204] bytes.
[UDP Worker:1] Doing work [2].
[DECODER]: Decoding udp message with timestamp: 1409866057463.
[UDP Balancer 1] Received a message from 1.1.0.8:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:2] Doing work [3].
[DECODER]: Decoding udp message with timestamp: 1409866057687.
[UDP Balancer 1] Received a message from 1.2.0.3:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [204] bytes.
[UDP Worker:3] Doing work [3].
[DECODER]: Decoding udp message with timestamp: 1409866057714.
[UDP Balancer 1] Received a message from 1.1.0.9:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:1] Doing work [3].
[DECODER]: Decoding udp message with timestamp: 1409866057938.
[UDP Balancer 1] Received a message from 1.2.0.4:65432 @ [Thu Sep 04 2014 14:27:37 GMT-0700 (PDT)] of size [204] bytes.
[UDP Worker:2] Doing work [4].
[DECODER]: Decoding udp message with timestamp: 1409866057965.
[UDP Balancer 1] Received a message from 1.1.0.10:65432 @ [Thu Sep 04 2014 14:27:38 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:3] Doing work [4].
[DECODER]: Decoding udp message with timestamp: 1409866058188.
[UDP Balancer 1] Received a message from 1.2.0.5:65432 @ [Thu Sep 04 2014 14:27:38 GMT-0700 (PDT)] of size [204] bytes.
[UDP Worker:1] Doing work [4].
[DECODER]: Decoding udp message with timestamp: 1409866058215.
[UDP Balancer 1] Received a message from 1.1.0.11:65432 @ [Thu Sep 04 2014 14:27:38 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:2] Doing work [5].
[DECODER]: Decoding udp message with timestamp: 1409866058437.
[UDP Balancer 1] Received a message from 1.2.0.6:65432 @ [Thu Sep 04 2014 14:27:38 GMT-0700 (PDT)] of size [204] bytes.
[UDP Worker:3] Doing work [5].
[DECODER]: Decoding udp message with timestamp: 1409866058465.
[UDP Balancer 1] Received a message from 1.1.0.1:65432 @ [Thu Sep 04 2014 14:27:38 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:1] Doing work [5].
[DECODER]: Decoding udp message with timestamp: 1409866058655.
[UDP Balancer 1] Received a message from 1.1.0.12:65432 @ [Thu Sep 04 2014 14:27:38 GMT-0700 (PDT)] of size [196] bytes.
[UDP Worker:2] Doing work [6].
[DECODER]: Decoding udp message with timestamp: 1409866058689.
[UDP Balancer 1] Received a message from 1.2.0.7:65432 @ [Thu Sep 04 2014 14:27:38 GMT-0700 (PDT)] of size [204] bytes.
[UDP Worker:3] Doing work [6].

A simple n-queens solution in javascript

N Queens (8)
N Queens (8)

I was recently just playing around with the n queens problem and decided to try to solve it. I’ve never done it before, but figured that I could use a depth first search kind of solution. I ended up building a solution in node.js which does identify all potential solutions, but decided to put it in angular.js to run on the front end so users can play around and visualize the solutions.

Essentially, the algorithm works like this:

  • If there’s only 1 queen left to place, and multiple possible positions, then it’s a valid solution. If it’s a unique solution, let’s store it.
  • If there is more than one queen left to place, and the number of valid positions left is less than the number of queens, then this is the wrong path, let’s backtrack.
  • If neither applies, lets iterate through all the next potential moves in the adjacent column, then recompute the possible positions from that grid state.
  • Finally, after we place the queen, we recursively call the algorithm for one less queen, a new grid state, recomputed valid positions and the next column.

I haven’t had much time to optimize, but a few ideas I could possibly do:

  1. Since a chess board is symmetrical, I probably don’t need to check all four corners (since they’ll yield the same result if the board is rotated 90 degrees)
  2. Use a stack instead for managing the valid positions instead of a simple array. Right now I actually make a copy of all new possible positions after a queen is placed.
  3. Use a better way to come up with unique solutions than concating a string and comparing each new solution that comes in

A couple optimizations that already helped:

  1. pruning next potential positions to only the starting column + 1 (enormous performance enhancement)
  2. pruning routes when number of queens left > number of open positions
  3. cutting down the length of the unique string when comparing objects

In any case, it was a fun little experiment. Here’s the source code, and here’s where you can see it working in jsfiddle. Warning!!! Angular tends to run pretty slow and blocks when you compute, so don’t try to run with too many queens (like > 12 or 13) otherwise you may cause your browser to freeze.

N Queens – http://jsfiddle.net/rocketegg0/wu6cpp5v/

Some average run times:

Run Times
Num Queens Solutions # Comparisons Run Time (ms)
4 2 12 <1
5 10 41 1
6 4 138 1
7 40 473 1
8 92 1758 7
9 352 7077 16
10 724 30654 45
11 2680 142755 360
12 14200 734048 4936
function Grid(width, height) {
        this.width = width;
        this.height = height;
        var grid = [];
        var validPositions = [];
    
        for (var i = 0; i < width; i++) {
            grid.push([]);
            for (var j = 0; j < height; j++) {
                grid[i][j] = '_';
                validPositions.push(new Pair(i,j))
            }
        }
    
        this.getValidPositions = function() {
            return validPositions;
        }
    
        this.getGrid = function() {
            return grid;
        }
        
        //Position validation
        function isQueen(pair, x, y) {
            return pair.x == x && pair.y == y;
        }
        
        function isInRow(pair, y) {
            return pair.y == y;
        }
        
        function isInCol(pair, x) {
            return pair.x == x;
        }
        
        //Ex: 
        //3, 2 > 4, 1 = -1, 1
        //3, 2 > 4, 3 = -1, -1
        //3, 2 > 2, 1 = 1, 1
        //3, 2 > 2, 3 = 1, -1
        function isInDiagonal(pair, x, y) {
            if (Math.abs(pair.x - x) == Math.abs(pair.y - y)) {
                return true;
            }
            return false;
        }
        
        function testPosition(pair, x, y) {
            if (isQueen(pair, x, y) ||
                isInRow(pair, y) ||
                isInCol(pair, x) ||
                isInDiagonal(pair, x, y)) {
                return true;
            }
            return false;
        }
    
        function recomputeValid(validPositions, queenX, queenY) {
            var newValid = [];
            for (var i = 0; i < validPositions.length; i++) {
                if (!testPosition(validPositions[i], queenX, queenY)) {
                    newValid.push(validPositions[i]);
                }
            }
            return newValid;
        }
        
        function gridToString(grid) {
            var gridstr = '';
            for (var i = 0; i < grid.length; i++) {
                gridstr += '['
                for (var j = 0; j < grid[0].length; j++) {
                    gridstr += grid[i][j];
                    if (j < grid.length-1) {
                        gridstr += '|'
                    }
                }
                gridstr += ']'
                gridstr += '\n';
            }
            return gridstr;
        }
    
        var printGrid = function (grid) {
            console.log(gridToString(grid));
        }
    
        var solutions = [];
        $scope.solution_grids = [];
    
        function convertToSolution(grid) {
            var str = '';
            for (var i = 0; i < grid.length; i++) {
                for (var j = 0; j < grid.length; j++) {
                    if (grid[i][j] == 'Q') {
                        str += i + '#' + j;
                    }
                }
            }
            return str;
        }
        
        this.numcomputations = 0;
    
        this.solve = function (numQueens, grid, validPositions, startcol) {
            if (numQueens <= 1 && validPositions.length > 0) {
                grid[validPositions[0].x][validPositions[0].y] = 'Q';
                var solution = convertToSolution(grid);
                if (solutions.indexOf(solution) == -1) {
                    solutions.push(solution);
                    $scope.solution_grids.push(gridToString(grid));
                }
                grid[validPositions[0].x][validPositions[0].y] = '_'; //reset
                return true;
            } else if (numQueens > validPositions.length) {
                return false;
            } else {
                var x, y;
                var nextcol = validPositions.filter(function(point) {
                    return point.x == startcol + 1 ? true : false;
                });    //prune routes to only next col
                for (var i = 0; i < nextcol.length; i++) {
                    x = nextcol[i].x, y = nextcol[i].y;
                    grid[x][y] = 'Q';
                    this.solve(numQueens - 1, grid,
                               recomputeValid(validPositions, x, y), startcol+1);
                    grid[x][y] = '_';	//reset
                    this.numcomputations++;
                }
            }
        }
    }
    
    $scope.solve = function() {
        var nqueens = $scope.sides;
        var grid = new Grid(nqueens, nqueens);
        var startTime = new Date().getTime();
        console.log('Start Time: ' + startTime);
        grid.solve(nqueens, grid.getGrid(), grid.getValidPositions(), 0);
        $scope.numcomputations = grid.numcomputations;
        $scope.endTime = new Date().getTime() - startTime;
        console.log('End Time: ' + ($scope.endTime) + 'ms');
    };    

Mongoose / MongoDB performance enhancements and tweaks (Part 3)

bush_doing_it_wrong_1

Holy crap. MongoDB native drivers are SO much faster than updates through mongoose’s ORM.  Initially, when I set out on this quest to enhance mongoDB performance with node.js, I thought that modifying my queries and limiting the number of results returned would be sufficient to scale. I was wrong (thanks Bush for the imagery). It turns out that the overhead that mongoose adds for wrapping mongoDB documents within mongoose’s ORM is tremendous. Well, I should say tremendous for our use case.

In my previous two posts of tweaking mongoDB / mongoose for performance enhancements (Part 1 and Part 2), I discussed optimization of queries or making simple writes instead of reads. These were worthwhile improvements and the speed difference eventually added up to significant chunks, but I had no idea moving to the native driver would give me these types of improvements (See below).

Example #1: ~400 streams, insertion times.
These numbers are from after I made the initial tweaks after Part 1. Unfortunately, I don’t really have that good of a printout from mongotop, but this kind of gives you an idea. Look at the write times for streams and packets, flowing at a rate of ~400 streams. This is for 400 sources of packets, which all gets written and persisted. Here you can see the write time to streams is @ 193ms / 400 streams or 48.25 ms / 100 streams. Likewise, packet writing is 7.25 ms / 100 streams. (You can mostly ignore read time, these are used for data aggregates and computing analytics). Compare these with the results below:

ns total read write
streams 193ms 0ms 193ms
packets 30ms 1ms 29ms
devices 9ms 9ms 0ms

Example 2: ~1000 streams, insertion times.
You can see here that write time has dropped significantly. Writes to the packets collection is hovering at around 1.7 ms / 1000 streams, and writes to the streams collection hovers at around 7.6 ms / 100 streams. Respectively, that’s a 425% and a 635% improvement in query write times to the packets collection and streams collection. And don’t forget, I had already begun the optimizations to mongoose. Even after the tweaks I made in Part 2, these numbers still represent a better than 100% improvement to query times. Huge, right?

ns total read write
packets 186ms 169ms 17ms
devices 161ms 159ms 2ms
streams 97ms 21ms 76ms

I knew using the mongoDB native drivers would be faster, but I hadn’t guessed that they would be this much faster.  

To make these changes, I updated mongoose to the latest version 3.8.14, which enables queries to be made using the native mongoDB driver released by 10gen (github here: https://github.com/mongodb/node-mongodb-native) via Model.Collection methods.  These in turn call methods defined in node_modules/mongodb/lib/mongodb/collection/core.js, which essentially just execute raw commands in mongo. Using these native commands, one can take advantage of things like bulk inserts.

I still like mongoose, because it helps instantiate the same object whenever you need to create and save something. If something isn’t defined in the mongoose.Schema, that object won’t get persisted to mongoDB either. Furthermore, it can still be tuned to be semi-quick, so it all depends on the use case. It just so happens that when you’re inserting raw json into mongoDB or don’t need the validation and other middleware that mongoose provides, you can use the mongoDB native drivers while still using mongoose for the good stuff. That’s cool.

Here’s what the new improvements look like:

    var Stream = mongoose.model('Stream');
    async.waterfall([
        //Native mongoDB update returns # docs updated, update by default updates 1 document:
        function query1 (callback) {
            Stream.collection.update(query1, query1set, {safe: true}, function(err, writeResult) {
                if (err) throw err;
                if (writeResult == 1) {
                    callback('Found and updated call @ Query1');
                } else {
                    callback(null);
                }
            });
        },
        function(callback) {
            Stream.collection.update(query2, query2set, {safe: true}, function(err, writeResult) {
                if (err) throw err;
                if (writeResult == 1) {
                    callback('Found and updated stream @ Query2');
                } else {
                    pushNewStream(packet, cb);
                    callback('No stream found.  Pushing new stream.');
                }
            });

        }
    ], function(err, results) {});

Mongoose / MongoDB speed improvements (Part 2)

In a previous post MongoDB performance enhancements and tweaks, I described some techniques I’ve found to speed up mongoose inserts and updates on large volume performance statistic data. I was still seeing performance bottlenecks in MongoDB, especially after running a node cluster for our data analytics. Since node is now spread to multiple cores (scaling “horizontally”, to be described in another post), the writes generally come to MongoDB much faster. The main question for me was whether it was mongoose, the node-mongoDB driver or mongo per se slowing me down.

The problem:

When we get performance data, we start tracking it by the creation of a “stream” object. The stream object is first created with a unique identifier, then subsequent packets that come in update the stream. The streams get the highest packet value of the incoming packet and update their timestamp with the packet’s timestamp. Later on when we stop seeing packets flow for a particular stream, we time it out so analytics can be computed for all packets that came in from the beginning of the stream to the end of the stream.

My initial implementation made a series of read queries using mongoose find(query), returned all potential matching streams, then updated the matching stream. The source code looked something like this.

function updateStream(packet) {
   var stream = mongoose.model('Stream');
   var query1 = {
        $and: [{
            'from.ID': packet.fromID
        }, {
            'to.ID': packet.toID
        }, {
            'stream_ended.from': false
        }, {
            'from.IP_ADDRESS': packet.IP
        }, {
            'from.highestPacketNum': {
                $lt: packet.highestPacketNum
            }
        }]
    };

   //Since streams are bilateral, we have to do two query reads in order to find the matching stream
   var query2 = {
        $and: [{
            'to.ID': packet.toID
        }, {
            'from.ID': packet.fromID
        }, {
            'stream_ended.to': false
        }, {
            'to.IP_ADDRESS': IP
        }, {
            'to.highestPacketNum': {
                $lt: packet.highestPacketNum
            }
        }]
    };  

   async.waterfall([
      function (callback) {
         stream.find(query1).exec(function(err, calls) { 
            if (calls.length > 1) {  //throw error, there should only be 1 stream
               throw new Error('There should only be one stream with this unique identifier');
            } else if (calls.length == 1) {
               //update calls[0]
               calls[0].save(cb);
               callback(null);
            } else {
               callback('Query1 yielded no results');  //continue down the waterfall
            }
         });
      },
      function (callback) {
         stream.find(query2).exec(function(err, calls) { 
            if (calls.length > 1) {  //throw error, there should only be 1 stream
               throw new Error('There should only be one stream with this unique identifier');
            } else if (calls.length == 1) {
               //update calls[0]
               calls[0].save(cb);
               callback(null);
            } else {
               callback('Query2 yielded no results');  //continue down the waterfall
            }
         });
      }
   ], cb);
}

You can see that this sourcecode was highly inefficient because mongoose was returning all possible matches of the stream. This was based on a limitation by our packet simulator, which at the time did not spoof unique IDs in the packets that it would send. At this point in time, we were capped at around 250 simultaneous streams, running in 1 node process.

Result: ~250 simultaneous streams of data

Improvement #1: Limit the search to the first object found, update it and persist it.

Essentially the source code remained the same, but the mongoose.find queries changed from find(query1).exec to find(query1).limit(1).exec(). With this change, we saw an improvement of around 50%, since mongoose would return after finding the first match. At this point, the blocker shifted back to node.js, and I noticed that at a certain point, the event queue would block up with handling too many events. AT the time, node.js was responsible for doing aggregation, decoding the packets, invoking mongoose and storing them, and running our REST API as well as serving up static data. I saw my poor little node.js process pinged out at 100% trying to churn through all the data. One good thing I noticed though, is that even though node.js was capped out in terms of resources, it still continued to churn through the data, eventually processing everything given enough time.

Result: ~350 simultaneous streams of data, query time improved by about 50%

Improvement #2: Clustering node.js

This deserves a post in itself since it required breaking out the various services that the single node process was handling into multiple forked processes, each doing its own thing. Suffice it to say, I eventually decided to fork the stream processors into N instances, where N is the number of cores on the machine / 2, with basic load balancing. This caused node to write to mongoDB much faster, and caused delays which eventually bogged mongoDB down. Thus the pendulum swung back to mongoDB.

Result: ~400 simultaneous streams of data, query time remains the same, mongoDB topped out.

Improvement #3: mongoDB updates in place

Finally, I realized that there was no reason to actually get the stream object returned in our source code, since I was just making updates to it. I had also noticed that it was actually the read time in mongotop that was spiking when the number of streams increased. This was because the find() functions in mongoose return a mongoose wrapped mongoDB document, so the update does not happen in place. For simple updates without much logic, there is no point to getting the object back, even when using the .lean() option to get json back. Here was the update:

function updateStream(packet) {
    //queries remain the same ...

    async.waterfall([
        function(callback) {
            Stream.findOneAndUpdate(query1, {$set: { 'endTime': packet.timestamp, 'metadata.lastUpdated': new Date(), 'from.highestPacketNum': highestPacketNum }}, {new: false}).exec(function(err, doc) {
                if (doc) {
                    cb(err, doc);
                    callback('Found and updated stream @ Query1');
                } else {
                    callback(null);
                }
            });
        },

        //No matching yet with IP, so try with just swapped SSRCs and update that one
        function(callback) {
            Stream.findOneAndUpdate(query2, {$set: { 'to.IP_ADDRESS': IP, 'endTime': packet.timestamp, 'metadata.lastUpdated': new Date(), 'to.highestPacketNum': highestPacketNum }}, {new: false}).exec(function(err, doc) {
                if (doc) {
                    cb(err, doc);
                    callback('Found and updated stream @ Query2');
                } else {
                    createNewStream(packet, cb);
                    callback('No streams found.  Pushing new stream.');
                }
            });
        }
    ], function(err, results) {

    });
}

It turns out that after this improvement, I saw in mongotop that read time dropped to 0ms per second, and write time slightly spiked. However, this was by far the biggest improvement in overall query time.

Result: ~600 simultaneous streams of data, query time dropped by 100 – 200% (when including read time dropping to 0). This, combined with the stream processors running on multiple cores seemed to be a scalable solution that would significantly spike our capacity, but I noticed that at around 600 simultaneous streams of data, suddenly our stream creation would spike, and continue increasing.

Improvement #4: MongoDB upgrade to 2.6.x and query improvement using $max field update operator

For all you readers who like mysteries, can you guess what was causing the stream creation to spike? I spent a long time thinking about it. For one, mongotop didn’t seem to be topped out in terms of the total query time on the streams collection. I noticed spikes of up to 400 or so ms for total query time, but it seemed by and large fine. Node.js was running comfortably at around 40% – 50% cpu usage per core on each of the four cores. So if everything seemed fine, why was stream creation spiking?

The answer, it turns out, was a race condition caused by the processing of the second packet of the simultaneous stream before the first packet could be used to instantiate a new stream object. At a certain point, when enough simultaneous streams were incoming, the delay in creation of the new stream would eclipse the duration between the first and second packets of that same stream. Hence, everything after this point created a new stream.

I thought for a while about a solution, but I got hung up either on choosing a synchronous processing of the incoming packets, which would significantly decrease our throughput, or use a “store and forward” approach where a reconciliation process would go back and match up streams. To be fair, I still have this problem, but I was able to reduce its occurrence to a significant extent. Because there’s no guarantee that we would be handling the packets in a synchronous order, I updated our query to make use of the $max field update operator, which would only update the stream highest packet number if a packet with the same IDs and higher packet number came in. This in turn let us reduce the query time because I no longer had to query to find a stream with a lower packet number than the incoming packet. After this update, I noticed that the reduced query time significantly reduced the total query time on the collection and at the same time helped the race condition issue.

function updateStream(packet) {
    var query1 = {
        $and: [{
            'from.ID': packet.fromID
        }, {
            'to.ID': packet.toID
        }, {
            'stream_ended.from': false
        }, {
            'from.IP_ADDRESS': packet.IP
        }]
    };

    var query2 = {
        $and: [{
            'to.ID': packet.fromID
        }, {
            'from.ID': packet.toID
        }, {
            'stream_ended.to': false
        }]
    };

    async.waterfall([

        //First see if there is a call object
        function(callback) {
            Call.findOneAndUpdate(query1, {
                $set: { 'endTime': packet.timestamp, 
                        'metadata.lastUpdated': new Date()
                    },
                $max: {
                    'from.highestPacketNum': highestPacketNum
                }
            }, {new: false}).exec(function(err, doc) {
                if (doc) {
                    cb(err, doc);
                    callback('Found and updated stream @ Query1');
                } else {
                    callback(null);
                }
            });
        },

        function(callback) {
            Call.findOneAndUpdate(query2, {
                $set: { 'to.IP_ADDRESS': packet.IP, 
                        'endTime': packet.timestamp, 
                        'metadata.lastUpdated': new Date()
                    },
                $max: {
                    'to.highestPacketNum': highestPacketNum 
                }
            }, {new: false}).exec(function(err, doc) {
                if (doc) {
                    cb(err, doc);
                    callback('Found and updated stream @ Query3');
                } else {
                    pushNewStream(packet, cb);
                    callback('No stream found.  Pushing new stream.');
                }
            });
        }
    ], function(err, results) {

    });
    return true;
}

Note that the threshold for the race condition is just higher with this approach and not completely solved. If enough streams come in, I’ll still eventually get the race condition where the stream is not instantiated before the second packet is being processed. I’m still not quite sure what the best solution is here, but as with everything, improvement is an incremental process.

Result: ~1000 simultaneous streams, query time dropped by 100%, 4 cores running at 40% – 50% cpu.

A candid look at biglaw and big law firms from a biglaw dropout

Leaving biglaw behind

To any readers out there, whether you’re a biglaw attorney, law student, engineer, working professional considering law school or spambot, here’s an update on my life just over a year after my exodus from biglaw. It’s been just over a year since I left:

Giving notice

I still remember the day I gave my notice to my boss, a young and rising partner in my law firm, that I was leaving for a startup. My former partner, a highly intelligent and skilled attorney, of course sensed my reasons for leaving. But instead of speaking candidly about my experience, we spoke through with the layers of etiquette built up over time by biglaw attorneys.

I told him how much I enjoyed working with my fellow associates and for him, how much I respected him and appreciated his mentorship, and how the opportunity was too good to pass up. He told me how he appreciated my contributions to the law firm, how I was practicing at a level beyond my seniority and about my bright future at the firm and that he would regret seeing me go. And though we spoke truthfully to one another, we managed to miss the truth entirely.

I didn’t tell him how much I disliked working at the law firm; nor did I tell him that I probably would have accepted any position that gave me the opportunity to leave. I didn’t tell him about the aggregate toll that responding to emails 18 hours a day had taken on my sense of normalcy and happiness. I didn’t tell him how I, in desperation after having worked the first 8 weekends in a row, applied to engineering jobs after just two months in biglaw. I didn’t tell him about the numerous spreadsheets I had created and obsessively updated detailing how much money I would have each month, the day my net worth would be zero, and the day when my ROI on law school would overcome the opportunity cost of giving up my past engineering career. I didn’t tell him how many late nights I had spent in the past 6 months fighting to keep my eyes open while I watched old CS lectures and studied abstract data types and binary tree implementations. I didn’t tell him how scared I was to accept the position because I seriously doubted the viability of the startup and knew that it was completely leaving behind my legal background to perhaps never be used again. I didn’t tell him that despite all my fears and doubts, how easy it was me for to make the decision.

The decision to leave

What had caused my desperation to get out after just two months? It wasn’t just the long hours. I thought that might be the case, that perhaps I was just lazy, but in the aftermath of leaving my law firm, I still continued to work 50 hour weeks with about 20 hours of work a week on my personal projects. In fact, I was still putting in more time than I had at the firm. That actually led me to discover about myself that I didn’t shy away from work.

It wasn’t just the unpredictability of work, although that was a huge contributing factor. It’s really hard to describe what it’s like being tethered to your work phone and being on hook for any quantity of incoming work at any waking hour on any day of the week. I lived in fear of my phone, having been burned many times in the past by “short fuse” deals that needed me to drop everything and work the weekend. Like a traumatized animal, I learned to fear the words “what are you working on at the moment,” knowing that my answer would inevitably lead to more work. It turns out that it only took a few months of Friday night emails asking me to drop everything and work the weekend to break me.

It wasn’t just the nature of the work either, although each deal on which I was staffed meant the drudgery of hundreds of agreements, leases, and licenses being dropped into the dataroom at any time of day, waiting to be reviewed and summarized by me. It wasn’t solely the environment either, although it was astonishing to see the facades of contentment when so many associates were unhappy. You see, biglaw attorneys are exceedingly polite to one another. So polite, in fact, that the real feelings of biglaw attorneys rarely manifest themselves, except to those closest to the associate. This can’t be unknown by biglaw partners, but because associates don’t openly vocalize their discontent, biglaw partners have no incentive to improve conditions. Instead, we pretended to have fun, making casual jokes or observations about current affairs. We had RC car races, eating contents, a goodbye party for every associate or staffer who left. These events made for great PR to those outside the firm, but it was never mentioned how the attorneys would just sit there either awkwardly making small talk or checking their mobile devices waiting for an excuse to get back to work. It was of course all of these things, and more, that brought me to my breaking point.

In the month leading up to my departure, after I had my offer in hand from the startup, I constantly wondered to myself if I was insane for wanting to leave my biglaw career behind. I was at one of the most prestigious law firms in the country, had spent the last 4 years of my life devoted to learning the law and had a clear ticket to society’s elite and comfortable wealth if I could just put in the time. I sought counsel from others who had made the leap. Perhaps not coincidentally, almost all of them are startup founders. They, for the same reasons as me (and probably all associates), hated the biglaw machine and wanted out. Unlike me however, they had already mustered the courage and taken the leap. I was still waffling back and forth with whether it would be worth trying to make it to another year or at least six months until I received my year-end bonus. Ultimately, it wasn’t anything any one of those other biglaw deserters told me. In fact, I heard nothing new from them. If you’re a biglaw reader, you’re also likely not reading anything new. No, I knew all the facts, had considered the cost numerous times, and was just trying to convince myself to do something about it. It turn out that I already had made my decision.

Once I decided I was leaving, I knew that nothing my partner or anything other associates could say would convince me to stay. I knew that it was the right decision to leave, not in a year, not in six months, but at the very moment I was planning. It wasn’t about the startup or all the work I had put in to get back to the level of engineering competence to get hired. It turns out that it was just about the opportunity to leave. It turns out that any opportunity to leave was good enough. I’ll never forget the feeling when I gave my notice to that young partner. I didn’t begrudge him at all, not for the weekends he made me work, the workload, or anything else at the firm. I knew it was just part of his job. Instead, my sense of unadulterated joy came instead from the hope of a better future and of a happier life that leaving instilled. As I told that partner that I was leaving, I felt an enormous weight lifted off my shoulders like nothing in the world was or could go wrong. I felt emboldened, powerful and most importantly, free. I joke to my friends that I’ll never have as good a feeling in my life ever again, not unless I become a slave and receive my freedom.

The aftermath

So then, what’s the postscript after leaving -has my attitude changed in the last year?
No. The short answer is that I have not regretted once leaving biglaw for engineering. I regret some of the things biglaw afforded, like the salary or prestige of being able to call myself a lawyer. But when evaluating things as a whole, I would change nothing about my departure.

In the past year, I’ve set to work on several projects, one of which I’ve described on this blog called Dockumo. I have other projects in the queue, like building a solution to bluebooking, an insane kind of drudgery that law students subject themselves to involving following imprecise rules on how to cite certain legal works. As I stated before, I still work long hours, but the biggest difference is the fact that I’m now working for myself. The work that I do feels like it has purpose, that it is bettering me and my skills as an engineer, and can build upon itself to enable me to create bigger and more expansive projects -projects that can help others. Software, to me, is still the most efficient solution to many of the world’s biggest problems. Being able to program software is a powerful concept and skill, and it leads to ability to create anything I can dream of if I just put in the time and effort. That feeling of hope and potential to me is the greatest motivator of all; it is what gives my work purpose and it was exactly what I was missing when I was a biglaw attorney.

Dockumo Public Document sharing is here

Hey.

I finally added public document sharing on Dockumo after two weeks of launching. It was my original goal to get something like this up and running, but it turned out to be slightly more complicated than I had thought. Since I had built everything from the perspective of a user who is logged in, I had to develop some workarounds for users who were just hitting my site. The whole experience actually turned me off a bit to having user login. It seems like such a huge barrier to conversion.

In any case, here’s what public documents do: Say a user wants to share a template (I’ve shared an example cover letter when I was leaving law behind and coming back to software engineering). The user will login, create a new article, click “private” off (it’s enabled by default), then input the article’s content. Once the user saves the article, Dockumo will run a mapreduce behind the scenes and “index” the article so it becomes publicly searchable. The mapreduce runs by category, which are fixed (via a JSON file) so it forces the user into certain options. Tags are user specific and are searchable through the interface. Then the main landing page gets updated with “trending” articles so users can see what’s popular. Here’s a screenshot:

Landing page @ Dockumo.
Landing page @ Dockumo.

Then, if a user finds an article that she likes, she can email the article to herself, or download it to an HTML or word file locally. If the user wants to make some tweaks, she’ll need to login herself, add a copy of the article and then edit it for her own use. Hence the cycle repeats. The hope is that over time, shared articles can get better. It goal is that all of us can use the benefit of our collective intelligences to make and share better documents, whether it’s something simple like a cover letter, something more in depth like a lease, or something more personal like a statement of purpose.

Dockumo article search by category
Dockumo article search by category

That’s the power behind Dockumo and my hope for the community that someday, people can find some use for it. Now if I could just get one user who wasn’t already a friend …

MongoDB performance enhancements and tweaks

MongoDB performance enhancements and tweaks

In my travails in building and my work on a real time analytics engine, I’ve formed some opinions on how well mongoDB is suited for scalability and how to tweak queries and my node.js code to extract some extra performance. Here are some of my findings, from several standpoints (mongoDB itself, optimizations to the mongoose driver for Node, and node.js itself).

Mongoose Driver
1. Query optimization

A. Instead of using Model.findOne or Model.find and iterating, try to use Model.find().limit() – I encountered a several factor speed up when doing this. This is talked about in several other places online.

B. If you have excess CPU, you can return a bigger chunk of documents and process them using your server instead and free up some cycles for MongoDB.

Improvement: Large (saw peaks of 1500ms for reads in one collection using mongotop. Afterwards, saw this drop to 200ms)

Example:

//Before:
Collection.findOne(query3, function(err, doc) {
  //Returns 1 mongoose document
});

//After
Collection.find(query3).limit(1).exec(function(err, docs) {
  //returns an array of mongoose documents            
});

See these links for some more information: Checking if a document exists – MongoDB slow findOne vs find

2. Use lean()
According to the docs, if you query a collection with lean(), plain javascript objects are returned and not mongoose.Document. I’ve found that in many instances, where I was just reading the data and presenting it to the user via REST or a visual interface, there was no need for the mongoose document because there was no manipulation after the read query.

Additionally, for relational data, if you have for instance a Schema that contains an array of refs (e.g. friends: [{ type: mongoose.ObjectId, ref: ‘User’}]), and you only need to return the first N number of friends to the user, you can use lean() to modify the returned javascript objects and then do population instead of populating the entire array of friends.

Improvement: Large (depending on how much data is returned)

Example:

//Before:
User.find(query, function(err, users) {
  //Users will be mongoose Documents. Hence you can't add fields outside the Schema (unless you have an { type: Any } object
  var options = {
     path: 'friends',
     model: 'User',
     select: 'first last'
  };
  Users.populate('friends', options, function(err, populated)) //will populate ALL friends in the array
});

//After
var query = new Query().lean();
User.find(query, function(err, users) {
  //Users will be javascript objects. Now you can go outside the schema and return data in line with what you need
  users.forEach(function(user) {
     user.friends = friends.splice(0, 10);  //take the first ten friends returned, or whatever
  });
  var options = {
     path: 'friends',
     model: 'User',
     select: 'first last'
  };
  Users.populate('friends', options, function(err, populated)) //now Model.populate populates a potentially much smaller array
});

Results (Example on my node.js server using mongoTop):
Load (ms)
Seconds No Lean() Lean()
5 561 524
10 371 303
15 310 295
20 573 563
25 292 291
30 302 291
35 544 520
40 316 307
45 289 286
50 537 503
Average 409.5 388.3
% improvement 0.051770452 = 5.177%

3. Keep mongoDB “warm”.
MongoDB implements pretty good caching. This can be evidenced by running a query several times in quick succession. When this occurs, my experience has been that the query time decreases (sometimes dramatically so). For instance, a query can go from 50ms to 10ms after running twice. We have one collection that is constantly queried – about 500 times per second for reads and also 500 times per second for writes. Keeping this collection “warm”, i.e. running the query that will be called at some point in the future, can help keep the call responsive when Mongo starts to slow down.

Improvement: Untested
Example:

function keepwarm() {
   setTimeout(function() {
      User.find(query);
      keepwarm();
   }, 500);
}

Mongo Native
1. Compound indexing
For heavy duty queries that run often, I decided to create compound indices using all the parameters that comprised the query. Even though intuitively, it didn’t jump out to me that indexing by timestamp for instance would make a difference, it does. According to the mongoDB documentation, if your query sorts based on timestamp (which ours did), indexing by timestamp can actually help.

Improvement: Large (depending on how large in documents your collection is and how efficiently mongoDB can make use of your indices)
Example:

//in mongo shell
db.collection.ensureIndex({'timestamp': 1, 'user': 1});

//in mongoose schema definition
Model.index({'timestamp': 1, 'user': 1});

Alternative? Aggregating documents into larger documents, such as time slices. Intuitively, that would mean that queries don’t have to traverse as large an index to reach the targeted documents. You may ask what the difference is between creating a compound index versus breaking the document down into aggregates like a day’s or hours slice. Here’s a few possibilities:

  1. A. MongoDB tries to match up queries with indices or compound indices, but there’s no guarantee that this match will occur. Supposedly, the algorithm used to determine which index to use is pretty good, but I question how good it is if for instance, the query you are using includes an additional parameter to search for. If MongoDB doesn’t see all parameters in the index, will it still know to use a compound index or a combination of compound indices?
  2. B. Using aggregates could actually be slower if it requires traversal of the document for the relevant flight data (which might not afford fast reads).
  3. C. If writes are very heavy for the aggregate (e.g. you use an aggregate document that is too large in scope), the constant reading and writing of the document may cause delays via mongoDB’s need to lock the collection/document.
  4. D. Aggregates could make indexing more difficult
  5. E. Aggregates could make aggregation/mapreduce more difficult because your document no longer represents a single instance of an “event” (or is not granular enough)

2. Use Mongotop to determine where your bottlenecks are.
Mongotop shows each collection in your database and the amount of time spent querying reads and writes. By default it updates every second. Bad things happen when the total query time jumps over a second. For instance, in Node, that means that the event queue will begin to block up because mongo is taking too long

Example:

 
//example output
                            ns       total        read       write		2014-07-31T17:02:06
              mean-dev.packets       282ms       282ms         0ms
             mean-dev.sessions         0ms         0ms         0ms
               mean-dev.series         0ms         0ms         0ms
              mean-dev.reduces         0ms         0ms         0ms
             mean-dev.projects         0ms         0ms         0ms

3. Use explain()… sparingly
I’ve found that explain is useful initially, because it will show you the number of scanned documents to reach the result of the query. However, when trying to optimize queries further, I found that it was not that useful. If I’ve already created my compound indices and MongoDB is using them, how can I extract further performance using explain() when explain() may already show a 0 – 1ms duration?

Example:

//in mongo shell
db.collection.find({
        $and: [{
            'from.ID': 956481854
        }, {
            'to.ID': 1038472857
        }, {
            'metadata.searchable': false
        }, {
            'to.IP_ADDRESS': '127.0.0.1'
        }, {
            'from.timestamp': {
                $lt: new Date(ISODate().getTime() - 1000 * 60 * 18)
            }
        }]
    }).explain()

4. For fast inserts for a collection of limited size, consider using a capped collection.

A capped collection in mongoDB is essentially a queue-like data structure that enforces first-in first-out. According to the mongoDB docs, capped collections maintain insertion order, so they’re perfect for time series. You just have to specify what the max size of the collection should be in bytes. I used an average based on: db.collection.stats(), where I found that each record was about 450 bytes in size.

To enforce this, you can run this in the mongoDB shell:

db.runCommand({"convertToCapped": "mycoll", size: 100000}); //size in bytes

See mongoDB docs here:

Node.js
1. Implement pacing for large updates.
I’ve found that in situations where there is a periodic update on a large subset of a collection while many updates are going on, the large update could cause the event queue in Node to backup as mongoDB tried to keep up. By throttling the number of updates that can go on based on total update time, I could adjust based on the load on the server currently. The philosophy is if node/mongoDB have extra cycles, we can dial up the pace of backfilling/updates a bit, whereas when node/mongoDB is overloaded, we can backoff.

Example:


//Runs periodically
    _aggregator.updateStatistics(undefined, updateStatisticsPace, function(result) {
          console.log('[AGGREGATOR] updateStatistics() complete.  Result: [Num Updated: %d, Duration: %d, Average (ms) per update: %d]', result.updated, result.duration, result.average);
          if (result.average < 5) {  //<5 ms, speed up by 10%
            updateStatisticsPace = Math.min(MAX_PACE, Math.floor(updateStatisticsPace * 1.1));    //MAX_PACE = all records updated
          } else if (result.average >= 5 && result.average < 10) { //5 < ms < 10, maintain pace
            updateStatisticsPace = Math.min(MAX_PACE, updateStatisticsPace);
          } else {  //>= 10ms, slow down by 2/3, to a min of 10
            updateStatisticsPace = Math.min(MAX_PACE, Math.max(updateStatisticsPace_min, Math.floor(updateStatisticsPace * .66)));
          }

          if (MAX_PACE === updateStatisticsPace) { console.log('[Aggregator] updateStatistics() - Max pace reached: ' + _count); }
          console.log('[AGGREGATOR] updateStatistics() Setting new pace: %d', updateStatisticsPace);
          callback(null, result)
    });

Custom tags directive in angular.js

For me to finish up phase 1 of Dockumo development (allowing users to tag and categorize articles), I had to let users add custom tags to their articles. If the article is made public, then the user’s tags will be indexed and searchable through the search interface. This will let other users sift through content based on tag and could eventually give some good insight into what content was most popular.

Why I created it: I’ve used angular-tag before, but I didn’t like the way it hooked in the “Delete” button, which on my macbook sometimes defaults to going to the previous page in history (like clicking the “Back” button). I also found the CSS to be a bit wonky and difficult to work with. When I would click on the input box, the box would expand and didn’t play nicely with my css. I’ve been feeling more reticent these days with respect to using untested third party libraries out there (even small). Sometimes they save me lots of time, but other times, they only cause lots of headaches. Delving into someone else’s source code, modifying their css, figuring out how to mash my code and theirs takes a lot of time. Sometimes it’s not until I do all that do I realize that the library doesn’t really do what I want it to do. Hence the frustration.

What it does: My angular directive is very simple. It lets the user bind a variable to the directive through ngModel. The variable is an array of strings (a set). The directive then renders an input text box below, letting the user enter in a comma separated list of tags. If the user clicks “add”, these tags are split by commas, trimmed and their lowercase values added to the set of tags already bound. It works similarly to wordpress’s tagging system. That’s it: no fuss, no muss.

Screen Shot 2014-07-28 at 10.09.00 AM

Without further ado, here’s the source:

angular.module('mean.articles')

.directive("mTag", ['$resource', function($resource) {
  return {
    restrict: 'E',
    controller: 'TagController',
    scope: {
      read: '=',  //read or write (I provide an API in case the user only wants to show the tags and not provide the input
      tags: '=ngModel'  //the ng-model binding
    },
    templateUrl: 'public/system/views/directives/tag.html'
  };
}]);

Here’s the controller code:

'use strict';

angular.module('mean.system').controller('TagController', 
    ['$scope', function ($scope) {

    //only allows inputs of alphabetic characters (no numbers or special chars)
    $scope.validate = function(string) {
    	if (!string || string.length === 0) {
    		return true;
    	} else {
    		return string.match(/^\s?[A-Z,\s]+$/i);	
    	}
    };

    //Adds the tag to the set
    function addTag (string) {
    	if ($scope.tags.indexOf(string.toLowerCase()) === -1) {
    		$scope.tags.push(string);
    	}
    };

    //When the user clicks "Add", all unique tags will be added
    $scope.appendTags = function(string) {
    	if ($scope.validate(string)) {
    		if (string) {
	    		var split = string.split(',');

	    		split.forEach(function(tag) {
	    			if (tag.trim().length > 0) {
	    				addTag(tag.trim());
	    			}
	    		});

	    		$scope.temptags = "";
	    	}
    	}
    };

    //When the user wants to delete a tag
    $scope.deleteTag = function(tag) {
    	if (tag) {
    		var idx = $scope.tags.indexOf(tag);
    		if (idx > -1) {
    			$scope.tags.splice(idx, 1);
    		}
    	}
    };

}]);

And lastly, the HTML template:

<div class="input-group" ng-show="!read">
    <input type="text" name="temptags" ng-model="temptags" class="form-control"  placeholder="Enter comma separated tags" ng-class="{'haserror':!validate(temptags)}" style="margin-bottom:0px"/>
    <span ng-show="!validate(temptags)" class="label label-danger">Please only use regular characters for tags (a-z)</span>
    <div class="input-group-btn" style="vertical-align:top">
        <button class="btn btn-inline" ng-click="appendTags(temptags)">Add</button>
    </div>
</div>

<div class="margintopten">
    <span class="label label-default normal tag marginrightten" ng-repeat="tag in tags"><a ng-click="deleteTag(tag)" ng-if="!read"><i class="fa fa-times-circle" ng-click="deleteTag(tag)"></i></a>  {{tag}}</span>
</div>

Here’s a working jsfiddle.

Custom tag directive with error handling
Custom tag directive with error handling

As always, let me know comments or feedback.

Dockumo – the crowd in the cloud

Screen Shot 2014-07-22 at 1.23.04 PM

In the past few months, I’ve been working on a project in my spare time called Dockumo. Dockumo is a web-based tool that lets users create “Articles” or basically blurbs of text, edit them and create new versions of them easily. When a user edits an article, the user is given the option to “Save” it or “Save as New Version.” Saving as a new version will generate a “Series” for the article, which will then start keeping track of all versions of the article. Pretty basic stuff. It’s my first foray into making consumer-facing software, including making better software tools for lawyers (my ultimate goal).

Dockumo main page (for text comparison)
Dockumo main page (for text comparison)

The utility here is that the user can see the article and versions through a timeline view and can easily compare any two versions (or any two articles for that matter) with a couple of clicks. I built this because comparison before required making two word documents, saving them, then comparing them and saving the result. Dockumo does this, but makes it much faster to do. Also, there’s no need to store lots of messy versions on your computer since it’s stored in the cloud. Furthermore, I built in some functionality to export the result to a text file, html doc or word doc (word was by far the most challenging). Exporting actually keeps track of the changes in “Track changes” format in word, which I implemented through a modification to the node.js officegen module.

An example of an article
An example of an article

Here’s what the timeline view looks like:

Timeline view of an article
Timeline view of an article

Finally, Dockumo lets users group articles together in “Journals.” Journals are a way to get related content together and then to share them with your friends or other users on the site. When you share a journal, it becomes a “collaboration” for all users who are invited. Those collaborators can add new versions of any articles grouped in the journal. The idea behind this is that users will be able to view and edit data together, while keeping track of new versions. There’s an option, for instance, to receive a notification whenever a collaborator edits a series. So for example, if five users are collaborators on a journal who are working on a report or a term paper, each one can make his or her modifications and save as a new version. As the document evolves through time, each user can see which other user made the change and how the article changed overtime through the comparison view.

A journal
A journal

What’s left?

When I started out making Docukmo, I only intended it to be a quick and easy way to diff two blurbs or text. I didn’t want feature creep to set in, but it inevitably did. Journals were not planned from the beginning, but I figured users would want a way to see relevant content. I ended up having to add lots of other elements as well, such as “friending” other users, user search, a notifications system, exporting to other formats, and emailing users who want updates, user permissions. All of these things added complexity and time. A good lesson for me though: When I thought I was able to launch the website, I was actually still about a month out. It wasn’t until I tried to launch did I realize that there were so many niggling things left undone. For instance, not allowing users to register under the same username or email address, or giving a password reset mechanism. Fortunately, I can carry over a lot of this code to any future projects down the line.

And feature-creep. I am probably the worst-offender of this concept ever. I always think to myself, “well if the software doesn’t do [X], then users won’t want to use it!” For me, this is an ever present temptation to continue adding feature upon feature to the source while never actually launching. Since launching, there have been other features that I really want to add (and probably will). For instance, I want to give users the option to make articles “public” and tag them with keywords so that the community (all users within the cloud) can search for a particular type of document (e.g. a cover letter or a thank you note), see an example of a it, and suggest modifications and rate existing articles. In other words, have a community of contributors who persistently improve cloud-based, crowd-sourced documents. That’s why I called the product “Dockumo”, or “doc” + “kumo” (“cloud” in Japanese).

Let me know what you guys think. I’m always happy to discuss improvements I can make or my technical/architectural decisions.

About page
About page