Real Time Analytics Engine

My current project at work is to architect a solution to capturing network packet data sent over UDP, aggregate it, analyze it and report it back up to our end users. At least, that’s the vision. It’s a big undertaking and there are several avenues of approach that we can take. To that end, I’ve started prototyping to create a test harness to see the write performance/read performance of various software at different levels in our software stack.

Our criteria are as follows (taken from a slide I created):

Our Criteria
Our Criteria

Based on these four requirements pillars, I’ve narrowed down the platform choices to the following:

SERVER

nodejsStrengths

      • Single-threaded, event loop model
      • Callbacks executed when request made (async), can handle 1000s more req than tomcat
      • No locks, no function in node directly performs I/O
      • Great at handling lots of small dynamic requests
      • Logo is cool

Weaknesses

      • Taking advantage of multi-core CPUs
      • Starting processes via child_process.fork()
      • APIs still in development
      • Not so good at handling large buffers

logo-white-big

Strengths

        • Allows writing code as single-threaded (Uses Distributed Event Bus, distributed peer-to-peer messaging system)
        • Can write “verticles” in any language
        • Built on JVM and scales over multiple cores w/o needing to fork multiple servers
        • Can be embedded in existing Java apps (more easily)

Weaknesses

        • Independent project (very new), maybe not well supported?
        • Verticles allow blocking of the event loop, which will cause blocking in other verticles. Distribution of work product makes it more likely to happen.
        • Multiple languages = debugging hell?
        • Is there overhead in scaling over multiple nodes?

PERSISTENCE

For our data, we are considering several NoSQL databases. Our data integrity is not that important, because our data is not make-or-break for a company. However, it is essential to be highly performant, with upwards of 10-100k, fixed-format data writes per second but much fewer reads.

mongo

Architecture

        • Maps in memory space directly to disk
        • B-tree indexing guarantee logarithmic performance

Data Storage

        • “Documents” akin to objects
        • Uses BSON (binary JSON)
        • Fields are key-value pairs
        • “Collections” akin to tables
        • No conversion necessary from object in data to object in programming language

Data-Replication

        • Mongo handles sharding for you
        • Uses a primary and secondary hierarchy (primary receives all writes)
        • Automatic failover

Strengths

        • Document BSON structure is very flexible, intuitive and appropriate for certain types of data
        • Easily query-able using mongo’s query language
        • Built-in MapReduce and aggregation
        • BSON maps easily onto JSON, makes it easy to consume in front end for charting/analytics/etc
        • Seems hip

Weaknesses

        • Questionable scalability and write performance for high volume bursts
        • Balanced B-Tree indices maybe not best choice for our type of data (more columnar/row based on timestamp)
        • NoSQL but our devs are familiar with SQL
        • High disk space/memory usage

 

cassandra
Architecture

        • Peer to peer distributed system, Cassandra partitions for you
        • No master node, so you can read-write anywhere

Data Storage

        • Row-oriented
        • Keyspace akin to database
        • Column family akin to table
        • Rows make up column families

Data Replication

        • Uses a “gossip” protocol
        • Data written to a commit log, then to an in-memory database (memtable) then to disk using a sorted-strings table
        • Customizable by user (allows tunable data redundancy)

Strengths

        • Highly scalable, adding new nodes give linear performance increases
        • No single point of failure
        • Read-write from anywhere
        • No need for caching software (handled by the database cluster)
        • Tunable data consistency (depends on use case, but can enforce transaction)
        • Flexible schema
        • Data compression built in (no perf penalty)

Weaknesses

        • Data modeling is difficult
        • Another query language to learn
        • Best stuff is used by Facebook, but perhaps not released to the public?

 

467px-Redis_Logo.svg

            TBD

 

        FRONT-END/REST LAYER

For our REST Layer/Front-end, I’ve built apps in JQuery, Angular, PHP, JSP and hands-down my favorite is Angular. So that seems like the obvious choice here.

AngularJS-large

Strengths

      • No DOM Manipulation! Can’t say how valuable this is …
      • Built in testing framework
      • Intuitive organization of MVC architecture (directives, services, controllers, html bindings)
      • Built by Google, so trustworthy software

Weaknesses

      • Higher level than JQuery, so DOM manipulation is unadvised and more difficult to do
      • Fewer 3rd party packages released for angular
      • Who knows if it will be the winning javascript framework out there
      • Learning curve

Finally, for the REST API, I’ve also pretty much decided on express (if we go with node.js):

express

Strengths

      • Lightweight
      • Easy integration into node.js
      • Flexible routing and callback structure, authorization & middleware

Weaknesses

      • Yet another javascript package
      • Can’t think of any, really (compared to what we are using in our other app – java spring

These are my thoughts so far. In the following posts, I’ll begin describing how I ended up prototyping our real time analytics engine, creating a test harness, and providing modularity so that we can make design decisions without too much downtime.

Node.js vs Vert.x (Part 1)

Now for a change of pace. Recently at work, we’ve been trying to figure out what platform to build an application that will handle serving realtime data to customers. We want the system to be fast, scalable and able to handle hundreds, maybe thousands or even tens of thousands of requests per second.

I did a bit of prototyping in both node and vert.x to see how they performed under pressure. To do this, I built a cute webapp that makes a bunch of requests on basic web servers written in both node.js and vert.x to see how fast they could respond and how they would handle under a heavy load of requests. Here’s a picture of the ui made for the webapp (build in angular.js).

Node webapp
Node webapp

I created a form that allows for various inputs. In it, one can specify several variables including the following:

The variables:

  • Iterations – number of http requests to make
  • Block Size – How often a result is computed (reports start time for request, end time, average time per call (ms) and total time (ms) )
  • Range – How many results to display on screen – the graph tracks this
  • Polling – Toggle On/Off (this will start polling requests as fast as node.js can handle them. These are serial in nature).
  • Show Graph – toggle graphing on/off (off will provide better javascript performance)

Thanks to angular-seed for the fast prototyping and angular-google-chart for charting.

Benchmarking Parameters: Each request is a simple get request made to the respective webserver, which then writes a header and a “Hello” response. The requests are made through angular’s $http method. When a successful response is received, the callback function calls another $http request, until the number of successful responses received equals the number of iterations specified. I measure the time it takes from the time the request is made until the number of requests received per block is complete.

Time Keeping: I try to avoid all delays that could be attributable to javascript rendering (e.g. the timestamp is taken when the first request in the block is made. Then the timestamp is recorded when the # of responses in a block is received. I send both these parameters to a javascript function, which is responsible for rendering the results to the display. I also added a function to enable polling requests to be made, which makes $http requests as fast as responses can be received in order to add stress to the server’s load. This is enabled with the “polling” button.

Here’s a snippet of the webserver source code.

In Node.js:

StaticServlet.prototype.handleRequest = function(req, res) {
  var self = this;



  var path = ('./' + req.url.pathname).replace('//','/').replace(/%(..)/g, function(match, hex){
    return String.fromCharCode(parseInt(hex, 16));
  });

  console.log(path);
  if (path == './show') {

      res.writeHead(200, {'content-type': 'text/html'});
      res.write("Hello: ");
      res.end();

  } else if (path == './version') {
    res.writeHead(200, {'content-type': 'text/html'});
      res.write("Node.js version: " + 'v0.10.26');
      res.end();
    
  } else {
    var parts = path.split('/');
    if (parts[parts.length-1].charAt(0) === '.')
      return self.sendForbidden_(req, res, path);
    else {
      fs.stat(path, function(err, stat) {
        if (err)
          return self.sendMissing_(req, res, path);
        if (stat.isDirectory())
          return self.sendDirectory_(req, res, path);
        return self.sendFile_(req, res, path);
      });
    }
  }
}

In Vert.x:


server.requestHandler(function(req) {
var file = '';

  if (req.path() == '/show') {
  	  req.response.chunked(true);
  	  req.response.putHeader('content-type','text/html');
      req.response.write("Hello: ");
      req.response.end();
  } else if (req.path() == '/version') {
  	  req.response.chunked(true);
  	  req.response.putHeader('content-type','text/html');
      req.response.write('Vertx version: ' + '2.0.2-final (built 2013-10-08 10:55:59');
      req.response.end();
  } else if (req.path() == '/') {
    file = 'index.html';
    req.response.sendFile('app/' + file);   
  } else if (req.path().indexOf('..') == -1) {
    file = req.path();
    req.response.sendFile('app/' + file);   
  }
  

See? Dead simple. Of course there are lots of flaws with this methodology (e.g. the webservers are only serving static data, are writing short responses, are not optimized, etc.). It wasn’t my intention to come to a hard conclusion with this. I just wanted a data point and to experiment with both platforms. It turns out they (at least in this test) came very close to one another in terms of performance. Both servers were running on my machine, which specs are listed below.

System Specs: Macbook Pro, mid-2012, 2.3ghz with 16gb ram and a 512gb ssd. Both webservers are running locally on my machine with a bunch of other apps open.

And here are some preliminary results:

Here’s the node.js webserver, with polling turned on:

Node while polling, 1000 requests, 50 request block size
Node while polling, 1000 requests, 50 request block size

Here’s the vert.x webserver, with polling turned on:

Vert.x while polling, 1000 requests, 50 request block size
Vert.x while polling, 1000 requests, 50 request block size

You can see that they’re very close. Next I tried stressing both servers a bit through running several concurrent queries and several “instances” of the web app. In a later post, I’ll put up some more detailed results with trying to stress both webservers out. The response time definitely slows down as more are being made.

Conclusions: Both webservers are surprisingly close in terms of response/processing/overhead time. My CPU usage goes a bit higher on the vert.x server, but I do have several other applications running. I also haven’t tested running multiple instances of the same verticle yet in vert.x, or trying to fork processes in node.js. Both webservers are as barebones as they get. So in other words, I can’t make any hard conclusions yet, except to say that both servers are

  • Able to handle high loads of requests per second (probably on the order of 500 – 1000
  • Out of the box, both servers run roughly equivalently

These results seemed a little bit surprising to me, given that on the web vert.x seems to have faster results. One factor that may contribute to this is the lack of realism in server response. It’s probably not the case that so many requests would be coming in simultaneously (or there would be multiple server instances to handle such requests), and the response size is very small. Since both servers are basically just writing to a stream, as long as the I/O is written with performance in mind, this may be roughly as fast as a my CPU can handle writing responses. Another hypothesis is perhaps that vert.x really shines in handling multiple instances. I’ll have to experiment and report my results.

Postscript: In case you want to try it out for yourself, I’ve made the source code available on my github @ https://github.com/rocketegg/Code-Projects/tree/master/ServerPerformance I know this test has been done with much more sophistication and by a lot of people out there, but hopefully you can play around with this webapp, have fun and learn something.