Real Time Analytics Engine

My current project at work is to architect a solution to capturing network packet data sent over UDP, aggregate it, analyze it and report it back up to our end users. At least, that’s the vision. It’s a big undertaking and there are several avenues of approach that we can take. To that end, I’ve started prototyping to create a test harness to see the write performance/read performance of various software at different levels in our software stack.

Our criteria are as follows (taken from a slide I created):

Our Criteria
Our Criteria

Based on these four requirements pillars, I’ve narrowed down the platform choices to the following:

SERVER

nodejsStrengths

      • Single-threaded, event loop model
      • Callbacks executed when request made (async), can handle 1000s more req than tomcat
      • No locks, no function in node directly performs I/O
      • Great at handling lots of small dynamic requests
      • Logo is cool

Weaknesses

      • Taking advantage of multi-core CPUs
      • Starting processes via child_process.fork()
      • APIs still in development
      • Not so good at handling large buffers

logo-white-big

Strengths

        • Allows writing code as single-threaded (Uses Distributed Event Bus, distributed peer-to-peer messaging system)
        • Can write “verticles” in any language
        • Built on JVM and scales over multiple cores w/o needing to fork multiple servers
        • Can be embedded in existing Java apps (more easily)

Weaknesses

        • Independent project (very new), maybe not well supported?
        • Verticles allow blocking of the event loop, which will cause blocking in other verticles. Distribution of work product makes it more likely to happen.
        • Multiple languages = debugging hell?
        • Is there overhead in scaling over multiple nodes?

PERSISTENCE

For our data, we are considering several NoSQL databases. Our data integrity is not that important, because our data is not make-or-break for a company. However, it is essential to be highly performant, with upwards of 10-100k, fixed-format data writes per second but much fewer reads.

mongo

Architecture

        • Maps in memory space directly to disk
        • B-tree indexing guarantee logarithmic performance

Data Storage

        • “Documents” akin to objects
        • Uses BSON (binary JSON)
        • Fields are key-value pairs
        • “Collections” akin to tables
        • No conversion necessary from object in data to object in programming language

Data-Replication

        • Mongo handles sharding for you
        • Uses a primary and secondary hierarchy (primary receives all writes)
        • Automatic failover

Strengths

        • Document BSON structure is very flexible, intuitive and appropriate for certain types of data
        • Easily query-able using mongo’s query language
        • Built-in MapReduce and aggregation
        • BSON maps easily onto JSON, makes it easy to consume in front end for charting/analytics/etc
        • Seems hip

Weaknesses

        • Questionable scalability and write performance for high volume bursts
        • Balanced B-Tree indices maybe not best choice for our type of data (more columnar/row based on timestamp)
        • NoSQL but our devs are familiar with SQL
        • High disk space/memory usage

 

cassandra
Architecture

        • Peer to peer distributed system, Cassandra partitions for you
        • No master node, so you can read-write anywhere

Data Storage

        • Row-oriented
        • Keyspace akin to database
        • Column family akin to table
        • Rows make up column families

Data Replication

        • Uses a “gossip” protocol
        • Data written to a commit log, then to an in-memory database (memtable) then to disk using a sorted-strings table
        • Customizable by user (allows tunable data redundancy)

Strengths

        • Highly scalable, adding new nodes give linear performance increases
        • No single point of failure
        • Read-write from anywhere
        • No need for caching software (handled by the database cluster)
        • Tunable data consistency (depends on use case, but can enforce transaction)
        • Flexible schema
        • Data compression built in (no perf penalty)

Weaknesses

        • Data modeling is difficult
        • Another query language to learn
        • Best stuff is used by Facebook, but perhaps not released to the public?

 

467px-Redis_Logo.svg

            TBD

 

        FRONT-END/REST LAYER

For our REST Layer/Front-end, I’ve built apps in JQuery, Angular, PHP, JSP and hands-down my favorite is Angular. So that seems like the obvious choice here.

AngularJS-large

Strengths

      • No DOM Manipulation! Can’t say how valuable this is …
      • Built in testing framework
      • Intuitive organization of MVC architecture (directives, services, controllers, html bindings)
      • Built by Google, so trustworthy software

Weaknesses

      • Higher level than JQuery, so DOM manipulation is unadvised and more difficult to do
      • Fewer 3rd party packages released for angular
      • Who knows if it will be the winning javascript framework out there
      • Learning curve

Finally, for the REST API, I’ve also pretty much decided on express (if we go with node.js):

express

Strengths

      • Lightweight
      • Easy integration into node.js
      • Flexible routing and callback structure, authorization & middleware

Weaknesses

      • Yet another javascript package
      • Can’t think of any, really (compared to what we are using in our other app – java spring

These are my thoughts so far. In the following posts, I’ll begin describing how I ended up prototyping our real time analytics engine, creating a test harness, and providing modularity so that we can make design decisions without too much downtime.