Real Time Analytics Engine

My current project at work is to architect a solution to capturing network packet data sent over UDP, aggregate it, analyze it and report it back up to our end users. At least, that’s the vision. It’s a big undertaking and there are several avenues of approach that we can take. To that end, I’ve started prototyping to create a test harness to see the write performance/read performance of various software at different levels in our software stack.

Our criteria are as follows (taken from a slide I created):

Based on these four requirements pillars, I’ve narrowed down the platform choices to the following:

SERVER

Strengths

Single-threaded, event loop model
Callbacks executed when request made (async), can handle 1000s more req than tomcat
No locks, no function in node directly performs I/O
Great at handling lots of small dynamic requests
Logo is cool

Weaknesses

Taking advantage of multi-core CPUs
Starting processes via child_process.fork()
APIs still in development
Not so good at handling large buffers

Strengths

Allows writing code as single-threaded (Uses Distributed Event Bus, distributed peer-to-peer messaging system)
Can write “verticles” in any language
Built on JVM and scales over multiple cores w/o needing to fork multiple servers
Can be embedded in existing Java apps (more easily)

Weaknesses

Independent project (very new), maybe not well supported?
Verticles allow blocking of the event loop, which will cause blocking in other verticles. Distribution of work product makes it more likely to happen.
Multiple languages = debugging hell?
Is there overhead in scaling over multiple nodes?

PERSISTENCE

For our data, we are considering several NoSQL databases. Our data integrity is not that important, because our data is not make-or-break for a company. However, it is essential to be highly performant, with upwards of 10-100k, fixed-format data writes per second but much fewer reads.

Architecture

Maps in memory space directly to disk
B-tree indexing guarantee logarithmic performance

Data Storage

“Documents” akin to objects
Uses BSON (binary JSON)
Fields are key-value pairs
“Collections” akin to tables
No conversion necessary from object in data to object in programming language

Data-Replication

Mongo handles sharding for you
Uses a primary and secondary hierarchy (primary receives all writes)
Automatic failover

Strengths

Document BSON structure is very flexible, intuitive and appropriate for certain types of data
Easily query-able using mongo’s query language
Built-in MapReduce and aggregation
BSON maps easily onto JSON, makes it easy to consume in front end for charting/analytics/etc
Seems hip

Weaknesses

Questionable scalability and write performance for high volume bursts
Balanced B-Tree indices maybe not best choice for our type of data (more columnar/row based on timestamp)
NoSQL but our devs are familiar with SQL
High disk space/memory usage

Architecture

Peer to peer distributed system, Cassandra partitions for you
No master node, so you can read-write anywhere

Data Storage

Row-oriented
Keyspace akin to database
Column family akin to table
Rows make up column families

Data Replication

Uses a “gossip” protocol
Data written to a commit log, then to an in-memory database (memtable) then to disk using a sorted-strings table
Customizable by user (allows tunable data redundancy)

Strengths

Highly scalable, adding new nodes give linear performance increases
No single point of failure
Read-write from anywhere
No need for caching software (handled by the database cluster)
Tunable data consistency (depends on use case, but can enforce transaction)
Flexible schema
Data compression built in (no perf penalty)

Weaknesses

Data modeling is difficult
Another query language to learn
Best stuff is used by Facebook, but perhaps not released to the public?

TBD

FRONT-END/REST LAYER

For our REST Layer/Front-end, I’ve built apps in JQuery, Angular, PHP, JSP and hands-down my favorite is Angular. So that seems like the obvious choice here.

Strengths

No DOM Manipulation! Can’t say how valuable this is …
Built in testing framework
Intuitive organization of MVC architecture (directives, services, controllers, html bindings)
Built by Google, so trustworthy software

Weaknesses

Higher level than JQuery, so DOM manipulation is unadvised and more difficult to do
Fewer 3rd party packages released for angular
Who knows if it will be the winning javascript framework out there
Learning curve

Finally, for the REST API, I’ve also pretty much decided on express (if we go with node.js):

Strengths

Lightweight
Easy integration into node.js
Flexible routing and callback structure, authorization & middleware

Weaknesses

Yet another javascript package
Can’t think of any, really (compared to what we are using in our other app – java spring

These are my thoughts so far. In the following posts, I’ll begin describing how I ended up prototyping our real time analytics engine, creating a test harness, and providing modularity so that we can make design decisions without too much downtime.

	thefanginvestor on One month+ in … no regre…
	Yuan on One month+ in … no regre…
	Beyond Boulder on How to recover lost rides on…
	codeandcodes on How to recover lost rides on…
	codeandcodes on How to recover lost rides on…

Share this:

Related

Leave a comment Cancel reply