My current project at work is to architect a solution to capturing network packet data sent over UDP, aggregate it, analyze it and report it back up to our end users. At least, that’s the vision. It’s a big undertaking and there are several avenues of approach that we can take. To that end, I’ve started prototyping to create a test harness to see the write performance/read performance of various software at different levels in our software stack.
Our criteria are as follows (taken from a slide I created):
Based on these four requirements pillars, I’ve narrowed down the platform choices to the following:
- Single-threaded, event loop model
- Callbacks executed when request made (async), can handle 1000s more req than tomcat
- No locks, no function in node directly performs I/O
- Great at handling lots of small dynamic requests
- Logo is cool
- Taking advantage of multi-core CPUs
- Starting processes via child_process.fork()
- APIs still in development
- Not so good at handling large buffers
- Allows writing code as single-threaded (Uses Distributed Event Bus, distributed peer-to-peer messaging system)
- Can write “verticles” in any language
- Built on JVM and scales over multiple cores w/o needing to fork multiple servers
- Can be embedded in existing Java apps (more easily)
- Independent project (very new), maybe not well supported?
- Verticles allow blocking of the event loop, which will cause blocking in other verticles. Distribution of work product makes it more likely to happen.
- Multiple languages = debugging hell?
- Is there overhead in scaling over multiple nodes?
For our data, we are considering several NoSQL databases. Our data integrity is not that important, because our data is not make-or-break for a company. However, it is essential to be highly performant, with upwards of 10-100k, fixed-format data writes per second but much fewer reads.
- Maps in memory space directly to disk
- B-tree indexing guarantee logarithmic performance
- “Documents” akin to objects
- Uses BSON (binary JSON)
- Fields are key-value pairs
- “Collections” akin to tables
- No conversion necessary from object in data to object in programming language
- Mongo handles sharding for you
- Uses a primary and secondary hierarchy (primary receives all writes)
- Automatic failover
- Document BSON structure is very flexible, intuitive and appropriate for certain types of data
- Easily query-able using mongo’s query language
- Built-in MapReduce and aggregation
- BSON maps easily onto JSON, makes it easy to consume in front end for charting/analytics/etc
- Seems hip
- Questionable scalability and write performance for high volume bursts
- Balanced B-Tree indices maybe not best choice for our type of data (more columnar/row based on timestamp)
- NoSQL but our devs are familiar with SQL
- High disk space/memory usage
- Peer to peer distributed system, Cassandra partitions for you
- No master node, so you can read-write anywhere
- Keyspace akin to database
- Column family akin to table
- Rows make up column families
- Uses a “gossip” protocol
- Data written to a commit log, then to an in-memory database (memtable) then to disk using a sorted-strings table
- Customizable by user (allows tunable data redundancy)
- Highly scalable, adding new nodes give linear performance increases
- No single point of failure
- Read-write from anywhere
- No need for caching software (handled by the database cluster)
- Tunable data consistency (depends on use case, but can enforce transaction)
- Flexible schema
- Data compression built in (no perf penalty)
- Data modeling is difficult
- Another query language to learn
- Best stuff is used by Facebook, but perhaps not released to the public?
- FRONT-END/REST LAYER
For our REST Layer/Front-end, I’ve built apps in JQuery, Angular, PHP, JSP and hands-down my favorite is Angular. So that seems like the obvious choice here.
- No DOM Manipulation! Can’t say how valuable this is …
- Built in testing framework
- Intuitive organization of MVC architecture (directives, services, controllers, html bindings)
- Built by Google, so trustworthy software
- Higher level than JQuery, so DOM manipulation is unadvised and more difficult to do
- Fewer 3rd party packages released for angular
- Learning curve
Finally, for the REST API, I’ve also pretty much decided on express (if we go with node.js):
- Easy integration into node.js
- Flexible routing and callback structure, authorization & middleware
- Can’t think of any, really (compared to what we are using in our other app – java spring
These are my thoughts so far. In the following posts, I’ll begin describing how I ended up prototyping our real time analytics engine, creating a test harness, and providing modularity so that we can make design decisions without too much downtime.