WitFoo Precinct persists and replicates data on big-data NoSQL platform Apache Cassandra. Precinct 6.1.3 is built on Cassandra 3.11. In preparation for upgrade to Cassandra 4.0, the following lab & production testing was conducted.
Lab Appliances
WitFoo Precinct clusters consisting of 1 Management, 1 Streamer and 3 Data nodes were deployed in AWS using the official Marketplace images. The instances were configured to use AWS GP2 SSD drives (the recommended default) and were running on c5d.2xlarge hardware (16GB RAM, 8 CPU Cores.)
The code running on each deployment was identical except for the Cassandra version changes. The Cassandra 3.11 (C3) cluster was configured with identical AWS nodes as the Cassandra 4.0 (C4) nodes. Schema, replication strategies and other key settings were also identical. Replication factor was set to 3 in both clusters. Cassandra heap was set to the following on all nodes: -Xms3866M -Xmx3866M
.
Test Data
Both clusters were configured to process, store, and replicate the same data. TTL on inserts was set to 8640 seconds. Data was inserted at a rate of 3 million rows per hour. 1,000 rows were inserted per partition and average partition size was 16MB.
Each record is inserted as JSON. For more details on how we store and process data see: Our Move from Elastic to Cassandra. In this test each cluster has a separate Streamer node reading from AWS and individually fingerprinting, parsing, and normalizing the data through NLP semantic framing.
Performance Results
The following are the results of performance in the test environment.
Table histograms
Results from tablehistograms artifacts.artifacts
are as follows:
tablehistograms artifacts.artifacts | Cassandra 3.11 (in microseconds) | Cassandra 4.0beta (in microseconds) |
Read Latency 50P | 2,816.16 | 2,816.16 |
Read Latency 75P | 4,055.27 | 4,055.27 |
Read Latency 95P | 8,409.01 | 5,839.59 |
Read Latency 98P | 17,436.92 | 5,839.59 |
Read Latency 99P | 17,436.92 | 5,839.59 |
Write Latency 50P | 9.89 | 9.89 |
Write Latency 75P | 11.86 | 11.86 |
Write Latency 95P | 17.08 | 17.09 |
Write Latency 98P | 24.60 | 24.60 |
Write Latency 99P | 29.52 | 29.52 |
Repairs
Times for running nodetool repair
are as follows:
After inserts start | Cassandra 3.11 | Cassandra 4.0beta |
30 minutes | 32 seconds | 8 seconds |
90 minutes | 71 seconds | 15 seconds |
6 hours | 95 seconds | 34 seconds |
11 hours | 77 seconds | 34 seconds |
Compaction
Times for running nodetool compact
are as follows:
After inserts start | Cassandra 3.11 | Cassandra 4.0beta |
30 minutes | 28 seconds | 15 seconds |
90 minutes | 61 seconds | 25 seconds |
6 hours | 80 seconds | 41 seconds |
11 hours | 79 seconds | 38 seconds |
Garbage Collection
Times for running nodetool garbagecollect
are as follows:
After inserts start | Cassandra 3.11 | Cassandra 4.0beta |
30 minutes | 29 seconds | 19 seconds |
90 minutes | 62 seconds | 32 seconds |
6 hours | 85 seconds | 45 seconds |
11 hours | 86 seconds | 57 seconds |
Performance Observations
Cassandra 4 delivered mild improvements in reads and writes of data with much more stable results in higher percentiles. The major observable improvements were seen in maintenance action costs delivering extreme improvement.
Production Testing
In addition to lab testing, we have deployed Cassandra 4.0 to 48 data nodes across 15 individual clusters. Utilizing our approaches outlined in Metric Driven Development we were able to observe similar success across all clusters. Tested clusters included deployments running on a wide array of disk configurations including slow, magnetic spindle and extremely fast SSD arrays. Cluster sizes ranged from 1 node and 7 nodes with data retention of up to 12TB (compressed.) Replication across geographies also saw improvement in production. Memory, IOPS and CPU utilization saw mild improvements over Cassandra 3.11 values.
Summary
The performance and stability improvements in Cassandra are a stride forward in big data efforts. It is our intention to include Cassandra 4.0 in the upcoming 6.1.4 release to deliver reliable function at a lower resource cost to our customers. Great work by the entire Cassandra community in taking big-data to the next level.