ClickHouse Case Studies Leveraging ClickHouse for Efficient OpenTelemetry Tracing: A Resmo Case Study
Edit This Case Study Record
ClickHouse Logo

Leveraging ClickHouse for Efficient OpenTelemetry Tracing: A Resmo Case Study

ClickHouse
Application Infrastructure & Middleware - Database Management & Storage
Infrastructure as a Service (IaaS) - Cloud Storage Services
Equipment & Machinery
Retail
Intrusion Detection Systems
Time Sensitive Networking
System Integration
Resmo, a tool that gathers configuration data from Cloud and SaaS tools using APIs, faced a significant challenge in managing the large volume of network calls resulting from collecting data from thousands of APIs. The traditional approach of logs was too verbose and difficult to query, while aggregated metrics lacked sufficient context for detecting and diagnosing specific issues. Resmo utilized tracing, which provided a better view of the flow of requests and their associated responses. However, the volume of spans generated by Resmo's data collection was excessive, and the usual approach of sampling could cause blind spots, making it difficult to identify issues on non-happy paths of execution that happen rarely. Furthermore, many vendors charge by the number of ingested events and the volume of data per GB, which can be costly without any sampling. Only a few vendors allow custom SQL queries on the data.
Read More
Resmo is a tool that collects configuration data from Cloud and SaaS tools using APIs. It allows users to explore this data using SQL to ask any question they want. Resmo comes with thousands of pre-built SQL-based rules and questions and also provides visual exploration capabilities of the collected data through filters, free text search, or graph. Customers can create their own rules or use automation to receive notifications via various channels when there are changes to the data or rule status. Resmo's data collection generates more than 300 million spans per day, and this number is rapidly increasing with the customer size.
Read More
Resmo decided to use full tracing (no sampling) with OpenTelemetry and ClickHouse for cost-effective and efficient storage and querying of traces. Initially, they considered using S3 and Athena, but the fixed startup delay of 2-3 seconds for Athena was a drawback. They hosted their own ClickHouse instance, which allowed them to store more than 4 billion spans with a 92% compression percentage. To improve the performance of common queries, they added materialized columns for frequently used fields in queries, monitors, and dashboards. They also used the out of the box configuration of Opentelemetry Collector with ClickHouse and Java agent for distributed tracing, adding manual instrumentation in the form of context-specific tags to their spans. They connected ClickHouse to Postgres for their observability queries, joining user and tenant IDs in their spans to the actual account names and account status in the Postgres database. For visualizing data, they used Grafana, and for writing queries, they used IntelliJ IDEA & DataGrip.
Read More
The implementation of ClickHouse and OpenTelemetry for full tracing has significantly improved Resmo's observability game. The solution has allowed Resmo to efficiently store and query traces, providing a better view of the flow of requests and their associated responses. The addition of materialized columns for frequently used fields has significantly improved query performance without affecting storage or the compression rate. The ability to connect ClickHouse to Postgres has enabled Resmo to use it in their observability queries, joining user and tenant IDs in their spans to the actual account names and account status in the Postgres database. This has provided unprecedented flexibility and allowed Resmo to expose this flexibility to their customers so they can easily ask arbitrary questions. The use of Grafana for visualizing data and IntelliJ IDEA & DataGrip for writing queries has further enhanced the efficiency and effectiveness of Resmo's observability strategy.
Resmo's ClickHouse instance can store more than 4 billion spans.
The data stored in ClickHouse consumes 275 GiB on disk, which uncompressed is 3.40 TiB - a 92% compression percentage.
Queries which scan all of the data complete rather quickly, and are mostly limited by the disk bandwidth.
Download PDF Version
test test