Data Transformation & Log Analytics | ChaosSearch

Written by Thomas Hazel | Feb 11, 2021

Logs are automatically-generated records of events that take place within a cloud-based application, network, or infrastructure service.

These records are stored in log files, creating an audit trail of system events that can be analyzed for a variety of purposes, including:

Monitoring the performance of applications
Troubleshooting systems, networks, and machines
Measuring and understanding user behavior
Verifying compliance with internal policies and industry-specific regulations
Detecting and responding to security incidents
Evaluating the root cause of a network event

Enterprise organizations use log analytics software to aggregate, transform, and analyze data from log files, developing insights that drive business decisions and operational excellence. Log analytics capabilities are essential for enterprise organizations who need to maintain oversight of increasingly complex cloud computing environments and optimally utilize their data.

There’s one major stumbling block that can add significant costs and complexity to the log analytics process: data transformation.

What Is Data Transformation?

Data transformation is the systematic process of converting data from its “raw” source format into a “structured” destination format that’s ready for analysis. As organizations adopt new technologies and expand their presence in the cloud, they generate an increased volume of log data that must be cleaned and transformed before analysts can make use of it. With today’s popular log analytics solutions, increased demand for data transformation resources invariably correlates with greater complexity and higher total cost of ownership (TCO).

In this blog post, we’re looking at the role of data transformation in log analytics and how the data transformation process can be optimized to reduce costs and complexity when dealing with large volumes of data.

We’ll explore the drawbacks associated with data transformation in one of today’s most popular log analytics solutions, the ELK stack, and show you how ChaosSearch is revolutionizing data transformation in the cloud.

What Are the Steps in Data Transformation?

Data transformation is the process of converting data from its raw source format into a desired format that’s ready to be analyzed by humans or by a log analytics software program.

The Data Transformation Process

Data Discovery - Data transformation begins with data discovery, the use of data profiling tools or scripts to identify data sources, understand their structure and characteristics, and determine how they should be transformed to achieve the desired format for analysis.
Data Mapping - Data mapping is the manual process of determining how individual fields should be transformed to achieve the desired format. Several types of transformations can be applied to log data, such as:

Format revision
Decoding of fields
Calculated and derived values
Deduplication
Date/Time conversion
Unit of measurement conversion
Merging of information
Splitting of single fields

Code Generation - In the code generation step, an engineer or member of the cloud devops team writes a script that will transform the data based on the defined data mapping rules and desired format.
Code Execution - Once the data transformation code has been completed and tested, it can be executed against the log data. The log data will be transformed into the desired format for computer or human analysis.
Data Review - As a final step, a developer or data analyst may review the output of the data transformation process for errors or anomalies. Discrepancies should be investigated and corrected to ensure the accuracy of transformed data prior to its analysis.

Next, we’ll look at how the data transformation process works in the ELK stack, one of today’s most popular log analytics solutions for cloud environments.

How Does Data Transformation Work in the ELK Stack?

Let’s start with a quick recap of the three main ELK stack components: Logstash, Elasticsearch, and Kibana.

Logstash is an open source tool that was designed to support log aggregation from multiple sources in complex cloud computing environments.

Elasticsearch acts as a searchable index for log data.

Kibana allows users to search for log data in elasticsearch, analyze it, and create data visualizations that drive insights.

If we’re focusing on the data transformation capabilities of the ELK stack, we need to take a close look at Logstash and how it works to aggregate and transform data before pushing it into the elasticsearch index. We also need to understand how Elasticsearch uses re-indexing to transform indexed data.

Event logs are processed by Logstash in three phases: aggregation, transformation, and dispatching.

These phases are governed by user-created Logstash configuration files containing three different types of plugins:

Input plugins allow users to collect and process data from 50+ different applications, platforms, and databases.
Filter plugins allow users to enrich, manipulate, and transform event log data. The most common filter plugin, known as Grok, allows users to transform unstructured data into structured data that is ready to be shipped to Elasticsearch and analyzed with Kibana.
Output plugins allow users to send enriched data to other locations or services, such as Elasticsearch or Amazon S3 buckets.

Log data sent from Logstash to Elasticsearch is stored in an index.

Indexed data can be transformed and reorganized in Elasticsearch to generate different kinds of visualizations that may reveal new insights.

Data transformation in Elasticsearch requires log data to be aggregated from a source index (or indices), then re-indexed into a destination index.

Top Data Transformation Challenges

Now that we’ve established how data transformation functions in the ELK stack, we can identify how using Logstash and Elasticsearch can lead to increased costs and complexity in the data transformation process - especially as organizations increase their daily ingest of log data.

1. Data Movement Increases Log Analytics Costs

The biggest issue with Logstash as data volume increases is the growing cost of inputting and outputting data. Data input/output from an Amazon EC2 Logstash instance can generate several types of fees, including:

API/data request fees are generated by input/output plugins that use API calls to retrieve log files from applications or infrastructure in the cloud.
Data egress fees are generated when Logstash transfers data out of the public cloud, between cloud regions, or between cloud availability zones.
Data retrieval fees may be generated when Logstash retrieves data from a lower-class storage tier, such as AWS Glacier.

As organizations produce increasing volumes of log data, the costs associated with moving that data in and out of Logstash can increase rapidly.

2. Data Transformation by Re-indexing is Resource Intensive

When data transformation takes place in Elasticsearch, it involves re-indexing: aggregating data from a source index, transforming it, then rewriting it to a destination index.

This process utilizes both computing resources and data storage and is always at least as resource-intensive as the initial aggregation and indexing of log data.

As an Elasticsearch index grows in size, more data storage and computing resources are needed to apply any transformation to the entire index.;

This can make large-scale data transformation with Elasticsearch prohibitively expensive.

Some organizations try to cut costs by excluding data from transformation operations, a compromise that eventually limits their ability to realize the full value of data.

3. ELK Stack Growth Means More Complexity with Daily Log Volume

For organizations operating the ELK stack, increased daily log volume often requires a more complex deployment model to maximize data utilization and avoid data loss. Organizations may further customize their ELK stack by adding:

Lightweight data shipper utilities that capture event log data from additional sources.
A buffer tool like Kafka or RabbitMQ that sits in front of Logstash, queueing events to prevent data loss during the input process.
Custom scripts that archive event data to Amazon S3 buckets when Elasticsearch indices are too large, to prevent crashes and data loss.

These customizations allow the ELK stack to function more effectively with large volumes of data, but they also increase technical overhead and add to the number of things that can go wrong.

Ultimately, an overly complex solution for cloud log analysis can tie up valuable IT resources and stifle innovation - that’s why ChaosSearch is revolutionizing cloud data transformation with the Chaos Data Refinery.

Chaos Data Refinery: A New Approach to Data Transformation in the Cloud

ChaosSearch delivers on a powerful new methodology for data transformation in the cloud, one that eliminates the need to move or tediously transform data prior to analysis - we call it the Chaos Data Refinery.

Here’s how it works:

Log data is automatically generated by applications, networks, services, and devices in the cloud.
Log data is sent directly to your Amazon S3 data lake.
Log data in Amazon S3 is stored using Chaos Index®, a highly compressed data representation that unlocks game-changing features and massively improved cost economics from cloud object storage.
The Chaos Refinery® platform cleans, prepares, and virtually transforms data directly in Amazon S3 storage - zero data movement required.
Data transformation in the Chaos Refinery® is completed without reindexing, resulting in substantially lower resource consumption and reduced costs.
Users can interact with data programmatically, or visually using the integrated Kibana interface for data analysis.

The elimination of data movement and the ability to transform data without reindexing make Chaos Refinery® the most powerful and cost-effective solution for log analytics in the cloud.

Simplify Cloud Data Transformation

With ChaosSearch, organizations significantly reduce the cost and complexity of transforming data for cloud log analytics.

As a result, organizations are empowered to fully leverage their data in a variety of use cases, including security log analysis and application/service troubleshooting.

Developers can integrate our platform directly into SaaS applications with our cloud-based data integration service, providing their users with enhanced data access, observability, and search capabilities.

As organizations continue to experience unprecedented data growth, there’s never been a greater need for innovative technologies that streamline the data transformation process, refining unstructured data into usable, actionable data at scale. Are you ready for the future of data transformation in the cloud?

View full post