Databricks Data Lakehouse vs. a Data Warehouse: What’s the Difference? Read Our Latest Blog...
Databricks Data Lakehouse vs. a Data Warehouse: What’s the Difference? Read Our Latest Blog...
Start Free Trial

ChaosSearch Blog

11 MIN READ

Cloud Data Retention & Analysis: Unlocking the Power of Your Data

Play

 

Enterprise data growth is accelerating rapidly in 2021, challenging organizations to adopt cloud data retention strategies that maximize the value of data and fulfill compliance needs while minimizing costs.

To meet this challenge, organizations are adopting or refining their cloud data retention strategies, addressing questions like:

  • Which data should be retained in the cloud?
  • How long should we retain data in cloud storage?
  • What type of cloud data storage should we use to retain data?

 

In this blog post, we’ll take a closer look at the state of data retention and analytics in the cloud. 

We’ll examine how organizations are storing their data in the cloud, the importance of cloud data retention, and the biggest challenges associated with analyzing large datasets in the cloud.

Finally, we’ll explore how innovative software technologies are addressing the challenges of cloud data retention and analysis at scale.

Data Retention Policy Considerations

 

What is Cloud Data Retention?

Cloud data retention is the practice of storing, archiving, or otherwise retaining data in cloud storage.

There are three types of cloud data storage that may be used to facilitate cloud data retention:

  1. Object Storage - Object storage designates each piece of data as an object, adds comprehensive metadata to every object, and eliminates the hierarchical organization of “files and folders”. Data in object storage is placed into a flat address space called a storage pool, a practice which results in faster data retrieval and more efficient analytics.
  2. File Storage - In a file storage system, data exists in named files that are organized into folders. Folders may be nested in other folders, forming a hierarchy of data-containing directories and sub-directories. Files may have limited metadata associated with them, such as the file name, date of creation, and the date it was last modified.
  3. Block Storage - Block storage technology separates data into blocks, breaks those blocks into separate pieces, assigns each piece a unique identifier code, and stores the data on a Storage Area Network (SAN). SANs present block storage to other networked systems, leveraging a high-speed architecture to deliver low-latency data access and facilitate high-performance workloads.

For many enterprise organizations who store data in the public cloud, Amazon Simple Storage Service (Amazon S3) is considered the best option for long-term cloud data retention. S3 is a cloud object storage service with six available storage classes, each one designed to accommodate specific access requirements at a competitive cost:

  • Amazon S3 Standard - cost-optimized for frequently accessed data.
  • Amazon S3 Intelligent-Tiering - intelligently classifies and stores data with changing or unknown access patterns.
  • Amazon S3 Standard-Infrequent Access - used for long-term storage of infrequently accessed data.
  • Amazon S3 One Zone-Infrequent Access - used for long-term storage of infrequently accessed data in a single availability zone.
  • Amazon S3 Glacier - provides durable, low-cost storage for data archiving.
  • Amazon S3 Glacier Deep Archive - an alternative to magnetic tape libraries for data that must be retained for 7-10 years.

With its multiple storage tiers and unlimited storage capacity, Amazon S3 is both a cost-effective and highly scalable option for long-term object storage in the cloud. Amazon also provides solutions for file storage (Amazon Elastic File System (EFS)) and block storage (Amazon Elastic Block Store (EBS)).

 

Why is Cloud Data Retention Important?

Data retention has always been a requirement for businesses.

In the past, those requirements were fairly narrow in scope and easy to manage.

Today, cloud data retention requirements have become more complex as organizations face increased regulation of their data storage practices and a stronger need to utilize data for business decision-making.

Here’s why cloud data retention is becoming increasingly important in 2021:

Reason #1: Regulatory Compliance Management

Recent high-profile data breaches and reports of large-scale privacy violations have led to the emergence of new regulations on how corporations protect their data. One example is the Payment Card Industry Data Security Standard (PCI DSS), which requires organizations who collect customer credit card data to:

  1. Only retain credit card information if absolutely necessary, and
  2. Ensure that retained credit card data is adequately protected.

Organizations who wish to demonstrate compliance with PCI DSS may need to show evidence of quarterly scanning and penetration testing, as well as evidence of regular event log checks. This is facilitated by the long-term retention of event log data that accurately documents how the credit card information was stored and accessed.

Reason #2: Contracts & Litigation Protection

Organizations are legally obligated to protect documents that could be relevant to litigation in the future. They may also be required to retain sales records, warranties, service records, and other types of records to meet their contractual obligations to customers and other stakeholders.

Reason #3: Supporting IT Functions & Processes

Cloud data retention can play an important role in supporting key IT functions, processes, and business intelligence initiatives.

This is especially true for the growing number of enterprise organizations who retain application, network, and system log files to support IT functions like system troubleshooting, network security monitoring, application performance optimization, and capacity management.

 

What Challenges are Impacting Cloud Data Retention?

Public cloud service providers like AWS deliver cost-effective solutions for retaining data in the cloud at scale, but most enterprises need better technology to efficiently analyze the vast amounts of log data they’re generating every day in increasingly complex cloud environments.

Many organizations are still using open-source solutions like the ELK stack to support their log analytics initiatives, and they’re running into major problems as they scale up operations in the cloud.

Here’s why:

  1. As organizations deploy more applications, services, and devices in the cloud, they produce greater volumes of log data.
  2. Before this log data can be utilized, it needs to be aggregated from a variety of sources (by Logstash or another log aggregator tool) and stored in an Elasticsearch index.
  3. Although there are no hard limits on the size of an Index in Elasticsearch, query performance tends to degrade as indices get larger. Indexing failures (and potential data loss) can also occur when the size of the index exceeds the storage limits of the host server.
  4. To avoid degrading performance as data volume grows, ELK stack users are forced to scale their deployment either vertically or horizontally, adding more cost and complexity.

The connection between increasing volumes of log data and degrading performance of legacy log analytics solutions like the ELK Stack leave enterprises with a difficult choice: either start reducing their data retention, or work to navigate the costs and technical challenges of scaling Elasticsearch.

And they’re both bad choices.

Reducing data retention means limiting data utilization in a way that can negatively impact security monitoring and other critical use cases for log data.

On the other hand, scaling Elasticsearch leads to increased complexity and elevates resource demands, threatening the cost-effectiveness of log analytics initiatives.

Dilemmas like this one are the driving force behind the adoption of new technologies that mitigate data retention challenges and truly enable data analytics at scale.

READ: The Ultimate Data Retention Policy Guide

 

Data Retention Storage Recommendations

 

How Does ChaosSearch Impact Cloud Data Retention & Analysis?

ChaosSearch brings a new approach to data analysis that gives organizations more streamlined and cost-effective access to analyze the massive quantities of data they have retained in the cloud. This approach consists of three key innovations:

Chaos Index® - Chaos Index® is a new multi-model data format that delivers high performance querying and extreme data compression. Chaos Index® supports both text search and relational queries, and enables organizations to discover, normalize, and index data autonomously and at scale.

Chaos Fabric® - Chaos Fabric® delivers containerized orchestration of Chaos Index® core functions: indexing, searching, and querying data. This feature eliminates resource contention and makes Chaos Index® functions high-performance and cost-effective at scale.

Chaos Data Refinery - Chaos Refinery® allows end-users to clean, prepare, and transform data without any data movement out of Amazon S3 buckets. Users can interact with real-time data and create visualizations using existing tools like Kibana.

ChaosSearch runs as a managed service with Amazon S3 as the sole backing store for data.

Users continue to benefit from the cost economics and unlimited data storage of Amazon S3 - but they also get the ability to search, query, and analyze their log data at scale using ChaosSearch.

As a result, users of ChaosSearch no longer have to choose between limiting their cloud data retention or adding complexity to their log analytics solution.

READ: Breaking the Logjam of Log Analytics

 

Summary

Cloud data retention is a growing concern for enterprise organizations who are producing large volumes of data in increasingly complex cloud environments.

Cloud data retention solutions include object storage, file storage, and block storage, with popular solutions that include Amazon S3, Amazon EFS, and Amazon EBS. Organizations who use an ELK Stack for log analytics often depend on Lucene indices, which can be problematic at scale, for long-term storage of log files in the cloud.

While expanding cloud data retention is important for compliance management, litigation protection, and supporting IT functions, most organizations don’t have the ability to truly analyze that data at scale. This leaves organizations to choose between reducing their cloud data retention or scaling up their log analytics solutions.

ChaosSearch is solving for data analysis at scale with new technology that leverages the cloud’s unlimited data retention and performs cost-effective log analysis directly in the cloud.

See the Solution

About the Author, Thomas Hazel

Thomas Hazel is Founder, CTO, and Chief Scientist of ChaosSearch. He is a serial entrepreneur at the forefront of communication, virtualization, and database technology and the inventor of ChaosSearch's patented IP. Thomas has also patented several other technologies in the areas of distributed algorithms, virtualization and database science. He holds a Bachelor of Science in Computer Science from University of New Hampshire, Hall of Fame Alumni Inductee, and founded both student & professional chapters of the Association for Computing Machinery (ACM). More posts by Thomas Hazel