Databricks Data Lakehouse vs. a Data Warehouse: What’s the Difference? Read Our Latest Blog...
Databricks Data Lakehouse vs. a Data Warehouse: What’s the Difference? Read Our Latest Blog...
Start Free Trial

ChaosSearch Blog

10 MIN READ

Unlocking Data Literacy Part 3: Choosing Data Analytics Technology

How to Choose Data Analytics Technology for Data Literacy
8:51

This is part 3 of a 3-part blog series. Here’s part 1 and part 2.

Ringing in the new year with new goals for data literacy?

The right data management strategy can help democratize access to analytics across your entire team, without the need for a data scientist or data engineer to act as an intermediary or bottleneck. As you examine your people’s data skills and related data literacy training processes, it might be time to consider a new approach to data analytics technology that facilitates data democratization in 2022. That’s right, your platform.

Data science and data analytics skills are in high demand for 2022. Paired with the last year’s Great Resignation, finding this specialized talent is increasingly difficult for organizations. However, most organizations store and access data in unnecessarily complicated ways. Solving some of these data management challenges with technology can help alleviate data engineering talent shortages and make it easier for anyone to access their own analytical insights.

First, let’s look at why data management is so complex today.

 

Data Analytics Technology for Data Literacy

 

Enterprise data management: It’s complicated

Most organizations store data in different formats across a variety of sources. However, data lakes and data warehouses still represent the dominant enterprise data management strategies in 2022. Data lakes and data warehouses share some common characteristics, but the two technologies are fundamentally different.

 

Data warehouses and ETL

Driven by the increasing number of applications (or data sources), databases and data models in use, data warehousing solutions emerged. These tools work to extract data from transactional or operational systems, transform it into a usable format, and load it into business intelligence (BI) systems to support analytical decision-making activities (a process otherwise known as ETL).

Because data warehouses follow a schema-on-write approach, data must have a defined schema before writing into a database. As a result, all warehouse data has been previously cleaned and transformed (usually by a data scientist or data engineer in the ETL process). When a BI user accesses the data warehouse, they’re actually accessing processed data (rather than raw data). The problem with this approach is that data has been transformed for a predetermined use case. This prevents BI users from interrogating and interacting with the data in different ways to reveal new information. Each new analytics request starts the ETL process over again, resulting in slower time to insights.

 

Data lakes and related analytical tools

Data lakes, or centralized data repositories, promised to replace data warehouses by storing data in a variety of formats and eliminating data silos. Their first iterations (e.g. Hadoop) failed, as low-cost cloud object storage (such as Amazon S3) emerged to relieve some of the administrative burdens of on-premises Hadoop clusters. As more data was added to the data lake, Hadoop became a “data swamp” that was too expensive and difficult to maintain. Since these early days, however, there’s been a lot of progress.

While data warehouses can only ingest structured data that fit predefined schema, data lakes can ingest all types of data in their source format. This encourages a schema-on-read process model, where data is aggregated or transformed at the time of query (instead of during the ETL process, which happens before the data enters a data warehouse).

One of the key benefits of schema-on-read is that it results in loose coupling of storage and compute resources needed to maintain a data lake. Bypassing the ETL process means you can ingest large volumes of data into your data lake without added time, cost, and complexity. Instead, compute resources are consumed at query-time where they’re more targeted and cost-effective. That means BI users can achieve faster time to insights using a data lake, if the right tools are in place.

Read: 3 Use Cases for Relational Access to Log Data

One thing to note is that data lake storage solutions don’t inherently include data analytic features. As a result, data lakes are often combined with other cloud-based services and downstream software tools to deliver data indexing, transformation, querying, and analytics functionality.

 

Choosing Data Analytics Tools for Business Users

 

Choosing data analytics tools for business users

So which analytics tools are right for your use case? Ideally, a solution that doesn’t require data movement can empower business users to access and ask questions about data, on demand.

Some pundits argue that you still need to move data in order to analyze it in a centralized data lake architecture. While that may be true sometimes, self-service data analytics platforms can prevent the need for data transformation and data movement, allowing users to query and analyze the data in place.

These platforms (such as ChaosSearch) sit on top of a cloud-based data repository, delivering key features that help organizations take advantage of a data lake architecture. This approach can activate low-cost cloud object storage (for example, Amazon S3 or Google Cloud Storage), enabling teams to ingest, index and analyze their data without having to move it into a separate ETL pipeline for analysis.

For some organizations, a data lake isn’t a catch-all for every workstream. Data may live in a distributed architecture (such as a data mesh), across a wide variety of sources. A cloud object store may be one of many endpoints for storing data within a data mesh architecture. However, for the purposes of query performance and faster time to insights, consolidating certain data into a cloud data platform makes most sense.

Let’s look at log data as one example. Log data is collected across devices, applications and networks within the organization. The sheer volume of log data can be overwhelming to certain analytics tools that weren’t designed for long-term retention of high-volume log data. Even so, the data consumers in this scenario (e.g. DevOps, security, cloud operations and more) need access to this log data to identify persistent performance or security issues, along with a variety of other scenarios that impact the user experience.

To work around existing tools’ architectural challenges, these teams will either reduce the amount of data they store, store data in different storage tiers (which impacts performance) or create a backup pipeline to ensure there’s no data loss. Managing these pipelines becomes complex and costly. Solutions like ChaosSearch activate cloud object storage for log search and analytics. This works by creating a unified data lake on top of an existing AWS or GCP cloud object storage environment. ChaosSearch natively includes Kibana for easy access to analytics. That way, organizations can store and analyze all their data, without any transformation or data movement.

Just as in the log analytics use case, solutions like ChaosSearch can be used to virtually prep data for consumption by popular BI tools, such as Tableau, Looker, PowerBI and more through its SQL API. Democratizing access to clean, normalized and trustworthy data can help everyday analysts and business users tap into the analytics tools of their choice – without having to rely on a data engineer as an intermediary.

 

Rethinking enterprise data management in 2022

Rather than hiring more resources for data transformation in 2022, focusing existing data engineering resources on a sound enterprise data management strategy can help eliminate complexity. Having the right technology in place frees up data scientists and data engineers to focus on more strategic initiatives, such as rolling out a data literacy training program or empowering employees to use self-service BI or machine learning tools. Rather than serving as an intermediary between data consumers and the enterprise data architecture, these valuable resources can reimagine the value of data and analytics throughout the organization.

To be truly data-driven, teams need the right tools and skills to analyze data at the moments they need it most. That way, they can use data as a consistent, strategic advantage, rather than a byproduct of a moment-in-time report. In the long run, business users can move from prescriptive and diagnostic analytics (those that tell you what happened in the past and why), to predictive and prescriptive analytics that can help teams understand and take action on future outcomes.

Try ChaosSearch

 

Additional Resources

Read the Blog: Unlocking Data Literacy Part 1: How to Set Up a Data Analytics Practice That Works for Your People

Watch the Webinar: Why and How Log Analytics Makes Cloud Operations Smarter

Read the Blog: Unlocking Data Literacy Part 2: Building a Training Program

Check out the White Paper: Log Analytics for CloudOps Making Cloud Operations Stable and Agile

About the Author, Dave Armlin

Dave Armlin is the VP Customer Success of ChaosSearch. In this role, he works closely with new customers to ensure successful deployments, as well as with established customers to help streamline integrating new workloads into the ChaosSearch platform. Dave has extensive experience in big data and customer success from prior roles at Hubspot, Deep Information Sciences, Verizon, and more. Dave loves technology and balances his addiction to coffee with quality time with his wife, daughter, and son as they attack whatever sport is in season. He holds a Bachelor of Science in Computer Science from Northeastern University. More posts by Dave Armlin