Optimize Your AWS Data Lake with Data Enrichment and Smart Pipelines
As an engaged member of the AWS community, we’re always on the lookout for new technologies and software tools that can help our customers succeed in their AWS data lake initiatives.
During the most recent AWS Re:Invent conference in Las Vegas, we had the opportunity to engage directly with AWS partners, customers, and other technology companies operating in the AWS ecosystem.
That’s where we discovered StreamSets, a modern DataOps Platform helping customers build resilient data pipelines to enrich enterprise logs and other data before ingesting it into their data lake(s). After learning more about smart data pipelines and data drift detection, we saw StreamSets as a complementary solution for ChaosSearch - a tool that could help our customers optimize how they get data into AWS before indexing and analyzing it with ChaosSearch.
In this week’s blog, we’re taking a closer look at StreamSets data integration technology and how our customers can benefit from using StreamSets with ChaosSearch to power data enrichment and analytics on AWS.
Read: 10 AWS Data Lake Best Practices
Enterprises Struggle to Make Sense of Big Data
Enterprise data is exploding in size and complexity, with businesses capturing logs and other data from hundreds of sources (e.g. operating systems, microservices, applications, mobile devices, cloud services, etc.), in structured, unstructured, and semi-structured formats (JSON, LOG, CSV, etc.).
With enterprise data growing rapidly, there’s a huge opportunity for companies to transform the data they’re producing into actionable and timely insights that can inform business decision-making.
But here’s the problem:
Today’s analytics solutions - including data warehouses and data lakes - simply weren’t built to handle the scale of data that organizations are generating in 2022. As data scales up, data engineers get bogged down by time-consuming ETL and pipeline maintenance.
Data lakes often turn into data swamps where data is stockpiled with no way to make use of it. Data that can’t be analyzed due to high costs is often discarded, its value wholly lost.
ChaosSearch Delivers Data Lake Analytics at Scale
ChaosSearch was designed to solve data lake analytics at scale, leveraging public cloud architecture to fulfill the true data lake promise of cost-optimized data storage, easy data ingest, schema-on-read, and on-demand, multi-model data access.
The ChaosSearch cloud data platform sits on top of Amazon S3, transforming your public cloud storage into a hot data lake for analytics in the cloud. As your data moves into Amazon S3, our proprietary Chaos Index® technology indexes it for multi-model access (i.e., full-text search, SQL queries, and machine learning) with up to 95% compression.
Once data is indexed, you can create virtual views by cleaning, preparing and transforming your data with the Chaos Refinery®. Virtual views can be turned into visualizations and dashboards using Kibana, which is natively included with ChaosSearch, and it all happens without data movement or ETL processing.
To start using ChaosSearch, our customers need to get their data into cloud object storage such as Amazon S3. We’ve emphasized Elasticsearch replacement and log analytics as early use cases for our platform, so many of our customers use tools like LogStash and Fluentd to move log data into Amazon S3.
But, as ChaosSearch expands beyond log analytics into operational analytics and other use cases leveraging the SQL API, our customers will benefit from a more sophisticated data integration/pipeline tool that connects with diverse data sources and provides additional features to support analytics use cases - this is where StreamSets can help.
Watch the Webinar: Make Your Data Lake Deliver - AWSInsider
StreamSets Delivers Data Integration Innovation
StreamSets is bringing a DataOps approach to data integration and management, enabling its customers to build, run, monitor, and manage smart data pipelines at scale from a single log-in.
AWS customers can leverage the StreamSets platform in a variety of ways to enrich data and ingest it into Amazon S3 before indexing and analyzing the data with ChaosSearch.
Here’s some of what StreamSets is bringing to the table:
Data Collection Engine
StreamSets Data Collector Engine streamlines the process of building data pipelines for streaming, batch, and change data capture (CDC). StreamSets’ easy-to-use visual interface makes it easy for anyone to build and manage smart data pipelines while eliminating 90% of break-fix and maintenance time.
StreamSets also provides more than 100 native integrations and pre-built connectors that make it easy to set up a data pipeline in minutes without any special skills.
Data Enrichment
A data pipeline in StreamSets has an origin, a destination, and one or more processors between them that can execute transformations or enrich the data with additional information.
For example, a log file containing user IP addresses could be enriched with geolocation data, giving downstream analysts the ability to analyze user behavior trends by country or region. Or, a log file where devices are identified with a Machine ID could be enriched to display more user-friendly names for devices.
Data transformations are necessary for changing the data format or structure, while data enrichment provides added context, makes data more accessible and creates new possibilities for downstream analysts.
Data Drift Detection
StreamSets describes data drift as “unexpected and undocumented changes to data structure, semantics, and infrastructure that is a result of modern data architectures.”
data drift often leads to broken processes or corrupted data, especially when changes to data output disrupt existing data pipelines.
StreamSets solves the data drift problem with patented smart data pipelines that can detect and handle changes in schema, semantics, and infrastructure drift. While traditional data pipelines often break in response to data drift, smart pipelines can automatically detect data drift, adjust to small changes, and alert on data drift events that threaten SLAs or pipeline health.
StreamSets’ smart data pipelines are fast to build and deploy, fault tolerant, adaptive, and self-healing - helping to relieve the back-pressure on developers to avoid breaking data pipelines and encouraging faster product updates and innovation.
Support for Multiple Data Formats
StreamSets supports all industry-standard data formats, including JSON, text files, Parquet, Binary, Avro, and more. Users simply choose their desired output format and StreamSets takes care of the rest.
Read: The New Best Way to Index and Query JSON Logs
ChaosSearch + Streamsets: A New Architecture for Cloud Data Analytics
With its smart data pipelines, data enrichment capabilities, and support for multiple data formats, StreamSets is simplifying the process of building data ingestion pipelines that can centralize data in your data lake.
With StreamSets, you can build resilient pipelines to capture data from various sources, transform it to your preferred file format, enrich it with additional context, and send it to your Amazon S3 data lake. Once your data arrives in Amazon S3, ChaosSearch provides data indexing, transformation, and analytics capabilities that help you extract insights from your data with no data movement and no ETL process.
Combining StreamSets with ChaosSearch gives AWS customers the ability to ingest and enrich data from virtually any source, centralize data in cost-effective Amazon S3 cloud object storage, and activate data for multi-model analytics at scale.
Additional Resources
Read the Blog: AWS vs GCP: Top Cloud Services Logs to Watch and Why
Watch the Webinar: Choosing an Analytical Cloud Data Platform: Trends, Strategies & Tech Considerations
Read the Blog: FinTech Companies Thrive and Innovate with ChaosSearch
Check out the Whitepaper: 2022 Gartner® Market Guide for Analytics Query Accelerators