Connector Update: AWS S3 & Azure Data Lake Storage

Introduction

We are pleased to announce a significant upgrade to Decube’s S3 and ADLS connectors, providing data teams with enhanced control over data lake cataloging. This release enables users to precisely define which files are ingested, configure how they are interpreted, and gain deeper visibility into their datasets—all within Decube.

‍

Why This Matters: Challenges in Data Lake Cataloging

Cataloging data lakes at scale is challenging. Without the right controls, teams often end up with noisy catalogs filled with irrelevant or misconfigured datasets. This not only makes data discovery harder, but also increases the risk of ingesting incorrect metadata, leading to confusion and governance headaches.

‍

What’s New in the S3 & ADLS Connector Upgrade

This release introduces several powerful enhancements:

- Path Specification for File Ingestion:

Define exactly which files and folders should be included in your catalog by specifying path patterns. No more unwanted datasets cluttering your catalog.

- File Format Controls:

For CSV, JSON, and JSONL files, you can now specify encoding, delimiter, header presence, and more—directly in the path specification. This ensures that files are parsed and cataloged correctly, every time.

- Regex-Based Dataset Inclusion:

Use regular expressions to include only datasets that match your criteria. Whether you want to catalog only certain tables or exclude test data, you’re in control.

- Enhanced Metadata Collection:

Decube now collects and displays additional metadata for each ingested dataset, including the number of files and total dataset size. This gives you a clearer picture of your data assets at a glance.

‍

‍How It Works: Controlling Ingestion with Path Specifications

Getting started is simple. When configuring your S3 or ADLS connector, you can now:

1. Define Path Patterns:

Specify which folders or files to include using path specifications. For example, only include files in a certain directory or with a specific naming convention.

2. Set File Format Options:

For each path, define how files should be interpreted—choose encoding, delimiter, and whether headers are present for CSVs, or set options for JSON/JSONL.

3. Apply Regex Filters:

Add a regex pattern to further filter which datasets are included. Only files matching your pattern will be cataloged.

‍

Example Use Case:

A data team wants to catalog only production tables stored as CSVs in a specific S3 bucket, using UTF-8 encoding and a comma delimiter. With the new connector, they can define this in the path specification, ensuring only the right data is ingested and cataloged.

Benefits for Data Teams

- Improved Catalog Quality:

Only relevant, well-structured datasets are included, making data discovery faster and more reliable.

- Reduced Noise:

Exclude test data, temporary files, or irrelevant datasets with ease.

- Better Visibility:

New metadata fields (file count, total size) help teams understand the scale and composition of their data assets.

- Stronger Governance:

Fine-grained control supports compliance and data management best practices.

‍

Getting Started

Ready to take control of your data lake cataloging?

Check out our updated documentation for step-by-step guides:

- Amazon S3 Datalake

- Azure Data Lake Storage

If you’re already using Decube, you can update your existing connectors to take advantage of these new features. For new users, setup is as simple as following the documentation links above.

‍

Conclusion

With this upgrade, Decube empowers data teams to build cleaner, more accurate catalogs for their S3 and ADLS data lakes. We can’t wait to see how you use these new controls to streamline your data operations and drive better insights.

Connector Update: AWS S3 & Azure Data Lake Storage

Introduction

Why This Matters: Challenges in Data Lake Cataloging

What’s New in the S3 & ADLS Connector Upgrade

‍How It Works: Controlling Ingestion with Path Specifications

Example Use Case:

Getting Started

Conclusion

Data Trust Platform

Read other blog articles

Master Data Quality Criteria: Best Practices for Data Engineers

4 Best Practices for Enhancing Sensitive Data Security

What Is Data Schema? Definition, Evolution, and Importance Explained

Grow with our latest insights

All in one place

Comprehensive and centralized solution for data governance, and observability.

Connector Update: AWS S3 & Azure Data Lake Storage

Introduction

Why This Matters: Challenges in Data Lake Cataloging

What’s New in the S3 & ADLS Connector Upgrade

‍How It Works: Controlling Ingestion with Path Specifications

Example Use Case:

Getting Started

Conclusion

Data Trust Platform

Read other blog articles

Master Data Quality Criteria: Best Practices for Data Engineers

4 Best Practices for Enhancing Sensitive Data Security

What Is Data Schema? Definition, Evolution, and Importance Explained

Grow with our latest insights

All in one place

Comprehensive and centralized solution for data governance, and observability.

Product

RESOURCES

company

LEgal