Over the last couple of years, there has been a tremendous push by data observability companies for adoption within the data stack. Educating the data community is crucial and challenging at a same time. There is no doubt in the value of data observability in improving efficiency and reliability across the data value chain. However, we also need to start building the best practices across the eco-system to maximise the value from data observability.
In the application observation ecosystem (such as Datadog), once DevOps and SRE teams are notified of an issue, the subsequent remedial actions are clear and well understood. It may involve well known processes such as code refactoring or spawning new instances, etc. In comparison, things are quite different with a Data Observability ecosystem. The remedial actions can be quite different, complex and relatively challenging depending on the incident, data-stack and data infrastructure.
So, I am sharing here a few remedial actions as good practices for you to consider post notification. I’m not here to tell you which platform is the best or who notifies better, rather, the focus is more on what’s the next step; as a Data Engineer, what you should do with that notification/alert.
Let us revisit the Data Observability pillars, which aggregately quantify data health.
Snippet from my previous post.
- Volume: It ensures all the row/events are captured or ingested.
- Freshness: Recency of the data. If your data gets updated every xx mins, this test will ensure its updated and raises an incident if not.
- Schema Change: If there is a schema change or a new feature that was launched, your data team needs to be aware to update the scripts.
- Distribution: All the events are in an acceptable range. e.g if a critical shouldn’t contain null values, then this test ensures to raise an alert for any null or missing values.
- Lineage: This is a must-have module, however, we always underplay these ones. Lineage provides a handy info to the data team of the upstream and downstream.
Steps you need to consider if you are notified against each pillar.
- Volume: There are incidents based on the #number of #events which may be greater or lesser than the expected threshold. This is usually notified at a table level.
- Freshness: This event generally occurs at the table level rather than the field level and indicates that a certain table is not being updated as intended. This may cause the firm to raise an alert, especially if they anticipate to publish a report or send out emails and discover that the data is old. A few debugging tips.
Volume and Freshness debugging steps may overlap which is why I decided to merge the steps.
- Understand if there are any changes made to the source table.
- The spike in volume could be due to a marketing campaign creating a bigger reach. You should treat this as anomalous if it’s one-off, but may require readjustment of thresholds if the trend persists long-term.
- Check with the engineering team if the site or app has had a downtime. (Tools such as Datadog should have notified you as well!)
- Check for orchestration failure on Airflow or pipeline issue in dbt (or relevant tools to your team)
- Possibility of data pipeline lag, which may result in the variance. If the lag is expected due to high demand or system issues, mark it as an anomaly.
- We also recommend you review the table lineage (upstream / downstream) for any possible errors. For a given table, start from where the issue is and move upstream. The highest upstream error should be investigated and fixed first.
- Check for any known table migration which was done during the incident window.
This notifies for any changes that had taken place at the table level which may impact your end report or downstream dependencies.
- Understand the changes deployed, whether it is an addition, modification or deletion of the table.
- Learn from the lineage which downstream resources are impacted.
- Make changes to your scripts whether it’s SQL, Python, Airflow or dbt models to facilitate the change in schema.
- After a staging table has been changed, you should run data recon or understand the diff before deploying it to production.
- There could be possibility of back-filling the table to reflect data in new field although there are certain things to keep in mind such as query time, strategy and table lock consideration. I’ll cover more of this in a future post.
Field or Column level-
I personally believe this is where a data observability tool makes the difference and drives ROI. Whether it’s validating a field, checking quality for ML models etc; field-level monitoring is the most critical module companies need to leverage on. Debugging steps may vary with the kind of architecture and things which may be under the purview of data team. Let me list down a few possible debugging steps.
- Let’s assume the incident was raised for a field that should not be null (not-null constraint): Pick up the primary or transaction ID and compare it with the records in the production/source data. This will give you an understanding whether the problem exists in the production or it’s a pipeline error.
- If it’s a pipeline error, fix the script and you may need to perform upserts to correct the event.
- If it’s a production/source -DB issue, liaise with the tech/engineering team to see if they can backfill and fix the issue. Once they fix the issue, you will still need to perform upserts to correct the event or possibly backfill it.
- One of the most difficult modules/projects is data backfilling. “Upserting events” seems straightforward, but it isn’t. There are a few critical aspects to keep in mind, such as ensuring data/events aren’t duplicated, records don’t leak, and specific handling of active tables.
Our next post will go into detail regarding data backfilling and what technique a company may use based on their data stack.
I hope this post gives a high-level overview of what to do after receiving a notification from a data observability tool. I’m excited to learn more from the comments and feedback.