Some time ago the long-awaited new version of Pentaho Data Integration 9.0 (PDI) appeared. The authors have published a list of new functionalities. In short: data flow templates, a new flow monitoring system, support for distributed environments (containers) and new integration possibilities with data sources. Due to the fact that PDI is an important element of our solution dedicated to data quality management, we have started testing the new version. Let’s see what’s new on the list.
Data pipeline templates
We are particularly interested in data flow templates. According to the documentation, they can be used by business users to transform data from various sources. At present, we obtain similar functionality using our proprietary tool for managing reference data and dictionaries of Metastudio DRM. We use PDI to create appropriately parameterized data flows. Their parameters are stored in relational tables. We treat them as reference data and manage them using Metastudio.
In a situation where business users can independently create new or modify existing data flows, it is necessary to pay special attention to the validation and versioning of flow parameters. We have found that the tool for managing reference data and dictionaries is great for managing flow parameters. The same tool that allows you to enter data into dictionary tables, after extending with the syntax validators, can automatically verify the data flow code. At the same time, we can manage parameter versions. This allows end users to be provided with the maximum possible control over data flows.
Data pipeline monitor
Business users can be incredibly creative. Thanks to automatic validation, most errors are already detected at the moment of making changes in data flows. Sometimes, however, processes do not work for objective reasons. Then quick diagnosis of the problem is important. Until now, we have used the Metastudio DRM interface to view flow execution reports. This allowed for convenient connection and monitoring of data flow parameters in one tool. We will check whether the new data pipeline monitor allows you to quickly diagnose problems.
Without a powerful tool that automatically verifies parameters and versions, downpour of data flows generated by business users will quickly turn into a flood. Using the Data Flow Monitor, we’ll find out that something is not working, but it will take longer to fix it than to create a new flow. In this way we will return to the starting point.
Modern Edge-to-Cloud Architectures
The ability to use containers is slowly becoming the standard. When we have the right infrastructure at our disposal, we can use it in a more optimal way. This is in line with the general trend in this type of solutions. Similar challenges are also faced by our Metastudio DRM product.
Expanded Analytics Ecosystem Support
New players are appearing on the market. Further integrations are required. Snowflake tornado forces subsequent suppliers to adapt their solutions to cooperate with him. We collect the first experience of proof of concept using this technology.
After the tests, we invite you to a webinar with a summary of how Metastudio DRM works with PDI 9.0