Gartner Report for Augmented Data Quality Solutions

Vendors

Ab Initio

Ab Initio Data Quality Environment (DQE) is offered as a stand-alone product or as part of the Ab Initio Data Platform that includes wider data management capabilities. It supports deployment on-premises or in the cloud via Kubernetes or virtual machines (VMs). While it does not have a SaaS offering, its platform as a service (PaaS) offering is a standard single-tenant deployment using Kubernetes/containers that customers can spin up via a cloud marketplace. Ab Initio DQE currently supports an estimated 1,900 active customers from key industries such as financial services and insurance, telecommunications and healthcare.

Ab Initio DQE scored highest for active metadata support, which is pervasive throughout the platform. It uses metadata to understand the context based on previous entries and suggest next steps to the user. Its business rules capability allows users to write rules in natural language expressions or as technical functions. The DQ rules can be automatically implemented into a pipeline and executed based on metadata definitions and discovered attributes. Ab Initio DQE also uses descriptive metadata (via NLP, machine learning [ML] and LLMs) to help determine the semantic content of a field or column and link it with a business concept or term. This information is added to the knowledge graph and used to automatically generate rules, inherit rules or metadata based on the business term or take actions.

Ab Initio DQE’s automation capability allows DQ rules to be automatically implemented into a pipeline and executed based on metadata definitions and discovered attributes. The language model studio provides facilities to create and manage prompts, generate functions, and test and tune conversations. It also enables the user to interact with the LLM of their choice, including local or private ones via a chat interface. Users can display lineage with the DQ scores visible to identify if there are potential DQ issues along the data pipeline or data flow. The platform’s scores are consistent across most critical capabilities, and Ab Initio is positioned well above average in DQ use cases.

Ab Initio DQE received its lowest scores for unstructured data support due to its limited native functionalities. The platform can read and parse data from document and text formats. However, it relies on separate packages to convert pictures, videos and sound into text descriptions before profiling and sentiment analysis can be applied by calling an appropriate third-party ML model.

Anomalo

Anomalo is an enterprise data quality platform that supports multiple deployment options such as SaaS and platform as a service. It also supports customers’ virtual private cloud (VPC) or bring-your-own-cloud (BYOC) options for deployments in the customers’ cloud account. Anomalo currently supports nearly 60 customers from key industries such as financial services, insurance and consumer goods.

Anomalo received its highest scores for profiling and monitoring/detection; rule discovery, creation and management; and alerts, notifications and visualization. Anomalo uses built-in unsupervised ML checks on any monitored dataset, which allows for automated detection of unexpected values, nulls, schema changes, data volume shifts, type changes, and value, frequency or format anomalies. For those requiring rules or other custom checks, Anomalo includes custom data quality checks that apply desired business rules, profile time series data and perform anomaly analysis. It also allows users to describe the desired logic for complex custom checks in natural language, which it converts into SQL prompts using LLMs enabled by the AI Check Assist. Alternatively, users can describe a desired outcome, and the LLM identifies and configures the most relevant data quality rule.

Anomalo’s augmentation capabilities help users detect DQ issues, provide support for rule creation and help with the resolution of identified DQ issues. The AI SQL Assist feature can streamline SQL writing and troubleshooting during custom data quality rules creation. It also extends its LLM support to assess and apply suitable algorithms to suggest potential matches out of the box, and uses ML to learn and automate the monitoring rules required by data stewards.

Anomalo received its lowest scores in data transformations, since it has no native, non-API interface for data cleansing, parsing and standardization, which adds complexity to usage among nontechnical users. It also scored low in usability, workflow and issue resolution, as the tool is only available in English and has limited built-in workflow functionality, and requires integration with project management tools like Jira or ServiceNow. It has limited lineage support, which is currently only available for tables monitored in Databricks, Snowflake and BigQuery. Owing to this, Anomalo’s weakest use cases are data engineering, and data and analytics governance.

Ataccama

Ataccama’s ONE Data Quality Suite, part of the Ataccama ONE platform, provides integrated data management capabilities blending data catalog, data governance and master data management (MDM) capabilities with its core DQ offerings. It supports on-premises or hosted/private cloud deployment as well as public cloud. It also has a PaaS offering fully supported by Docker containers and Kubernetes clusters. The vendor offers a free Snowflake Native App offering specifically designed for data engineers, which can deliver data quality validation to all Snowflake customers. Ataccama has around 570 active customers for the data quality product line, from key industries such as financial services, manufacturing and software.

Ataccama ONE Data Quality Suite received its highest scores for profiling and monitoring/detection; alerts, notifications and visualization; and active metadata support. It provides AI-based anomaly detection to evaluate time series predictions of various data characteristics as well as user feedback on previously detected anomalies. Users can leverage profiling results and other functionalities to create solutions based on patterns and trends found in historical data.

Ataccama ONE Data Quality Suite provides AI-assisted cleansing, standardization and issue resolution features. The GenAI AI assistant capabilities enable users to interact with the platform for content translation, DQ rule generation, rule suggestions, and use chat to search internal documentation. The GenAI-powered capabilities are backed by metadata to accomplish tasks on behalf of the user. These tasks include generation of DQ rules, explanation of DQ rules, suggestion of DQ rules based on data profiling, catalog item description generation, test data generation, content translation and more. Ataccama One AI Agent is designed to execute complex, end-to-end data quality tasks autonomously for DQ use cases. Backed by these capabilities, in this research, Ataccama is rated top for data engineering, data and analytics governance, and operational/transactional data quality use cases.

Ataccama ONE Data Quality Suite received its lowest score for unstructured data support since it supports native capability for limited data sources (currently only supports text-based data). For other data sources, it requires integration with external services to recognize and parse completely unstructured data and enable the results to be used across standard DQ flows, such as further classification and DQ evaluation.

CluedIn

CluedIn’s core product, “CluedIn,” is distributed under MDM or DQ product deployments, both of which can be deployed as stand-alone. It also offers a “free 10,000 records” trial option with no functionality differences as compared to the complete platform. CluedIn offers a variety of access options to users, including self-hosted, PaaS, managed service and SaaS deployment options. It has introduced a community version (an on-prem offering) that allows customers to install CluedIn on any hyperscaler. The platform has nearly 90 global customers for the data quality offering from key industries such as financial services, consumer goods and healthcare.

CluedIn received its highest scores for matching, linking and merging; rule discovery, creation and management; and usability, workflow and issue resolution. It leverages metadata associated with data attributes to generate rules that consider data lineage, source credibility and business context, providing users with comprehensive quality checks. Matching leverages ML algorithms to suggest potential matches by analyzing patterns and relationships across data, flagging high probability matches for user review. Consequently, CluedIn is placed above average in master data management use case.

The platform has introduced augmentation capabilities to support a low-code/no-code environment. The platform introduced AI agents to automate issue detection using LLMs and CluedIn Copilot capabilities to proactively inform users of potential actions to take. Its rule builder allows users to create rules that define filters and actions tailored to their data scenarios, including automating data transformations, capturing data quality issues and determining operational values. Users can input data quality requirements in natural language, and the system uses NLP to automatically translate those into technical rules. Automated task assignment directs specific remediation tasks to designated users or teams based on predefined rules.

CluedIn received its lowest scores for unstructured data support because this support is only available for email, document and web content data sources. CluedIn can extract text and relevant attributes from these sources and employs NLP to analyze the results. CluedIn also provides sentiment analysis to gauge perception and entity recognition to identify and classify entities.

DQLabs

DQLabs Platform for Modern Data Quality and Observability is a unified platform for DQ and data observability. Its delivery can be via self-hosted, PaaS, managed services or SaaS, enabled by fully containerized solutions (e.g., Kubernetes, Docker), and can be deployed across hybrid, cloud and on-premises environments. DQLabs has 125 active direct customers and an additional 500 customers who use the platform through its OEMs with Hitachi Pentaho DQ and Quest Erwin DQ from key industries such as banking and securities, technology, and healthcare providers.

DQLab received its highest scores for profiling and monitoring/detection; rule discovery, creation and management; and usability, workflow and issue resolution. Its active semantic layer employs AI and GenAI to automatically profile and contextualize data. Then, it leverages active metadata to enable real-time data quality monitoring, root cause analysis, lineage tracking and automated rule generation. Its introduction of an AI assistant, featuring GenAI conversational capabilities with an NLP-based interface, allows technical and nontechnical users to ask questions about data quality, request profiles, explore metrics and execute quality rules without complex queries.

DQLabs’ augmentation capabilities support a low-code/no-code framework for custom rules, workflows and integration of DQ checks into data pipelines. It automates DQ checks by employing AI/ML for anomaly detection and leveraging custom LLMs to predict missing values in a dataset based on factors such as the data domain, industry, geography and syntax. Its ML algorithms track user interactions, learning which data quality issues are commonly addressed and how they are prioritized. These algorithms over time can automate data quality issue alerts for remediation based on user behavior and business criticality. DQLab’s ability to analyze unstructured data sources and generate context-specific metadata is done using Open LLama GenAI model available under an open-source license, allowing users to freely use, modify and consume the LLM model for their DQ tasks.

DQLabs received its lowest scores for matching, linking and merging and unstructured data support capabilities. While the platform does have out-of-the-box features to help data stewards find potential matches, it lacked advanced capabilities. These include a supportive UI that can display potential matches based on linked relationships, enabling data stewards to take action more effectively, and the ability to predict optimal matching criteria. Consequently, the platform is positioned lowest for the master data management use case compared to other use cases.

Experian

Experian’s Data Quality (DQ) products include Experian Aperture Data Studio and several data validation and enrichment services, such as Experian Address Validation, Experian Email Validation and Experian Phone Validation. It supports on-premises, hosted/private cloud and public cloud deployment for Experian Aperture Data Studio, and its products and services are all offered as SaaS. Experian has an estimated 6,000 customers for these product lines, with the majority of them using data validation and enrichment services.

Experian received its highest scores for profiling and monitoring/detection; usability, workflow and issue resolution; and matching, linking and merging. It automatically detects when existing issues have been resolved and closes them without manual intervention. It has also introduced “Realtime Workflow,” which allows API calls from a website, mobile app, CRM or third-party application that captures or stores data while offering similar functionality to a classic workflow. Experian enables validation transformation and profiling directly on database table data without needing to transfer to the platform. Its most notable strength is in providing enrichment support for party data (e.g., consumers, businesses) while applying predictive analytics in real time.

Experian’s augmentation capabilities are most pronounced in its use of ML-based algorithms to detect anomalies in time-series data. When an anomaly is detected, it is routed to stewards for investigation and automatic recognition of matches across all data, and users can construct a similarity scoring based on combinations of built-in fuzzy matching algorithms. It also supports ML-driven data analysis to automatically suggest and apply DQ rules to datasets.

Experian’s DQ product lines received their lowest scores for unstructured data support due to very limited capabilities in this category. They offer minimal support for unstructured data, providing native support only for semistructured data formats such as JSON and XML and do not offer capabilities for sentiment analysis or entity recognition. The platform scored low in active metadata support since it does not support lineage capabilities. Also, they do not have ML-based parsing, and parsing is done based on predefined approaches. Due to these limitations, the platform is positioned below average across most DQ use cases.

IBM

IBM’s DQ products are Cloud Pak for Data (specifically, IBM Knowledge Catalog [IKC] and IBM Match 360), IBM Databand, IBM DataStage, IBM Manta Unified Data Lineage and legacy products InfoSphere Information Analyzer and QualityStage. IBM Cloud Pak for Data uses a cloud-native microservices architecture built on the Cloud Pak platform, which runs on Red Hat OpenShift. Each component is deployed as a set of microservices built primarily on Java with Node.js. Gartner estimates around 2,800 customers for these product lines. Its operations are geographically diversified with clients in various sectors.

IBM scored its highest for active metadata support; alerts, notifications and visualization; and usability, workflow and issue resolution. IBM’s product capabilities around alerts and notification, as well as usability, have been enhanced through the integration of IBM Knowledge Catalog (IKC) with Databand (acquired in 2022) and Manta (acquired in 2023). IKC’s integration with Databand enabled automatic collection of metadata for building historical baselines, detecting anomalies and triaging alerts to remediate DQ issues. It also offers API-based integration to enable proactive alerting and monitoring of DQ issues through embedded workflows in Cloud Pak for Data, thereby tracking issues to resolution within Databand. IKC’s integration with Manta provides a unified data lineage that enables a business user-friendly summary view and drill-down to view technical details, including visibility to data quality scores at different stages of the data pipelines. This allows users to identify upstream sources where DQ issues were introduced into the pipeline by clearly indicating data assets where DQ SLAs or values have been violated. IBM has introduced features to automate and simplify setting goals from SLAs for data freshness, completeness and validity across critical data elements to support regulatory reporting. SLA rules are tied to DQ scores and used to monitor changes, which ensures compliance and triggers third-party tool workflows for remediation.

IBM’s introduction of automated metadata discovery allows for the ingestion, profiling, data quality analysis and scoring, automatic data classification and automatic business-term assignment. It also introduced a relationship explorer, which is a visualization tool that leverages active metadata and graph technology to enable users to graphically navigate between business and technical assets. However, the platform has limited inclusion of augmentation capabilities and hence scores average across the DQ use cases.

It received its lowest score for unstructured data support, which is on their roadmap but has limited functionality in the current releases.

Informatica

Informatica offers Intelligent Data Management Cloud (IDMC), which includes Cloud Data Quality (CDQ) and Cloud Data Governance and Catalog as a data management platform supporting various data management capabilities and cloud services. CDQ is its data quality service delivered as part of IDMC that can also be utilized stand-alone. It also offers Informatica Data-as-a-Service for verifying addresses, phone numbers and emails, and Informatica Data Quality as its on-premises solution. IDMC supports multitenant SaaS or PaaS where its runtime component may be installed on-premises or private cloud, but its Data-as-a-Service email and phone offerings are SaaS only. Informatica has an estimate of 5,000 customers for these product lines with top verticals being financial services, public sector and healthcare.

Informatica received its highest scores for profiling and monitoring/detection; rule discovery, creation and management; and active metadata support. The technical metadata captured via profiling and classification links data quality rules to business contexts, such as processes and policies, and is centralized in IDMC’s Data Governance and Catalog. IDMC’s integrated DQ and data observability features allow issue detection and resolution. Its ability to map data lineage provides clear visibility of data flow across systems to help users identify the root causes of data quality issues. It allows users to define DQ rules in natural language while automatically applying them across data stores to support data integration and MDM needs. These capabilities make analytics, AI and machine learning; and master data management as Informatica’s strongest use cases.

Informatica’s augmentations leverage CLAIRE GPT and AI copilot to provide AI-enhanced features, such as NLP for easy interactions, automated rule generation, continuous learning for improved accuracy and answers to complex queries. CLAIRE GPT allows users to obtain insights into DQ by asking prompts related to DQ scores, assets with associated rules and scores, stale scores and assets missing quality checks. Informatica has also introduced auto catalog and intelligent glossary association through CLAIRE.

Informatica received its lowest scores for unstructured data support due to its limited ability to natively support unstructured data analysis. Analysis of image, audio and video data requires integration with third-party tooling for base analysis, and it lacks out-of-the-box features to support sentiment analysis.

Irion

Irion’s DQ tool is part of Irion EDM, an enterprise data management platform. It currently supports on-premises, hosted on-premises and SaaS offerings, and its PaaS DQ solution is currently in development. It has 70 active customers, with all the clients utilizing the DQ features primarily from the banking and securities, and insurance industries, with a small percentage in utility sectors.

Irion EDM received its highest scores for rule discovery, creation and management; profiling and monitoring/detection; and active metadata support. It offers a user-friendly experience for creating and implementing rules that cater to nontechnical users. It leverages metadata to automatically trigger DQ rules based on threshold settings to generate a technical control rule. If the threshold is exceeded, the platform can trigger an email alert or open a ticket on an incident management system for users to act. It uses a smart data profiling function to analyze the data and calculate indicators, assigning a reliability percentage to each. Users can customize profiling tasks based on specific needs and requirements.

Irion’s augmentation capabilities are driven by its Data Artificial Intelligence System (DAISY), which leverages GenAI to provide AI-driven insights, automation and extended features around DQ tasks such as defining transformations using NLP, data discovery, and rule creation and generation. It has introduced DAISY add-ons, such as Copilot and Rule Wizard, that enable nontechnical users to generate rules without deep understanding of technical details or code writing. The Copilot and Pipeline Wizard offering is another DAISY add-on that enables nontechnical users to create data pipelines. It also introduced Context Monitor to provide an advanced and real-time view for monitoring DQ processes, such as the collection and consolidation of DQ outcomes and statistics, sourced from both internal and external DQ spokes to the central hub.

Irion EDM received its lowest score for matching, linking and merging. It currently has minimal augmentation or automation features and provides users with the ability to define rules for performing exact value-based matching, which positions its MDM use case below average. The platform has limited native lineage capability, which is currently offered through integration and partnership with Orion, providing physical data lineage using 80-plus Orion Governance metadata ingestors.

Precisely

Precisely’s DQ products include the Precisely Data Integrity Suite, Data360, Spectrum Quality and Trillium. It supports SaaS, on-premises (single server or cluster), hosted/private cloud, public cloud or customer VPC environments. Precisely also offers many data validation services via SaaS offerings, such as email validation and address validation, and it provides a freemium, downloadable version of Data360 for its end users. Precisely has approximately 4,900 customers across sectors, including financial services, insurance and telecommunications industries.

Precisely received its highest scores for data transformations; matching, linking and merging; and usability, workflow and issue resolution. Its ML capability learns from human behavior to automate the process of entity resolution and parse data out-of-the-box. Users can receive automated suggestions for standardizing and cleansing data during the profiling process. It has APIs for address parsing (to streamline the complexity of local variations while separating an address into a standardized format), phone verification (to parse and standardize global numbers into the correct format) and email verification (to parse and validate existing email accounts).

Precisely’s augmentations include ML-based address parsing through geocoding modules to parse addresses, identify entities and validate addresses. It leverages LLMs to standardize country names and identify the related country for data, and when a country field is not supplied or cannot be determined, the solution relies on LLM calls to help improve country matches based on address validation. For matching feature automation, it offers a “Relationship Linker” called “Commonization” that enables the solution to automatically suggest the best record from a matched set using user-supplied criteria. Most of its key capabilities are centered around data validation and enrichment services (especially address- and location-related), resulting in its low scores across data quality use cases.

Precisely received its lowest scores for rule discovery, creation and management; and unstructured data support. Its ability to generate technical DQ rules using NLP or LLMs is still on the roadmap for Precisely. It’s also worth noting that, while Precisely’s Data Integrity Suite integrates other solutions (e.g., Spectrum for data and embedded location intelligence, Trillium for deep rules libraries, Data360 DQ for reconciliation), these solutions have different maturity levels across DQ capabilities.

Qlik

Since its acquisition of Talend, Qlik has consolidated all the data governance and DQ capabilities under Qlik Talend to facilitate data foundations for AI, analytics and operations, and it no longer supports open-source or freemium versions of its DQ products. Its data quality products include Qlik Talend Cloud, Talend Data Fabric, Talend Data Catalog and Qlik Answers. Qlik Talend is available as a hybrid cloud and on-premises offering, and it is also available as a SaaS offering via Qlik Talend Cloud. Qlik has around 3,000 active customers for its DQ offering from key industries such as financial services, manufacturing and retail/wholesale industries.

Qlik Talend received its highest scores for data transformations; matching, linking and merging; and active metadata support. Qlik Talend leverages ANother Tool for Language Recognition (ANTLR), a parser that allows developers to define complex parsing rules with a no-code approach, and it also supports multiple built-in preparation functions to allow business users to perform parsing. The parsing capabilities also extend to unstructured data where custom, grammar-specific parsers can be built using ANTLR. Users can leverage LLMs and other AI/ML models to standardize structured data, such as names, phone numbers and addresses. Qlik Talend Cloud platform applications keep track of the lineage internally for documentation, navigation and impact analysis. It also uses ML for data matching, allowing data experts to label suspected matches using a merging campaign enabled by automatic survivorship rules. Talend Data Fabric components like tMatchPairing, tMatchModel and tMatchPredict leverage ML techniques to automatically identify inconsistent data values in a dataset as well.

Qlik Talend’s augmentations include remediation and stewardship functions supported by self-training models that learn the rules from data stewards and apply them to the complete dataset. Qlik Talend’s latest development was its introduction of Trust Score for AI, a framework designed to enhance the reliability, trust and quality of the data for AI use cases. Consequently, Qlik’s positioning for the analytics, AI and machine learning use case is the strongest compared to other DQ use cases, in which Qlik is placed above average.

Qlik Talend received its lowest score for rule discovery, creation and management. Qlik Talend currently provides its Generative Assistant for SQL code, which can produce SQL transformation code based on a natural language description provided by the user. However, unlike most of its peers, Qlik Talend has yet to introduce LLMs to assist with the creation and modification of data quality rules based on users’ natural language prompts.

SAS

SAS delivers its DQ and data governance components as part of SAS Viya. The SAS Viya Enterprise version includes SAS Event Stream Processing (ESP), a real-time data analytics solution designed to handle streaming data from various sources to analyze and act on data as it arrives. It also includes SAS Data Management Advanced, SAS Data Quality Desktop and SAS Data Quality Desktop from the SAS 9 Platform. SAS Viya is a containerized, cloud-agnostic platform that can be deployed on-premises or on cloud platforms like Microsoft Azure, Amazon Web Services (AWS), Google Cloud Platform (GCP) and Red Hat OpenShift using container orchestration systems. It currently does not have a SaaS offering, and it provides a 14-day free trial version for end users to test the capabilities. SAS has an estimated 2,670 active customers for the data quality products spread across key industries.

SAS’s key strength compared to its peers is its ability to provide unstructured data support. It features patented unstructured attribute extraction algorithms that allow users to tag and extract valuable information from free-format text columns and documents. SAS Viya also employs Bidirectional Encoder Representations from Transformers (BERT) for analyzing natural language unstructured data, enabling content identification, keyword extraction, semantic analysis, topic classification, sentiment analysis and language detection. Its advanced NLP capabilities enable powerful sentiment analysis and entity recognition as well.

SAS supports profiling capabilities with a discovery agent that systematically crawls through connected data sources to enrich metadata and extract insights by analyzing tables and views. Its quality knowledge base utilizes AI to facilitate efficient data parsing based on language, locale and data token types, while offering ML algorithms and techniques for automated data matching. It has also made enhancements to the metadata repository, which compiles metadata for the SAS Viya platform to enable lineage tracking and impact analysis. The latest development was its introduction of SAS Data Maker, which allows users to create and validate synthetic data for AI modeling and development. These capabilities and augmentations position SAS higher for analytics, AI and machine learning; and data engineering use cases.

SAS Viya scored its lowest for active metadata support; and rule discovery, creation and management due to SAS Viya’s limited augmentation capabilities for supporting its users. Additionally, users need to integrate with Great Expectations to create and deploy data quality rules automatically. Gartner Peer Insights reviews indicate that a steep learning curve is required for SAS Viya customers.

Context

The market for augmented data quality solutions continues to grow and show strong adoption, reaching $2 billion in 2023 from $1.88 billion in 2022 (see Market Share Analysis: Data Management Software (Excluding DBMS), Worldwide, 2023, published in July 2024). The market’s growth is significantly propelled by the need to support AI-ready data (see A Journey Guide to Delivering AI Success Through ‘AI-Ready’ Data), and vendors are introducing more innovations and enhancements into their platforms to include GenAI capabilities for a boost in user centricity.

Augmented data quality solutions vendors continue to mature their existing capabilities while also introducing additional features. These enhancements aim to support unstructured data quality, ensure the quality of data products, automate and reduce the overall complexity in identifying and resolving data quality issues, and introduce AI assistants to build better intuitive interfaces. Their significant strides help organizations incorporate more diverse data sources as part of their data quality roadmap. Organizations can scale their AI ambitions, reduce technical complexity, reduce the overall time to value and ensure better adoption of the tool by enabling nontechnical users to be more hands-on in the usage of the tools. The most notable differentiating capabilities in this market are listed below.

Unstructured data quality support. While this is a relatively new and maturing capability, vendors acknowledge its importance and are prioritizing it as a key innovation in their platforms’ roadmap (see Develop Unstructured Data Management Capabilities to Support GenAI-Ready Data). Currently, most augmented data quality solutions can connect natively to limited unstructured data sources and leverage third-party services, such as AI image scanners for advanced processing of image, audio or video data to enable base analysis. Unstructured data support allows users to parse the data into a structured format, perform entity extraction (identifying and classifying entities within text) and conduct sentiment analysis for categorizing the data as positive, negative or neutral. These actions are aided by the analysis of the underlying metadata, the capabilities of ML, fuzzy or user-defined parsing algorithms, and the strength of the supervised, semisupervised or unsupervised algorithms for matching and entity extraction that the platform can support.

Support for data products: Gartner’s client inquiries increasingly indicate a strong focus and interest among clients in governing and provisioning data products (see Quick Answer: What Do D&A Leaders Need to Know About Data Products?). This is equally evidenced by the augmented data quality vendors’ rising pursuit of providing support for streamlining data products that drive trusted data consumption (see A Technical Professional’s Guide to Governing Data Products). Vendors are utilizing and augmenting their solutions’ capabilities and extending them to data products, enabling users to profile source data, apply cleansing rules and transform, monitor and share data and insights regardless of where the data is stored (see Ignition Guide to Launching and Managing Data Products). The solutions’ catalog and source-to-destination lineage capabilities also provide impact analysis and end-to-end traceability, clarifying data flow and dependencies for users. Integration with data governance platforms with a data marketplace bridges the gap between data producers and consumers for data product consumption. To offer a stamp of trust on an ongoing basis, the raw data feeding into data products or data pipelines is backed by the observability capabilities of the platform. These capabilities detect anomalies and inaccurate data patterns, thereby helping organizations to quickly evaluate and provide remediation.

User-centric automation: Successful adoption means that data quality tools should be able to guide even nontechnical users to profile, cleanse and remediate DQ issues with a minimum learning curve. The use of NLP and GenAI utilizing LLMs allows users to write both technical and business data quality rules in natural language. Some solutions’ LLMs can even suggest appropriate rules based on best practices and historical data quality standards when business users describe their data quality needs. This helps reduce dependency on IT or technical counterparts, thereby increasing visibility and reducing the overall time to value.

The vendors are also incorporating a self-learning loop, in which each new automated feature reduces manual dependency. This is prevalent in stewardship actions, such as triggering alerts, workflow automation and data remediation, allowing the platform to either take actions or suggest next steps based on past actions.

Intuitive interfaces: With the introduction of an AI assistant, many augmented data quality solutions enable users to ask and structure both tasks and queries in natural language for a frictionless interaction. These capabilities extend to querying unstructured data files and providing analysis on the overall quality of the documentation for stewards or users to either take action or gain confidence in its usage.

Market Definition

Gartner defines augmented data quality (ADQ) solutions as a set of capabilities for enhanced data quality experience aimed at improving insight discovery, next-best-action suggestions and process automation by leveraging active metadata, AI/machine learning (ML), graph analysis and natural language processing (NLP) or large language models (LLMs). Each of these technologies can work independently or cooperatively, to create network effects that can be used to increase automation and effectiveness across a broad range of data quality use cases. These purpose-built solutions include a range of functions such as profiling and monitoring, data transformation, rule discovery and creation, matching, linking and merging, active metadata support, data remediation, and role-based usability to improve data quality.

Packaged ADQ solutions help implement and support the practice of data quality assurance, mostly embedded as part of a broader data and analytics (D&A) strategy. Various existing and upcoming use cases include:

Analytics, AI and ML development: Data quality capabilities supporting the preparation and ongoing monitoring of structured/unstructured data for operational analytics, performance management, sentiment analysis, improving the quality of data used for training AI models or algorithms and actual data feeds to productions.
Data engineering: Data quality capabilities supporting various key data processing in the context of data engineering initiatives, which include general data integration or data migration scenarios.
D&A governance: Data quality capabilities supporting the data governance initiative and its associated key roles (such as chief data and analytics officers [CDAOs] and data stewards) with a focus on increasing the value of data assets while managing risks and compliance.
Master data management: Data quality capabilities supporting various key master data domains in the context of master data management (MDM) initiatives and the deployment of custom or packaged MDM solutions.
Operational/transactional data quality: Data quality capabilities supporting the controlling over the quality of data created by, maintained by and housed in operational/transactional applications, including Internet of Things (IoT) systems.

Mandatory Features

The mandatory features for this market include:

Connectivity: This is the ability to access and apply data quality across a wide range of data sources, including internal/external, at-rest/streaming, on-premises/cloud and relational/nonrelational data sources.
Profiling and monitoring/detection: This involves the statistical analysis of diverse datasets (ranging from structured to unstructured data and from on-premises to cloud) to give business users insights into the quality of data and to enable them to identify data quality issues. Results from profiling should be able to drive the ongoing monitoring for data quality issues based on preconfigured, custom-built monitoring rules (or adaptive rules) and alert violations. This automatically detects outliers, anomalies, patterns and drifts.
Alerts, notifications and visualization: The interactive analytical workflow and visual output of statistical analysis help business and IT users identify, understand and monitor data quality issues and discover patterns and trends over time through, for example, reports, scorecards, dashboards and mobile devices. Augmented solutions should provide recommendations to users about new alerts to add, based on anomalies detected. Solutions should actively learn which issues are not relevant from user behavior and feedback and then refine generated notifications accordingly.
Data transformations (parsing, cleansing and standardizing data): This involves the decomposition and formatting of diverse datasets based on government, industry or local standards, business rules, knowledge bases, metadata and ML. This also involves the modification of data values to comply with domain restrictions, integrity constraints or other business rules. Augmented solutions may use a combination of supervised and semisupervised AI and ML models to parse, standardize and cleanse data.
Matching, linking and merging: This involves matching, linking and merging related data entries within or across diverse datasets using a variety of traditional and new approaches, such as rules, algorithms, metadata, AI and ML. Augmented solutions utilize AI and ML models to automatically suggest potential matches and can tune the results based on user feedback. For merging tasks, consolidation rules for merging data are automatically suggested and refined based on user feedback. User involvement in terms of selecting algorithms, constructing specific match and consolidation rules, and configuration and tuning or matching parameters should be minimal.
Usability, workflow and issue resolution: This involves the suitability of the solution to engage and support various roles (both technical and nontechnical roles) required in a data quality initiative. A workflow includes processes and a user interface to manage data quality issue resolution through the stewardship workflow, and to enable business users to easily identify, quarantine, assign, escalate and resolve data quality issues as facilitated by collaboration, pervasive monitoring and case management. Augmented solutions can initiate and assign data quality issues by leveraging and activating business, technical and operational metadata.
Rule discovery, creation and management: This is the ability to discover, recommend, design, deploy and manage business rules for specific data values throughout the life cycle of these rules. The rules can be called within the solution or by third-party applications for data validation or transformation purposes, which can be done in batch or real-time mode. Augmented solutions will support the creation of ML-supported data quality rules through a training interface, using unsupervised algorithms that can infer and create data quality rules automatically, and through the use of natural language statements that describe and can execute a data quality rule.

Common Features

The common features for the stand-alone or unified data management platform market include:

Data validation and enrichment: Augmented solutions support integration with third-party AI models, such as LLMs, to validate or enrich datasets. They can also integrate externally sourced data to improve accuracy and completeness, or add value.
Active metadata support: This is the ability to collect, discover or import metadata from partners and to build or import lineage to perform rapid root cause analysis of data quality issues and impact analysis of remediation. This also involves applying passive and active metadata findings, and using metadata-based rule recommendations and associations, and data discovery and cataloging. It includes a metrics view based on critical data elements.
Deployment environment, architecture and integration with other applications: This covers styles of deployment, hardware and operating system options, configuration of data quality operations and processes, and interoperability with third-party tools. Augmented solutions can integrate observability and monitoring metrics with DataOps tools to improve notification, analysis and orchestration of issues as they pertain to data flows.
Multidomain support and address validation/geocoding: This is the ability to address both multiple data subject areas (such as various master data domains and vertical industry domains) and depth of packaged support (such as prebuilt data quality rules) for these subject areas. This also includes capabilities that support location-related data standardization and cleansing, as well as completion for partial data in real-time or batch processes. For augmented solutions, multidomain support should be coupled with the ability to automatically recommend or deploy any prepackaged content based on the explicit or inferred semantics of the data being profiled.
Unstructured data support: This is the ability to analyze unstructured or semistructured data to highlight data quality issues based on semantic analysis and business validation logic, and generate context specific metadata. AI/ML capabilities are leveraged to validate the accuracy, completeness, and consistency of unstructured data by assessing data quality based on the availability and completeness of metadata through the application of specific validation logic. The unstructured data support also includes aid in data preparation using NLP techniques to extract information based on entity recognition and sentiment analysis, and the application of parsing techniques to extract data and prevent noise.

Product/Service Trends

The augmented data quality solutions market consists of software products or services that provide the processes and technologies for identifying, understanding and correcting flaws in data that support effective data and analytics governance across operational business processes and decision making. The packaged solutions available include a range of critical functions, such as profiling, parsing, standardization, cleansing, matching, monitoring, rule creation and analytics, as well as built-in workflow, knowledge bases and collaboration.

The varieties of data quality capabilities and deployment options have expanded. Mainstream data quality vendors also provide greater data insights by detecting anomalies; discovering patterns, trends and relationships of data and tackling data quality issues. The solutions are also frequently integrated with other adjacent data management tools, such as data integration, master data management and metadata management. The deployment options include various and flexible methods and channels, such as SaaS-based services across modern and traditional data stacks.

Critical Capabilities Definition

Active Metadata Support

This is the ability to collect, discover or import metadata from partners, as well as to build or import lineage to perform rapid root cause analysis of data quality issues and impact analysis of remediation.

This also involves applying passive and active metadata findings and making use of metadata-based rule recommendations and associations, as well as data discovery and cataloging. It includes a metrics view based on critical data elements.

Alerts, Notifications and Visual

The interactive analytical workflow and visual output of statistical analysis help business and IT users identify, understand and monitor data quality issues, as well as discover patterns and trends over time through reports, scorecards, dashboards and mobile devices.

Augmented solutions should provide recommendations to users about new alerts to add based on anomalies detected. Solutions should actively learn from user behavior and feedback about which issues are not relevant and then refine generated notifications accordingly.

Data Transformations

This involves the decomposition and formatting of diverse datasets based on government, industry or local standards, business rules, knowledge bases, metadata and machine learning.

This also involves the modification of data values to comply with domain restrictions, integrity constraints or other business rules. Augmented solutions may use a combination of supervised and semisupervised AI and ML models to parse, standardize and cleanse data.

Matching, Linking and Merging

This involves matching, linking and merging related data entries within or across diverse datasets using a variety of traditional and new approaches, such as rules, algorithms, metadata, AI and machine learning.

Augmented solutions utilize AI and ML models to automatically suggest potential matches and can tune the results based on user feedback. For merging tasks, consolidation rules for merging data are automatically suggested and refined based on user feedback. User involvement in terms of selecting algorithms, constructing specific match and consolidation rules, and configuring and tuning or matching parameters should be minimal.

Profiling and Monitoring/Detection

This involves the statistical analysis of diverse datasets (ranging from structured to unstructured data and from on-premises to cloud) to provide business users with insights into the quality of data and enable them to identify data quality issues.

Results from profiling should drive the ongoing monitoring of data quality issues based on preconfigured, custom-built monitoring rules (or adaptive rules) and alert violations. This automatically detects outliers, anomalies, patterns and drifts.

Rule Discovery, Creation and Mgmt

The ability to discover, recommend, design, deploy and manage business rules for specific data values throughout their life cycle. The rules can be called within the solution or by third-party applications for data validation or transformation purposes, which can be done in batch or real-time mode.

Augmented solutions will support the creation of ML-supported data quality rules through a training interface, using unsupervised algorithms that can infer and create data quality rules automatically and through the use of natural language statements that describe and can execute a data quality rule.

Unstructured Data Support

The ability to analyze unstructured or semistructured data to highlight data quality issues based on semantic analysis, business validation logic and generate context-specific metadata.

AI/ML capabilities are leveraged to validate the accuracy, completeness, and consistency of unstructured data by assessing data quality based on the availability and completeness of metadata through the application of specific validation logic. They aid in data preparation using NLP techniques to extract information based on entity recognition, sentiment analysis, and apply parsing techniques to extract data and prevent noise.

Usability, Workflow and Issue Res

The suitability of the solution to engage and support various roles (both technical and nontechnical) required in a data quality initiative.

These also include processes and a user interface to manage data quality issue resolution through the stewardship workflow, enabling business users to easily identify, quarantine, assign, escalate and resolve data quality issues as facilitated by collaboration, pervasive monitoring and case management.

Use Cases

Analytics, AI and Machine Learning

Data quality capabilities support the preparation of data sources for analytics and control the quality of data used for training models and actual data feeds.

The augmented data quality solutions can help data scientists prepare their datasets by fixing data problems, transforming data values and detecting any data bias. This use case increasingly involves combining disparate datasets — including unstructured data, Internet of Things (IoT) data and streaming data — of unknown quality in data warehouse or data lake environments. Therefore, this use case has a heavier emphasis on data profiling, data transformation, matching and detecting any data anomalies and bias. In addition, usability, performance and scalability are key. Data quality capabilities are useful for supporting components of analytics, data science solutions, AI and ML.

Note: This use case focuses on the degree to which augmented data quality solutions support analytics, data science or AI/ML algorithm development, independent of any sales or offerings in that area.

Note: AI and machine learning and analytics and data science were combined into one use case: analytics, AI and machine learning. Both use cases have similar capabilities and largely support data consumption activities.

Data Engineering

Data quality capabilities are applied to data processing in the context of data engineering initiatives, which include data integration or data migration scenarios.

Data engineering initiatives cannot be successful without mechanisms to assure the quality of the data being integrated and delivered. With the increasing complexity and diversity of datasets, the key critical capabilities include profiling, monitoring, observability and active metadata support.

Note: This use case focuses on the degree to which augmented data quality solutions support data engineering, data processing or data integration, independent of any sales or offerings in that area.

Data and Analytics Governance

Data quality capabilities support the data governance initiative and its associated key roles (e.g., data stewards and data owners).

Information leadership roles (e.g., chief data officer) and initiatives focus on increasing the value of data assets while managing risks and compliance. The data governance initiative requires superior capabilities for data profiling, augmented rule creation and management, active metadata support, visualization, workflow and usability to support data governance roles at all levels. This includes data stewards, members of data governance boards/councils and other business-side stakeholders. Nontechnical individuals increasingly support these roles.

Note: This use case focuses on the degree to which augmented data quality solutions support data analytics and governance, independent of any sales or offerings in that area.

Master Data Management

Data quality capabilities are applied to various key master data domains in the context of MDM initiatives and the deployment of custom or packaged MDM solutions.

This use case emphasizes the matching, augmented rule discovery and creation, workflow and multidomain capabilities of the solutions due to the common requirements to resolve master data that is authored in disparate sources. Data quality capabilities are one component, among many, that make up a comprehensive master data management solution.

Note: This use case focuses on the degree to which augmented data quality solutions support master data management, independent of any sales or offerings in that area.

Operational/Transactional Data Quality

Data quality capabilities are applied to controlling the quality of data created by, maintained by and housed in operational/transactional applications, including IoT systems.

As data quality controls are increasingly applied upstream, closer to the source of data, the ability to embed data quality capabilities in operational applications is key. This use case emphasizes core data quality operations, including parsing, standardizing and cleansing, workflow and multidomain, as well as rule management and monitoring.

Note: This use case is focused on the degree to which augmented data quality solutions support operational and transactional use cases, independent of any sales or offerings in that area.

Vendors Added and Dropped

This report has shifted from addressing data quality solutions to covering the augmented data quality solution market.

Added

Ab Initio: Ab Initio is an established vendor with wider data management capabilities as part of the larger Ab Initio Data Platform. Its stand-alone offering for data quality is called Ab Initio DQE, which meets the current inclusion criteria.
Anomalo: Anomalo is a new entrant into the augmented data quality solutions market and meets the inclusion criteria.
Irion: Irion’s platform Irion EDM is an enterprise data management platform and includes data quality as a capability. Irion meets the inclusion criteria.

Dropped

Collibra: The vendor was dropped because it does not provide cloud-based/SaaS deployment options.
Datactics: The vendor was dropped due to lack cloud-native data quality capabilities and unstructured data support.
MIOsoft: The vendor was dropped due to lack of support for augmentation of critical data quality functions that leverage AI/ML features, graph analysis and metadata analytics in its product offerings at the time of evaluation.
SAP: The vendor was dropped because the product SAP Datasphere was not positioned for general-purpose data quality use cases and required additional SAP components to fully address data quality scenarios at the time of evaluation.

Inclusion and Exclusion Criteria

To qualify for inclusion, vendors must meet all following inclusion criteria:

Offer stand-alone software solutions that are positioned, marketed and sold specifically for general-purpose data quality applications. Vendors that provide several data quality product components or unified data management platforms must demonstrate that these are integrated and collectively meet the full inclusion criteria for this Critical Capabilities.

Deliver critical augmented data quality functions at minimum (descriptions are same as given in Market Definition):
- Profiling and monitoring/detection
- Data transformations
- Rule discovery, creation and management
- Matching, linking and merging
- Active metadata support
- Usability, workflow and issue resolution
- Alerts, notifications and visualization
- Unstructured data support
Support augmentation of the critical data quality functions listed above by leveraging AI/ML features (supervised, semisupervised or unsupervised methods, NLP-based or LLM-supported), graph analysis and metadata analytics (active metadata).
Support the above functions in both scheduled (batch) and interactive (real-time) modes.
Enable large-scale deployment via server-based or cloud-based runtime architectures that can support concurrent users and applications. Cloud-based/SaaS versions should support all critical data quality functions independently as mentioned in above criteria.
Maintain an installed base of at least 50 production paying customers (different companies/organizational entities) for their flagship data quality product (not individual smaller modules or capabilities). The customers must be running in production for at least six months.
Include a complete solution addressing administration and management, as well as end-user-facing functionality, for four or more of the following types of users: data steward, data architect, data quality analyst, data engineer, database administrator, data integration analyst, data scientist, data analyst, business intelligence analyst and a citizen user.
Provide out-of-box, and prebuilt data quality rules for the purpose of data profiling and monitoring, cleansing, standardization, and transformation, based on common industrial practices.
Support integrability and interoperability with other data management solutions such as metadata management, master data management and data integration solutions from third-party tools.
Provide direct sales and support operations, or a partner providing sales and support operations in at least two of the following regions: North America, South America, EMEA and Asia/Pacific.
The customer base for production deployment must include customers in multiple countries and in more than one region (North America, South America, EMEA, and Asia/Pacific), and be representative of at least three or more industry sectors.
The solution must demonstrate the capabilities as available in general availability from 15 October 2024.

The following types of vendor were excluded from this Critical Capabilities, even if their products met the above criteria:

Vendors that meet the above criteria but are limited to deployments in a single specific application environment, industry or data domain.
Vendors that support limited data quality functionalities — no augmentation and automation or addressing of very specific data quality problems (for example, addressing cleansing and validation). They are excluded because they do not provide the complete suite of data quality functionality expected from today’s augmented data quality solutions.
Vendors that support only on-premises deployment and have no option in cloud-based deployment on any public cloud environment (for example, AWS, Azure, or Google Cloud).

Critical Capabilities Methodology

This methodology requires analysts to identify the critical capabilities for a class of products or services. Each capability is then weighted in terms of its relative importance for specific product or service use cases. Next, products/services are rated in terms of how well they achieve each of the critical capabilities. A score that summarizes how well they meet the critical capabilities for each use case is then calculated for each product/service.

"Critical capabilities" are attributes that differentiate products/services in a class in terms of their quality and performance. Gartner recommends that users consider the set of critical capabilities as some of the most important criteria for acquisition decisions.

In defining the product/service category for evaluation, the analyst first identifies the leading uses for the products/services in this market. What needs are end-users looking to fulfill, when considering products/services in this market? Use cases should match common client deployment scenarios. These distinct client scenarios define the Use Cases.

The analyst then identifies the critical capabilities. These capabilities are generalized groups of features commonly required by this class of products/services. Each capability is assigned a level of importance in fulfilling that particular need; some sets of features are more important than others, depending on the use case being evaluated.

Each vendor’s product or service is evaluated in terms of how well it delivers each capability, on a five-point scale. These ratings are displayed side-by-side for all vendors, allowing easy comparisons between the different sets of features.

Ratings and summary scores range from 1.0 to 5.0:

1 = Poor or Absent: most or all defined requirements for a capability are not achieved

2 = Fair: some requirements are not achieved

3 = Good: meets requirements

4 = Excellent: meets or exceeds some requirements

5 = Outstanding: significantly exceeds requirements

To determine an overall score for each product in the use cases, the product ratings are multiplied by the weightings to come up with the product score in use cases.

The critical capabilities Gartner has selected do not represent all capabilities for any product; therefore, may not represent those most important for a specific use situation or business objective. Clients should use a critical capabilities analysis as one of several sources of input about a product before making a product/service decision.