Slideshow 12 top machine learning data catalog firms

Published
  • June 27 2018, 3:37pm EDT
16 Images Total

Which firms are best for machine learning data catalogs

The MLDC market is growing because firms want to scale data to the masses through self-service, according to Forrester analyst Michele Goetz. But many back-end data management systems can’t support what she calls “tribal knowledge, provide a good user experience (UX) for data consumers, and scale across a highly federated data ecosystem.” Together with Gene Leganza , Elizabeth Hoberman and Kara Hartig, she looks at 12 leading vendors in this space.

Leaders

Five venders were named as leaders in the Machine Learning Data Catalogs report. They are Alation, Collibra, IBM, Reltio and Unifi Software. All subhead listings that follow for each firm are from the Forrester Research report.

Content Continues Below


Alation started the MLDC trend

“In 2012, Alation wanted to re-envision metadata management and governance and be first to market with an ML data catalog,” explains the Forrester analysts. “Today, it provides deep data introspection with its behavioral I/O analysis of data use and queries. Its strategy is to stay true to data cataloging and leverage point solution and platform partnerships, such as with Trifacta and Paxata for anomaly detection, to further extend its footprint.”

Collibra steps further into data management

Collibra is known for its “data governance prowess,” the analysts say. “Collibra now goes beyond semantic metadata management, business glossary functionality and stewardship cockpits. Adding support for systems and logical metadata into its core catalog, Collibra better manages data models, schemas, classification, tagging, and certification. Adding a data shopping environment makes Collibra a hub for data democratization and activation.”

IBM reimagines data

“Revisiting its traditional data management and governance approach to enablement, IBM designed its MLDC from the ground up around role intent and behavior, with ML at the core and the ability to tap into Watson APIs,” the analysts explain. “The UI lets roles work the way they want to and not reorient their data sourcing, stewardship, or administrative processes to match another role’s workspace. Data search builds confidence in data by allowing social sharing, ratings and communication, usage metrics, drill-down to sample views, and ML for classification and tagging.”

Content Continues Below


Reltio challenges data assumptions and then innovates data

The only master data management (MDM) vendor in this Wave Report, “Reltio continues to show that labels don’t always tell the full story. By building an MDM capability on ML, graph, big data, metadata and services, Reltio was a data catalog all along,” the analysts say. “Data engineers and stewards will be comfortable in the environment and can continue to take advantage of the self-data linking and curation, workflows, rich profiling and actionable charts.”

Unifi Software brings insight where none existed before

“Data users are immediately struck by the Unifi environment’s simplicity: a single search field, designed around the same concept as commercial search engines, so that user intent is center stage for immediate insights,” the analysts explain. “Search is through natural language where users can not only find data they need but also ask questions of the data such as: ‘What was my 2018 revenue?’ or, ‘What was the revenue trend from the past three years?’ Most data preparation vendors see catalogs as a function; Unifi treats it as a core differentiator.”

Strong performers

Six venders were named as strong performers in the Machine Learning Data Catalogs report. They are Cambridge Semantics, Cloundera, Infogix, Informatica, Oracle and Waterline Data. Again, all subheads for each firm that follow are from the actual report.

Content Continues Below


Cambridge Semantics brings context to data for insight

“Cambridge Semantics’ cohesive platform of semantic tools, text analytics, cataloging and insight capabilities on a big data foundation helps enterprises across multiple industries overcome challenges of interpreting and standardizing complex data,” the analysts explain. “ML is embedded in the platform to alleviate data science efforts of customers. However, Cambridge Semantics can improve its ML results by augmenting its engineers with data scientists.”

Cloudera makes sense of the data within data lakes

“Cloudera offers advanced cataloging with sophisticated ML capabilities to understand, classify and catalog data ingested into the data lake,” the analysts explain. “The environment provides the right foundation to catalog and search data at scale, but there is an assumption that the data consumer will have expertise in structured query language (SQL) and database environments. However, Cloudera has one of the largest data science workbenches of all vendors in this evaluation and a product road map oriented to extending ML for metadata capture and data management.”

Infogix moves from data auditor to data activator

“Infogix began with a robust tool to audit data against governance policies, but with the acquisition of Lavastorm, Infogix with Data3Sixty is now a complete stewardship, quality and cataloging solution,” the analysts say. “The environment is intuitive and business-oriented for data stewards and data management teams to understand the conditions of the data and create data flows.”

Content Continues Below


Informatica revives existing data investments

Informatica recently entered the MLDC market with its Enterprise Data Catalog (EDC), the analysts explained. “Going beyond its metadata management and business glossary capabilities, Informatica has now evolved its linked data and graph-based prototypes into an environment that is intelligent, democratized, and user friendly. EDC blends the search-like experience for data with metadata and glossary capabilities that data stewards and engineers have come to expect from their information management solutions.”

Oracle powers up data management

“One of the most compelling aspects of Oracle Enterprise Metadata Management is the recognition that metadata and models exist beyond data sources and live within extract, transform, and load (ETL) and open source environments (e.g., Kafka),” the analysts say. “The ability to bring these data models into the catalog gives a level of data visibility that most modern and traditional tools have lacked. Currently, Oracle uses public and open source models. Moving forward, it can customize and extend models through its acquisition of DataScience.com.”

Waterline Data keeps back the big data swamp monster

“Waterline’s data catalog provides deep profiling of the data and incorporates tribal knowledge that connects system, logical, and semantic insights about data, its lineage, models and fitness for purpose,” according to the analysts. “Deployment of Waterline Data’s MLDC takes longer than average at three to four months. Partnerships are extensive and strong including some seemingly strange bedfellows with other catalog vendors. This allows Waterline customers to synchronize metadata with preexisting or embedded catalogs that may have overlapping functionality such as quality scores, security and access controls, and workflow/task management.”

Content Continues Below


Contenders

Only one firm was named as a contender in the report, Hortonworks. No firms were named as niche players, a category normally featured in Forrester Research Wave reports.

Hortonworks knows your data<br/>

“The Data Steward Studio helps organizations understand data within the Hortonworks ecosystem through extensive metadata capture about data, data models and schemas from source systems at the file, table and column level,” the analysts explain. “Classifications and ongoing changes are automated in ingestion processes from connected source systems. Hortonworks’ capabilities span from stewardship of data policies and administrative capabilities to manage the clusters to maintenance of connectivity and data science efforts through Zepplin integration.”