The rapid growth of data volumes and people dealing with it in organizations creates new challenges that never existed before.
- Knowledge about data is scattered across numerous systems and owners
- Legacy data catalogs use proprietary non-interchangeable metadata formats
- Legacy data catalogs don’t support building federated systems
- Legacy data catalogs are limited to static data assets - tables, schemas and so on
- Legacy data catalogs provide no support of ML Models and data pipelines
- A centralized fleet of domain-specific crawlers is hard to develop, extend and maintain
- Data discovery eats up to 30% of the time of data teams
- Data discovery/access is the major barrier to applying AI at scale in organizations
Some open-source initiatives are already trying to address these challenges. For example, open-source data catalogs like Amundsen, DataHub, Marquez are built to reduce data discovery time. Their strong and, at the same time weak, side is the monolithic and closed design of the discovery process. It significantly limits the possibilities of reusing discovered metadata across other data discovery products and reduces scalability. These products are successful solutions to problems 1, 6, and 7. However, they don’t cover 2-5 and 8.
Marquez's team introduced OpenLineage specification to standardize the data lineage discovery process. However, it doesn’t cover entities outside of Data Lakes / Warehouses world, like Dashboards, Pipelines, and ML Models. It also doesn’t enrich your metadata with quality tests information and results, and data profiling.
Open Data Discovery is an open standards specification that unifies metadata formats and allows multiple data sources and participants of the data discovery landscape to exchange metadata effectively, transparently, and consistently.
It describes the semantics of the data discovery process as we envision it. It is data source agnostic by design and is intentionally not tied to the specifics of any particular data source or data catalog.
Core features:
- Standard Open Data Discovery API (ODD API) for currently known data entities
- Extensible model to include data entities that could show up
- Includesentities for the ML world
- Federated Catalog of Catalogs for data discovery
- A reference implementation based on ODD API specification
- An Open-Source product based on open standards
- Composable and pluggable architecture architecture to fit any data strategy/business requirements
- Community-driven to achieve better compatibility with a wide list of integrations
Diagram 1 shows 1) Currently existing discovery ecosystem, where multiple data sources (feature stores, ETL tools, ML pipelines, data warehouses, data quality tools) and data catalogs exchange data with each other directly. 2) How the process will change with ODD. Various data sources and data catalogs will exchange data in a unified format through a single ODD Adapter.
The diagrams are inspired by the OpenLineage documentation.
Diagram 2 shows Open Data Discovery process with pull, push, and federation strategies. Any data source including Data Catalog can expose ODD Adapter API or have a specific adapter microservice to be discovered. It may also use a push strategy to be combined with already discovered data entities. ODD Data catalogs intentionally do not have access to the real data and operate only consumed metadata.
There are 3 main groups of Clients & Partners of ODD:
- Data Catalogs - DataHub, Amundsen, Collibra, Alation, etc.)
- Data Assets - sources/consumers (ETL Tools, warehouses, BI, feature stores, ML pipelines, data quality tools, etc.)
- End Users (enterprises: their data teams, engineers, product maangers, analysts, etc.)
Each of the groups can benefit greatly from engagement with and early adoption of ODD.
Goal: wider adoption & market integration, better product for users, market development.
- Faster time-to-value
- Better integration with the Data Discovery ecosystem
- Improvement of data discovery experience for users
- Covering more use cases for the community
- Onboarding more renowned companies
- Acceleration of the whole ecosystem
Goal: wider adoption & market integration, better product for users, more & better clients, market development.
- Better integration with the data discovery ecosystem
- Faster adoption and recognition on the market
- Opportunity to onboard more and better clients
Goal: fast finding & evaluation of trusted data, producing better end products with faster TTM, effective metadata exchange between teams & departments
- Better quality & speed of the data discovery and access
- Fast integration of metadata from various business units
- Ability to quickly find, evaluate and trust data
- Federated solution to bring together all business units
- Data observability, quality, health tracking
- Real end-to-end lineage
- Multi-cloud solution
ODD describes the process of gathering metadata from data storages/sources such as data lakes and data warehouses, data discovery processes through push and pull models and APIs that should be provided.
ODD does not describe Data Catalog and how it works: its authentication and authorization, how it provides access to data, etc.
The metadata discovery process is very similar to that of gathering metrics/logs/traces. It can be done through a pull or push model (or both). Each of the models has a range of use cases it suits best. ODD uses both models to effectively cover all core use cases.
Pulling metadata directly from the source is the most straightforward way of gathering it. However, an attempt to develop and maintain a centralized fleet of domain-specific crawlers can easily become a nightmare. Pulling data from multiple sources without having a standard for it means writing multiple source-specific crawlers for each adapter, which would be an overly complex and ineffective solution. ODD solves this issue by providing a universal API adapter (ODD Adapter).
The ODD Adapter entity introduced by ODD is a lightweight service behind a data warehouse. It is a proxy layer for data that allows gathering metadata in a standardized format. ODD Adapter receives requests for data entities and returns those entities. ODD Adapters are designed to be source-specific and expose only the information that could be gathered from a particular data source.
Pull model is preferred when:
- Latency on index update is ok
- There is already an adapter
Push model supports the process where individual metadata providers push the information to the central repository via APIs. The model is more preferred for use cases like Airflow job runs and quality check runs.
Federation allows scraping data entities from other Open Data Discovery servers.
There are many different use cases for federation. Commonly, it is used to either build scalable Data Catalogs or pull related data entities from one ODD server to another.
Hierarchical federation allows ODD servers to scale to environments with tens of data centers and millions of nodes. In this use case, the federation topology resembles a tree, with higher-level ODD servers collecting data entities from a larger number of subordinated servers.
For example, a setup might consist of many per-datacenter ODD servers that collect data in high detail (instance-level drill-down), and a set of global ODD servers that collect and store from those local servers.
In the case of the cross-service federation, an ODD server of one service is configured to scrape selected data from another service's ODD server to enable queries against both datasets within a single server.
The goal of ODD is to provide a standard protocol on how metadata can be collected and correlated in the most automated way possible.
To enable many different data sources and tools to expose the metadata we need agreement on what data should be exposed and in what format (structures).
The high-level entity of ODD is DataEntity. It could be any of these entities:
- DataInput (sources of data)
- DataSet (collections of data)
- DataTransformer (transformers of data: ETL or ML training jobs)
- DataTransformerRun (executions of ETL or ML training jobs)
- DataConsumers (consumers of data: ML model artifacts or BI dashboards)
- DataQualityTest (describes tests for particular DataSets)
- DataQualityTestRun (executions of data quality tests)
Each entity has:
- ODDRN (Open Data Discovery Resource Name) - a unique URL describing its place, system, and identifier in this system
- A human-friendly name
- List of metadata extension objects
ODD Data Entity:
DataEntity:
type: object
properties:
oddrn:
type: string
example: "//aws/glue/{account_id}/{database}/{tablename}"
name:
type: string
example:
owner:
type: string
example: "//aws/iam/{account_id}/user/name"
metadata:
type: array
items:
$ref: "#/components/schemas/MetadataExtension"
dataset:
$ref: '#/components/schemas/DataSet'
data_transformer:
$ref: '#/components/schemas/DataTransformer'
data_transformer_run:
$ref: '#/components/schemas/DataTransformerRun'
data_quality_test:
$ref: '#/components/schemas/DataQualityTest'
data_quality_test_run:
$ref: '#/components/schemas/DataQualityTestRun'
data_input:
$ref: '#/components/schemas/DataInput'
data_consumer:
$ref: '#/components/schemas/DataConsumer'
ODD Metadata Extension:
MetadataExtension:
type: object
properties:
schema_url:
description: "The JSON Pointer (https://tools.ietf.org/html/rfc6901) URL to the corresponding version of the schema definition for this extension"
example: https://raw.githubusercontent.com/opendatadiscovery/opendatadiscovery-specification/main/specification/extensions/glue.json#/definitions/GlueDataSetExtension
type: string
format: uri
metadata:
type: object
additionalProperties: true
required:
- schema_url
- metadata
DataInputs are sources of your data. They can be website URLs, external S3 bucket, or real-life data places.
DataInput:
properties:
outputs:
type: array
items:
type: string
required:
- outputs
Example:
{
"oddrn": "//http/host/www.amazon.com/path/goods",
"name": "Amazon Goods Website",
"metadata": [
{
“schema_url”: null,
“metadata”: {
"location": "internet"
}
}
],
“data_input”: {
"owner": "Amazon",
"description": "Amazon Goods website with api"
}
}
A DataSet is a collection of data stored in a structured, semi-structured, or unstructured format. It might be a table in a relational database, a parquet file on an S3 bucket, a Hive catalog table, and so on.
DataSets can have sub-datasets. For example, the Hive table is a dataset itself and it consists of DataSets in the format of folders/files on HDFS/S3.
ODD Dataset:
DataSet:
properties:
parent_oddrn:
type: string
description:
type: string
updated_at:
type: string
format: date-time
field_list:
type: array
items:
$ref: '#/components/schemas/DataSetField'
required:
- description
- field_list
DataSetField:
type: object
properties:
parent_field_oddrn:
type: string
type:
$ref: '#/components/schemas/DataSetFieldType'
is_key:
type: boolean
is_value:
type: boolean
default_value:
type: string
description:
type: string
stats:
$ref: '#/components/schemas/DataSetFieldStat'
required:
- name
- type
DataSetFieldType:
type: object
properties:
type:
type: string
enum:
- TYPE_STRING
- TYPE_NUMBER
- TYPE_INTEGER
- TYPE_BOOLEAN
- TYPE_CHAR
- TYPE_DATETIME
- TYPE_TIME
- TYPE_STRUCT
- TYPE_BINARY
- TYPE_LIST
- TYPE_MAP
- TYPE_UNION
- TYPE_DURATION
- TYPE_REFERENCE
- TYPE_VECTOR
- TYPE_UNKNOWN
logical_type:
type: string
is_nullable:
type: boolean
required:
- type
- is_nullable
Example ODDRN:
{
"oddrn": "//postgresql/host/pg.sandbox.datacompany.domain/database/goods/schemas/public/tables/items",
"name": "public.items",
"owner": "root",
"metadata": {
"location": "internet",
},
"parent_oddrn": null,
"description": "Amazon Goods table",
"updated_at": "2021-02-11T00:01:00Z",
"fieldList": [
{
"oddrn": "//postgresql/host/pg.sandbox.datacompany.domain/database/goods/schemas/public/tables/items/columns/id",
"name": "id",
"owner": "root",
"metadata": {},
"parentFieldOddrn": null,
"type": "TYPE_NUMBER",
"isKey": false,
"isValue": false,
"defaultValue": null,
"description": "Unique identifier field",
"stats": {
"numberStats": {
"lowValue": 1,
"highValue": 10000,
"meanValue": 5000,
"medianValue": 5000,
"nullsCount": 0,
"uniqueCount": 10000
}
}
},
{
"oddrn": "//postgresql/host/pg.sandbox.datacompany.domain/database/goods/schemas/public/tables/items/columns/name",
"name": "name",
"owner": "root",
"metadata": {},
"parentFieldOddrn": null,
"type": "TYPE_STRING",
"isKey": false,
"isValue": false,
"defaultValue": null,
"description": "Goods name",
"stats": {
"stringStats": {
"maxLength": 120,
"avgLength": 52,
"nullsCount": 0,
"uniqueCount": 10000
}
}
}
]
}
Example ODDRN:
{
"oddrn": "//aws/s3/sample.data/path/to/folder/file.csv",
"name": "file.csv",
"owner": "aws:iam:88898998/username",
"metadata": {
"location": "internet",
},
"parentOddrn": null,
"description": "Amazon Goods table",
"updatedAt": "2021-02-11T00:01:00Z",
"subtype": "DATASET_TABLE",
"fieldList": [
{
"oddrn": "//aws/s3/sample.data/path/to/folder/file.csv/id",
"name": "id",
"owner": "aws:iam:88898998/username",
"metadata": {},
"parentFieldOddrn": null,
"type": "TYPE_NUMBER",
"isKey": false,
"isValue": false,
"defaultValue": null,
"description": "Unique identifier field",
"stats": {
"number_stats": {
"lowValue": 1,
"highValue": 10000,
"meanValue": 5000,
"medianValue": 5000,
"nullsCount": 0,
"uniqueCount": 10000
}
}
},
{
"oddrn": "//aws/s3/sample.data/path/to/folder/file.csv/name",
"name": "name",
"owner": "aws:iam:88898998/username",
"metadata": {},
"parentFieldOddrn": null,
"type": "TYPE_STRING",
"isKey": false,
"isValue": false,
"defaultValue": null,
"description": "Goods name",
"stats": {
"string_stats": {
"maxLength": 120,
"avgLength": 52,
"nullsCount": 0,
"uniqueCount": 10000
}
}
}
]
}
Feature Groups are entities provided by Feature Stores. They are similar to a table in a database but can expose additional information.
Example ODDRN:
//feast/host/{namespace}/{featuregroup}
Data transformers are any entities that can consume data and produce any other object. For example ETL job, ML Experiment, ML Training.
DataTransformer:
type: object
properties:
description:
type: string
source_code_url:
type: string
sql:
type: string
inputs:
type: array
items:
type: string
outputs:
type: array
items:
type: string
required:
- description
- inputs
- outputs
DataTransformerRun:
properties:
transformer_oddrn:
type: string
start_time:
type: string
format: date-time
end_time:
type: string
format: date-time
status_reason:
type: string
status:
type: string
enum:
- SUCCESS
- FAILED
- SKIPPED
- BROKEN
- ABORTED
- RUNNING
- UNKNOWN
required:
- transformer_oddrn
- start_time
- end_tsime
- status
Example ODDRN:
//airflow/host/{host}/paths/{path}/dags/{dag_id}/jobs/{job_id}
Example ODDRN:
//kubeflow/host/{host}/paths/{path}/jobs/{job_id}
Any data entity built with one or many datasets, for example Dashboards, ML Models.
DataConsumer:
type: object
properties:
description:
type: string
inputs:
type: array
items:
type: string
required:
- description
- inputs
Example ODDRN:
//aws/sagemaker/{account_id}/{model_id}
Example ODDRN:
//tableau/{host}/{path}/{dashboard_id}
Data Quality Tests are assertions for data. They are the workhorse abstractions covering all kinds of common data issues. Each dataset could be linked to a test suite (like Jira ticket, Confluence page, or any other URL), which should be linked to one or many datasets. DataQualityTestRun describes test run status.
DataQualityTest:
type: object
properties:
name:
type: string
suite_name:
type: string
suite_url:
type: string
dataset_list:
type: array
items:
type: string
expectation:
type: object
linkedUrlList:
type: array
items:
$ref: '#/components/schemas/LinkedUrl'
required:
- description
- datasetList
DataQualityTestRun:
type: object
properties:
data_quality_test_oddrn:
type: string
start_time:
type: string
format: date-time
end_time:
type: string
format: date-time
status_reason:
type: string
description:
type: string
status:
type: string
enum:
- SUCCESS
- FAILED
- SKIPPED
- BROKEN
- ABORTED
- RUNNING
- UNKNOWN
required:
- data_quality_test_oddrn
- start_time
- end_time
- status
Data Discovery - the first step of working with data: finding the right data and evaluating it.
Open Data Discovery (ODD) Spec - a specification for the data discovery process.
Open Data Discovery (ODD) Platform - a reference implementation of the ODD Standard built upon it.
ODD Adapter API - an open API specification of the ODD Adapter to provide data to the ODD Puller.
ODD Adapter - a microservice that implements the ODD Adapter API and provides data source specific entities.
ODD Ingestion API - an open API specification for push strategy ingestion.
ODD Puller - a service that regularly pulls metadata from ODD Adapters.
ODDRN - Open Data Discovery Resource Name (the unique identifier of the data resource).
ETL tools - Extract, Transform, Load. These tools play a key role in data integration strategies allowing businesses to gather data from multiple sources and consolidate it into a single centralized location and make different types of data work together. ETLs collect and refine different types of data and deliver it to data warehouses or help to migrate it between different sources.
Name | GithHub |
---|---|
Stepan Pushkarev | spushkarev |
German Osin | germanosin |
Elena Goydina | Evanto |
Nikita Dementyev | DementevNikita |
Sofia Shnaidman | soffest |