Skip to content

Latest commit

 

History

History
729 lines (585 loc) · 23 KB

File metadata and controls

729 lines (585 loc) · 23 KB

Open Data Discovery Specification


Challenge

The rapid growth of data volumes and people dealing with it in organizations creates new challenges that never existed before.

  1. Knowledge about data is scattered across numerous systems and owners
  2. Legacy data catalogs use proprietary non-interchangeable metadata formats
  3. Legacy data catalogs don’t support building federated systems
  4. Legacy data catalogs are limited to static data assets - tables, schemas and so on
  5. Legacy data catalogs provide no support of ML Models and data pipelines
  6. A centralized fleet of domain-specific crawlers is hard to develop, extend and maintain
  7. Data discovery eats up to 30% of the time of data teams
  8. Data discovery/access is the major barrier to applying AI at scale in organizations

Some open-source initiatives are already trying to address these challenges. For example, open-source data catalogs like Amundsen, DataHub, Marquez are built to reduce data discovery time. Their strong and, at the same time weak, side is the monolithic and closed design of the discovery process. It significantly limits the possibilities of reusing discovered metadata across other data discovery products and reduces scalability. These products are successful solutions to problems 1, 6, and 7. However, they don’t cover 2-5 and 8.

Marquez's team introduced OpenLineage specification to standardize the data lineage discovery process. However, it doesn’t cover entities outside of Data Lakes / Warehouses world, like Dashboards, Pipelines, and ML Models. It also doesn’t enrich your metadata with quality tests information and results, and data profiling.


Proposed Solution

Open Data Discovery is an open standards specification that unifies metadata formats and allows multiple data sources and participants of the data discovery landscape to exchange metadata effectively, transparently, and consistently.

It describes the semantics of the data discovery process as we envision it. It is data source agnostic by design and is intentionally not tied to the specifics of any particular data source or data catalog.

Core features:

  1. Standard Open Data Discovery API (ODD API) for currently known data entities
  2. Extensible model to include data entities that could show up
  3. Includesentities for the ML world
  4. Federated Catalog of Catalogs for data discovery
  5. A reference implementation based on ODD API specification
  6. An Open-Source product based on open standards
  7. Composable and pluggable architecture architecture to fit any data strategy/business requirements
  8. Community-driven to achieve better compatibility with a wide list of integrations

open-data-discovery-standard-before-and-after 

Diagram 1 shows 1) Currently existing discovery ecosystem, where multiple data sources (feature stores, ETL tools, ML pipelines, data warehouses, data quality tools) and data catalogs exchange data with each other directly. 2) How the process will change with ODD. Various data sources and data catalogs will exchange data in a unified format through a single ODD Adapter.

The diagrams are inspired by the OpenLineage documentation.


open-data-discovery-reference-catalog-implementation 

Diagram 2 shows Open Data Discovery process with pull, push, and federation strategies. Any data source including Data Catalog can expose ODD Adapter API or have a specific adapter microservice to be discovered. It may also use a push strategy to be combined with already discovered data entities. ODD Data catalogs intentionally do not have access to the real data and operate only consumed metadata.


Engagement Benefits

There are 3 main groups of Clients & Partners of ODD:

  • Data Catalogs - DataHub, Amundsen, Collibra, Alation, etc.)
  • Data Assets - sources/consumers (ETL Tools, warehouses, BI, feature stores, ML pipelines, data quality tools, etc.)
  • End Users (enterprises: their data teams, engineers, product maangers, analysts, etc.)

Each of the groups can benefit greatly from engagement with and early adoption of ODD.

Data Catalogs

Goal: wider adoption & market integration, better product for users, market development.

  • Faster time-to-value
  • Better integration with the Data Discovery ecosystem
  • Improvement of data discovery experience for users
  • Covering more use cases for the community
  • Onboarding more renowned companies
  • Acceleration of the whole ecosystem

Data Sources

Goal: wider adoption & market integration, better product for users, more & better clients, market development.

  • Better integration with the data discovery ecosystem
  • Faster adoption and recognition on the market
  • Opportunity to onboard more and better clients

End Users

Goal: fast finding & evaluation of trusted data, producing better end products with faster TTM, effective metadata exchange between teams & departments

  • Better quality & speed of the data discovery and access
  • Fast integration of metadata from various business units
  • Ability to quickly find, evaluate and trust data
  • Federated solution to bring together all business units
  • Data observability, quality, health tracking
  • Real end-to-end lineage
  • Multi-cloud solution

Scope

ODD Scope

ODD describes the process of gathering metadata from data storages/sources such as data lakes and data warehouses, data discovery processes through push and pull models and APIs that should be provided.

Out of ODD Scope

ODD does not describe Data Catalog and how it works: its authentication and authorization, how it provides access to data, etc.


Discovery Models

Push and Pull Models

The metadata discovery process is very similar to that of gathering metrics/logs/traces. It can be done through a pull or push model (or both). Each of the models has a range of use cases it suits best. ODD uses both models to effectively cover all core use cases.

Pull Model

Pulling metadata directly from the source is the most straightforward way of gathering it. However, an attempt to develop and maintain a centralized fleet of domain-specific crawlers can easily become a nightmare. Pulling data from multiple sources without having a standard for it means writing multiple source-specific crawlers for each adapter, which would be an overly complex and ineffective solution. ODD solves this issue by providing a universal API adapter (ODD Adapter).

ODD Adapter

The ODD Adapter entity introduced by ODD is a lightweight service behind a data warehouse. It is a proxy layer for data that allows gathering metadata in a standardized format. ODD Adapter receives requests for data entities and returns those entities. ODD Adapters are designed to be source-specific and expose only the information that could be gathered from a particular data source.

open-data-discovery-pull-strategy-odd 

Pull model is preferred when:

  • Latency on index update is ok
  • There is already an adapter

Push Model

Push model supports the process where individual metadata providers push the information to the central repository via APIs. The model is more preferred for use cases like Airflow job runs and quality check runs.

open-data-discovery-push-strategy-odd 

Data Discovery Federation

Federation allows scraping data entities from other Open Data Discovery servers.

Use Cases

There are many different use cases for federation. Commonly, it is used to either build scalable Data Catalogs or pull related data entities from one ODD server to another.

Hierarchical Federation

Hierarchical federation allows ODD servers to scale to environments with tens of data centers and millions of nodes. In this use case, the federation topology resembles a tree, with higher-level ODD servers collecting data entities from a larger number of subordinated servers.

For example, a setup might consist of many per-datacenter ODD servers that collect data in high detail (instance-level drill-down), and a set of global ODD servers that collect and store from those local servers.

open-data-discovery-hierarchical-federation 

Cross-Service Federation

In the case of the cross-service federation, an ODD server of one service is configured to scrape selected data from another service's ODD server to enable queries against both datasets within a single server.

open-data-discovery-cross-service-federation 


Data Model Specification

The goal of ODD is to provide a standard protocol on how metadata can be collected and correlated in the most automated way possible.

To enable many different data sources and tools to expose the metadata we need agreement on what data should be exposed and in what format (structures).

The high-level entity of ODD is DataEntity. It could be any of these entities:

  1. DataInput (sources of data)
  2. DataSet (collections of data)
  3. DataTransformer (transformers of data: ETL or ML training jobs)
  4. DataTransformerRun (executions of ETL or ML training jobs)
  5. DataConsumers (consumers of data: ML model artifacts or BI dashboards)
  6. DataQualityTest (describes tests for particular DataSets)
  7. DataQualityTestRun (executions of data quality tests)

Each entity has:

  • ODDRN (Open Data Discovery Resource Name) - a unique URL describing its place, system, and identifier in this system
  • A human-friendly name
  • List of metadata extension objects

ODD Data Entity:

DataEntity:
     type: object
     properties:
       oddrn:
         type: string
         example: "//aws/glue/{account_id}/{database}/{tablename}"
       name:
         type: string
         example:
       owner:
         type: string
         example: "//aws/iam/{account_id}/user/name"
       metadata:
         type: array
         items:
           $ref: "#/components/schemas/MetadataExtension"
       dataset:
         $ref: '#/components/schemas/DataSet'
       data_transformer:
         $ref: '#/components/schemas/DataTransformer'
       data_transformer_run:
         $ref: '#/components/schemas/DataTransformerRun'
       data_quality_test:
         $ref: '#/components/schemas/DataQualityTest'
       data_quality_test_run:
         $ref: '#/components/schemas/DataQualityTestRun'
       data_input:
         $ref: '#/components/schemas/DataInput'
       data_consumer:
         $ref: '#/components/schemas/DataConsumer'

ODD Metadata Extension:

MetadataExtension:
     type: object
     properties:
       schema_url:
         description: "The JSON Pointer (https://tools.ietf.org/html/rfc6901) URL to the corresponding version of the schema definition for this extension"
         example: https://raw.githubusercontent.com/opendatadiscovery/opendatadiscovery-specification/main/specification/extensions/glue.json#/definitions/GlueDataSetExtension
         type: string
         format: uri
       metadata:
         type: object
         additionalProperties: true
     required:
       - schema_url
       - metadata

Data Inputs

DataInputs are sources of your data. They can be website URLs, external S3 bucket, or real-life data places.

DataInput:
    properties:
      outputs:
        type: array
        items:
          type: string    
    required:
        - outputs

Example:

{
    "oddrn": "//http/host/www.amazon.com/path/goods",
    "name": "Amazon Goods Website",
    "metadata": [
        {
          “schema_url”: null,
          “metadata”: {
             "location": "internet"
          }
        }        
    ],
    “data_input”: {
      "owner": "Amazon",
      "description": "Amazon Goods website with api"
    }
}

DataSets

A DataSet is a collection of data stored in a structured, semi-structured, or unstructured format. It might be a table in a relational database, a parquet file on an S3 bucket, a Hive catalog table, and so on.

DataSets can have sub-datasets. For example, the Hive table is a dataset itself and it consists of DataSets in the format of folders/files on HDFS/S3.

ODD Dataset:

DataSet:
 properties:
   parent_oddrn:
     type: string
   description:
     type: string
   updated_at:
     type: string
     format: date-time
   field_list:
     type: array
     items:
       $ref: '#/components/schemas/DataSetField'
 required:
   - description
   - field_list


DataSetField:
 type: object
 properties:
   parent_field_oddrn:
     type: string
   type:
     $ref: '#/components/schemas/DataSetFieldType'
   is_key:
     type: boolean
   is_value:
     type: boolean           
   default_value:
     type: string
   description:
     type: string
  stats:
     $ref: '#/components/schemas/DataSetFieldStat'
 required:
   - name
   - type

DataSetFieldType:
 type: object
 properties:
   type:
     type: string
     enum:
        - TYPE_STRING
        - TYPE_NUMBER
        - TYPE_INTEGER
        - TYPE_BOOLEAN
        - TYPE_CHAR
        - TYPE_DATETIME
        - TYPE_TIME
        - TYPE_STRUCT
        - TYPE_BINARY
        - TYPE_LIST
        - TYPE_MAP
        - TYPE_UNION
        - TYPE_DURATION
        - TYPE_REFERENCE
        - TYPE_VECTOR
        - TYPE_UNKNOWN
   logical_type:
     type: string
   is_nullable:
     type: boolean
 required:
   - type
   - is_nullable

Tables

Example ODDRN:

{
    "oddrn": "//postgresql/host/pg.sandbox.datacompany.domain/database/goods/schemas/public/tables/items",
    "name": "public.items",
    "owner": "root",
    "metadata": {
        "location": "internet",        
    },
    "parent_oddrn": null,    
    "description": "Amazon Goods table",
    "updated_at": "2021-02-11T00:01:00Z",
    "fieldList": [
        {
           "oddrn": "//postgresql/host/pg.sandbox.datacompany.domain/database/goods/schemas/public/tables/items/columns/id",
           "name": "id",
           "owner": "root",
           "metadata": {},
           "parentFieldOddrn": null,
           "type": "TYPE_NUMBER",
           "isKey": false,
           "isValue": false,
           "defaultValue": null,
           "description": "Unique identifier field",
           "stats": {
               "numberStats": {
                   "lowValue": 1,
                   "highValue": 10000,
                   "meanValue": 5000,
                   "medianValue": 5000,
                   "nullsCount": 0,
                   "uniqueCount": 10000
               }
           }
        },
        {
           "oddrn": "//postgresql/host/pg.sandbox.datacompany.domain/database/goods/schemas/public/tables/items/columns/name",
           "name": "name",
           "owner": "root",
           "metadata": {},
           "parentFieldOddrn": null,
           "type": "TYPE_STRING",
           "isKey": false,
           "isValue": false,
           "defaultValue": null,
           "description": "Goods name",
           "stats": {
               "stringStats": {
                   "maxLength": 120,
                   "avgLength": 52,
                   "nullsCount": 0,
                   "uniqueCount": 10000
               }
           }
        }      
    ]
}

Files

Example ODDRN:

{
    "oddrn": "//aws/s3/sample.data/path/to/folder/file.csv",
    "name": "file.csv",
    "owner": "aws:iam:88898998/username",
    "metadata": {
        "location": "internet",        
    },
    "parentOddrn": null,    
    "description": "Amazon Goods table",
    "updatedAt": "2021-02-11T00:01:00Z",
    "subtype": "DATASET_TABLE",
    "fieldList": [
        {
           "oddrn": "//aws/s3/sample.data/path/to/folder/file.csv/id",
           "name": "id",
           "owner": "aws:iam:88898998/username",
           "metadata": {},
           "parentFieldOddrn": null,
           "type": "TYPE_NUMBER",
           "isKey": false,
           "isValue": false,
           "defaultValue": null,
           "description": "Unique identifier field",
           "stats": {
               "number_stats": {
                   "lowValue": 1,
                   "highValue": 10000,
                   "meanValue": 5000,
                   "medianValue": 5000,
                   "nullsCount": 0,
                   "uniqueCount": 10000
               }
           }
        },
        {
           "oddrn": "//aws/s3/sample.data/path/to/folder/file.csv/name",
           "name": "name",
           "owner": "aws:iam:88898998/username",
           "metadata": {},
           "parentFieldOddrn": null,
           "type": "TYPE_STRING",
           "isKey": false,
           "isValue": false,
           "defaultValue": null,
           "description": "Goods name",
           "stats": {
               "string_stats": {
                   "maxLength": 120,
                   "avgLength": 52,
                   "nullsCount": 0,
                   "uniqueCount": 10000
               }
           }
        }      
    ]
}

Feature Groups

Feature Groups are entities provided by Feature Stores. They are similar to a table in a database but can expose additional information.

Example ODDRN:

//feast/host/{namespace}/{featuregroup}

DataTransformers

Data transformers are any entities that can consume data and produce any other object. For example ETL job, ML Experiment, ML Training.

DataTransformer:
 type: object
 properties:
   description:
       type: string
   source_code_url:
       type: string
   sql:
       type: string                       
   inputs:
       type: array
       items:
       type: string           
   outputs:
       type: array
       items:
       type: string
 required:
 - description
 - inputs
 - outputs

DataTransformerRun:
properties:
  transformer_oddrn:
    type: string
  start_time:
    type: string
    format: date-time
  end_time:      
    type: string
    format: date-time
  status_reason:
    type: string
  status:
    type: string
    enum:
      - SUCCESS
      - FAILED
      - SKIPPED
      - BROKEN
      - ABORTED
      - RUNNING
      - UNKNOWN
required:
  - transformer_oddrn
  - start_time
  - end_tsime
  - status

ETL Jobs

Example ODDRN:

//airflow/host/{host}/paths/{path}/dags/{dag_id}/jobs/{job_id} 

ML Training Jobs

Example ODDRN:

//kubeflow/host/{host}/paths/{path}/jobs/{job_id}

DataConsumers

Any data entity built with one or many datasets, for example Dashboards, ML Models.

DataConsumer:
 type: object
 properties:
   description:
     type: string
   inputs:
     type: array
     items:
       type: string
 required:
   - description
   - inputs

ML Models

Example ODDRN:

//aws/sagemaker/{account_id}/{model_id}

BI Dashboards

Example ODDRN:

//tableau/{host}/{path}/{dashboard_id}

DataQualityTests

Data Quality Tests are assertions for data. They are the workhorse abstractions covering all kinds of common data issues. Each dataset could be linked to a test suite (like Jira ticket, Confluence page, or any other URL), which should be linked to one or many datasets. DataQualityTestRun describes test run status.

DataQualityTest:
        type: object
        properties:
            name:
                type: string
            suite_name:
                type: string              
            suite_url:
                type: string
            dataset_list:
                type: array
                items:
                    type: string
            expectation:
                type: object
            linkedUrlList:
                type: array
                items:
                    $ref: '#/components/schemas/LinkedUrl'
        required:
            - description
            - datasetList

    DataQualityTestRun:  
        type: object
        properties:
            data_quality_test_oddrn:
              type: string
            start_time:
              type: string
              format: date-time
            end_time:        
              type: string
              format: date-time
            status_reason:
              type: string
            description:
              type: string
            status:
              type: string
              enum:
                - SUCCESS
                - FAILED
                - SKIPPED
                - BROKEN
                - ABORTED
                - RUNNING
                - UNKNOWN
        required:
            - data_quality_test_oddrn
            - start_time
            - end_time
            - status

Glossary

Data Discovery - the first step of working with data: finding the right data and evaluating it.

Open Data Discovery (ODD) Spec - a specification for the data discovery process.

Open Data Discovery (ODD) Platform - a reference implementation of the ODD Standard built upon it.

ODD Adapter API - an open API specification of the ODD Adapter to provide data to the ODD Puller.

ODD Adapter - a microservice that implements the ODD Adapter API and provides data source specific entities.

ODD Ingestion API - an open API specification for push strategy ingestion.

ODD Puller - a service that regularly pulls metadata from ODD Adapters.

ODDRN - Open Data Discovery Resource Name (the unique identifier of the data resource).

ETL tools - Extract, Transform, Load. These tools play a key role in data integration strategies allowing businesses to gather data from multiple sources and consolidate it into a single centralized location and make different types of data work together. ETLs collect and refine different types of data and deliver it to data warehouses or help to migrate it between different sources.


Specification Status

Proposers

Name GithHub
Stepan Pushkarev spushkarev
German Osin germanosin
Elena Goydina Evanto
Nikita Dementyev DementevNikita
Sofia Shnaidman soffest