Data Dictionary Case

 

Key Questions

Technical metadata is scattered around, documented in different applications across different branches of the organization. This raises questions such as:

  • Which database attributes and columns are out there, and what do they mean in terms of the glossary?
  • What is the data type of such a data element, and does it have a limit on the length of the value?
  • Can I be absolutely certain that I have all data elements related to the business term “customer” and none are still left undocumented?
  • In which system or application is the field recorded?
  • What data structures are my data elements it part of?
  • Whom should I address to in case I cannot find the data type or precision for a certain metric?

The data dictionary envisions a process to systematically document logical/physical data structures and elements completely across the IT space in a collaborative way. Secondly it envisions a complete business and technical traceability by relating them to both business glossary assets, as well as technology assets.

Traceability Requirements

This data dictionary case is concerned with the following asset types: 

  1. data structures such as data entity, data model, and database;
  2. data elements such as data attribute and table column.

Further, it aims at a traceability between the above asset types in different domains, which translates to establishing following relations in the Operating Model:

  • Data Table consists of Table Column
  • System is a system of record for Data Asset
  • Database is database for Data Asset
  • Code Set is allowed value set for Data Element
  • Business Asset represents Data Asset

The description of these relation types and attribute types is provided in the section Asset Types. Naturally, other relation types may be added to the Operating Model to fit a more specific case sample.

Data Dictionary Lifecycle

The data dictionary business case is illustrated below. The diagram show it as an extension of the business glossary case. Note, although, the glossary is not a prerequisite for the data dictionary, the glossary does offer additional business traceability that gives a complete overview of how business assets are represented in the IT space. 

 

The lifecycle consists of the following statuses:  

  1. Define: in the first step, IT stewards gather all their existing meta-date on logical and physicial data assets, analyse it and import relevant parts of it in the Business Semantics Glossary tool. These descriptions often reside in the systems themselves, or even in design documents on the private desktop folder. The import functionality allows to import thousands of data structures and their elements in one pass. There are two other ways: (1) to create assets manually, or discover them from unstructured text. The outcome is an initial population of data structures and elements, assigned the “candidate” status, and organised in different data asset domains. During this process, in case there is a glossary in place, these data assets may be related to the business assets in the glossary.
  2. Map: during the second step, the stewards define field mappings between data elements from structures in different systems. A field mapping may be attributed with transformation logic that explains how the elements from one structure have to be interpreted in terms of elements from the other. This transformation logic is usually programmed in application logic or implicit. As mapping specification are also assets, they receive also a “candidate” status; hence are to be reviewed and approved using the workflows as explained (a step not visualized in the diagram).
  3. Visualize: approved data dictionary assets can be provisioned in different ways to the business user. They can be exported in MS Excel spreadsheet format, comma-separated file format or SQL statements. The Collibra API also offers ways to push approved assets to external applications. The latter can be automated in a timely manner through custom workflows. Once published the assets are not assigned another status by default. However, it could be useful to distinguish between “Approved” and “Published”. To achieve, this software can be customised.
  4. Use (not shown in the diagram): during the final stage in the lifecycle, the business users consume published assets in their own applications such as lineage. This typically results the identification of inconsistencies or incompleteness in the current data dictionary. These issues can be reported triggering the cycle to restart.

You have to login to comment.