Reference Data Management Case


Key Questions

IT systems such as CRM or ERPs in large organizations make transactions based on controlled lists of codes, also known as reference data. Examples are abundant: country codes, language codes, product codes, account identifiers and many more. Business users as well are applying these code lists to annotate their daily administration work or as a keyword to look up information. The core problem of this business case is that different cooperating organizations may utilize different versions of their code sets. The impedance mismatch becomes even worse considering the double usage of different versions by IT systems and business users at the same time, while they are handing off information using these code sets all the time. This business case envisions to resolve questions such as:

  • what version of the project identification codes are used in this database?
  • what ISO country code values are allowed for the instantiation of the business term Country in the marketing glossary?
  • what is the difference between the version of ISO country codes of last year as compared to the one currently operational internally?
  • if I cannot find a code for a specific account or project, whom should I report it to?
The reference data management case aims at a systematic approach to manage code value hierarchies, group code values in sets and define complex mappings between them in order to enable crosswalks from one information system to another, taking into account differences in the code sets through time. Moreover, code values and -sets may also be related to the business assets managed in a glossary domain. 


The reference data management activities are supported by the Reference Data Accelerator Product. In addition to relate code sets with business assets the Business Semantics Glossary product can be used.

The Reference Data Lifecycle

The reference data case is in many cases the critical starting point for data governance. It can also be the second phase, that is: once a first business glossary has been set up, its assets can be related to the reference data. 

 The reference data lifecycle is illustrated below. 

It consists of the following phases:

  1. Intake: the first step is to gather all existing reference data content, analyse it and import relevant parts of it in the Reference Data Accelerator. The import functionality allows to import thousands of assets in one pass. Two other options are the create assets manually, or discover them from unstructured text. The outcome is an initial population of code value assets, assigned the candidate status, and organised in different codelist domains. During this process, optionally, the codes may be related to the business assets in the glossary.
  2. Update: the second step involves assigning roles and responsibilities in the respective codelists. The steward teams improve the bulk import, and make it ready for review. To ensure the orchestration of improving, reviewing and approving tasks among the team members, the out-of-the-box Operating Model features two approval workflow variants: Approval and Simple Approval. The outcome is a set of code value assets with the status “Approved”.
  3. Map: during the third step, the stewards define crosswalks between code values. A crosswalk may be further attributed with transformation logic that explains side effects of the mapping or possibly exceptions. This transformation logic is usually hidden or implicit before the business case. Produced crosswalks receive also a candidate status; hence are to be reviewed and approved using the workflows as explained in the “update” step.
  4. Publish/Provision: approved code values can be provisioned in different ways to the business user. They can be exported in MS Excel spreadsheet format, comma-separated file format or SQL statements. The Collibra API also offers ways to push approved assets to external applications. The latter can be automated in a timely manner through custom workflows. Once published the assets are not assigned another status by default. However, it could be useful to distinguish between “Approved” and “Published”. To achieve, this software can be customised.
  5. Use (not shown in the diagram): during the final stage in the lifecycle, the business users consume published assets in their own applications such as reporting. This typically results the identification of inconsistencies or incompleteness in the current glossaries. These issues can be reported triggering the cycle to restart.

Customer Examples

You have to login to comment.