How to Manage Entity Identifiers Across Multiple Micro Services

Jun 2

A critical aspect in the modern world of micro services is how to manage entities that are conceptually the same, but are stored separately in multiple micro services.

Some large companies have tried tackle this by enforcing a company-wide entity schema to make sure that all different systems use the exact same data model for an entity type. This approach normally fails because the schema would have to support all possible existing and future use cases. The model quickly becomes bloated with loads of fields that are unused, nullable and hard to understand and the development of the schema becomes a centralized bottleneck.

A better approach is to let each system use entities modeled according to their specific needs.

As an example, let’s look at Classifieds Ad entity on a popular platform like Craigslist or Ebay. Ads on such a platform may originate from a multitude of sources. Some ads get created by frontend clients on different platforms like web or mobile apps. Other ads get imported from various external partners or are created in other internal systems. These systems, in turn, might have imported the ads from another source system.

These ads are then used downstream by multiple systems, each saving and using the ads differently according to their unique requirements. Examples of such systems are front-facing read APIs, search APIs, analytics data lakes, export integrations etc.

Each of these systems will model the ads slightly differently, depending on their needs. One system might use a data model where the ads are decorated with related data, such as viewing statistics. Another system may project the ads data into a model that fits a specific export integration.

This flexible modelling approach is much better than using a centralized schema. However, there are challenges also with this approach, such as deciding the ownership of entities and their fields, and how to successfully map fields between the models without data being misinterpreted.

Another critical challenge that arises is how to maintain a consistent reference or identifier for these entities across systems. This challenge is what we will focus on in this article.

There are two main things to consider when it comes to identifiers:

How are identifiers generated?
How are identifiers referenced?

Generating Identifiers

The purpose of an identifier is to be able to identify and lookup entities within a system, so the identifier, by definition, must be unique. If an entity only ever will exist within a single system owned by a single team, the need for uniqueness is limited to that specific system itself. However, in the new world of distributed computing, requirements may arise that makes uniqueness within only one system too limited. The different kinds of identifiers described below have different characteristics in this regard.

Sequence-based Identifiers

Traditionally, using sequence-based identifiers generated by the same database where the entity itself is stored, have been the most common approach thanks to its ease of use. However, using sequence-based identifiers has some drawbacks:

The identifier can only be generated by a single system, the one that owns the database.
The identifier is predictable which opens up for security issues if the identifier is publicly exposed, for example in a URL or in an API call.
If many systems use number based sequences for different kinds of entities, hard to detect bugs can arise where these numbers are mixed-up. Example, a piece of code may accidentally access an order using an ad-id. Since the identities are both numbers and their ranges overlap, the wrong data is exposed.

Namespaced Identifiers

Natural Identifiers

A natural identifier (natural key) is a unique key formed from attributes that are used in the real world. An example is the use of ISO 3166 Alpha-2 country codes as keys for country entities. Another example of a potential natural key are ISBN numbers identifying book entities, although it only works for published books. If a system also needs to reference unpublished books, ISBN won’t be usable as the identifier across the entire lifecycle of the book.

It is often hard to find natural keys for your entities. Consider our classified ad example: what could serve as a natural key for it? Almost all fields in the ad would have to form a composite key, but any change to those fields would alter the entity's ID, undermining the model's ability to handle mutable entities.

Globally Unique Identifiers

Note that each system must not generate their own UUIDs, that would defeat the purpose.

Referencing Identifiers

Strategy A: Utilizing a Consistent Natural Key Across All Services

A natural key is a unique key in a database formed from attributes that are used in the real world. An example is the use of ISO 3166 Alpha-2 country codes as keys for country entities. Another example of a potential natural key are ISBN numbers identifying book entities, although it only works for published books. If a system also needs to reference unpublished books, ISBN won’t be usable as the identifier across the entire lifecycle of the book.

Strategy B: Referencing Identities Between Systems

In this strategy, the systems keep references to corresponding identifiers used by other systems. An upstream system can store a reference in its database, referencing the identifier in a downstream system. The opposite is also possible where the downstream system keeps a reference to the upstream identity.

This strategy is very common and might seem like an intuitive approach, but it can lead to considerable complexity and potential issues when systems are migrated or modified.

Strategy C: Implementing a Consistent UUID Across All Services

Universally Unique Identifiers (UUIDs) are 128-bit numbers, represented as a string of 32 hexadecimal digits. They offer numerous advantages in the context of distributed systems:

Global Uniqueness: UUIDs are globally unique across every table, database, and system. This removes the risk of ID collision.
Ease of Merging: UUIDs enable easy merging of records from different databases and systems. If both systems already use UUIDs, records can simply be merged without iD collisions, and existing references from other systems will continue to work seamlessly.
Client-side Generation: UUIDs can be generated upstream, on the client-side. The client can start the lifecycle of an entity by generating a UUID long before it is ready to push the entity to the downstream system. Before pushing the entity, the client can create references to it and create other entities in other backend systems referencing the UUID, all without the need to first create the entity in the downstream system. This is extremely powerful since the downstream system no longer needs to provide a create API with sloppy validation and save an incomplete entity, just to give the client an identifier. Instead, the downstream system can have a create endpoint with all validations needed to create a consistent entity according to Domain Driven Design best practices.

Mixing Strategies for the Real World

In an ideal world, we would most often prefer using Strategy A or C, but existing legacy systems might prevent that.

Legacy Scenario: Upstream system does not use UUIDs

Consider a scenario where the downstream system uses UUIDs, but the upstream system uses numeric sequences as keys and these keys are referenced extensively in many other systems. In such a case, it is not realistic in the short term to start using UUID identifiers in the upstream system itself. In this scenario, there are three options:

The upstream system keeps track of the mapping between its numeric id and the downstream UUID. The UUID is generated by the upstream system when creating the entity in the downstream system. Whenever the upstream system needs to update an entity in the downstream system, it will first do a lookup in its mapping table to get hold of the downstream UUID.
The downstream system keeps track of the mapping of the upstream id to the downstream UUID. This option is a little bit more complicated, because the downstream system must also extend its API to allow update requests that use the upstream identifier, often named external_id or source_id in the downstream system.
The upstream system calculates a deterministic downstream UUID based on the upstream numeric key. The calculated UUID is used when creating the entity in the downstream system. This can be achieved by generating a UUIDv5, which hashes a namespace identifier and a name to create a deterministic UUID. This option is the most appealing one, but is only possible if you can convince the upstream system to calculate the UUID whenever it needs to reference an entity in the downstream system, which is not always possible when you integrate with 3rd parties.

Here’s a simple Java code example that shows how we could generate a UUIDv5

import java.nio.ByteBuffer;
import java.util.UUID;

public class UUIDv5 {
    public static UUID generate(UUID namespace, String name) {
        long msb = namespace.getMostSignificantBits();
        long lsb = namespace.getLeastSignificantBits();
        UUID uuid = new UUID(msb, lsb);
        ByteBuffer buffer = ByteBuffer.wrap(name.getBytes());
        return uuid.nameUUIDFromBytes(buffer.array());
    }
}

In this example, the 'generate' method generates a UUIDv5 using a namespace, which itself is a constant UUID representing the system generating the ID, and a name, which could be the internal numeric key. The result is a deterministic UUID that is globally unique.

Legacy Scenario: Downstream system does not support UUIDs

When faced with a downstream system that doesn't support UUIDs, the best option is to extend that system to store the UUID as an attribute on the entity. The UUID can then be passed to any further downstream systems and the system can also be extended with an API to lookup the entity by the UUID attribute.

Conclusion

Enforcing a shared, company-wide entity schema is a monumental challenge and, in most cases, isn't practical. Instead, allow each system to keep its own model of the entity, but try to maintain a consistent ID for the entity across all systems.

Aim for using natural keys or UUIDs across all services, but accept that you may need to employ a mix of strategies when legacy or 3rd party systems are involved. When using UUIDs as keys, ensure that they are generated as early as possible in the data flow. The uniqueness of UUIDs removes concerns around having multiple “first-creators” of an entity.

By carefully selecting and implementing the strategies described above, managing IDs in a distributed system can transition from being a daunting task to a more efficient process that enhances the overall data flow and operations of your systems.

Best of luck. I hope you found this article useful.

Mats Melke