Blog

Entity Storage, the Drupal 8 Way

Francesco Placella

In Drupal 7 the Field API introduced the concept of swappable field storage. This means that field data can live in any kind of storage, for instance a NoSQL database like MongoDB, provided that a corresponding backend is enabled in the system. This feature allows support of some nice use cases, like remotely-stored entity data or exploit storage backends that perform better in specific scenarios. However it also introduces some problems with entity querying, because a query involving conditions on two fields might end up needing to query two different storage backends, which may become impractical or simply unfeasible.

That's the main reason why in Drupal 8, we switched from field-based storage to entity-based storage, which means that all fields attached to an entity type share the same storage backend. This nicely resolves the querying issue without imposing any practical limitation, because to obtain a truly working system you were basically forced to configure all fields attached to the same entity type to share the same storage engine. The main feature that was dropped in the process, was the ability to share a field between different entity types, which was another design choice that introduced quite a few troubles on its own and had no compelling reason to exist.

With this change each entity type has a dedicated storage handler, that for fieldable entity types is responsible for loading, storing, and deleting field data. The storage handler is defined in the handlers section of the entity type definition, through the storage key (surprise!) and can be swapped by modules implementing hook_entity_type_alter().

Querying Entity Data

Since we now support pluggable storage backends, we need to write storage-agnostic contrib code. This means we cannot assume entities of any type will be stored in a SQL database, hence we need to rely more than ever on the Entity Query API, which is the successor of the Entity Field Query system available in Drupal 7. This API allows you to write complex queries involving relationships between entity types (implemented via entity reference fields) and aggregation, without making any assumption on the underlying storage. Each storage backend requires a corresponding entity query backend, translating the generic query into a storage-specific one. For instance, the default SQL query backend translates entity relationships to JOINs between entity data tables.

Entity identifiers can be obtained via an entity query or any other viable means, but existing entity (field) data should always be obtained from the storage handler via a load operation. Contrib module authors should be aware that retrieving partial entity data via direct DB queries is a deprecated approach and is strongly discouraged. In fact by doing this you are actually completely bypassing many layers of the Entity API, including the entity cache system, which is likely to make your code less performant than the recommended approach. Aside from that, your code will break as soon as the storage backend is changed, and may not work as intended with modules correctly exploiting the API. The only legal usage of backend-specific queries is when they cannot be expressed through the Entity Query API. However also in this case only entity identifiers should be retrieved and used to perform a regular (multiple) load operation.

Storage Schema

Probably one of the biggest changes introduced with the Entity Storage API, is that now the storage backend is responsible for managing its own schema, if it uses any. Entity type and field definitions are used to derive the information required to generate the storage schema. For instance the core SQL storage creates (and deletes) all the tables required to store data for the entity types it manages. An entity type can define a storage schema handler via the aptly-named storage_schema key in the handlers section of the entity type definition. However it does not need to define one if it has no use for it.

Updates are also supported, and they are managed via the regular DB updates UI, which means that the schema will be adapted when the entity type and field definitions change or are added or removed. The definition update manager also triggers some events for entity type and field definitions, that can be useful to react to the related changes. It is important to note that not all kind of changes are allowed: if a change implies a data migration, Drupal will refuse to apply it and a migration (or a manual adjustment) will be required to proceed.

This means that if a module requires an additional field on a particular entity type to implement its business logic, it just needs to provide a field definition and apply changes (there is also an API available to do this) and the system will do the rest. The schema will be created, if needed, and field data will be natively loaded and stored. This is definitely a good reason to define every piece of data attached to an entity type as a field. However if for any reason the system-provided storage were not a good fit, a field definition can specify that it has custom storage, which means the field provider will handle storage on its own. A typical example are computed fields, which may need no storage at all.

Core SQL Storage

The default storage backend provided by core is obviously SQL-based. It distinguishes between shared field tables and dedicated field tables: the former are used to store data for all the single-value base fields, that is fields attached to every bundle like the node title, while the latter are used to store data for multiple-value base fields and bundle fields, which are attached only to certain bundles. As the name suggests, dedicated tables store data for just one field.

The default storage supports four different shared table layouts depending on whether the entity type is translatable and/or revisionable:

  • Simple entity types use only a single table, the base table, to store all base field data.
    | entity_id | uuid | bundle_name | label | … |
    
  • Translatable entity types use two shared tables: the base table stores entity keys and metadata only, while the data table stores base field data per language.
    | entity_id | uuid | bundle_name | langcode |
    
    | entity_id | bundle_name | langcode | default_langcode | label | … |
    
  • Revisionable entity types also use two shared tables: the base table stores all base field data, while the revision table stores revision data for revisionable base fields and revision metadata.
    | entity_id | revision_id | uuid | bundle_name | label | … |
    
    | entity_id | revision_id | label | revision_timestamp | revision_uid | revision_log | … |
    
  • Translatable and revisionable entity types use four shared tables, combining the types described above: the base table stores entity keys and metadata only, the data table stores base field data per language, the revision table stores basic entity key revisions and revision metadata, and finally the revision data table stores base field revision data per language for revisionable fields.
    | entity_id | revision_id | uuid | bundle_name | langcode |
    
    | entity_id | revision_id | bundle_name | langcode | default_langcode | label | … |
    
    | entity_id | revision_id | langcode | revision_timestamp | revision_uid | revision_log |
    
    | entity_id | revision_id | langcode | default_langcode | label | … |
    

The SQL storage schema handler supports switching between these different table layouts, if the entity type definition changes and no data is stored yet.

Core SQL storage aims to support any table layout, hence modules explicitly targeting a SQL storage backend, like for instance Views, should rely on the Table Mapping API to build their queries. This API allows retrieval of information about where field data is stored and thus is helpful to build queries without hard-coding assumptions about a particular table layout. At least this is the theory, however core currently does not fully support this use case, as some required changes have not been implemented yet (more on this below). Core SQL implementations currently rely on the specialized DefaultTableMapping class, which assumes one of the four table layouts described above.

A Real Life Example

We will now have a look at a simple module exemplifying a typical use case: we want to display a list of active users having created at least one published node, along with the total number of nodes created by each user and the title of the most recent node. Basically a simple tracker.

User activity tracker

Displaying such data with a single query can be complex and will usually lead to very poor performance, unless the number of users on the site is quite small. A typical solution in these cases is to rely on denormalized data that is calculated and stored in a way that makes it easy to query efficiently. In our case we will add two fields to the User entity type to track the last node and the total number of nodes created by each user:

function active_users_entity_base_field_info(EntityTypeInterface $entity_type) {
 $fields = [];

 if ($entity_type->id() == 'user') {
   $fields['last_created_node'] = BaseFieldDefinition::create('entity_reference')
     ->setLabel('Last created node')
     ->setRevisionable(TRUE)
     ->setSetting('target_type', 'node')
     ->setSetting('handler', 'default');

   $fields['node_count'] = BaseFieldDefinition::create('integer')
     ->setLabel('Number of created nodes')
     ->setRevisionable(TRUE)
     ->setDefaultValue(0);
 }

 return $fields;
}

Note that fields above are marked as revisionable so that if the User entity type itself is marked as revisionable, our fields will also be revisioned. The revisionable flag is ignored on non-revisionable entity types.

After enabling the module, the status report will warn us that there are DB updates to be applied. Once complete, we will have two new columns in our user_field_data table ready to store our data. We will now create a new ActiveUsersManager service responsible for encapsulating all our business logic. Let's add an ActiveUsersManager::onNodeCreated() method that will be called from a hook_node_insert implementation:

 public function onNodeCreated(NodeInterface $node) {
   $user = $node->getOwner();
   $user->last_created_node = $node;
   $user->node_count = $this->getNodeCount($user);
   $user->save();
 }

 protected function getNodeCount(UserInterface $user) {
   $result = $this->nodeStorage->getAggregateQuery()
     ->aggregate('nid', 'COUNT')
     ->condition('uid', $user->id())
     ->execute();

   return $result[0]['nid_count'];
 }

As you can see this will track exactly the data we need, using an aggregated entity query to compute the number of created nodes.

Since we need to also act on node deletion (hook_node_delete), we need to add a few more methods:

 public function onNodeDeleted(NodeInterface $node) {
   $user = $node->getOwner();
   if ($user->last_created_node->target_id == $node->id()) {
     $user->last_created_node = $this->getLastCreatedNode($user);
   }
   $user->node_count = $this->getNodeCount($user);
   $user->save();
 }

 protected function getLastCreatedNode(UserInterface $user) {
   $result = $this->nodeStorage->getQuery()
     ->condition('uid', $user->id())
     ->sort('created', 'DESC')
     ->range(0, 1)
     ->execute();

   return reset($result);
 }

In the case where the user's last created node is the one being deleted, we use a regular entity query to retrieve an updated identifier for the user's last created node.

Nice, but we still need to display our list. To accomplish this we add one last method to our manager service to retrieve the list of active users:

 public function getActiveUsers() {
   $ids = $this->userStorage->getQuery()
     ->condition('status', 1)
     ->condition('node_count', 0, '>')
     ->condition('last_created_node.entity.status', 1)
     ->sort('login', 'DESC')
     ->execute();

   return User::loadMultiple($ids);
 }

As you can see, in the entity query above we effectively expressed a relationship between the User entity and the Node entity, imposing a condition using the entity syntax, that is implemented through a JOIN by the SQL entity query backend.

Finally we can invoke this method in a separate controller class responsible for building the list markup:

 public function view() {
   $rows = [];

   foreach ($this->manager->getActiveUsers() as $user) {
     $rows[]['data'] = [
       String::checkPlain($user->label()),
       intval($user->node_count->value),
       String::checkPlain($user->last_created_node->entity->label()),
     ];
   }

   return [
     '#theme' => 'table',
     '#header' => [$this->t('User'), $this->t('Node count'), $this->t('Last created node')],
     '#rows' => $rows,
   ];
 }

This approach is way more performant when numbers get big, as we are running a very fast query involving only a single JOIN on indexed columns. We could even skip it by adding more denormalized fields to our User entity, but I wanted to outline the power of the entity syntax. A possible further optimization would be collecting all the identifiers of the nodes whose titles are going to be displayed and preload them in a single multiple load operation preceding the loop.

Aside from the performance considerations, you should note that this code is fully portable: as long as the alternative backend complies with the Entity Storage and Query APIs, the result you will get will be the same. Pretty neat, huh?

What's Left?

What I have shown above is working code, you can use it right now in Drupal 8. However there are still quite some open issues before we can consider the Entity Storage API polished enough:

  • Switching between table layouts is supported by the API, but storage handlers for core entity types still assume the default table layouts, so they need to be adapted to rely on table mappings before we can actually change translatability or revisionability for their entity types. See https://www.drupal.org/node/2274017 and follow-ups.
  • In the example above we might have needed to add indexes to make our query more performant, for example, if we wanted to sort on the total number of nodes created. This is not supported yet, but of course «there's an issue for that!» See https://www.drupal.org/node/2258347.
  • There are cases when you need to provide an initial value for new fields, when entity data already exists. Think for instance to the File entity module, that needs to add a bundle column to the core File entity. Work is also in progress on this: https://www.drupal.org/node/2346019.
  • Last but not least, most of the time we don't want our users to go and run updates after enabling a module, that's bad UX! Instead a friendlier approach would be automatically applying updates under the hood. Guess what? You can join us at https://www.drupal.org/node/2346013.

Your help is welcome :)

So What?

We have seen the recommended ways to store and retrieve entity field data in Drupal 8, along with (just a few of) the advantages of relying on field definitions to write simple, powerful and portable code. Now, Drupal people, go and have fun!

Comments

From the article:

"The main feature that was dropped in the process, was the ability to share a field between different entity types, which was another design choice that introduced quite a few troubles on its own and had no compelling reason to exist."

I've actually worked on several Drupal 7 sites that rely on this feature. Typically a field collection (the backend representation of a complex visual design component) will be attached to both nodes and fieldable panels panes. Attaching it to nodes allows the component to appear as part of the site's structured content, and attaching it to fieldable panels panes allows the administrator to occasionally add it to unstructured landing pages also.

As a hypothetical example, consider a "call to action" component that needs to appear in the right sidebar of every blog post on a site (with content related to the particular blog post), but might also need to be added to either the left or right sidebar on occasional non-blog pages, with independent content.

Typically there will be custom theming and logic associated with this component, so having it be the same actual field is extremely handy.

I haven't yet seen a way to replicate that cleanly in Drupal 8, but I hope there might be one.

One thing that I did not clarify is that in D8 fields are identified by "$entity_type_id.$field_name", hence two entity types can have two different fields with the same name. This should make it easy to share custom theming and logic if even if the fields are attached to two different entity types, and thus have two different definitions.

See https://www.drupal.org/node/2554097 actually installing the fields.