Demystifying data models and query language
Today’s developers face mounting pressure to create data-intensive applications due to:
- The rising amount of data
- The complexity involved
- The increasing demand for quick processing speed
As a first step in building a data-intensive application, data modeling comes into place where they need to understand the data requirements and identify data models. It shows the logical structure of a database, data flow between different data elements, and how they relate to one another. To request and retrieve data from databases and information systems, a query language is used.
Data models and query language play a significant role in application development as they determine how data can be stored and retrieved. There are different types of data models, and those that we use now are an evolution of an initial hierarchical model.
1970: the emergence of data models and query language
The year was 1970, and IBM developed Information Management System (IMS) for stock-keeping in the Apollo space program. IMS was a simple hierarchical model and also the most popular database used for business processing during that time. In hierarchical models, data is represented as a tree with records nested within records. Though it solved the purpose of specific business functions, it could not solve the problem of many-to-many relation. As a solution, two models were proposed: relational model (which later became SQL) and network model.
The network model was standardized by Conference On Data System Language (CODASYL). Unlike the hierarchical model with one parent, the network model can have multiple parents for a record. The links in this model were not foreign keys but pointers, where the access path had to traverse from the root along with the chain of record. Even though the manual access path selection gave the efficient use of hardware at that time, the code for updating and querying the database became complicated and inflexible.
Unprepared for that time, the relational model was introduced with a query optimizer. It can automatically decide which part of the query to execute in which order, and also which index to use. Due to this nature, query optimizers developed much efficiently over the period of time. The main advantage of choosing relational models was that they build a query optimizer only once and all applications using the database benefit from it. In this model, foreign keys were used to represent the many-to-many and many-to-one relations.
Today: the usage of data models and query languages
Historically, data was represented as one big tree as the hierarchical model, but that couldn’t represent many-to-many relationships, so the relational and document model emerged to solve that problem. We will see some verticals on which we could compare the document and relational model, and the ways that can lead to a simpler application code.
A deeply nested document model has high workload and becomes difficult to process, likewise, shredding a document into multiple tables can result in the same. Also, due to the poor support for joints, it leads to a cumbersome schema and unnecessarily complicated application code. These reasons made the relational model a better choice for handling highly interconnected data.
The document model, aka ‘schemaless’, is actually schema-on-read that provides the power of a dynamically typed language. When the application code has to change the format of data, just a minor change on the application code is needed to handle and merge the data henceforth processed. Schema change on the relational model has a poor impact. Running an ‘update’ query on the large table would likely be slow and may need downtime. Some tools do exist as a workaround to this limitation. Schema-on-read is more suited if the data collection does not have the same structure for some reason.
A document model usually stores as a single continuous string, encoded as JSON or XML or binary variant (MongoDB’s BSON). This loads the entire document even if a small part of it gets accessed. It is better, in this case, to keep the document model small and avoid writes that increase the size. On the other hand, the relational database provides support for locality by allowing the schema to declare the table’s row. Oracle does this by ‘multi-table index cluster tables’; Cassandra and HBase use the concept of ‘column-family’.
With the relational model came in the ‘declarative’ query language (SQL), while the IMS and CODASYL queried the database using ‘imperative’ code.
The imperative language makes the computer to perform certain operations in a certain order, like a programming loop that prints elements of an array, whereas, in a declarative language you just have to specify the pattern of the data you want, this abstracts the implementation details of the database engine making it easier for adding performance optimizations for queries under the hood. This is not so for the imperative language, say the database implemented using the imperative language, needs to reclaim unused disk space, it has to move the records around without changing the order in which they appeared since the database never knows whether the code is relying on the ordering. A declarative one never bothers about ordering which makes it easier for parallel executions.
Graph data model and query language
The relation model can handle simple cases of many-to-many relationships, but as it gets more complex, it’s better to move to a graph model. For example, Facebook maintains a single graph with vertices representing people, locations, events, check-ins and comments of users, edges representing friends and who commented on which posts. Even though CODASYL’s network model looks similar to the graph model, both the models are different. Here are some of the differences between CODASYL and graph models.
- CODASYL had a schema that specified which record could be nested with what. But in a graph database, there are no such restrictions. It’s just vertex and edges providing more flexibility for applications to adopt changes.
- In a graph database, you can refer to a record directly by using its unique ID, whereas in the network model, the ‘access path’ was used for traversal.
- Since the network model had an ordering to be maintained, adding a new record made it difficult. However, the graph data model has no order to maintain.
- In CODASYL, the querying was done using imperative language, which made it hard for any schema changes. But, the graph model supports declarative languages like Cypher used in Neo4j or SPARQL.
We have all three models being used today, although one model can be emulated in terms of the other, which would often not be recommended: Each with its own purpose and none better than the other.
Advantages of the right data model
By knowing the pros and cons of different data models, developers can choose the right set of data models to develop their data intensive application. With right data modeling, organizations can
- Improve business processes
- Reduce complexity and risks
- Improve collaboration
Reference: ‘Designing data-intensive applications’ by ‘Martin Kleppmann’