Browsing Large Volumes of Data

This issue comes up very often and so I decided it deserves its own article. It almost invariably arises in the situation where a legacy application using a database like Paradox, Dbase, or Access, is being rewritten with a new backend, or the programmer has only previously written applications using such databases. 

The issue concerns the presentation of data. In such applications, it is common to see a form open with a grid that, by default, shows all records in a database table. It doesn't matter if the table contains 100 or 100,000 records, all are directly available in the grid and the user can browse forwards and backwards through all of them.

Performance in most cases is not a problem, this is mainly because, even in cases where the application is multi-user and is thus reading the data across a network, it is the application itself that is directly reading from the database files and is therefore in control of how much and what parts of a table it will read. Also, with grid controls such as the DBGrid in Delphi, the database components only fetch as many rows as necessary as a user scrolls. If the user tries to go to the end of the list directly, because of the nature of such databases, it is possible for the application to read directly from the end of the table without working through the earlier records. So while the wisdom of presenting data this way can be argued, it works and both programmers, and especially users of these applications, have become accustomed to working this way.

Enter the need to rewrite the application to use an SQL Client/Server database in a multitier design (typically using the dbExpress components combined with ClientDatasets in Delphi). One of the first problems encountered is performance. The programmer will drop a data-aware grid on a form, hook it up to the table with an SQL statement that selects all records, and gasp in horror when it takes several minutes to read all 100,000 records before the form is able to display any of them.

While this may seem at first to be a major failing of "new" technology, it really isn't. The trade-offs between desktop-type and SQL databases, and between two-tier and multitier architectures are beyond the scope of this article, suffice to say that the latter in both cases present advantages that generally outweigh the apparent disadvantages.

The first things to understand about Client/Server databases, whether InterBase, MS SQL, Oracle, etc, are the following:

With this much understood, the observed behaviour in a two-tier application using a DBGrid can also be understood. Generally in this case, the database connection is persistent and so it is possible for the application to fetch records only as needed, but only in a forward direction. Thus the initial display is still fast and, as long as the user only casually scrolls forward a page at a time, performance is still very reasonable. But as soon as the user drags the scrollbar to the bottom, the app will appear to lock up for a time before finally refreshing the display.  It must do this because it has no way to go directly to the end of the table (or to any arbitrary point) without reading all earlier records first. So if there are 100,000 records, then every one of them must be read from the server and saved in memory in order to display the last 10. Once it has done this then performance will be ok because the records will now be cached in local memory.

Things get worse with a multitier architecture. Once again, there are many advantages to multitier design, but it changes the way an application must be designed. One of the main purposes of a multitier design is scalability, and one of the main ways of accomplishing this is via stateless data access. This means an application only connects to a database long enough to get the requested data, and then immediately disconnects. The next time it needs data it connects again. Because there is no persistent connection, there is no way for the application to hold a cursor open within a table in order to fetch 10 records now and 10 more sometime in the future. If an SQL statement executed by an application results in 100,000 records, the application must fetch all 100,000 immediately so that it can close the connection.

Thus, if you are using dbExpress database components along with Providers and ClientDatasets, you will wait possibly several minutes while an entire table is fetched from the database, packaged in memory by the DatasetProvider, and then transferred to the ClientDataset. Once complete, performance will then be fine since everything is now cached by the ClientDataset.

This is not a flaw in the components mentioned here, it is the nature of both the database servers and of a multitier design. The dbExpress components reflect the former (they are unidirectional like the database servers they connect to) and ClientDatasets reflect the latter (by default they fetch all records so the database connection can be closed). It will be found that all new database technology will reflect these factors. This includes data access in both Java and .Net applications.

What to do about it 

Invariably, the claim is made "But the users  need this behaviour". The short answer is simply: No, they do not need it. That is simply what they (and many programmers) are used to. Like most people everywhere, users of such applications often do not appreciate change and will resist it. For the most part they cannot imagine any other way, or at least any better  way, to operate. That's not unreasonable, depending on the type of application and the users of it, it is not their job to imagine such things, it is their job to get their work done using the tools they have.

It is our job as designers and programmers to find a way to give users the actual functionality they need while accomodating the realities of the architecture we are designing for. Neither can be ignored. Insisting that a new application must work the same way the users are used to solves nothing - it improves nothing for them while ignoring the realities discussed above.

While users may be accustomed to having a grid that lets them browse thousands of records, there is no logical basis whatsoever for claiming this is needed. No user can work on thousands of records at once or even, in most cases, over the span of an entire day. It serves no purpose, other than possibly to waste time, for a user to repeatedly press the page down key looking at record after record until coming to one that needs attention. 

Most such applications, while initially displaying all records, also have a control that lets the user perform an incremental search or filter. Any user interested in being productive is going to use such controls to more quickly get to the records that actually need attention. That right there is confirmation that users do not need to see all the many records they are  not interested in.

Keying on this last, an appropriate design change really just reverses the default - the display starts out with  no records. Rather than an incremental search (very hard on performance as a query would need to be executed on every keystroke), the user enters search criteria just as before, but then hits a search button. The user quickly discovers the more they enter into the search control, the faster and more accurate the results, they will have little or no need for scrolling and will have the record or records of interest immediately in front of them.

But that's not all!

One of the major limitations of most desktop applications is that result sets must be ordered by picking a specific index to be used, and thus incremental searching/filtering is limited to one index order at a time. If the user decides to find a customer by last name instead of customer number, your application must present some way to select the appropriate index.

With SQL databases, this limitation does not exist. Onse can search by any number of different attributes at the same time, and order the results separately from that. Therefore, rather than only presenting one control for entering search criteria, the user can be presented with a number of different controls. For example, if the first part of the last name is known along with the city or area code, any or all of that can be entered and the results are returned immediately. This makes the application much more productive for the user.

I have rewritten a number of legacy applications and in every case I have introduced this new method of operating, and in every case, with perhaps one or two "stick-in-the-mud" users that refused to accept any change at all in each case, the new application was quickly accepted by users and always with many compliments on how much  better they were able to use it.