Keywords

1 Introduction

Being the first “programmer in a box” to come along for technology users, spreadsheets are widely used both by individuals to cope with simple needs like tracking personal finances, training plans, to-do lists, supplier databases, or any purpose that requires input of data and/or performing calculations; as well as large companies as integrators of complex systems and as support for informing business decisions especially in areas like marketing, business development, sales, and finance. As result of this general purpose flexibility, a plenty of spreadsheet layout designs are possible towards the physical organization of the data composing a spreadsheet.

Works proposing spreadsheet models [1, 2] already systematize common templates of table structures. Other works created a library containing common spreadsheet patterns [3] for later use of pattern matching algorithms in order to extract models from them. Other works implemented a header inference system for spreadsheets [4], describing the relation between the headers and their association with data.

However, these patterns are quite far from covering all existing kinds of spreadsheet’s data arrangements and do not take in consideration the domains those patterns are generally applied.

Knowing more about the typical data arrangement patterns, in other words, what people usually want to model in a spreadsheet and what they usually expect to see in a spreadsheet, can be very useful insight in how to build mechanisms and strategies to specify and maintain less erroneous spreadsheets.

This work intends to take a step on extending the current perception of the emerged spreadsheet patterns regarding the data arrangements. For this purpose, two large repositories of spreadsheets used in spreadsheet studies were directly observed and analyzed, namely:

  • The EUSES corpus [5] – published in 2005 and made available only to researchers, it is a dataset of over 4,500 spreadsheets gathered from the public world-wide-web;

  • Enron corpus [6, 7] – a recent large dataset containing around 15,000 industrial spreadsheets extracted from the Enron Corporation e-mail archive made public during the legal investigation concerning the company after it went bankrupt.

The analysis method consisted of manually selecting random spreadsheet samples from the datasets, until the patterns observed were becoming redundant. Due to the low diversity verified, only 80 spreadsheets representative of all of the spreadsheets existing in the datasets were selected and reunited. With them, a formal systemization of data arrangement patterns was made using the UML conceptual model, namely, class diagrams, which is one of the most proliferated conceptual models, having a high level of understanding.

The rest of the paper is organized as follows: in Sect. 2 we present the identified patterns, cataloging them and presenting related insights. Then, in Sect. 3 we present a metamodel of a spreadsheet concerning its data arrangement, and in Sect. 4 we conclude the paper.

2 Patterns

2.1 Table Structures

When thinking about spreadsheets we immediately conceive tabular forms constituted by a set of labels – usually called “headers” – associated with a set of values. Based on the spreadsheets observed, we can catalogue the common tables structures into three distinct groups which are defined by the table growth orientation and their purpose.

Vertical Tables.

The most linear table structure consists of a simple grown-vertically table, where there is a header in the first row; this structure is commonly associated with inventory, database (Fig. 1), or statistical data (Fig. 2). A header can represent a formula referring other row’s entry values.

Fig. 1.
figure 1

Vertical table used as a database

Fig. 2.
figure 2

Vertical table used to display statistical data

Also, sometimes there is an additional bottom row that applies an aggregation function to some specific column, as we can see in Fig. 2.

Horizontal Single Entry Tables.

A second table structure is a table whose headers are disposed vertically, and in which there is only one entry. Typically, the purpose of this kind of tables is to display summary data, and usually an aggregation function is applied on the solo entry values.

In Fig. 3, a SUM function is used to calculate the “TOTAL INCOME” from the above entry values.

Fig. 3.
figure 3

Horizontal single entry table example

Relationship Tables.

A third group of table structures are the relationship tables, consisting of tables that grow horizontally, with a highlighted header – the top one. The top header values are themselves headers, that is, without that header’s entry value, the other header entry values are meaningless. Sometimes the top header label is omitted, being only displayed its values. Aggregation functions are also commonly used on this tables, both vertically (see row “8” in Fig. 4) and horizontally (see column “F” in Fig. 5 of Sect. 2.2).

Fig. 4.
figure 4

Relationship table using calendar years

This table structure pattern dominates spreadsheets used for financial modeling and analysis, with the top header usually representing calendar years (Fig. 4), year quarters, months, etc.

2.2 Header Composition

In horizontal tables, it is usual to see headers composed by other headers. The main headers – the ones who are composed – typically represent categories, and the coupled ones are headers belonging to the category of the main header where they are attached.

Commonly, a main header’s entry value consists of an aggregation function – usually SUM – applied to the coupled headers’ entry values.

In Fig. 5, we can see a relationship table composed by six main headers: “Expected number of purses sold:”, “COSTS”, “Total Costs”, “REVENUE ($60/purse)”, “Total Revenue” and “TOTAL PROFIT”, with the last four ones consisting of formulas. The main header “COSTS” is composed by other six headers, with three of them – namely: “Cigar Boxes”, “Recourses” and “Technology” – having attached headers of their own. It is also possible to verify that “COST” has no table entry values associated, functioning as a pure categorization label, meanwhile the lower level main headers, such as “Cigar Boxes”, have entry values consisting of a SUM aggregation function applied to the headers’ values they have attached.

Fig. 5.
figure 5

Relationship table with coupling

2.3 Header Hierarchy

Similar to the composed headers, there are the hierarchically organized headers. Although in the header composition is express some sort of hierarchy, there are actually some major differences between the two header arrangements: in this type of header arrangement, the hierarchy is explicit, that is, the headers are not physically on the same level; also, unlike composed headers, in this arrangement the top headers (the ones who have at least one header below in the hierarchy) do not have any values in the table associated to them; lastly, a header hierarchy appears in both vertical and horizontal table structures, although it is very uncommon to see it in a horizontal one.

In Fig. 6 it is possible to see a vertical table with two header hierarchies (“Dimensions” and “Location”) which have a mere organizational purpose, with the intend to offer a clearer and focused table understating. However, header hierarchies can be use with a comparison purpose in mind. As we can see in Fig. 7, there is a hierarchy for each header naming a year quarter (“1st Quarter”, “2nd Quarter”, “3rd Quarter” and “4th Quarter”) with all of them sharing the same semantic yet physically different sub-headers. Using this kind of arrangement obviates the need for multiple tables, whose physical separation makes it difficult to compare the analogous data from the distinct tables; or obviates the need for unique header labels – for instance, using “1st Quarter 2002”, “2nd Quarter 2002”, etc., that also complicates the data analysis.

Fig. 6.
figure 6

Vertical table with a header hierarchy

Fig. 7.
figure 7

Relationship table with a header hierarchy

2.4 Table Replication

In a spreadsheet, it is often observed the replication of table structures, only differing semantically in a certain aspect. In Fig. 8 we can see two structure replicas of a total of five replicas of a relationship table, only differing in the year in which the table data concerns. In this case, the replicas are distributed by different worksheets, however, the replication can also occur on a single worksheet as shown in the example in Fig. 9, where to calculate the “INCOME” and the “EXPENSES” the same table structure can be used.

Fig. 8.
figure 8

Relationship table replicated in different worksheets

Fig. 9.
figure 9

Horizontal single entry replicated in the same worksheet

The choice between the two replication options seem to depend on the table dimensions: larger table structures will naturally fit better in a spreadsheet on distinct worksheets (Fig. 8), while smaller ones can perfectly fit on the same worksheet (Fig. 9); and on the table purpose: if the spreadsheet analysis mainly relies on the comparison of the output data from the distinct replicas, it is convenient that the replicas stay physically close, which is the case of the example in Fig. 9 – besides the fact that the structures are quite small, the obvious object of analysis of the worksheet is the comparison between the “TOTAL INCOME” and the “TOTAL EXPENSES.

3 A Metamodel for Spreadsheet Arrangement

The patterns identified in Sect. 2 can be formally systemized using and extending the UML conceptual model, specifically the UML class diagram metamodel. In Fig. 10, we present the metamodel in which spreadsheet elements – represented as entities – such as worksheets, tables, headers, etc., are an extension of the entity Class, and inherit some of its relations with other entities, namely, Association (with Aggregation and Composition specializations), Property and Usage.

Fig. 10.
figure 10

Spreadsheet metamodel according to the detected patterns identified

The spreadsheet entities may have their own constants, for instance, the entity Worksheet have an integer constant named “order”. That constant indicates in which order the worksheet appears in the workbook, and so does the entity Table, but to indicate its placement in the worksheet relative to other tables. Additionally, Table has another constant named “Table Type” that specifies if the table grows vertically, horizontally, or if it is a relationship table.

Entities such as Table and Header can have Properties, which in the context of a class diagram are the commonly named Attributes. Those attributes specify child-headers, which can be further expanded to other headers, or be “leaf” headers.

With Association and its two extensions we can specify to which the spreadsheets entities connect and how this connection is done in terms of data arrangement. For instance, in Fig. 11 we can see a model (according to the metamodel) of the spreadsheet table shown in Fig. 6 of Sect. 2.3, where the header hierarchies are expressed through two aggregations. If there were no hierarchies, that is, all the headers placed on the same row, a composition would be used instead.

Fig. 11.
figure 11

Model representation of the table presented in Fig. 6 of Sect. 2.3

Using the entity Usage it is possible to specify usage dependencies among instances of the spreadsheet entities. For instance, as we see in Fig. 12 – a partial model of the table presented in Fig. 5 of Sect. 2.2 – there is an entity Formula to specify a formula associated to the attribute of the same name of the class to which this entity Formula is associated by a composition. This entity has a string constant to express the formula text with the header reference between brackets. Moreover, there is expressed a dependency between the Formula entity and the corresponding header that is referenced, using Usage.

Fig. 12.
figure 12

Partial model representation of the table presented in Fig. 5 of Sect. 2.2

Furthermore, for a particular group of formulas, more specifically, the aggregation functions, there is a proper entity associated to the header of which attributes are input for the aggregation function specified in the entity CellsAggregation (see Fig. 13).

Fig. 13.
figure 13

Model representation of the table presented in Fig. 4 of Sect. 2.1

4 Conclusions

This paper presented a brief catalog of spreadsheet patterns regarding data arrangements layouts observed from two real-world spreadsheets datasets, extending and confirming the actual perceptions of the patterns in spreadsheets designs. Nevertheless, there is a major limitation on the approach taken, since neither of the datasets were fully covered, so it is possible that other existing patterns were not observed and, therefore, not registered. Moreover, this paper also presents a formalization of the identified patterns as a UML metamodel. This is an essential to design tools to build on top of the UML realm. In fact, the models we presented of the spreadsheets were created using a tool we implemented based on the metamodel. Conformance and other model-driven features are thus free to get.