On the Emergence of Patterns for Spreadsheets Data Arrangements

Teixeira, Ricardo; Amaral, Vasco

doi:10.1007/978-3-319-50230-4_25

Ricardo Teixeira¹⁶ &
Vasco Amaral¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9946))

Included in the following conference series:

Federation of International Conferences on Software Technologies: Applications and Foundations

899 Accesses
4 Citations

Abstract

Spreadsheets are widely used both by individuals as well as large companies in a vast plethora of application domains. One of the reasons for this popularity is the general purpose flexibility spreadsheets offer to the end user. This flexibility favors the existence of multiple spreadsheet designs regarding the physical organization of the data presented by a spreadsheet. Nevertheless, to the best of our knowledge, little is still known about patterns of spreadsheet data arrangements. Works refer the emergence of commonalities and templates but it is hard to find a systematic study on the topic that presents us catalogues. It is known that spreadsheets are extremely error-prone. Therefore, to know the typical data arrangement patterns can be very useful insight on how to build mechanisms and strategies in order to prevent errors regarding spreadsheets specification and maintenance. The present work aims at present data arrangement patterns that emerged from our studies and direct observation of real-world spreadsheet samples from two large datasets, and, additionally, a formal representation of the patterns identified through the use of conceptual models.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Spreadsheet Engineering

Adding Spreadsheets to the MDE Toolkit

Spreadsheet quality assurance: a literature review

Article Open access 22 January 2024

Keywords

1 Introduction

Being the first “programmer in a box” to come along for technology users, spreadsheets are widely used both by individuals to cope with simple needs like tracking personal finances, training plans, to-do lists, supplier databases, or any purpose that requires input of data and/or performing calculations; as well as large companies as integrators of complex systems and as support for informing business decisions especially in areas like marketing, business development, sales, and finance. As result of this general purpose flexibility, a plenty of spreadsheet layout designs are possible towards the physical organization of the data composing a spreadsheet.

Works proposing spreadsheet models [1, 2] already systematize common templates of table structures. Other works created a library containing common spreadsheet patterns [3] for later use of pattern matching algorithms in order to extract models from them. Other works implemented a header inference system for spreadsheets [4], describing the relation between the headers and their association with data.

However, these patterns are quite far from covering all existing kinds of spreadsheet’s data arrangements and do not take in consideration the domains those patterns are generally applied.

Knowing more about the typical data arrangement patterns, in other words, what people usually want to model in a spreadsheet and what they usually expect to see in a spreadsheet, can be very useful insight in how to build mechanisms and strategies to specify and maintain less erroneous spreadsheets.

This work intends to take a step on extending the current perception of the emerged spreadsheet patterns regarding the data arrangements. For this purpose, two large repositories of spreadsheets used in spreadsheet studies were directly observed and analyzed, namely:

The EUSES corpus [5] – published in 2005 and made available only to researchers, it is a dataset of over 4,500 spreadsheets gathered from the public world-wide-web;
Enron corpus [6, 7] – a recent large dataset containing around 15,000 industrial spreadsheets extracted from the Enron Corporation e-mail archive made public during the legal investigation concerning the company after it went bankrupt.

The analysis method consisted of manually selecting random spreadsheet samples from the datasets, until the patterns observed were becoming redundant. Due to the low diversity verified, only 80 spreadsheets representative of all of the spreadsheets existing in the datasets were selected and reunited. With them, a formal systemization of data arrangement patterns was made using the UML conceptual model, namely, class diagrams, which is one of the most proliferated conceptual models, having a high level of understanding.

The rest of the paper is organized as follows: in Sect. 2 we present the identified patterns, cataloging them and presenting related insights. Then, in Sect. 3 we present a metamodel of a spreadsheet concerning its data arrangement, and in Sect. 4 we conclude the paper.

2 Patterns

2.1 Table Structures

When thinking about spreadsheets we immediately conceive tabular forms constituted by a set of labels – usually called “headers” – associated with a set of values. Based on the spreadsheets observed, we can catalogue the common tables structures into three distinct groups which are defined by the table growth orientation and their purpose.

Vertical Tables.

The most linear table structure consists of a simple grown-vertically table, where there is a header in the first row; this structure is commonly associated with inventory, database (Fig. 1), or statistical data (Fig. 2). A header can represent a formula referring other row’s entry values.

Also, sometimes there is an additional bottom row that applies an aggregation function to some specific column, as we can see in Fig. 2.

Horizontal Single Entry Tables.

A second table structure is a table whose headers are disposed vertically, and in which there is only one entry. Typically, the purpose of this kind of tables is to display summary data, and usually an aggregation function is applied on the solo entry values.

In Fig. 3, a SUM function is used to calculate the “TOTAL INCOME” from the above entry values.

Relationship Tables.

A third group of table structures are the relationship tables, consisting of tables that grow horizontally, with a highlighted header – the top one. The top header values are themselves headers, that is, without that header’s entry value, the other header entry values are meaningless. Sometimes the top header label is omitted, being only displayed its values. Aggregation functions are also commonly used on this tables, both vertically (see row “8” in Fig. 4) and horizontally (see column “F” in Fig. 5 of Sect. 2.2).

This table structure pattern dominates spreadsheets used for financial modeling and analysis, with the top header usually representing calendar years (Fig. 4), year quarters, months, etc.

2.2 Header Composition

In horizontal tables, it is usual to see headers composed by other headers. The main headers – the ones who are composed – typically represent categories, and the coupled ones are headers belonging to the category of the main header where they are attached.

Commonly, a main header’s entry value consists of an aggregation function – usually SUM – applied to the coupled headers’ entry values.

In Fig. 5, we can see a relationship table composed by six main headers: “Expected number of purses sold:”, “COSTS”, “Total Costs”, “REVENUE ($60/purse)”, “Total Revenue” and “TOTAL PROFIT”, with the last four ones consisting of formulas. The main header “COSTS” is composed by other six headers, with three of them – namely: “Cigar Boxes”, “Recourses” and “Technology” – having attached headers of their own. It is also possible to verify that “COST” has no table entry values associated, functioning as a pure categorization label, meanwhile the lower level main headers, such as “Cigar Boxes”, have entry values consisting of a SUM aggregation function applied to the headers’ values they have attached.

2.3 Header Hierarchy

Similar to the composed headers, there are the hierarchically organized headers. Although in the header composition is express some sort of hierarchy, there are actually some major differences between the two header arrangements: in this type of header arrangement, the hierarchy is explicit, that is, the headers are not physically on the same level; also, unlike composed headers, in this arrangement the top headers (the ones who have at least one header below in the hierarchy) do not have any values in the table associated to them; lastly, a header hierarchy appears in both vertical and horizontal table structures, although it is very uncommon to see it in a horizontal one.

In Fig. 6 it is possible to see a vertical table with two header hierarchies (“Dimensions” and “Location”) which have a mere organizational purpose, with the intend to offer a clearer and focused table understating. However, header hierarchies can be use with a comparison purpose in mind. As we can see in Fig. 7, there is a hierarchy for each header naming a year quarter (“1st Quarter”, “2nd Quarter”, “3rd Quarter” and “4th Quarter”) with all of them sharing the same semantic yet physically different sub-headers. Using this kind of arrangement obviates the need for multiple tables, whose physical separation makes it difficult to compare the analogous data from the distinct tables; or obviates the need for unique header labels – for instance, using “1st Quarter 2002”, “2nd Quarter 2002”, etc., that also complicates the data analysis.

2.4 Table Replication

In a spreadsheet, it is often observed the replication of table structures, only differing semantically in a certain aspect. In Fig. 8 we can see two structure replicas of a total of five replicas of a relationship table, only differing in the year in which the table data concerns. In this case, the replicas are distributed by different worksheets, however, the replication can also occur on a single worksheet as shown in the example in Fig. 9, where to calculate the “INCOME” and the “EXPENSES” the same table structure can be used.

The choice between the two replication options seem to depend on the table dimensions: larger table structures will naturally fit better in a spreadsheet on distinct worksheets (Fig. 8), while smaller ones can perfectly fit on the same worksheet (Fig. 9); and on the table purpose: if the spreadsheet analysis mainly relies on the comparison of the output data from the distinct replicas, it is convenient that the replicas stay physically close, which is the case of the example in Fig. 9 – besides the fact that the structures are quite small, the obvious object of analysis of the worksheet is the comparison between the “TOTAL INCOME” and the “TOTAL EXPENSES.

3 A Metamodel for Spreadsheet Arrangement

The patterns identified in Sect. 2 can be formally systemized using and extending the UML conceptual model, specifically the UML class diagram metamodel. In Fig. 10, we present the metamodel in which spreadsheet elements – represented as entities – such as worksheets, tables, headers, etc., are an extension of the entity Class, and inherit some of its relations with other entities, namely, Association (with Aggregation and Composition specializations), Property and Usage.

The spreadsheet entities may have their own constants, for instance, the entity Worksheet have an integer constant named “order”. That constant indicates in which order the worksheet appears in the workbook, and so does the entity Table, but to indicate its placement in the worksheet relative to other tables. Additionally, Table has another constant named “Table Type” that specifies if the table grows vertically, horizontally, or if it is a relationship table.

Entities such as Table and Header can have Properties, which in the context of a class diagram are the commonly named Attributes. Those attributes specify child-headers, which can be further expanded to other headers, or be “leaf” headers.

With Association and its two extensions we can specify to which the spreadsheets entities connect and how this connection is done in terms of data arrangement. For instance, in Fig. 11 we can see a model (according to the metamodel) of the spreadsheet table shown in Fig. 6 of Sect. 2.3, where the header hierarchies are expressed through two aggregations. If there were no hierarchies, that is, all the headers placed on the same row, a composition would be used instead.

Using the entity Usage it is possible to specify usage dependencies among instances of the spreadsheet entities. For instance, as we see in Fig. 12 – a partial model of the table presented in Fig. 5 of Sect. 2.2 – there is an entity Formula to specify a formula associated to the attribute of the same name of the class to which this entity Formula is associated by a composition. This entity has a string constant to express the formula text with the header reference between brackets. Moreover, there is expressed a dependency between the Formula entity and the corresponding header that is referenced, using Usage.

Furthermore, for a particular group of formulas, more specifically, the aggregation functions, there is a proper entity associated to the header of which attributes are input for the aggregation function specified in the entity CellsAggregation (see Fig. 13).

4 Conclusions

This paper presented a brief catalog of spreadsheet patterns regarding data arrangements layouts observed from two real-world spreadsheets datasets, extending and confirming the actual perceptions of the patterns in spreadsheets designs. Nevertheless, there is a major limitation on the approach taken, since neither of the datasets were fully covered, so it is possible that other existing patterns were not observed and, therefore, not registered. Moreover, this paper also presents a formalization of the identified patterns as a UML metamodel. This is an essential to design tools to build on top of the UML realm. In fact, the models we presented of the spreadsheets were created using a tool we implemented based on the metamodel. Conformance and other model-driven features are thus free to get.

References

Engels, G., Erwig, M.: ClassSheets: automatic generation of spreadsheet applications from object-oriented specifications. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ASE 2005, pp. 124–133. ACM, New York (2005)
Google Scholar
Cunha, J., Fernandes, J.P., Mendes, J., Saraiva, J.: Embedding and evolution of spreadsheet models in spreadsheet systems. In: Proceedings of the 2011 IEEE Symposium on Visual Languages and Human-Centric Computing, Pittsburgh, pp. 179–186 (2011)
Google Scholar
Hermans, F., Pinzger, M., Deursen, A.: Automatically extracting class diagrams from spreadsheets. In: D’Hondt, T. (ed.) ECOOP 2010. LNCS, vol. 6183, pp. 52–75. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14107-2_4
Chapter Google Scholar
Abraham, R., Erwig, M.: Header and unit inference for spreadsheets through spatial analyses. In: Proceedings of the 2004 IEEE Symposium on Visual Languages and Human-Centric Computing, Rome, pp. 165–172 (2004)
Google Scholar
Fisher, M., Rothermel, G.: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In: 1st Workshop on End-User Software Engineering, pp. 47–51. ACM, New York (2005)
Google Scholar
Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: 37th International Conference on Software Engineering, ICSE 2015, Florence, pp. 7–16 (2015)
Google Scholar
Jansen, B.: Enron versus EUSES: a comparison of two spreadsheet Corpora. In: Second Workshop on Software Engineering Methods in Spreadsheets, SEMS 2015, Florence, pp. 41–47 (2015)
Google Scholar

Download references

Acknowledgements

This work has been partially supported by NOVA LINCS through the FCT project with reference UID/CEC/04516/2013.

Author information

Authors and Affiliations

NOVA LINCS, DI, FCT, Universidade Nova de Lisboa, Lisbon, Portugal
Ricardo Teixeira & Vasco Amaral

Authors

Ricardo Teixeira
View author publications
You can also search for this author in PubMed Google Scholar
Vasco Amaral
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Teixeira .

Editor information

Editors and Affiliations

Dipartimento di Informatica, Universita di Pisa, Pisa, Italy
Paolo Milazzo
Budapest University of Technology and Economics, Budapest, Hungary
Dániel Varró
Vienna University of Technology, Vienna, Austria
Manuel Wimmer

Attachments

1.1 Attachment 1. EUSES’ Spreadsheet Files

Database
01_20_04.xls
consultants.xls
Database_excel95.xls
datadict.xls
dist_ed_courses_Jan2000.xls
document_de_reference#A828A.xls
EbscohostByDb2002-03.xls
epcdata2002.xls
FeatureList.xls
flip_usd5.XLS
FS_Upgrade_Plan_v3_111502.xls
FS_Upgrade_Proj_Mgmt_#A829F.xls
haymth.xls
haymth_old.xls
ps-cs-msc-new.xls
topconschedtemplate.xls
Financial
02rise.xls
costfactors.xls
departmental_sales_e.xls
FinancialReport.xls
hist4q_e.xls
hist_e.xls
PersonalFinanceScope.xls
Prq403.xls
Q3_Final.xls
Q4_02.XLS
quaterly.xls
tab004.xls
treasurers_report_aud#A7EA4.xls
UF_Genetics_Financial#A7E51.xls
USFAthleticFinancialSummary.xls
W_SBT_financial.xls
Grades
1A6EGrades.xls
262grades.xls
310Grades.xls
483_grades_web.xls
511Grades.xls
Inventory
am-template-inventory.xls
capitol_art_inventory.xls
ColdStorage.xls
inventor.xls
Inventory%20Schedule%202004.xls
Inventory-Emergency_C#A84CC.xls
InventoryList.xls
NMfgInventory04.xls
nonstandby_inventory_#A8712.xls
Overview.xls
Software_inventory_sheet.xls
temp_videos0304.xls
TuftsGHGInventory.xls
VRSinventory01.xls
VRSinventory03.xls

1.2 Attachment 2. EURON’s Spreadsheet Files

andrea_ring__4__BRLH Storage.xlsx
andrew_lewis__84__Notification Rpt 1200.xlsx
andy_zipper__109__Cost Allocation 02-21-01.xlsx
andy_zipper__112__mODEL 3 7 01 Base.xlsx
andy_zipper__115__DYNEGY-ICE VOL Jun1.xlsx
andy_zipper__266__Broker detail 5-29-01.xlsx
andy_zipper__290__AGA.xlsx
andy_zipper__342__COF Curves for Andy Zipper.xlsx
barry_tycholiz__870__EPNG BP Tariff Sheet.xlsx
benjamin_rogers__1003__NEPOOL-ZoneG Dailies.xlsx
benjamin_rogers__1024__TLR Analysis.xlsx
benjamin_rogers__1052__FPLE model.xlsx
benjamin_rogers__1058__newco development cash flow.xlsx
benjamin_rogers__1108__Wheatland O&M.xlsx
benjamin_rogers__1231__Comparison2.xlsx
benjamin_rogers__911__PJM Eastern Hub Pricing.xlsx
benjamin_rogers__936__PJM Model.xlsx
bill_williams_iii__1373__EOL 5-11.xlsx
bill_williams_iii__1395__EES September Daily.xlsx
chris_germany__2124__DecCohCHOICE-ENA.xlsx
chris_stokley__3947__NP15 DJ Charts.xlsx
darrell_schoolcraft__7827__imbalsumm0110.xlsx
larry_may__21636__ed052501.xlsx
louise_kitchen__22676__BGM 1024 ngpl.xlsx
phillip_m_love__30520__Paulacustomerlist.xlsx
stacey_white__39052__Summary Oct 15.xls
steven_p_south__39352__04-23-01 Earnings 2 of 2.xlsx
vladi_pimenov__41075__VLADI-GASDAILY-CURVEFETCH.xlsx

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Teixeira, R., Amaral, V. (2016). On the Emergence of Patterns for Spreadsheets Data Arrangements. In: Milazzo, P., Varró, D., Wimmer, M. (eds) Software Technologies: Applications and Foundations. STAF 2016. Lecture Notes in Computer Science(), vol 9946. Springer, Cham. https://doi.org/10.1007/978-3-319-50230-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-50230-4_25
Published: 01 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50229-8
Online ISBN: 978-3-319-50230-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Emergence of Patterns for Spreadsheets Data Arrangements

Abstract