Abstract
Spreadsheets are widely used both by individuals as well as large companies in a vast plethora of application domains. One of the reasons for this popularity is the general purpose flexibility spreadsheets offer to the end user. This flexibility favors the existence of multiple spreadsheet designs regarding the physical organization of the data presented by a spreadsheet. Nevertheless, to the best of our knowledge, little is still known about patterns of spreadsheet data arrangements. Works refer the emergence of commonalities and templates but it is hard to find a systematic study on the topic that presents us catalogues. It is known that spreadsheets are extremely error-prone. Therefore, to know the typical data arrangement patterns can be very useful insight on how to build mechanisms and strategies in order to prevent errors regarding spreadsheets specification and maintenance. The present work aims at present data arrangement patterns that emerged from our studies and direct observation of real-world spreadsheet samples from two large datasets, and, additionally, a formal representation of the patterns identified through the use of conceptual models.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Being the first “programmer in a box” to come along for technology users, spreadsheets are widely used both by individuals to cope with simple needs like tracking personal finances, training plans, to-do lists, supplier databases, or any purpose that requires input of data and/or performing calculations; as well as large companies as integrators of complex systems and as support for informing business decisions especially in areas like marketing, business development, sales, and finance. As result of this general purpose flexibility, a plenty of spreadsheet layout designs are possible towards the physical organization of the data composing a spreadsheet.
Works proposing spreadsheet models [1, 2] already systematize common templates of table structures. Other works created a library containing common spreadsheet patterns [3] for later use of pattern matching algorithms in order to extract models from them. Other works implemented a header inference system for spreadsheets [4], describing the relation between the headers and their association with data.
However, these patterns are quite far from covering all existing kinds of spreadsheet’s data arrangements and do not take in consideration the domains those patterns are generally applied.
Knowing more about the typical data arrangement patterns, in other words, what people usually want to model in a spreadsheet and what they usually expect to see in a spreadsheet, can be very useful insight in how to build mechanisms and strategies to specify and maintain less erroneous spreadsheets.
This work intends to take a step on extending the current perception of the emerged spreadsheet patterns regarding the data arrangements. For this purpose, two large repositories of spreadsheets used in spreadsheet studies were directly observed and analyzed, namely:
-
The EUSES corpus [5] – published in 2005 and made available only to researchers, it is a dataset of over 4,500 spreadsheets gathered from the public world-wide-web;
-
Enron corpus [6, 7] – a recent large dataset containing around 15,000 industrial spreadsheets extracted from the Enron Corporation e-mail archive made public during the legal investigation concerning the company after it went bankrupt.
The analysis method consisted of manually selecting random spreadsheet samples from the datasets, until the patterns observed were becoming redundant. Due to the low diversity verified, only 80 spreadsheets representative of all of the spreadsheets existing in the datasets were selected and reunited. With them, a formal systemization of data arrangement patterns was made using the UML conceptual model, namely, class diagrams, which is one of the most proliferated conceptual models, having a high level of understanding.
The rest of the paper is organized as follows: in Sect. 2 we present the identified patterns, cataloging them and presenting related insights. Then, in Sect. 3 we present a metamodel of a spreadsheet concerning its data arrangement, and in Sect. 4 we conclude the paper.
2 Patterns
2.1 Table Structures
When thinking about spreadsheets we immediately conceive tabular forms constituted by a set of labels – usually called “headers” – associated with a set of values. Based on the spreadsheets observed, we can catalogue the common tables structures into three distinct groups which are defined by the table growth orientation and their purpose.
Vertical Tables.
The most linear table structure consists of a simple grown-vertically table, where there is a header in the first row; this structure is commonly associated with inventory, database (Fig. 1), or statistical data (Fig. 2). A header can represent a formula referring other row’s entry values.
Also, sometimes there is an additional bottom row that applies an aggregation function to some specific column, as we can see in Fig. 2.
Horizontal Single Entry Tables.
A second table structure is a table whose headers are disposed vertically, and in which there is only one entry. Typically, the purpose of this kind of tables is to display summary data, and usually an aggregation function is applied on the solo entry values.
In Fig. 3, a SUM function is used to calculate the “TOTAL INCOME” from the above entry values.
Relationship Tables.
A third group of table structures are the relationship tables, consisting of tables that grow horizontally, with a highlighted header – the top one. The top header values are themselves headers, that is, without that header’s entry value, the other header entry values are meaningless. Sometimes the top header label is omitted, being only displayed its values. Aggregation functions are also commonly used on this tables, both vertically (see row “8” in Fig. 4) and horizontally (see column “F” in Fig. 5 of Sect. 2.2).
This table structure pattern dominates spreadsheets used for financial modeling and analysis, with the top header usually representing calendar years (Fig. 4), year quarters, months, etc.
2.2 Header Composition
In horizontal tables, it is usual to see headers composed by other headers. The main headers – the ones who are composed – typically represent categories, and the coupled ones are headers belonging to the category of the main header where they are attached.
Commonly, a main header’s entry value consists of an aggregation function – usually SUM – applied to the coupled headers’ entry values.
In Fig. 5, we can see a relationship table composed by six main headers: “Expected number of purses sold:”, “COSTS”, “Total Costs”, “REVENUE ($60/purse)”, “Total Revenue” and “TOTAL PROFIT”, with the last four ones consisting of formulas. The main header “COSTS” is composed by other six headers, with three of them – namely: “Cigar Boxes”, “Recourses” and “Technology” – having attached headers of their own. It is also possible to verify that “COST” has no table entry values associated, functioning as a pure categorization label, meanwhile the lower level main headers, such as “Cigar Boxes”, have entry values consisting of a SUM aggregation function applied to the headers’ values they have attached.
2.3 Header Hierarchy
Similar to the composed headers, there are the hierarchically organized headers. Although in the header composition is express some sort of hierarchy, there are actually some major differences between the two header arrangements: in this type of header arrangement, the hierarchy is explicit, that is, the headers are not physically on the same level; also, unlike composed headers, in this arrangement the top headers (the ones who have at least one header below in the hierarchy) do not have any values in the table associated to them; lastly, a header hierarchy appears in both vertical and horizontal table structures, although it is very uncommon to see it in a horizontal one.
In Fig. 6 it is possible to see a vertical table with two header hierarchies (“Dimensions” and “Location”) which have a mere organizational purpose, with the intend to offer a clearer and focused table understating. However, header hierarchies can be use with a comparison purpose in mind. As we can see in Fig. 7, there is a hierarchy for each header naming a year quarter (“1st Quarter”, “2nd Quarter”, “3rd Quarter” and “4th Quarter”) with all of them sharing the same semantic yet physically different sub-headers. Using this kind of arrangement obviates the need for multiple tables, whose physical separation makes it difficult to compare the analogous data from the distinct tables; or obviates the need for unique header labels – for instance, using “1st Quarter 2002”, “2nd Quarter 2002”, etc., that also complicates the data analysis.
2.4 Table Replication
In a spreadsheet, it is often observed the replication of table structures, only differing semantically in a certain aspect. In Fig. 8 we can see two structure replicas of a total of five replicas of a relationship table, only differing in the year in which the table data concerns. In this case, the replicas are distributed by different worksheets, however, the replication can also occur on a single worksheet as shown in the example in Fig. 9, where to calculate the “INCOME” and the “EXPENSES” the same table structure can be used.
The choice between the two replication options seem to depend on the table dimensions: larger table structures will naturally fit better in a spreadsheet on distinct worksheets (Fig. 8), while smaller ones can perfectly fit on the same worksheet (Fig. 9); and on the table purpose: if the spreadsheet analysis mainly relies on the comparison of the output data from the distinct replicas, it is convenient that the replicas stay physically close, which is the case of the example in Fig. 9 – besides the fact that the structures are quite small, the obvious object of analysis of the worksheet is the comparison between the “TOTAL INCOME” and the “TOTAL EXPENSES.
3 A Metamodel for Spreadsheet Arrangement
The patterns identified in Sect. 2 can be formally systemized using and extending the UML conceptual model, specifically the UML class diagram metamodel. In Fig. 10, we present the metamodel in which spreadsheet elements – represented as entities – such as worksheets, tables, headers, etc., are an extension of the entity Class, and inherit some of its relations with other entities, namely, Association (with Aggregation and Composition specializations), Property and Usage.
The spreadsheet entities may have their own constants, for instance, the entity Worksheet have an integer constant named “order”. That constant indicates in which order the worksheet appears in the workbook, and so does the entity Table, but to indicate its placement in the worksheet relative to other tables. Additionally, Table has another constant named “Table Type” that specifies if the table grows vertically, horizontally, or if it is a relationship table.
Entities such as Table and Header can have Properties, which in the context of a class diagram are the commonly named Attributes. Those attributes specify child-headers, which can be further expanded to other headers, or be “leaf” headers.
With Association and its two extensions we can specify to which the spreadsheets entities connect and how this connection is done in terms of data arrangement. For instance, in Fig. 11 we can see a model (according to the metamodel) of the spreadsheet table shown in Fig. 6 of Sect. 2.3, where the header hierarchies are expressed through two aggregations. If there were no hierarchies, that is, all the headers placed on the same row, a composition would be used instead.
Using the entity Usage it is possible to specify usage dependencies among instances of the spreadsheet entities. For instance, as we see in Fig. 12 – a partial model of the table presented in Fig. 5 of Sect. 2.2 – there is an entity Formula to specify a formula associated to the attribute of the same name of the class to which this entity Formula is associated by a composition. This entity has a string constant to express the formula text with the header reference between brackets. Moreover, there is expressed a dependency between the Formula entity and the corresponding header that is referenced, using Usage.
Furthermore, for a particular group of formulas, more specifically, the aggregation functions, there is a proper entity associated to the header of which attributes are input for the aggregation function specified in the entity CellsAggregation (see Fig. 13).
4 Conclusions
This paper presented a brief catalog of spreadsheet patterns regarding data arrangements layouts observed from two real-world spreadsheets datasets, extending and confirming the actual perceptions of the patterns in spreadsheets designs. Nevertheless, there is a major limitation on the approach taken, since neither of the datasets were fully covered, so it is possible that other existing patterns were not observed and, therefore, not registered. Moreover, this paper also presents a formalization of the identified patterns as a UML metamodel. This is an essential to design tools to build on top of the UML realm. In fact, the models we presented of the spreadsheets were created using a tool we implemented based on the metamodel. Conformance and other model-driven features are thus free to get.
References
Engels, G., Erwig, M.: ClassSheets: automatic generation of spreadsheet applications from object-oriented specifications. In: Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ASE 2005, pp. 124–133. ACM, New York (2005)
Cunha, J., Fernandes, J.P., Mendes, J., Saraiva, J.: Embedding and evolution of spreadsheet models in spreadsheet systems. In: Proceedings of the 2011 IEEE Symposium on Visual Languages and Human-Centric Computing, Pittsburgh, pp. 179–186 (2011)
Hermans, F., Pinzger, M., Deursen, A.: Automatically extracting class diagrams from spreadsheets. In: D’Hondt, T. (ed.) ECOOP 2010. LNCS, vol. 6183, pp. 52–75. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14107-2_4
Abraham, R., Erwig, M.: Header and unit inference for spreadsheets through spatial analyses. In: Proceedings of the 2004 IEEE Symposium on Visual Languages and Human-Centric Computing, Rome, pp. 165–172 (2004)
Fisher, M., Rothermel, G.: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In: 1st Workshop on End-User Software Engineering, pp. 47–51. ACM, New York (2005)
Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: 37th International Conference on Software Engineering, ICSE 2015, Florence, pp. 7–16 (2015)
Jansen, B.: Enron versus EUSES: a comparison of two spreadsheet Corpora. In: Second Workshop on Software Engineering Methods in Spreadsheets, SEMS 2015, Florence, pp. 41–47 (2015)
Acknowledgements
This work has been partially supported by NOVA LINCS through the FCT project with reference UID/CEC/04516/2013.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Attachments
Attachments
1.1 Attachment 1. EUSES’ Spreadsheet Files
-
Database
-
01_20_04.xls
-
consultants.xls
-
Database_excel95.xls
-
datadict.xls
-
dist_ed_courses_Jan2000.xls
-
document_de_reference#A828A.xls
-
EbscohostByDb2002-03.xls
-
epcdata2002.xls
-
FeatureList.xls
-
flip_usd5.XLS
-
FS_Upgrade_Plan_v3_111502.xls
-
FS_Upgrade_Proj_Mgmt_#A829F.xls
-
haymth.xls
-
haymth_old.xls
-
ps-cs-msc-new.xls
-
topconschedtemplate.xls
-
Financial
-
02rise.xls
-
costfactors.xls
-
departmental_sales_e.xls
-
FinancialReport.xls
-
hist4q_e.xls
-
hist_e.xls
-
PersonalFinanceScope.xls
-
Prq403.xls
-
Q3_Final.xls
-
Q4_02.XLS
-
quaterly.xls
-
tab004.xls
-
treasurers_report_aud#A7EA4.xls
-
UF_Genetics_Financial#A7E51.xls
-
USFAthleticFinancialSummary.xls
-
W_SBT_financial.xls
-
Grades
-
1A6EGrades.xls
-
262grades.xls
-
310Grades.xls
-
483_grades_web.xls
-
511Grades.xls
-
Inventory
-
am-template-inventory.xls
-
capitol_art_inventory.xls
-
ColdStorage.xls
-
inventor.xls
-
Inventory%20Schedule%202004.xls
-
Inventory-Emergency_C#A84CC.xls
-
InventoryList.xls
-
NMfgInventory04.xls
-
nonstandby_inventory_#A8712.xls
-
Overview.xls
-
Software_inventory_sheet.xls
-
temp_videos0304.xls
-
TuftsGHGInventory.xls
-
VRSinventory01.xls
-
VRSinventory03.xls
1.2 Attachment 2. EURON’s Spreadsheet Files
-
andrea_ring__4__BRLH Storage.xlsx
-
andrew_lewis__84__Notification Rpt 1200.xlsx
-
andy_zipper__109__Cost Allocation 02-21-01.xlsx
-
andy_zipper__112__mODEL 3 7 01 Base.xlsx
-
andy_zipper__115__DYNEGY-ICE VOL Jun1.xlsx
-
andy_zipper__266__Broker detail 5-29-01.xlsx
-
andy_zipper__290__AGA.xlsx
-
andy_zipper__342__COF Curves for Andy Zipper.xlsx
-
barry_tycholiz__870__EPNG BP Tariff Sheet.xlsx
-
benjamin_rogers__1003__NEPOOL-ZoneG Dailies.xlsx
-
benjamin_rogers__1024__TLR Analysis.xlsx
-
benjamin_rogers__1052__FPLE model.xlsx
-
benjamin_rogers__1058__newco development cash flow.xlsx
-
benjamin_rogers__1108__Wheatland O&M.xlsx
-
benjamin_rogers__1231__Comparison2.xlsx
-
benjamin_rogers__911__PJM Eastern Hub Pricing.xlsx
-
benjamin_rogers__936__PJM Model.xlsx
-
bill_williams_iii__1373__EOL 5-11.xlsx
-
bill_williams_iii__1395__EES September Daily.xlsx
-
chris_germany__2124__DecCohCHOICE-ENA.xlsx
-
chris_stokley__3947__NP15 DJ Charts.xlsx
-
darrell_schoolcraft__7827__imbalsumm0110.xlsx
-
larry_may__21636__ed052501.xlsx
-
louise_kitchen__22676__BGM 1024 ngpl.xlsx
-
phillip_m_love__30520__Paulacustomerlist.xlsx
-
stacey_white__39052__Summary Oct 15.xls
-
steven_p_south__39352__04-23-01 Earnings 2 of 2.xlsx
-
vladi_pimenov__41075__VLADI-GASDAILY-CURVEFETCH.xlsx
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Teixeira, R., Amaral, V. (2016). On the Emergence of Patterns for Spreadsheets Data Arrangements. In: Milazzo, P., Varró, D., Wimmer, M. (eds) Software Technologies: Applications and Foundations. STAF 2016. Lecture Notes in Computer Science(), vol 9946. Springer, Cham. https://doi.org/10.1007/978-3-319-50230-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-50230-4_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50229-8
Online ISBN: 978-3-319-50230-4
eBook Packages: Computer ScienceComputer Science (R0)