Introduction

Reproducibility is a fundamental scientific principle of topical importance given the current challenges to the value of science and claims of false discoveries which are later not substantiated; however, this topic appears to be addressed rarely in the literature (Casadevall and Fang 2010). A recent large-scale study revealed doubts about the reproducibility of experiments carried out in the behaviour sciences (Open Science Collaboration 2015), and serious concerns about the inferences being drawn on single laboratory studies have been highlighted in a far-reaching publication by Kafkafi et al. (2017). Studies on reproducibility are scarce because they are often not considered novel enough to publish, although scientists agree this information is beneficial for the research community (Moonesinghe et al. 2007). While the issue of reproducibility of animal experiments where behaviour is also analysed (e.g. assessing drug efficacy) has been addressed recently (Crabbe et al. 1999; Wahlsten et al. 2006; Baldini et al. 2013; Tuyttens et al. 2014; Kafkafi et al. 2017), we are not aware of a similar initiative regarding tests within the field of animal cognition.

The terms “reproducibility” and “replicability” are used differently across disciplines, which can lead to confusion. In natural sciences (see Yang et al. 2008; Richter et al. 2009; Casadevall and Fang 2010) the terms are used sensu Casadevall and Fang (2010) where “…reproducibility refers to a phenomenon that can be predicted to recur even when experimental conditions may vary to some degree”, while “replicability describes the ability to obtain an identical result when an experiment is performed under precisely identical conditions”. Unfortunately, the same terms are used with opposite meaning in the social sciences (e.g. Asendorpf et al. 2013; Klein et al. 2014). Reproducibility closely resembles the concept of systematic replication as introduced by Sidman (1960, Chapter 4).

The factors potentially influencing the reproducibility of behavioural studies include: (1) proximal environment and (2) phenotype (van der Staay et al. 2010). The first may include any type of deviation from the original methodology or from the testing environment including the pre-training procedures (if any), or how variables are defined and used. Phenotype refers to differences between the populations being tested such as different strains/lines, age, sex, human–animal relationship or previous experience of the subjects.

The recently raised concern regarding reproducibility (Open Science Collaboration 2015) is especially relevant to canine research because dogs, compared to chimpanzees for example, are tested in significantly higher numbers and in a wider range of laboratories, often in superficially similar tests worldwide. They do not live in the laboratory and are a very heterogeneous sample in regard to their anatomical features (including surgical alteration), previous experiences, and genetic background (Miklósi and Topál 2013).

Results of previous studies provide evidence that both proximal environment (such as local differences and slight deviations between the applied protocols) and phenotype can influence dogs’ performance in cognitive tests. Whether dogs tested in different countries vary in regard to their performance in cognitive tests has not been extensively investigated yet, although cultural differences were shown to affect attachment behaviour in Austrian and Hungarian family dogs (Horn et al. 2013a). Fujita et al. (2012) have also reported differences in performance between Japanese and German dogs in an incidental memory test. However, local (hereafter also used to describe effects at a national level) effects are often incidental and not the primary focus of the study design. The dog–human relationship is also known to affect dogs’ performance in cognitive tests (Topál et al. 1997; Horn et al. 2013b), and local variation in dog management practices (e.g. tendency to use food treats in training) has the potential to influence reproducibility, too.

The effect of deviations between the applied protocols was demonstrated in a two-way choice task, where differences in methodology (utilisation of a clicker) influenced dogs’ behaviour (Pongrácz et al. 2013). Their performance was significantly better in the pointing with clicker condition than in the standard momentary distal pointing paradigm. While in both cases dogs could choose from two containers (correct choice indicated by the pointing), in the ‘pointing with clicker’ condition the indicated container was not baited. Instead, when a correct choice was made, the experimenter clicked and delivered the treat into the container.

The effect of dog phenotype was recently demonstrated by Fadel et al. (2016), where the authors reported differences in trait impulsivity between Border Collies and Labrador Retrievers, and between working and show lines within these breeds. The role of previous training experience and breed group regarding dogs’ performance in cognitive tests has also been described (Marshall-Pescini et al. 2016). Trained dogs were faster in solving a detour task, while working breeds were performing better in a manipulation task than retriever and herding breeds.

Sex differences in cognitive performance have also been found in some set-ups, with male dogs performing better when presented with a novel manipulation task (Duranton et al. 2015) and female dogs being more sensitive towards size constancy violation (Müller et al. 2011).

Aim and hypotheses

We tested the influence of three factors: breed, gender (phenotypic factors) and testing site (proximal factor) on the reproducibility of dog cognition results using a systematic approach. Using the same experimenter and equipment, we compared performance in four cognitive tasks and assessed differences in owner-instructed obedience task, using three comparable dog groups (Border collies, Labrador retrievers; various other breeds based on local availability) of both sexes and neuter statuses (phenotypic features) at three different testing sites (Hungary, Austria and Britain—proximal environments). This study allowed us to calculate the potential magnitude of differences that are not due to experimenter or equipment differences, and their relevance.

If the proximal factor has a significant influence on reproducibility, then we expected general testing site differences irrespective of breed. If breed has a major influence, we expected different performance in the pure breeds (Border collie and Labrador retriever) regardless of testing sites, but no such difference between the mixed breed groups in the different countries because this group consisted of various dog breeds kept as companions in human families. If sex/neuter status influenced the cognitive performance, these differences were expected to be present at each testing site. An interaction between testing site and breed would suggest, for example, local genetic effects or differences in dog-keeping habits. Differences in the obedience test would also reflect differences in dog-keeping practices, as these basic obedience tasks measure one aspect of the dog–human relationship, that is, the owner’s ability to control the dog in a novel environment instead of dog cognition.

General methods

Subjects

See Table 1 for the number of subjects included in the analysis listed by testing sites and breed. We tested dogs on three testing sites in three different countries: Budapest (Hungary—HU) and Vienna (Austria—AT) are capitals, both with a population of about 1.7 million, while Lincoln (United Kingdom—UK) has an estimated population of 93,000 (2011 census figures applying to defined administrative area). We recruited family dogs over 1 year of age without specific advanced level training from the following three populations: Border collies, Labrador retrievers and any other purebred dogs. We decided to test two popular breeds, which were easily available at all three sites because single-breed groups are genetically more homogenous. The third group (in which several breeds were represented) was targeted to resemble more closely the variability of samples currently being used in family dog studies in different countries. The dogs were required to be motivated by food. The subjects were recruited locally via social media, flyers and radio advertisements. The testing took place on a single occasion, and while we tested every dog in all tests, some of them needed to be excluded from one or more tests during data analysis.

Table 1 Overview of the subjects included in the statistical analysis in each subtest

Sex of the subjects was balanced across breed groups and testing sites as best as possible. Due to differences in dog-keeping practices, the percentage of neutered animals (of both sexes) recruited was higher in the UK sample. We combined sex and reproductive status into a single variable with four categories.

Testing sites

The testing room in Lincoln was secluded, while the other two testing sites were located in laboratories where there were occasional movement and some minor noise in the corridor (Fig. 1). For logistical reasons, the tests were carried out in one country after another in the following order: Budapest–Vienna–Lincoln–Budapest.

Test procedures

We looked for tests in the literature that met the following conditions:

  1. 1.

    The test could be conducted without extensive pre-training. This was necessary because dogs visited the testing site on a single occasion to avoid dropout.

  2. 2.

    The test does not last longer than 15 min and is not too exhausting for the dogs.

  3. 3.

    The test required minimal equipment as it had to be transported from site to site. Since all tests took place in the same room, the set-up for each test had to be easy/quick to build up and remove.

  4. 4.

    The tests overlap or interfere with each other as little as possible. The tests should not rely on the same type of manipulative skills or have the same set-up (e.g. two-way choice tasks). The tasks were provided in a fixed order with short breaks in between to standardise and minimise any carryover effect.

  5. 5.

    The tests cover different facets of dog cognition. We intended to maximise the scope of the gathered behavioural data within the project.

  6. 6.

    The reported performance of the dogs in the original publication was moderate but above chance at the group level. This was necessary in order to avoid ceiling and floor effects.

Based on these conditions, we selected the following tests (described in detail later).

  • Pointing test following the human pointing gesture in a communicative situation, to assess behavioural flexibility in a social situation (Brúder 2010)

  • Problem-solving test solving a problem without demonstration and after witnessing human demonstration of the solution, to assess problem-solving abilities in a social learning context (Pongrácz et al. 2012)

  • Means-end test pulling out the baited one from two slides (based on visual cues in a support problem task paradigm), to assess physical cognition (Range et al. 2011).

  • Memory test choosing from four previously investigated bowls after a 10-min break, to assess memory capacity (Fujita et al. 2012)

  • Obedience test testing the dog in a set of basic obedience tasks by the owner, to assess owner control. This test was not intended to measure cognitive abilities as there was no training involved, but was used to assess the level of control the owner possessed over the dog (Fukuzawa et al. 2005), to be able to detect possible differences in dog management practices between the populations

Right before testing, dogs participated in the pre-training phase of the means-end test, while the presentation phase of the memory test took place before the Obedience test.

The protocols were mainly reproductions of already published studies, sometimes with slight modifications of the original protocol. We decided to use only food rewards (Frolic® Dog Food) to avoid losing subjects that are not both food and toy motivated. Tests were carried out between August 2012 and June 2013.

Data collection and analysis

The tests were recorded via video cameras and coded with the coding programme Solomon©. Individual results for each dog in each task are given in the Electronic Supplementary Material. Choice proportions are reported as percentages ± standard deviation. The statistical analyses were carried out with SPSS 21 and JASP 0.8.0.1. A priori sensitivity analysis and effect sizes were calculated with G*Power 3.1.9.2.

We coded the same behavioural variables as those coded in the original studies and compared our findings to those, using the same statistical methods where applicable. We used generalised linear models with Poisson distribution and loglinear link to investigate the effects of proximal and phenotype-related factors. Model building was carried out via backward model selection. The initial factors were the following: testing site, breed, country, sex, testing site × breed. To compare the probability of the null hypothesis (no difference between the samples) and the Bayesian probability of the alternative hypotheses, Bayesian ANOVA was carried with the following fixed factors: testing site, breed, sex, testing site × breed. Any deviations from this procedure are described in the relevant test section.

Inter-observer coding

Four trained coders coded 20% of the videos and Cohen’s kappas (linear weighed in case of obedience scoring and unweighted for the rest of the variables) were calculated. This yielded excellent agreements (k ≥ 0.75) between observers in all measured variables (exact values for the individual variables can be found in “Appendix”).

Ethical approval

The study was approved by the institutional ethics and animal welfare committee at the University of Veterinary Medicine Vienna (11/10/97/2012) and by the School of Life Sciences Ethics Committee at the University of Lincoln, UK (UID COSREC146). According to the Hungarian Animal Protection Act (“1998. évi XXVIII. Törvény”, 3. §/9.), which defines experiments on animals, our non-invasive observational study was not considered as an animal experiment and thus did not require approval.

Fig. 1
figure 1

Three testing rooms from the position of recording cameras and room dimensions: top left Budapest (3.6 m × 4.6 m), top right Vienna (6 m × 7.2 m), bottom line Lincoln (5.2 m × 5.9 m)

Test 1: Response flexibility in utilising human communicative signals

Method

We used a dynamic, distant pointing test to investigate how accurately dogs follow the experimenter-given cues and how flexibly they can use the human pointing gesture. The protocol is based on Brúder (2010). This test investigates two aspects of dogs’ cognition: (1) performance in utilising a simple communicative signal and (2) dogs’ ability to shift (behavioural flexibility in their choice). Dogs are expected to perform above chance level in this task in general, but in this specific design, after having been presented with three consecutive pointing signals in the same direction, a drop in performance is expected at the first pointing in the opposite direction. This is followed by a recovery in performance when these latter signals are repeated. The test consisted of a total of 8 trials: two pre-training trials (see “Video protocol” section in Appendix) and six test trials. The six test trials occurred in a fixed order (AAABBB), while the direction of the pointing (left or right) was balanced. The dog was held by the collar by the owner 2.5 m from the experimenter. In the pre-training trials, the experimenter put a piece of treat in one container, placed it in front of herself and the dog was allowed to take the treat from the container. During the test trials, the bowls (diameter 18 cm) were 1.5 m from each other, with the experimenter standing 50 cm behind them facing the dog. Before each trial, the experimenter called the dog by its name, established eye contact and then performed the pointing gesture. Once she reached a static position (Fig. 2a), the dog was released. If the dog approached the indicated container first, it was allowed to eat the reward, and if the dog approached the non-baited container first, the experimenter removed the baited container, the owner called back the dog and the next trial followed. A trial ended when the dog approached one of the bowls within 10 cm.

Fig. 2
figure 2

Demonstration of the equipment used during the project. a Set-up of the pointing test. b Set-up of the problem-solving test. c The apparatus used in the means-end test. d Set-up of the obedience task. e Set-up of the incidental memory test

Measured variables

We coded the total number of correct choices (out of six) as the number of trials in which the dog went to the container signalled by the experimenter. A choice was considered correct if the dog approached the baited bowl (within 10 cm) first.

For analysing the dogs’ performance (correct/incorrect choice) after switching from pointing to one side to the other (during Trial 4 and 5), we calculated a generalised linear model with binomial distribution and logit link.

Results and Discussion

From the three countries, dogs’ mean performance in the first trial was 75.7 ± 42.9%. From a previous dataset (N = 117, (Brúder 2010) dogs performance was 78.6 ± 41.2% in this task. We found that neither proximal, nor phenotype-related factors influenced dogs’ performance regarding the number of total correct choices (Table 2).

Table 2 Effect of proximal and phenotypic-related factors on performance in the pointing test

Performance after pointing direction transition

In the fourth trial, when the experimenter first pointed in the other direction, similarly to the original findings (59.8%), dogs’ performance dropped (48.1%). While dogs in our study performed at chance level, dogs in Brúder (2010) performed above chance. Calculating the effect size revealed that this difference between the two populations/studies was small (η 2 = 0.014). Dogs in our study performed above chance level again in Trials 5 and 6 (Fig. 3).

Fig. 3
figure 3

Dogs’ performance in the pointing test, * = significance of performance above chance

Dogs’ performance in this pointing test was robust; they performed at similar levels on all testing sites, among all three breed groups, regardless of sexual status (Tables 2, 3). The drop in dogs’ performance after the transition was also prevalent in every group, which shows that they reacted similarly in such a simple communicative situation.

Table 3 Dogs’ performance in the pointing test

Test 2: Problem solving before and after demonstration: the tube task

Method

In the tube task, the dogs were provided with a two-action task in an interspecific social learning context. The dogs could obtain a piece of food from a device (Fig. 2b) with two actions: via manipulating the plastic tube or via one of the two ropes attached to the left and right ends of the tube (see “Video protocol” section in Appendix). Our protocol was based on Pongrácz et al.’s (2012) study, in which the level of success in the control group did not differ from the groups witnessing human demonstration, but when presented with a human demonstrating a rope manipulation, dogs tended to favour the demonstrated action, although they did not routinely follow the demonstrated side. The owner and the dog were 2.5 m from the equipment. The height of the tube was adjusted to the height of the dog (the height of the tube could be adjusted between 40 and 100 cm-based on the dogs’ height at the withers; 21-30 cm → 40 cm, 31-40 cm → 50 cm, etc.; see Fig. 2b).

In two pre-training trials, the dog could witness the experimenter throw a piece of food into the slanted tube so that it fell out immediately at the other end and the dog could collect it. After this, we tested the dog in a control condition in a single trial, where the dog could attempt to extract the treat on its own without experimenter demonstration. After this, in three trials the experimenter demonstrated how the food could be extracted via pulling down the rope (always on the same side for a given subject). After the demonstration, the experimenter put the treat back into the tube, walked to the owner and the next trial began. Except for demonstrations, every manipulation of the tube happened behind an opaque screen, so that the dog could not see how the apparatus was loaded. The trial ended if the dog extracted the treat or after 60 s. During the test, the owner was allowed to encourage the dog but not to give instructions or commands to the dog and had to remain in the same position from the start.

Measured variables

We coded the number of successful trials (out of four) in which the dog released the treat from the tube within 60 s and the number of trials with successful rope manipulations, when the dog solved the task by manipulating the rope. We also coded on which side of the tube the successful manipulation occurred.

For the number of trials with successful rope manipulations, a generalised linear model with negative binomial distribution with log link was built.

Results and discussion

Dogs’ performance in this test was not influenced by either proximal, or phenotype-related factors (Tables 4, 5). In our combined sample, success rate was 69.4 ± 46.1% in the control trial, which is comparable to the success level (72.2 ± 44.8%) of Pongrácz et al. (2012). In the first trial, the proportion of successful rope manipulations was 20.8%, while in Pongrácz et al. (2012) 16.7% of the dogs succeeded in the first control trial via manipulating the rope. While in Pongrácz et al. (2012), dogs followed the pull demonstration in 43.3% of cases, and our dogs did so in only 22.9% of the trials.

Table 4 Dogs’ performance in the problem-solving test
Table 5 Effect of proximal and phenotypic-related factors on performance in the problem-solving test

In our study, dogs always (regardless of testing site, sex, breed, demonstrated side) preferred the left side (one sample binomial, P < 0.05), while no such preference occurred in Pongrácz et al.’s experiment. One possible explanation for this difference is a change in the procedure: The experimenter took 3 steps backward and remained behind the equipment in Pongrácz et al. (2012), while in our case, the experimenter went around on the left side to go back to the dog’s owner. It is probable that the experimenter’s movement to the left biased the dogs’ attention to the corresponding side.

We reproduced the success rate of the original study, and dogs in our sample were similarly likely to operate the apparatus via a rope during their first encounter without a demonstration even though we used food as a reward, but if they had the chance to first interact with the apparatus without a demonstration, dogs did not follow the method shown in the repeated human demonstrations (manipulating the rope), whereas in the original protocol of Pongrácz et al. (2012) they did.

Test 3: Means-end test

Method

The protocol was based on Range et al. (2011), who measured dogs’ performance in a support problem (physical cognition). The apparatus, consisting of two sliding boards, was slightly modified from the original study to test the dogs without the experimenter sitting in front of the dog during the trials (Fig. 2c). The two sliding wooden boards were connected with a 110-cm-long string, so that if the dog pulled out one board, the other one was mechanically pulled back into the metal cage (100 × 65 × 65 cm). For the present study, we only used the ‘same distance condition’ of the original study where the two treats were placed at the same distance from the end of the boards, one on a board and one next to the other board (see “Video protocol” section in Appendix). Thus, only pulling out the board with the treat on top would be rewarded. In this condition, dogs, as a group, are expected to perform above chance level.

During a pre-training phase, dogs were trained with shaping and positive reinforcement to pull out a baited board. During the pre-training, only a single board was available (the other was pushed back into the cage), but presentation of the boards alternated every time the dog got the treat so that the dogs received an even number of treats from both boards. Pre-training was completed if the dog was able to readily pull out the board without help three times in a row, if the dog lost interest in the task or if it did not learn the task within a 20-min session. Dogs that did not reach the learning criterion were excluded from the analysis.

The test consisted of six trials. During the test trials, the owner was seated on a chair 3 metres away from the apparatus. The dog was prevented from seeing the baiting of the apparatus via an opaque screen (at least 100 × 150 cm). After baiting, the experimenter removed the screen, walked back next to the owner and the trial began. The trial ended if (1) the dog pulled out one of the two boards or (2) after 60 s. During test trials, the owner was allowed to encourage the dog with his/her voice and gestures, but had to remain seated. When the trial ended, the owner called back the dog and the next trial began. We included only those subjects that made a valid choice in every trial (total number of excluded dogs = 90; reasons for exclusion: dog did not reach learning criteria, dog managed to reach the food without pulling out the slide, dog did not make a choice, equipment malfunction).

Measured variables

We coded the duration of the pre-training (seconds) as the time required to reach the pre-training criterion (pull out the slide without hesitation 3 times in a row). Duration of pre-training was used as a random factor in the statistical models. The number of correct choices was the number of trials in which the dog pulled out the baited slide within 60 s. The maximum value was 6.

Results and discussion

In line with the original study, dogs included in this study (N = 128) performed above chance level (binomial P < 0.001 two-tailed; 578 out of 768 trials). Dogs’ chose the correct slide in 75 ± 18% of the trials (see detailed information about performance level in Table 6). The influence of the testing site seems inconclusive. Testing site affected the performance of the dogs based on Bayesian analysis, but it did not have a significant effect based on the GLM (Table 7). Based on the results of the Bayesian analysis, dogs from AT showed lower performance, but still performed above chance level (Fig. 4). Interestingly, this lower performance was also closer to the performance level reported in the original study from AT.

Table 6 Dogs’ performance in the means-end task at three different testing sites (HU, AT, UK)
Table 7 Effect of proximal and phenotypic-related factors on number of correct choices in the means-end task
Fig. 4
figure 4

Boxplot of the dogs’ performance in the means-end test, * = significance of performance above chance

We could not reach the planned sample size because many dogs failed to learn how to operate the equipment within the short time frame available to them and we also had to exclude subjects due to equipment malfunction (e.g. the slides got stuck). The rate of exclusion did not differ between testing sites X 2 (2, N = 218) = 0.39, P = 0.825, but it differed among breed groups, with a higher dropout rate among the mixed breed dogs X 2 (2, N = 218) = 6.10, P = 0.047.

We reproduced the original findings (Range et al. 2011) indicating that the results were robust, with dogs performing significantly above chance level at all testing sites and in every breed/population. The difference in performance levels between testing sites may be a consequence of random effects on the smaller sample size in comparison to the other tests, which highlights the importance of testing at least 15 dogs/group. A larger sample size would be required to test whether there is a real difference between the dogs’ performance in this task between testing sites.

Test 4: Incidental memory

Method

This test is an adapted reproduction of Experiment 2 from Fujita et al. (2012). The ‘incidental memory’ test measures how accurately dogs can recall information in an unexpected memory test. During the presentation phase, the dog (on a leash) is allowed to investigate four bowls (Fig. 2e): an empty one, one containing a pebble and two containing a single piece of food each. The dog is allowed to consume one treat and inhibited (via the leash) from consuming the other, therefore at the end of the presentation phase only a single container still had food in it. After a 10-min delay, the dog is allowed to choose which bowl to visit (see “Video protocol” section in Appendix). Based on Fujita et al. (2012), dogs are expected to remember the location of the remaining treat after the break and go for the container where they left the food.

The bowls were 26 cm in diameter and 10–12 cm in height. We put the bowls 2 m away from the starting point (in Fujita et al. this distance was 1.5 m), while keeping the angle (30°) the same between neighbouring bowls. We decided to increase the distance due to the larger body size of the dogs in our sample. The position of the objects and that from which a treat could be eaten was randomized and told to the owner in advance, so that she could be prepared to prevent the dog from eating the second treat via holding onto the leash. After the presentation phase, the dog and the owner left the room and the experimenter changed the set of containers to a clean one (otherwise identical, but never containing any food). During the 10-min delay, the owner, the dog, and the experimenter participated in the obedience test. After the delay, they returned to the room and the owner released the dog with a general release command without pointing in any direction. The trial ended after the dog made its second choice (visited the second bowl).

Behavioural variables

We coded the dogs’ first and second choice based on the bowls’ content at the end of the presentation phase (where it left the treat, where it had previously consumed the treat, the bowl containing the pebble or the empty container).

Results and discussion

Dogs chose the container in which they had left a treat significantly above chance level (25%). In our sample from three testing sites, on average 58.5% of the dogs went to the location where they left the food previously (compared to 51.3% in Fujita et al. 2012, and for more details see Table 8).

Table 8 Dogs’ performance in the memory test at three different testing sites (HU, AT, UK)

Choice of the container (22.6%) from which the dog had previously eaten (but which was empty at the end of the presentation phase) fell between the German (42.8%) and Japanese (5.6%) results of Fujita et al. (2012) for our dog population. The dogs which made an error during their first choice were more likely to go the container where they have previously eaten (Chi-square test, P < 0.001). Of those dogs that did not find the correct location on the first attempt (N = 90), 55.6% went there on their second attempt (64.7% in Fujita et al. 2012).We found that neither proximal, nor phenotype-related factors influenced dogs’ performance regarding the measured variables in the test (Table 9).

Table 9 Effect of proximal and phenotypic-related factors on dogs’ performance in the memory test

Test 5: Obedience test

Method

By means of a short behavioural test battery, we measured the subject’s obedience level (the owner’s ability to control the dog with simple commands) outdoors, in an area with moderate disturbance (people occasionally walking by, but no traffic nearby (Fig. 2d, “Video protocol” section in appendix). Our aim was to gather information about their training performance and relationship in a relatively objective manner. This part did not assess dog cognition per se as dogs’ performance in such a situation most likely depends on their training experience and their handler’s skilfulness. We used the following basic obedience tasks: call back, down (3 conditions: only verbal command, only hand signal, both) and stay. Between the tasks, the owner was allowed to praise/pet the dog and give treats. The owners were not allowed to hold treats, or touch the dogs during the tasks. The commands were given in a fixed order for all dogs. The dog was on a long leash (5 m) throughout the test, but the owner was free to decide whether (she)he held onto it.

Measured variable

The scoring system was based on Fukuzawa et al. (2005). Each task was evaluated with the same five-point scale (For a detailed description of the scoring see “Appendix”). We added additional scores where tasks could be divided into subtasks (e.g. call back and make the dog sit down) to code the transition as well. The final score was the sum value of the task and transition scores of the five commands.

We scored the dog’s performance and summarised the scores received for the different commands (total score, maximum value = 32 points). A generalised linear model with multinomial distribution, and cumulative logit link was used.

Results and discussion

In the GLM, a testing site × breed interaction was revealed (Wald χ 2 (4, N = 195) = 10.44, P = 0.034, Tables 10, 11 and Fig. 5). In contrast, the Bayesian analysis did not support the presence of a testing site × breed interaction and favoured the model including testing site and breed only as separate factors. Border collies achieved higher scores than Labradors and other breeds, and dogs from AT received higher scores than dogs from the UK. Sexual status had no influence on the received obedience score. Obedience score did not influence any other analysed variables including the duration of necessary pre-training for the means-end task.

Table 10 Dogs’ performance in the obedience test at three different testing sites (HU, AT, UK)
Table 11 Effect of proximal and phenotypic-related factors on dogs’ performance in the training level test
Fig. 5
figure 5

Boxplot of the obedience scores by breed groups and countries. BC Border collie, LR Labrador retriever, VB various breeds, HU Hungary, AT Austria, UK United Kingdom

Although our goal was to recruit dogs without special training, Border collies and their owners may represent a special population, who, regardless of testing site (country of origin), provide some basic training to their dogs, as these dogs are often selected for a range of popular sporting activities like agility. In addition, we found differences between two testing sites, Lincoln and Vienna. One possible explanation is that dogs from a capital need some basic training to live near to traffic and crowded spaces, while this is not necessary in a small city like Lincoln, where owners have more opportunity to exercise their dogs in open fields away from others.

General discussion

To our knowledge, this is the first attempt to specifically measure reproducibility of a range of measures of cognitive-behavioural performance by dogs. We have successfully replicated the main findings of a broad range of cognitive tests (indicating inter-experimenter reliability) across three testing sites (indicating intra-rater and inter-site reliability), using three groups of dogs (indicating inter-subject reliability), indicating that these phenomena are robust and the results are generalizable between geographic regions. Where we did find differences in level of performance among testing sites, these were in the means-end test (where the sample size was smaller than desired) and in the obedience test (which depended on the owner for execution).

Our findings do not question the view that certain breeds or even lines (e.g. working vs. show line, Fadel et al. 2016) differ as a population in their behaviour or problem-solving performance in some tasks, but indicate these effects may be small. The current population compared only Border collies and Labradors, but the samples were relatively small compared to that of Fadel et al. (2016), and the tasks were not selected with the aim of detecting breed differences. However, it is worth noting that differences between breeds can be due to genetic, functional, geographic and/or cultural factors (Miklósi 2014), and further work is required to tease out the relative importance of these factors in any discussion of the matter.

Although for most studies we replicated the main findings (whether dogs are able to perform on a similar level as a group in a given condition), there were some minor deviations from the original results, and many of these effects may be due to protocol differences. In the tube task, while we found the level of success and the preferred method (push vs. pull) in the control condition was comparable, dogs in our sample did not copy the demonstrated method, since they were not more likely to perform a pull action following the demonstration. In this case, we deviated significantly from the original protocol (Pongrácz et al. 2012) to make this test suitable for the present project (switching from ball reward to food reward, testing with a within-subject design instead of the original between subject design). We also found what appeared to be a local enhancement effect from the experimenter’s position during the task. This highlights how small changes in the protocol can have significant effects on the results. It is therefore essential that protocols are fully illustrated so that they can be faithfully reproduced, and to this end the use of video demonstration as supplementary material to the methods is invaluable (Kampis et al. 2010; Kaminski et al. 2011; Huber et al. 2012). Videos of all the protocols used here are available in the supplementary information.

The method of pre-training in the means-end task could also have affected the results. Müller et al. (2014) found that with a modified training protocol, Border collies performed at chance level in this test condition. As our aim was to reproduce the dogs’ performance from Range et al. (2011) (which we achieved), we did not test whether dogs understood the task from the beginning or learnt the means-end relations during the pre-training. Nevertheless, as in the original study, we found no effect of the number of pre-training trials on the success rate.

In the memory task, our dogs’ performance was between that of the German and Japanese dogs’ reported by Fujita et al. (2012). Compared to dogs from Japan, European dogs were more likely to visit the container where they had previously found food (JP 6% below chance level, DE 43% at chance level, our sample 23% at chance level). Whether Japanese dogs differ only in this or also in other cognitive aspects from European dogs requires further investigation across a wider range of tasks, and emphasises the need for caution when generalising. This is reinforced by the results of the obedience test, where testing site and breed influenced performance. Training level has been reported to influence performance of dogs both in a physical problem-solving task (Marshall-Pescini et al. 2008) and in a food choice task investigating social influence of the owner (Prato-Previde et al. 2008). In our study, we did not have enough dogs with advanced level training to test such effects, as our aim was to focus on effects relating to ordinary family dogs. A short obedience task as used here has the potential to provide objective data about the owner’s ability to control the dog in a novel environment. This is more informative in evaluating dogs that have not completed a formal dog training course and makes comparison across dogs with different training background feasible, although it is important to keep in mind that to some extent, this test in its current form also relies on the owner’s abilities. This information about the subjects is especially relevant in more sophisticated testing set-ups, because training level has been shown to influence behaviours which are usually measured in cognitive tasks as dependent variables, such as latency and duration of interaction with the apparatus and performance in a manipulative task (Marshall-Pescini et al. 2016). Thus it may be advisable to test whether the level of training of dogs succeeding in specific training tasks is comparable to the typical family dog population before making any generalisations to the latter (Huber et al. 2013).

In future studies (via close collaboration or by utilising extensive video protocols, Kampis et al. 2010) behaviour of a large number of dogs from different countries and multiple testing sites could be compared to establish the robustness of other widely used testing protocols. Moreover, another aspect that should be studied to understand reproducibility of cognitive test in dogs is the effect of the experimenter and/or handler since we do not know to what extent the range of unintentional human cues could influence dogs’ performance in complex situations.