1 Introduction

Cities are complex systems: collections of interacting agents that exhibit non-trivial collective behavior [2, 19]. This observation has guided research in general principles of city planning that can govern the behavior of the complex adaptive system the city manifests. Early work by Jacobs proposed ideal sizes and specific guidelines for city neighborhoods [25], and more recently, researchers have begun to empirically validate these ideas using mobile phone data [46]. West and Kempes model scaling behavior for cities as balancing sublinear growth in resource consumption (as a function of population) against the superlinear growth of socioeconomic effects, both positive (per capita wages) and negative (inequity, disease) as a power law with an exponent of about 0.15 (as opposed to, for example, the exponent of 0.25 identified in many biological processes) [28, 67]. More recently, the rate of COVID-19 spread has been shown to approximate the same power law [57]. As West and Kempes argue “Cities are machines we evolved to facilitate, accelerate, amplify, and densify social interactions.”

The holistic study of cities as complex systems complements the rapid (yet ultimately opportunistic) proliferation of artificial intelligence technology in the public sector. Although conventional machine learning techniques are common in urban applications [44, 65, 71], neural architectures are opening new opportunities by adapting convolutional, recurrent, and transformer architectures to spatiotemporal data [35, 38, 56, 70, 73, 75, 76]; see Grekousis 2020 for a recent survey [17].

These two lines of inquiry—top–down modeling of cities as complex systems and bottom–up modeling of specific urban systems using deep learning—are difficult to reconcile. Complex systems are not amenable to reductionist statistical experiments: comparing the results of an agent-based model with observed data (e.g., for autonomous vehicle research [11]) is often the best we can do, despite the challenges of addressing the inverse problems implied [8, 59]. The central issue is that observational micro-data for cities are inconsistent in availability and quality, limiting the opportunity for validation of sophisticated models.

This inconsistency persists despite significant investments in open data. Over the last 2 decades, cities have increasingly released datasets publicly on the web, proactively, in response to transparency regulation. For example, in the US, all 50 states and the District of Columbia have passed some version of the federal Freedom of Information (FOI) Act. While this first wave of open data was driven by FOI laws and made national government data available primarily to journalists, lawyers, and activists, a second wave of open data, enabled by the advent of open source and web 2.0 technologies, was characterized by an attempt to make data “open by default” to civic technologists, government agencies, and corporations [64]. While open data have indeed made significant data assets available online, their uptake and use have been weaker than anticipated [64], an effect may attribute to inconsistent availability of high-value data across cities [32]. Ultimately, open data exhibit convenience sampling effects.

In this paper, we consider four research thrusts all aimed at using AI techniques to improve the coverage, access, and equity of urban data, and thereby reduce barriers and attract attention to the study of critical questions in city dynamics and socioeconomic interactions. Machine learning research is broadly recognized to be too narrow in applications and datasets, focusing on opportunistic, discriminatory, and profit-oriented applications [50, 53, 68]. By making high-quality urban data available across cities, across variables, across time-scales, and at multiple resolutions, we aim to make AI research on societally important problems the path of least resistance. However, to accomplish this long-term goal, we need to address specific challenges in working with urban data.

Expanding existing sources By simultaneously modeling multiple heterogeneous datasets [69], we aim to identify the underlying relationships and interactions between urban systems as a middle path between reductionist, application-specific prediction tasks and holistic, simulation-oriented inference. However, where our earlier work assumed uniform data coverage, we now need to apply advanced learning techniques to interpolate and extrapolate dense spatiotemporal datasets to account for inconsistent coverage (Sect. 2). These techniques help expand the utility and reach of data-hungry predictive models to counteract the sparse and inconsistent availability of public and private data. As an example, we show how the interpolation of urban transportation data is remarkably amenable to deep learning architectures developed for image inpainting on the web.

Developing new sources Open data in urban contexts are typically either spatiotemporal (vector or raster) or administrative (structured). However, by investing in infrastructure, we can develop and make available data sources around governance, economics, decision-making, and public participation. As an example, we show how transcripts from public meetings are amenable to computational processing to increase oversight and participation, if we can first establish an infrastructure and appropriate standards to collect and manage this data (Sect. 5).

Exploiting rich ontologies The use of large, noisy, and heterogeneous data motivates investment in data curation: associating contextual information with the data to mediate its collection and use. However, manual curation activities (e.g., human labeling of data) scale poorly. In complex domains, human expertise is better invested developing richer labeling schemes than actually labeling data. For example, ontologies have been developed for electric mobility [55], humanitarian [3], and smart city applications [1, 9, 13, 18, 61]. However, categorizing public data (e.g., social media posts) using these ontologies requires new techniques in hierarchical multi-label classification. As an example, we show how graph encoding techniques can be used to significantly improve performance in these contexts (Sect. 4).

Incorporating fairness and interpretability In every application of urban machine learning, prediction and modeling carries enormous risk of exacerbating inequity and opacity [21, 42, 43, 49]. Building on recent advances in fair and explainable AI, we consider the interactions between accuracy, fairness, and explainability in urban applications. We then propose new methods for controlling these tradeoffs in response to emerging regulation (Sect. 3).

2 Interpolation of spatiotemporal data using deep learning

Image inpainting is a task of synthesizing missing pixels in images. In computer vision, there are two board branches to inpaint images. The first branch contains diffusion-based or patch-based methods that utilize low-level image features to reconstruct the missing regions. The second branch contains learning-based methods that involve the training of deep learning models. Traditional diffusion-based methods transfer information from the valid regions to the missing regions, which are convenient to apply but limited to small missing regions only. Learning-based approaches aim to recover the images based on the patterns learned from large amount of training data. Such methods include context encoder by Pathak et al. [47], global and local image inpainting by Lizuka et al. [22], partial convolution method by Liu et al. [36], etc.

Image inpainting techniques have wide application potentials, including the geospatial domain that works frequently with satellite images. Zhang et al. [77] proposed a unified spatial–temporal–spectral deep convolutional neural network (CNN) image inpainting architecture to recover information obscured by poor atmospheric conditions in satellite images. Kang et al. [27] modified the architecture from [72] to restore the missing patterns of sea surface temperature from satellite images. Tasnim and Mondal [60] also applied the inpainting architecture from [72] to remove redundancies in satellite images and restore the imagery.

We build on prior work from our group in learning fair integrations of heterogeneous urban data [69]. We originally assumed uniform spatial and temporal coverage data, but in practice, urban datasets are spatially imbalanced: one neighborhood may be missing a variable of interest defined everywhere else, undermining trust in the results. Conventional statistical approaches to impute missing data, such as global/local mean imputing, interpolations, and spatial regression models, are limited in their ability to capture non-linear interactions, where deep learning methods, including image inpainting techniques in geospatial imputation, excel.

Given the similar nature of images and gridded urban data, we conjecture that image inpainting techniques can be adapted to impute missing urban data, improving coverage and quality, and therefore usability. As far as we know, no prior work that has exploited image inpainting techniques to reconstruct missing values in raster urban data. In this section, we present our preliminary experiments and results of utilizing an image inpainting technique to compute missing values in gridded urban data.

2.1 Example: interpolating urban mobility data

We use taxi trip data as a representative example of urban data, though the coverage of urban data is much broader. We used NYC taxi trip data from 2011 to 2016 from NYC Open Data Portal [10]. The years 2011–2014 cover the trips throughout the entire year, while 2015 and 2016 only cover half of the year. The raw data are collected tabular format, where each record/row contains the information of each taxi trip, including the longitude and latitude of the starting location. We considered the demand prediction problem, interpreting each record as an indicator of demand following Mooney et al. [43]. We processed the tabular data into raster format given the following steps:

  • We defined a rectangular subset of the greater metropolitan area of New York City representing lower Manhattan. We only consider the taxi trips that began within this rectangular region.

  • We imposed a 32 \(\times \) 32 grid over our selected region. This choice of dimensions is somewhat arbitrary, balancing fidelity (reducing the need to upsample or downsample datasets too much), computational efficiency, and interpretability (1 grid cell is approximately 1 km\(^2\).) For each year, each unique date, and each unique hour, we count how many taxi trips are within each grid and interpret these values as an estimate of taxi demand in that cell, at that time.

  • In total, we have 32,616 samples to model with, each having 32 \(\times \) 32 dimension. 70% of samples (23,482) are used as training data, 10% (2610) as validation set and the rest 20% (6524) as test data.

Fig. 1
figure 1

Inpainting results of taxi trip data. From left to right, the columns are: ground truth images; the irregular masks; the masked ground truth; the final inpainting results

2.2 Modeling and results

We implemented the architecture from Liu et al. [36]. Many prior works in image inpainting only considered rectangular-shaped missing regions, but rarely are the patterns of missing data so regular. In urban data, the missing regions could be scattered or in irregular shapes corresponding to irregular political boundaries or inconsistent data collection. Therefore, Liu et al.’s work fits well into the urban scheme. Liu et al. used the summation of four different losses as the objective function, to account for different factors related to the perception of the resulting image, which was appropriate for web images but less appropriate for quantitative urban data. We only used the \(\ell _1\) regularization loss between the original data and inpainted data. The model hyper-parameters are set to be consistent with Liu’s work. The learning rate is set to \(1e^{-4}\) flat. The batch size is set to 32. The maximum iteration is 10,000 and we evaluated the model on validation set every 100 iterations.

Five inpainted examples are visually presented in Fig. 1. We can see that the inpainting technique can be naturally applied to gridded urban data and yield promising results. Imputing the missing values in urban setting could also be viewed an a type of synthesis. The synthesis of partial urban data could improve the applicability and usability of urban data, but will require future work in multiple areas:

  • Though deep learning methods are powerful, we need rigorous evaluation against traditional imputation techniques to see if these complex methods are warranted. Additionally, visual similarities are subjective, which is appropriate for web images but not if we intend to use these datasets for quantitative analysis. Additional quantitative measurements should be incorporated.

  • The region of the experiment (NYC) and the dimensions of the urban grid (32x32) are both limited. Expanding the region to cover more area would capture more urban dynamics, while evaluating the effect of different grid sizes, will be necessary to test generalizability.

  • We treat each date and each hour as a unique sample. However, in reality, the current hour timestamp is closely related to the demand from the previous hour. Modeling each sample separately ignores such dependency. Therefore, we hypothesize that modifying the architecture to work with temporal blocks would help improve performances.

3 Trade-off among distributive and procedural fairness

Real-world datasets often contain societal biases, which are perpetuated in the machine learning models, leading to discriminatory decisions in high-stake domains. In response, many methods were developed to mitigate fairness by achieving some statistical measure of equity between majority and minority groups (e.g., equalized odds and equality of opportunity) [4, 7, 30, 39, 74]. This line of work is guided primarily by the notion of distributive fairness, which emphasizes on a fair allocation of resources.

However, prior work has shown that procedural fairness, the perceived fairness of the process that leads to the outcome, is equally as important as distributive fairness [5, 63]. For example, in court systems, studies have shown that “most people care more about procedural fairness ... than they do about winning or losing the particular case.” [63] Recent studies have also shown that procedural fairness is critical to automated decision systems [33, 40]. For instance, through a cross-sectional survey study at a large German university, Marsinkowski et al. found that both distributive and procedural fairness have significant implications on higher education admission that uses an automated decision system [40].

The interaction between distributive fairness, interpretability, and procedural fairness are rapidly becoming a compliance issue. In April 2021, the EU released a proposal for sweeping regulation of algorithmic bias [14]. In the same week, the Federal Trade Commission released a blog post [26] that described a legal framework for evaluating AI bias, foreshadowing enforcement. In the California, a bill regulating automated decision systems is in committee [23].

Drawing from procedural fairness theory, we propose Explanation Loss (see Eq. 1), a novel fairness metric that measures procedural fairness and a method to optimize for it [34]. In particular, this metric measures the neutrality of the decision process to different demographic groups. Since complex black-box models (e.g., deep neural networks and tree-based ensemble models) are often used due to their high predictive power, we use interpretability methods to generate explanations that reveal the model decision process for each datum. The metric then computes the average absolute differences of the explanations between all possible pairs of input samples, one from the minority group and one from the majority group. The intuition is that the difference in the model’s explanation for two groups can be approximated by the average differences of all pairs of individual explanations. Therefore, Explanation Loss measures how far away the decision process is from being perfectly neutral. In the following sections, we describe the data we used, the method that optimizes for the metric, and the preliminary results we obtained.

3.1 Data

We used the COMPAS dataset [31] for the preliminary study. COMPAS dataset, which contains attributes of criminal defendants, is often being used to study (deeply flawed) recidivism models: whether a person will reoffend within 2 years). It is known to exhibit severe biases against minority groups. Specifically, studies have shown that models trained on COMPAS tend to overpredict recidivism for black defendants and underpredict recidivism for white defendants [31].

We preprocessed the dataset following Rieger et al. [52]. The dataset contains a total of 7214 samples. We filtered 1042 due to missing information about the recidivism. We categorized age into under 25, between 25 and 45, inclusively, and above 45. We categorized sex into Male and Female. We also categorized the crime description based on matching words, resulting in categories Possession (of drugs), Driving, Violence, Theft, and No Charge. For example, descriptions that are matched with “theft” or “burglary” are categorized as Theft. We then one-hot encoded all categorical variables, and used the numeric variables as is. We focused on equalizing explanations of the Black and Caucasian records, since these two are the predominant groups of the data.

We split the data into train, validation, and test sets, with a ratio of 80/10/10.

3.2 Method

Interpretability techniques aim to generate explanations for a model’s individual predictions. A popular class of such techniques is known as feature attribution, which, given an input, the model, and prediction, assigns a number to each feature of the input to represent how much it contributes to the prediction [37, 45, 51, 58]. There are two reasons that feature attribution methods are appropriate for our study: (1) they allow us to compare model’s explanation for each prediction at the feature level, which is especially important for fairness, since certain features are more sensitive than others; (2) feature attribution vectors can be interpreted as attribution priors to incorporate the notion of procedural fairness in the model.

We propose the following regularization to achieve procedural fairness:

$$\begin{aligned} {\hat{\theta }}&= \arg \min _{\theta } \sum _{x_i, y_i \in D} \mathcal {L} (f_\theta (x_i), y_i) \nonumber \\&\quad + \lambda \frac{1}{|D_{s1}||D_{s2}|}\sum _{x_j \in D_{s1}, x_k \in D_{s2}}|expl(x_j) - expl(x_k)|.\nonumber \\ \end{aligned}$$
(1)

The regularization computes the average L1 norm of difference between explanations of every pair of instances (one from each group), \(x \in D_{s1}\), \(x' \in D_{s2}\). Each explanation, \(expl(x)\), is a vector of feature importance scores with dimension equal to the number of features of the input. This vector is generated using any feature attribution method. In this study, we used Contextual Decomposition (CD) as the feature attribution method [45]. This regularization term takes the exact form of our proposed metric for procedural fairness.

We trained a simple multi-layer neural network model with one hidden fully connected layer of 100 neurons and ReLU activation, and varied a weight for the regularization term of 0 (no explanation loss), 0.2, 0.4, 0.6, 0.8, and 1.0. The model was trained with a batch size of 256 and a learning rate of 0.001 for 5 seeds, and the average results were reported. While the regularization equation refers to explanations of the entire dataset, in practice, this term is computed per batch for faster convergence. Specifically, for each batch, we partition the instances into two groups, then for every possible pair (one from each group), we compute the L1 norm of the difference of the feature attributions, and finally, we average the differences. An ablation study of the effect of batch size on results remains future work.

3.3 Metrics

In addition to our proposed metric of procedural fairness (Explanation Loss), the metrics we used to evaluate the model include 1) accuracy and 2) fairness. We considered two popular fairness metrics: equality of opportunity [20] and equalized odds [54]. A model is said to satisfy equality of opportunity if the false-positive rates are equivalent across two demographic groups. Similarly, equalized odds require false-negative rates to be equivalent across two groups in addition to false-positive rates.

The loss term represents the notion of fairness distance: how far away the model is from perfectly fair. The fairness distance we consider is based on equality of opportunity [], and measures the absolute difference between the false-positive rates of one demographic group (FPR1) and another (FPR2): \(|\mathrm{{FPR1}} - \mathrm{{FPR}}2|\). On the other hand, fairness distance is based on equalized odds measures [] \(|\mathrm{{FPR}}1 - \mathrm{{FPR}}2|\) + \(|\mathrm{{FNR}}1 - \mathrm{{FNR}}2|\), which adds an additional absolute difference between the false-negative rates.

Table 1 The effect of equal explanations on accuracy and fairness distances of the model

3.4 Results

The results are summarized in Table 1. From the first two columns of the table, we can see that the regularization effectively encourages the model to predict with similar explanations across two demographic groups, which we interpret as improved procedural fairness by penalizing the tendency for a model to essentially learn two separate submodels, one for each group. Second, adding the regularization term does not reduce the accuracy of the model. Third, equalizing the explanations has a minor effect on fairness of the outcomes, causing a slight increase on fairness distances.

4 Hierarchical multi-label classification

We demonstrate hierarchical multi-label classification (HMC) in the urban domain. HMC tasks involve a large set of labels organized into parent-child relationships, typically representing increasing specificity or isa relationships. Each input record is associated with multiple labels in the hierarchy, representing the uncertainty associated with a large label space in a complex domain. HMC has received increasing attention with the adoption of neural networks [15, 16, 66, 78], often in contexts requiring significant human expertise, making large-scale labeling exercise infeasibly expensive. In other words, human expertise is invested in modeling the world through a complex ontology rather than labeling data using that ontology. As a result, the machine learning tasks represent a different set of requirements: the number of labels can be large relative to the number of labeled items, but there is structure among the labels that algorithms can exploit for distance supervision.

Ontology development is common in urban planning, where the complexity of the domain and multiplicity of perspectives require building consensus around a universe of discourse. For example, ontologies have been developed by teams of experts to describe electric mobility [55], humanitarian efforts [3], and smart city applications [1, 9, 13, 18, 61]Footnote 1. The HMC literature rarely considers these urban applications, instead favoring biological and scientific domains where public data are more readily available.

We are exploring new approaches for HMC that involve learning reusable representations of the ontology itself (using graph encoding techniques) to tame the complexity, then using these learned representations as the labels when training a classifier. We show that using these ontologies as a source of supervision can significantly improve the classification performance over other HMC techniques, motivating greater investment in developing comprehensive ontologies to represent the complex urban domain as a whole rather than expending resources on creating expensive labeled datasets for myriad specific applications.

4.1 Case study: community listener

We worked with a local non-profit organization to identify the community needs from several sources of the data, such as social media (Twitter, Reddit, and Facebook conversations) and long-form survey responses. We classify these discourses into the Sustainable Development Goals Ontology (SDG) [3]Footnote 2 and the Social Progress Index (SPI)Footnote 3. The data and the predicted labels are then aggregated and visualized on an online dashboard serving policymakers and entrepreneurs. The Sustainable Development Goals Interface Ontology (SDG) was developed by United National Environment Programme to support the achievement of the 17 United National Sustainable Development Goals to promote human rights and equity. The ontology includes 169 nodes with 3 levels. Social Progress Index (SPI) was introduced by Social Progress Imperative to promote improvement and actions for social progress. They define Social Progress as “the capacity of a society to meet the basic human needs of its citizens, establish the building blocks that allow citizens and communities to enhance and sustain the quality of their lives, and create the conditions for all individuals to reach their full potential.” SPI includes three levels with 124 nodes.

4.2 Experimental settings

Table 2 Dataset statistics

There are two datasets used in this experiment, Programs and Organizations. Programs is a list of descriptions of humanitarian programs; the task is to determine which areas of humanitarian need are intended by the Program. The description typically mentions the mission and areas of focus for the program, which we anticipated would make Programs relatively easy to classify. The Organizations dataset is a list of companies and non-profits that may work in areas of interest for humanitarian causes. In this case, the descriptions are less likely to explicitly mention areas of humanitarian need. For both datasets, we associate each record with zero or more labels from the SDG and SPI ontologies. Statistics of the two datasets are shown in Table 2. We split each dataset into training, validation, and test set with 8:1:1 ratio. The models are optimized with validation set and the experimental results are reported from the test set.

We experimented with different text embedders and classification models to find the best combinations. Because the organization did not have abundant computation resources, we limited our choices within computation efficient models. We chose TF-IDF and Glove [48] as our text embedders. For the classification model, we adopted two frameworks for classification: one considered the hierarchical structure with graph encoding (named Ontology) within the labels and the other did not (named naive). The naive model considered the labels as a flat list. The model consisted of two fully connected layers and iwas optimized with Binary cross entropy, which is often used for multi-label classification. The diagram of the ontology framework is shown in Fig. 2. The framework learned a representation for the label ontology using a graph autoencoder [29]. Then, the model considered the node embeddings and mapped the input instances onto the node embedding space with cosine similarity. Finally, the model was optimized with binary cross entropy and produce probability confidence as output. The threshold for classification is set to be 0.5. Finally, we evaluated all models with Precision (P), Recall (R), and F1 score which are commonly used in multi-label classification community. Following the literature, a data record is considered correctly classified when the predicted leaves match the ground truth exactly: there is no partial credit for siblings, for example.

Experimental results

Fig. 2
figure 2

Illustration of our framework

Table 3 Experimental results on the Program and Organization datasets

We demonstrate our experimental results in Table 3. Because these two datasets are custom and not publicly available, we provide results from two baseline methods—majority vote and random selection. We can observe that considering the label ontology significantly improve the results. The trained model then allows us to tag the discourses on social media based on the humanitarian ontologies from SPI and SDG and to visualize within an online dashboard serving policymakers and entrepreneurs. As a result, we can organize public discourse and participation to capture levels of interest in various topics.

This approach is potentially critical for addressing data scarcity in practice. As we have argued, in complex domains, obtaining labeled data is expensive and requires significant human expertise. For example, determining whether a potential project is related to a goal to enhance inclusive and sustainable urbanization (SDG 11.3), achieve sustainable management of resources (SDG 12.2), encourage adoption of sustainable practices (SDG 12.6), or all three, requires significant expertise with the SDG ontology, municipal government practices, and the data being labeled. Moreover, labeled datasets can be rendered obsolete with only minor changes to the ontology, requiring an expensive re-labeling exercise. To enable comprehensive cross-sector models that can be deployed in a variety of contexts, we need to make efficient use of the human attention invested in creating the ontology.

5 Modeling governance behaviors

The source of municipal democracy in the United States is found in city halls across the country. Even as our collective work in the analysis of urban space is used to create, debate, and ultimately enact urban policy, there is a lack of large-scale quantitative studies on municipal government. Comparative research into municipal governance in the USA is often prohibitively difficult due to a broad federal system where states, counties, and cities divide legislative powers differently. This power distribution has contributed to the lack of necessary research into the procedural elements of administrative and legislative processes, because it affords each municipality to each have their own standards for archival and publishing of municipal data [62].

To better study the complexities of municipal councils across the county, multiple tools are needed to standardize and aggregate data into large research databases and access portals. The data from municipal government meetings (videos, transcripts, voting records, etc.) must be made more accessible to both the general public and to researchers, and, such tools must be deployed in multiple municipalities across the nation, so that data can be used in aggregate to study the spread of policy, topic coverage, public sentiment, and more.

Once this infrastructure is available, it becomes possible to conduct large-scale quantitative studies on the dynamics of discourse in policy deliberation and enactment, quantifying how much of policy is decided upon using community sentiment as the policy basis, how such policy is supported or not from the public, and how similar policy proposals in different municipalities (or levels of government) are discussed and either enacted or rejected.

5.1 Council data project

To enable such large-scale studies, we have begun work on “Council Data Project,” [6] a suite of tools for deploying and managing infrastructure for rapidly generating, archiving, and analyzing transcript datasets of municipal council meeting content. Council Data Project (CDP) is easily deployable and generalizes to many different meeting venues, but is specifically built with municipal council meetings in mind.

For each meeting a CDP deployment processes, our tools generate a transcript of timestamped sentences, and archives the produced transcript and all attached metadata (minutes items, presentations and attachments, voting records, etc.). CDP deployments additionally create a keyword-based index multiple times a week to enable plain-text search of events.

To further the utility of the CDP produced corpus, we are creating audio classification models for labeling each sentence with the classified speaker, aligning sentences in the transcript to the provided list of minutes item, re-using the generated keyword-based index for a municipality level n-gram viewer [41], and much more. Such work will enable the creation of datasets such as a dataset of discussions where only a set of specific councilmembers are present, or a dataset of discussions regarding specific pieces of legislation (minutes items).

Fig. 3
figure 3

Examples of analysis made possible through CDP infrastructure. Using the produced transcripts, we can build topic models to tag topics both in a single meeting’s transcript and track topic trends over time. With multiple CDP instances, we can show how these trends hold (and spread—investigating the topical latency between municipalities) over entire regions. Additionally, building models for tracking the sentiment of discussions regarding specific pieces of legislation as they move through council

Council Data Project enables large-scale quantitative studies by generating standardized municipal governance corpora—including legislative voting records, timestamped transcripts, and full legislative matter attachments (related reports, presentations, amendments, etc.). CDP enables the reproduction of political science research such as studying the effects of gender, ideology, and seniority in council deliberation [24], and studying the effects that adopting information communication technologies have on the civic participation process [12].

In constructing CDP to be as easily deployable as possible, we enable studies to understand how these behaviors generalize (or act as outliers) in different municipalities and settings.

Effective use of this new source of data motivates research in adapting deep learning techniques to multi-speaker, structured settings. Tasks include identifying speakers, topic and sentiment labeling by speaker to understand political positions, labeling speech by agenda topic, summarizing public sentiment to guide outreach, and communication investments. These problems appear to be within the capabilities of emerging deep learning techniques, but require research attention in formulating the problem and evaluating competing techniques, which in turn require access to high-quality labeled datasets. Moreover, linking public discourse over social media with formal discourse in public hearings, administrative data collected through municipal service delivery, and geospatial data collected through sensing technologies will be required to meet our goal of a holistic study of the science of cities.

6 Conclusions

We aim to improve the coverage, access, and equity of urban data to advance understanding of city dynamics, unifying a top–down, holistic view of cities as a complex system and bottom–up, application-oriented view of cities as an assembly of independent subsystems. We aim to combat the disproportionate attention received by online advertising, face recognition, image labeling, and NLP tasks that dominate the machine learning literature by making high-quality, comprehensive urban datasets available for research. We identify four areas of research, with promising preliminary results, that involve the application of AI in urban contexts: spatiotemporal interpolation of data, unifying fairness, and interpretability in the context of emerging regulation of algorithms, accommodating the complex domain models that are necessary to describe cities holistically, and engaging with new sources of data at the intersection of public discourse and policymaking.