PLoS ONE
Machine learning for buildings’ characterization and power-law recovery of urban metrics
DOI 10.1371/journal.pone.0246096 , Volume: 16 , Issue: 1
Article Type: research-article, Article History
•
•
• Altmetric

### Notes

Abstract

In this paper we focus on a critical component of the city: its building stock, which holds much of its socio-economic activities. In our case, the lack of a comprehensive database about their features and its limitation to a surveyed subset lead us to adopt data-driven techniques to extend our knowledge to the near-city-scale. Neural networks and random forests are applied to identify the buildings’ number of floors and construction periods’ dependencies on a set of shape features: area, perimeter, and height along with the annual electricity consumption, relying a surveyed data in the city of Beirut. The predicted results are then compared with established scaling laws of urban forms, which constitutes a further consistency check and validation of our workflow.

Krayem, Yeretzian, Faour, Najem, and Rozenblat: Machine learning for buildings’ characterization and power-law recovery of urban metrics

## Introduction

A key question for planning, designing, and managing urban spaces is how the different city’s components interact and influence its dynamics [1]. It is therefore important to recognize cities as complex systems with emergent properties resulting form far from equilibrium dynamics and with energy requirements for self-maintenance [1, 2].

Specifically, urban form’s evolution, or equivalently the city’s infrastructure’s spatial patterns, is governed by the rules of competitive processes which manifest themselves in self-similar fractal patterns or scaling laws, which govern the changes of city components with its size [1, 37]. At the micro-scale, urban buildings, which are described as “containers of socio-economic activities” are of particular interest [8]. Comprehensive surveys of buildings detailing their uses, ages, and sizes are essential to support more effective policymaking relating to the sustainable management of cities [9]. For instance, much of the work on the building stock has been driven by the energy sector. Particularly, urban buildings account for a high portion of the Green House Gas emissions through electricity consumption [10, 11]. Therefore, identifying building attributes helps in simulating their energy performance, identifying spatiotemporal patterns to assess the impact of retrofitting strategies to reduce energy consumption, and in adapting buildings for climate change [1216]. Moreover, tracking the rate of change of cities and the survival of buildings is essential to estimate the distribution and lifecycle of stock material in order to inform on the best practices of its management and utilization or the so-called “industrial ecology” [17, 18].

Therefore, generating a building database is important for urban science broadly speaking. Methodologically, it can rely on collecting existing information as in the property taxation database [19] or conducting ground surveys. However, this data is sometimes expensive, unavailable or insufficient. For this reason, taking advantage of new data sources, methods and tools is a central focus area in urban research. Volunteered Geographic Information (VGI) platforms, which are crowdsourcing tools, are gaining emerging interest. Coloring London [20], and Coloring Beirut [21], where residents are encouraged to fill, substitute, and update the buildings database themselves are such examples. Moreover, the automated capture and extraction of building attribute data are more and more facilitated by the development of computational resources, machine learning techniques, and remote sensing [9, 2224].

In this paper, we apply machine learning algorithms to assist and complement the collection of building data. Neural Networks and Random Forests are built to link the physical character of the buildings to their number of floors and vintage respectively. We first outline the database we are working on, and then proceed by developing the machine learning algorithms for the buildings’ attributes prediction. The relation of the results to established urban scaling is illustrated. Finally, we stress the importance of such methodology in data-scare environments in promoting transparency despite the challenges and bureaucratic impediments, which stand in the way of forming a national repository.

## Data collection and preprocessing

Beirut, the city, is located on the eastern shore of the Mediterranean sea with a stock of 17, 742 buildings (in 2016). The latter’s corresponding footprints’ attributes used in this work (area, perimeter and height), were obtained from the National Center for Scientific Research CNRS Lebanon, while additional information on a subset of 7, 122 buildings (among the 17, 742 buildings) was surveyed by the Saint-Joseph University (USJ), part of the LIBRIS program [ANR- LIBRIS project (ANR-09-RISK-006)—contribution to seismic hazard assessment in Lebanon. A co-joint project between ISTERRE, IPGP, EDYTEM and RESONNANCE laboratories with the AUB, NDU, USJ universities and the CNRS-L. Collaboration under the task 1.3. “Speleoseismicity and the Lebanese endokarst”]. It includes buildings’ year of construction, type, number of floors and of apartments. Their corresponding construction years were converted into construction periods based on the city’s architectural history, which witnessed five major waves of construction each with specific distinctive features [16, 25, 26]. The distribution of the USJ subset according to the year of construction is given in Table 1.

Table 1
Percentage of buildings per period of construction in the dataset.
Construction periodLabelPercentage of buildings
Before 192311.2%
1924-194027.8%
1941-1960342.1%
1961-1990439.1%
After 199159.7%

Moreover, data on the annual electricity consumption for many buildings were obtained from the national power utility: Electricité du Liban (EDL). Entries with missing fields, incorrect buildings’ heights (≤ 2.8m), or atypical floor height (≤ 2.8m or ≥4, 5m) were removed from the dataset. It is worth noting that buildings’ footprints were manually digitized over the entire city of Beirut by CNRS-L using aerial photos at 15cm resolution and VHR pan-chromatic satellite images from Pleaides-1A at 70cm resolution. This is not part of our work but rather a CNRS-L in-house processing step, which was made available to us. The floor height ranges were also determined as in [25], where a height of 4.5m corresponds to a floor of a building constructed before 1923, while a height of 2.8m mainly corresponds to recent buildings. Residential and mixed buildings types were kept while others such as hospitals, places of worship, and schools were removed, which left us with 1, 968 buildings of the USJ dataset.

This step was followed by the application of an Isolation forests (iForest) scheme [27], which is an outliers’ detection procedure and is essential to the removal of noisy, incorrect or aberrant information in the dataset. The outliers’ removal was based on a set of features: the yearly electricity consumption, floors’ number, height, perimeter, area, period of construction, and type of the building. Subsequently, 434 buildings were classified as outliers and were therefore removed from the dataset, leaving us with a dataset of 1, 534 buildings, which we used in what follows. To visualize the outliers, the six features were reduced to three using the Principle Component Analysis (PCA). The correlations between each new dimension and the two others were then illustrated using a two-dimensional scatter plot matrix shown in Fig 1. The diagonal plots show the univariate distribution of each dimension. The spatial distribution of the buildings used in the development of the predictive algorithms is shown in Fig 2.

Fig 1
Correlation plot of the buildings samples after applying PCA for dimension reduction, with outliers highlighted in blue.
Fig 2
Spatial distribution of the accepted buildings after the data pre-processing.

## Methods

Many machine learning algorithms are available with different architectures. In our manuscript, we chose three well-known algorithms (linear and logistic regression, NN, and RF) that are described in more details below. Each model with a given architecture learns its parameters based on the training set. After that, to evaluate the model’s performance, and thus to choose the best among them, a performance metric is applied to compare how close the actual data is to the model’s prediction. This metric is normally a measure of error, or how far the predictions are from the actual data and thus the algorithm with the best metric value is chosen.

The building’ floor number was shown to be dependent on the building’s height, area, perimeter, and annual electricity consumption (Table 2). Having established this dependency, the building’s construction period’s relation to the aforementioned features was investigated. The selection of the buildings’ features, that is the independent variables, can be justified by a correlation analysis achieved with the Pearson coefficient for the floors’ number, which is used to evaluate bivariate correlation between continuous variables, and a dependency strength achieved with the logistic regression’s accuracy score for the construction period, which is used to assess the accuracy of multi-label categorical classification as seen in Table 2. The Pearson coefficient is defined to be the ratio between the covariance between variables over the product of their respective variances given by cov(x, y)/σx σy. The accuracy score is given by: $accuracy=\left(1/{n}_{samples}\right)×{\Sigma }_{i=1}^{{n}_{samples}}1\left({y}_{pred,i}={y}_{true,i}\right)$, where ypred,i is the predicted value of the i-th sample and ytrue,i is the corresponding true value.

Table 2
Pearson coefficient and logistic regression’s accuracy score describing, respectively, the correlation and dependency between dependent variables and selected variables for prediction.
Pearson CoefficientAccuracy score
Floor numberConstruction period
Electricity consumption0.570.51
Height0.950.59
Area0.370.46
Perimeter0.380.46

For the prediction of number of floors, which is an integer value, a multi-layer feedforward (MLF) neural network (NN) for a multivalued non-linear regression was trained, whereas for the classification of the construction periods, which is a categorical value with labels ranging from 1 to 5 given in Table 1, NN, random forests (RF), and multiple logistic regression with classes from 1 to 5 were applied. MLF neural networks are the most popular type of NN. Their design is motivated from a real brain: networks of simple processing elements, neurons, operating on their local input data and communicating the output with other elements. Each neuron is connected to at least one other neuron, and each connection is evaluated by a weight coefficient. The training of a NN is in fact adjusting these weights in such way, the calculated outputs of the whole network are as close as possible to the actual ones [28]. RF are an ensemble learning method for classification or regression, which consist of constructing several estimators or decision trees at the training time and outputting the majority vote of the estimators for class prediction, or their mean prediction for regression [29]. Finally, multiple logistic regression is a classification method that describes the relationship between a nominal-scaled, i.e categorical variable and a set of independent variables. It consists of calculating the probabilities of the different possible outcomes of the categorical variable [30].

The number of hidden layers of the NN, which outputs the number of floors, ranged from 1 to 3, with corresponding number of neurons ranging from 3 to 8, and learning rate from 0.001 to 0.1. As for the construction period’s logistic regression algorithm we used multiple solver to guarantee convergence such as Newton and BFGS solvers, a one-vs-the rest (OVR) multi-class strategy which consists of fitting one classifier per class, and finally features were selected according their k-score which is an inter-reliabilty measure for categorical variables. As for its NN, the hidden layers ranged 3 from to 8, with corresponding number of neurons varying from 1 to 40. The solvers we used were ADAM, BFGS, and Sigmoid, and a variety of activation functions were applied such as logistic, tanh, and relu. The learning rate was varied between 0.001 to 0.1. Finally, the RF estimators ranged from 10 to 500, with maximum depth ranging between 3 and 6, and maximum features used when considering the optimal split were defined using auto, and the criteria to evaluate the quality f the split was measure by the Gini impurity and the entropy. In order to measure the performance of the NN with different architectures, regression metrics such as the mean absolute error (MAE), the mean squared error (MSE), the mean absolute percentage error (MAPE), and the coefficient of determination (R2) were computed. Similarly measures of performance of classification algorithms were also computed such as the accuracy score and the f1-score. Further, the resulting models were applied on the test sets to evaluate their performance. Subsequently, the best performing models were extended to the whole city.

### Floors

The dataset of 1, 536 samples was subdivided into training, validation, and test sets each containing respectively 859, 369, and 308 samples, which correspond respectively to the 55%, 25%, and 20% splits, often recommended in the literature [31]. The features of the training set were normalized and consequently their values ranged from 0 to 1. The NN’s hyper-parameter tuning was carried out exploring different numbers of hidden units, neurons, and learning rates. Additionally, the cumulative distributions of the number of floors P(f) of the 1,536 buildings and that of the combination of the latter set with the predicted 6,877 buildings’ floors were computed. We tested whether these distributions can be explained by power-laws: P(f) = (f/fmin)α+1, where f is the floor number, α is the exponent, and fmin is the cutoff of the power-law, or whether a lognormal, whose parameters are given in [32], can better explains the distributions. The parameters were determined using the poweRlaw package in R for a discrete data set, bootstrapped, and subsequently the models were compared using the likelihood ratio test.

### Period of construction

Table 1 shows that the dataset is highly imbalanced, with only 1.2% of the buildings belonging to the first construction period, compared to 42.1% belonging to the third period. Resampling the data was crucial before proceeding. Since we had a relatively small dataset (1, 536 samples), oversampling the minority classes of the training set was applied to improve the quality of the predictive model. This was achieved using SVMSMOTE [33] by creating synthetic observations of the minority classes, at each iteration of the cross-validation. Different configurations of logistic regression, RF classifiers and NN classifiers were examined and compared with the accuracy score.

## Results

### Floors prediction

The input’s layer’s 4 neurons correspond to the area, perimeter, height, and electricity consumption, while to output layer’s single neuron is that of the period of construction. The optimum number of hidden layers and their neurons were found to be 1 and 8 respectively, with a learning rate of 0.01 and a sigmoid transfer function. The scores of the applied NN on the test are given by:

• mean absolute error MAE = 0.54
• mean squared error MSE = 0.73
• mean absolute percentage error MAPE = 7.2%
• R2 = 87.7%

The prediction of the floors’ number for the rest of the city’s buildings could now be extended keeping in mind that buildings with missing input features had to be excluded. This left us with 6, 877 buildings whose number of floors is to be predicted. The results along with the surveyed data from USJ were mapped as shown in Fig 3.

Fig 3
Distribution of buildings per floors’ number in Beirut administrative area.

The cumulative distribution of the number of floors of the USJ buildings was evaluated. Additionally, the latter along with predicted buildings’ floor number was also computed. They are shown respectively in Figs 4 and 5. In the first, the ratio r of the log-likelihoods of the data between the power-law and lognormal is negative, which means that the lognormal is a better fit, while in the second r > 0 indicating that the power law with exponent α = 5.35±0.30 is a better fit. This latter parameter is in accordance with the findings of [34], where the exponent of the height distribution of London was shown to be α = 5.26.

Fig 4
Their respective parameters are fmin = 11 and α = 12.92, while those of the lognormal are given by fmin = 11, μ = 1.55, and σ = 0.27.P(f) of the 1,536 building buildings is shown blue, the power-law is shown in green, and the lognormal in shown in red.
Fig 5
The green line is the power-law with fmin = 14, and α = 5.35. while the red line corresponds to the lognormal with fmin = 6 params = 2.05, 0.28.P(f) for all the buildings is shown in blue.

### Period of construction prediction

The exhaustive hyper-parameter tuning and models cross-comparison converged to a random forest with 100 decision trees with an accuracy score of 56.7%. Further, its accuracy score on the test set was given by 48.7%. Using this model, despite its low accuracy, the rest of Beirut’s buildings were tagged with their corresponding predicted construction period (Fig 6). Further, the confusion matrix was plotted in Fig 7. It revealed that the algorithm best predicted the third construction period with an accuracy of 63%, while its worse accuracy was attained with the second construction period with only 40%.

Fig 6
Spatial distribution of buildings per period of construction in Beirut administrative area.
Fig 7
Distribution of buildings per predicted period of construction in Beirut administrative area, with colorbar depicting the number of floors.

The sensitivity of the pipeline to our desired methodology was also tested. Here we present an illustration of the effect of sampling and normalization on RF. It is worth noting that without sampling the model misses all of the buildings from the first construction period as shown in Table 3.

Table 3
Sample of the pipeline’s sensitivity analysis.
Construction periodRF (sampling only)RF (sampling and normalization)
10%37.5%
230.9%43.6%
362.1%51.5%
446.1%44.7%
553.1%46.5%

## Discussion

In the previous sections, we have presented a pipeline which relies on machine learning to complement urban buildings’ database. It should be noted that our starting point was a data whose floors distribution is best described by a lognormal. However the combination of the latter with the predicted data was shown to be a power-law, which is in full accordance with the measured one for other cities [34]. The fact that the distribution of predicted buildings’ heights follows a power-law and not a log-normal is a confirmation that our model recovers known properties about the heights; namely that they follow a power-law and not a log-normal distribution. This is further a consistency check on the validity of the results. High accuracy in predicting buildings’ floors number was attained revealing a strong relation with the building height, area, perimeter and electricity consumption. The quantification of the floors is relevant to energy planning, as it helps simulating the energy demand by representing each floor as a thermal zone. A floor can be further subdivided into subzones for more accuracy of the building performance simulation [35]. Furthermore, it can help approximating the building population for micro-scale modeling and analysis of human behavior [36].

On the other hand, the period of construction could be predicted with an accuracy score of 48.7% only. More training data may be required. However, the low accuracy may be related to the need for more variables on which the period of construction depends, such as window to wall ratio (WWR), wall thickness and other era-specific descriptors. The construction period gives insight into the materials of the buildings, which can inform materials flows and stocks models for valuation of buildings, as well as the determination of their energy performance and refurbishment techniques [37, 38], and the identification of future waste streams along with recovery strategies [39].

## Conclusion

Finally, we developed NN algorithms to predict the number of floors and the construction period of buildings given their heights, areas, perimeters and electricity consumption. We began by cleaning the available dataset and removing unreliable entries and outliers. Then, we evaluated the significance of each input feature on the output to justify its selection. The NN was able to predict the number of floors with a high prediction accuracy with a coefficient of determination of R2 of 87.7%. Then, the construction period’s Random Forest was built after re-sampling of the data to overcome its imbalance. Subsequently, the exponent of the power-law governing the floor distribution was shown to be conforming with that appearing in the literature.

In developing cities, like Beirut, available urban data is often underutilized because of its sporadic nature and/or access challenges. With the approach adopted in this paper, the lack of full datasets is compensated by machine learning interventions that can fill in data gaps and offer policy designers a powerful and verifiable new leverage. Beside the immediate applications reported above related to service provision, efficiency, and analytics (e.g. electricity) and buildings’ characteristics, the presented methodology can be an effective tool to generate wider policy insights despite data irregularities. The two main areas where such an approach can be particularly advantageous are: (1) assessing urban resiliency, risk, and emergency planning. For example, having an accurate distribution of the number of floors and building materials would be critical for a rapid assessment of the human loss in the case of a natural disaster such as an earthquake or large fires; (2) generating demographic and socio-economic insights related to population concentration, census, which is not available in Lebanon since 1994, and public services. For example, number of floors distribution could be used to distinguish between residential, commercial, and industrial units/zones within the city and inform policy experts about electricity rationing strategy (like the case of Beirut where power outages are regular but randomly allocated geographically); or provide information on energy consumption’s “hot spots” which could help with predicting electricity demand surge and the needed grid reinforcement strategy.

## Acknowledgements

We would like to thank Dr. Ali Ahmad for the useful discussion and feedback on the manuscript.

## References

McPhearsonT, PickettSTA, GrimmNB, NiemeläJ, AlbertiM, ElmqvistT, et al Advancing Urban Ecology toward a Science of Cities. BioScience. 2016;66(3):198212.

BattyM. The size, scale, and shape of cities. Science. 2008;319(5864):769771.

BattyM, CarvalhoR, Hudson-SmithA, MiltonR, SmithD, SteadmanP. Scaling and allometry in the building geometries of Greater London. European Physical Journal B. 2008;63(3):303314.

BettencourtLMA. The origins of scaling in cities. Science. 2013;340(6139):14381441.

BattyM. Competition in the Built Environment: Scaling Laws for Cities, Neighbourhoods and Buildings. Nexus Network Journal. 2015;17(3):831850.

SteadmanP, EvansS, BattyM. Wall area, volume and plan depth in the building stock. Building Research and Information. 2009;37(5-6):455467.

Schläpfer M, Lee J, Bettencourt L. Urban Skylines: building heights and shapes as measures of city size. arXiv preprint arXiv:151200946. 2015;.

RavetzJ. State of the stock-What do we know about existing buildings and their future prospects? Energy Policy. 2008;36(12):44624470.

HudsonP. Urban Characterisation; Expanding Applications for, and New Approaches to Building Attribute Data Capture. Historic Environment: Policy and Practice. 2018;9(3-4):306327.

10

ChenY, HongT, LuoX, HooperB. Development of city buildings dataset for urban building energy modeling. Energy and Buildings. 2019;183:252265.

11

Cerezo C, Reinhart CF. Urban Energy Lifecycle: An Analytical Framework To Evaluate The Embodied Energy Use Of Urban Developments. In: Proceedings of Building Simulation 2013: 13th Conference of International Building Performance Simulation Association; 2013. p. 1280–1287.

12

DavilaCC, ReinhartC, BemisJ. Modeling Boston: A workflow for the generation of complete urban building energy demand models from existing urban geospatial datasets. Energy. 2016;117:237250.

13

Hong T, Chen Y, Lee SH, Piette MA. CityBES: A Web-based Platform to Support City-Scale Building Energy Efficiency. In: 5th International Urban Computing Workshop; 2016.

14

EvansS, LiddiardR, SteadmanP. 3DStock: A new kind of three-dimensional model of the building stock of England and Wales, for use in energy analysis. Environment and Planning B: Urban Analytics and City Science. 2017;44(2):227255.

15

CostanzoV, YaoR, LiX, LiuM, LiB. A multi-layer approach for estimating the energy use intensity on an urban scale. Cities. 2019;95(September):102467

16

KrayemA, Al BitarA, AhmadA, FaourG, Gastellu-EtchegorryJP, LakkisI, et al Urban Energy Modeling and Calibration of a Coastal Mediterranean City: The Case of Beirut. Energy and Buildings. 2019;199:223234.

17

TanikawaH, HashimotoS. Urban stock over time: Spatial material stock analysis using 4d-GIS. Building Research and Information. 2009;37(5-6):483502.

18

KohlerN, SteadmanP, HasslerU. Research on the building stock and its applications. Building Research and Information. 2009;37(5-6):449454.

19

BruhnsH. Property taxation data for nondomestic buildings in England and Wales. Environment and Planning B: Planning and Design. 2000;27(1):3349. doi: 10.1068/bst6

20

Hudson P, Dennett A, Russell T, Smith D. Colouring London—A Crowdsourcing Platform for Geospatial Data Related to London’s Building Stock; 2019.

21

Coloring Beirut; 2019. Available from: https://www.coloringbeirut.com/.

22

HuS, WangL. Automated urban land-use classification with remote sensing. International Journal of Remote Sensing. 2013;34(3):790803.

23

MeinelG, HechtR, HeroldH. Analyzing building stock using topographic maps and GIS. Building Research and Information. 2009;37(5-6):468482.

24

BelgiuM, TomljenovicI, LampoltshammerTJ, BlaschkeT, HöfleB. Ontology-based classification of building types detected from airborne laser scanning data. Remote Sensing. 2014;6(2):13471366.

25

ArbidGJ. Practicing modernism in Beirut architecture in Lebanon 1946-1970. Cambridge, Massachusetts: Harvard University; 2002.

26

SalibaR. Beirut 1920-1940 Domestic Architecture Between Tradition and Modernity Paperback. Beirut: The Order of Engineers and Architects; 1998.

27

Liu FT, Ting KM, Zhou ZH. IsolationForest: Isolation Forest. In: 2008 Eigth IEEE International Conference on Data Mining; 2008.

28

SvozilD, KvasnickaV, PospichalJ Introduction to multi-layer feed-forward neural networks. Chemometrics and Intelligent Laboratory Systems. 1997;39(1):4362.

29

LiawA, WienerM. Classification and regression by randomForest. R news. 2002;2(3):1822.

30

HosmerlD,StanleyL. Goodness of fit tests for the multiple logistic regression mode. Communications in Statistics—Theory and Methods. 1980;9(10):10431069.

31

ClarkA. The machine learning audit- CRISP-DM Framework. ISACA Journal. 2018;1:4247.

32

ClausetA, ShaliziCR, NewmanMEJ. Power-Law Distributions in Empirical Data. SIAM Review. 2009;51(4):661703.

33

NguyenHM, CooperEW, KameiK. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms. 2011;3(1):4

34

BattyM, CarvalhoR, Hudson-SmithA, MiltonR, SmithD, SteadmanP. Scaling and allometry in the building geometries of Greater London. European Physical Journal B. 2008;63(3):303314.

35

DoganT, ReinhartC. Shoeboxer: An algorithm for abstracted rapid multi-zone urban building energy model generation and simulation. Energy and Buildings. 2017;140:140153.

36

GregerK. Spatio-temporal building population estimation for highly urbanized areas using GIS. Transactions in GIS. 2015;19(1):129150.

37

AksoezenM, DanielM, HasslerU, KohlerN. Building age as an indicator for energy consumption. Energy and Buildings. 2015;87:7486.

38

NgST, GongW, LovedayDL. Sustainable refurbishment methods for uplifting the energy performance of high-rise residential buildings in Hong Kong In: Procedia Engineering. vol. 85; 2014 p. 385392.

39

Heinrich MA, Lang W. Capture and Control of Material Flows and Stocks in Urban Residential Buildings. In: IOP Conference Series: Earth and Environmental Science. vol. 225; 2019.

8 Jul 2020

PONE-D-20-10968

Machine learning for buildings' characterization and

power-law recovery of urban metrics

PLOS ONE

Dear Dr. Najem,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The paper is very exciting. However, as reviewer 1 says it clearly:  the manuscript in its current form is not suitable for publication in an interdisciplinary journal like PLOS One, as it is currently located in a space where it has not enough detail for subject experts (e.g.  what kind of NN model did you use?) and not explanatory enough for non-experts (e.g. what does it mean, if the results fit one distribution better than another?). Both reviewers give you insights to improve it and to bring it more into a form that is suitable for this journal as it is in its core very interesting work, about which one would like  to know more details about.

Please submit your revised manuscript by Aug 21 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

• A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
• A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
• An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Celine Rozenblat

PLOS ONE

As reviewer 1 says it clearly: the manuscript in its current form is not suitable for publication in an interdisciplinary journal like PLOS One, as it is currently located in a space where it has not enough detail for subject experts (e.g. what kind of NN model did you use?) and not explanatory enough for non-experts (e.g. what does it mean, if the results fit one distribution better than another?). Both reviewers give you insights to improve it and to bring it more into a form that is suitable for this journal as it is in its core very interesting work, about which one would like to know more details about.

Journal requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. There are a number of broken figure references, e.g. line 69, 134.

Please ensure these are fixed in the revised version of the manuscript.

In addition, please update your data availability statement to give a full list of data sources and URL links or contact details that future researchers can use to access the data.

3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide

4. Please include a separate caption for each figure in your manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewer's Responses to Questions

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Reviewer #1: This work estimates additional properties of buildings in Beirut, Lebanon. On the basis of a smaller data set, the authors use methods from machine learning to estimate the number of floors and year of construction. The resulting data - given a certain validity of the method - provides novel insights into the building stock of the city, which is crucial in developing cities such as Beirut. It might be a valuable tool for a resilient future development in a place where no recent census data exists, or (building) information is scarce. I think this work is of interest for a wide community of scholars, as it shows some of the potential machine learning methods have when data accessibility and availability is limited, as it is often the case in poorer countries.

Although the manuscript appeals to me, I think a few clarifications and improvements are necessary. It seems to me, that the data cleaning process is a little too stringent, as more than two thirds of the buildings get removed. Does the USJ data set not represent a good sample of the building stock in Beirut? I think it would improve this part, if a little discussion about the data would be there, or a table/figure that shows the distribution of the two used parameters for the full data set. Especially, since so many buildings are removed already in this first step.

I have always a hard time reading 3D plots in manuscripts. I think it might be more informative if Figure 1 was similar to a correlation plot (or sometimes called 'scatter plot matrix'). Also, I would find it informative, to see what the Principle components look like, as they can tell a lot about how the data looks like in general.

I think it would be good to make the methods section a little bit more understandable for non-experts, as PLOS One has a readership from across different fields. For example, that it would be great if the specific choices for the different algorithms would be explained in a little more detail. Also, a non-expert might not immediately know what the meaning of the different scores exactly is and what information they exactly provide to the precision of the methods. The same accounts for why the authors chose the specific subsets sizes to train, validate, and test the model. I believe that the whole Methods section would benefit from such additional explanations.

This extends into the results section, where I would love to see the different results from the exploration of different models, number of layers, non-normalized vs. normalized data, and so on.

In general, it would be interesting to see how sensitive the pipeline is to changes and what the different results were during the exploration step. As this might be crucial if other people would want to use the same method.

The authors compare the distribution of floors in Beirut to a power-law and a lognormal distribtution. What does it mean that they follow more one or the other? What are the additional insights one gains from this?

I have mentioned before, the work is very appealing for me. However, I think the manuscript in its current form is not suitable for publication in an interdisciplinary journal like PLOS One, as it is currently located in a space where it has not enough detail for subject experts (e.g. what kind of NN model did you use?) and not explanatory enough for non-experts (e.g. what does it mean, if the results fit one distribution better than another?). I advocate for some major revisions to bring it more into a form that is suitable for this journal as it is in its core very interesting work, about which I want to know more detail about.

Reviewer #2: “Machine learning for buildings? characterization and power-law recovery of urban metrics”

referee-report

The authors analyze building data of Beirut in Lebanon with the purpose of predicting building age and the number of floors. Specifically, a somewhat small subset of surveyed buildings is considered. As “independent variables” height, area, perimeter, and electricity consumption are used and fed into the neural networks. The work achieves good performance for the number of floors and modest performance for the period construction. The authors complement an analysis of the distribution of number of floors and find somewhat large exponents beyond 5.

The paper is well written and the approach can be of importance for similar applications in other cities and countries. The need for building data is well justified in the introduction. I appreciate that the manuscript is short.

I have a few issues that the authors should address:

- Please clarify if the city of Beirut or the metro-region is considered. I guess it is the former. Please also add an approximate population figure. Dividing the population by the number of buildings gives a rough idea about population density/floors.

- In my opinion 3D representation inadequate never works in 2D. Please develop an alternative representation.

- The power-law exponent is very large (also in the publication by Batty). The problem is that such steep power-law distributions loose what makes power-laws special and they become similar to other distributions.

- The prediction of period of construction could be improved by including information on location, e.g. distance from center.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

26 Aug 2020

We thank the Editor for seeing merit in our work and for sending us the Referees’ reports. These provided valuable input and recommendations. We also appreciate that both Referees saw value in our work “I think this work is of interest for a wide community of scholars,” (quoting the first Referee) and “The paper is well written and the approach can be of importance for similar applications in other cities and countries.” as the second Referee states. We particularly appreciate the Referees’ helpful suggestions on technical points and details of presentation, which will ensure our paper is more easily accessible to a broad audience.

As we were preparing for the reply, a devastating blast hit Beirut, and took us all by surprise. We see the urgency of putting this work out to inform the scientific community about the buildings’ details, which are part of all the modeling initiatives calling for the dissemination of such data: be it the shock wave simulation in this complex urban environment, damage assessment, ret- rograding and buildings’ preservation. We also provide a link to our dataset: https://zenodo.org/record/4001720#.X0ZaeZMzblw

Concerning the detailed comments, we address them below and list the corresponding changes in the manuscript. Original comments of the Referees are in blue and changes in the manuscript are in red, both here and in the revised manuscript.

With these changes and clarification, we trust our manuscript is now suitable for publication in PLOS One.

Referee 1: This work estimates additional properties of buildings in Beirut, Lebanon. On the basis of a smaller data set, the authors use methods from machine learning to estimate the number of floors and year of construction. The resulting data - given a certain validity of the method - provides novel insights into the building stock of the city, which is crucial in developing cities such as Beirut. It might be a valuable tool for a resilient future development in a place where no recent census data exists, or (building) information is scarce. I think this work is of interest for a wide community of scholars, as it shows some of the potential machine learning methods have when data accessibility and availability is limited, as it is often the case in poorer countries. Although the manuscript appeals to me, I think a few clarifications and improvements are necessary. It seems to me, that the data cleaning process is a little too stringent, as more than two thirds of the buildings get removed. Does the USJ data set not represent a good sample of the building stock in Beirut? I think it would improve this part, if a little discussion about the data would be there, or a table/figure that shows the distribution of the two used parameters for the full data set. Especially, since so many buildings are removed already in this first step.

Reply: We thank the Referee for characterizing our work as novel, crucial, and valuable. We also acknowledge that their reservations on the current manuscript’s methodology section, given that readership of PLOS One, are very valid.

Indeed, as the Referee notes, in the data cleaning process more than two thirds of the buildings are filtered out. The USJ data set is a very good representative one, however it did not include the building’s height as a descriptor. Buildings’ footprints were manually digitized over the entire city of Beirut by CNRS-L using aerial photos at 15cm resolution and VHR pan-chromatic satellite images from Pleaides-1A at 70cm resolution. This is not part of our work but rather a CNRS-L in-house processing step, which was made available to us. A large number of these buildings, when overlaid on the USJ dataset, turned out to have atypical floor height (≤ 2.8m or ≥ 4, 5m) and thus were removed from the dataset. This lead to the filtering of around two-thirds of the buildings.

Action: This sentence was added to the text: It is worth noting that buildings’ footprints were manually digitized over the entire city of Beirut by CNRS-L using aerial photos at 15cm resolution and VHR pan-chromatic satellite images from Pleaides-1A at 70cm resolution. This is not part of our work but rather a CNRS-L in-house processing step, which was made available to us.

Referee 1: I have always a hard time reading 3D plots in manuscripts. I think it might be more informative if Figure 1 was similar to a correlation plot (or sometimes called ’scatter plot matrix’). Also, I would find it informative, to see what the Principle components look like, as they can tell a lot about how the data looks like in general.

Reply: We thank the Referee for his/her suggestion to use a different representation for the PCA. We added the below scatter plot matrix in addition to the 3D plot in our original document as they carry the same information in different dimensions.

Action: The below figure was added along with the sentence “To visualize the outliers, the six features were reduced to three using the Principle Component Analysis (PCA), which allowed for a 3D representation of the samples as function of the new dimensions, as shown in Fig. 1 as well as the corresponding two-dimensional scatter plot matrix shown in Fig. 2 . ”

2

Correlation plot of the buildings samples after applying PCA for dimension reduction, with outliers highlighted in brown.

Referee 1: I think it would be good to make the methods section a little bit more understandable for non-experts, as PLOS One has a readership from across different fields. For example, that it would be great if the specific choices for the different algorithms would be explained in a little more detail. Also, a non-expert might not immediately know what the meaning of the different scores exactly is and what information they exactly provide to the precision of the methods. The same accounts for why the authors chose the specific subsets sizes to train, validate, and test the model. I believe that the whole Methods section would benefit from such additional explanations. This extends into the results section, where I would love to see the different results from the exploration of different models, number of layers, non-normalized vs. normalized data, and so on.

Reply: We agree with the Referee that the algorithms and the scores need to be defined properly to justify their use. As for the sizes of the training, validation, and test sets these correspond to the 55%, 25%, and 20% respectively often recommended in the literature. For the results section, we presented the optimal combination of parameters for the different models which minimized the error measures.

Action: The following text in red was added to the beginning of the Methods section: “The selection of the buildings’ fea-

tures, that is the independent variables, can be justified by a correlation analysis achieved with the Pearson coefficient for

the floors’ number, which is used to evaluate bivariate correlation between continuous variables, and a dependency strength

achieved with the logistic regression’s accuracy score for the construction period, which is used to assess the accuracy of multi-

label categorical classification as seen in Table 2. The Pearson coefficient is defined to be the ratio between the covariance

between variables over the product of their respective variances given by cov(x,y)/σxσy. The accuracy score is given by:

accuracy = (1/nsamples) × Σnsamples 1(ypred,i = ytrue,i), where ypred,i is the predicted value of the i − th sample and ytrue,i i=1

is the corresponding true value.”

We also justify the split in the sizes of the training, validation, and test sets by adding the following: “The dataset of 1, 536 samples was subdivided into training, validation, and test sets each containing respectively 859, 369, and 308 samples, which correspond respectively to the 55%, 25%, and 20% splits, often recommended in the literature.

The following was added to give all the details about the algorithms used: “MLF neural networks are the most popular type of NN. Their design is motivated from a real brain: networks of simple processing elements, neurons, operating on their local input data and communicating the output with other elements. Each neuron is connected to at least one other neuron, and each connection is evaluated by a weight coefficient.The training of a NN is in fact adjusting these weights in such way,

the calculated outputs of the whole network are as close as possible to the actual ones [37]. RF are an ensemble learning method for classification or regression, which consist of constructing several estimators or decision trees at the training time and outputting the majority vote of the estimators for class prediction, or their mean prediction for regression [36]. Finally, multiple logistic regression is a classification method that describes the relationship between a nominal-scaled , i.e categorical variable and a set of independent variables. It consists of calculating the probabilities of the different possible outcomes of the categorical variable [38]. The number of hidden layers of the NN, which outputs the number of floors, ranged from 1 to 3, with corresponding number of neurons ranging from 3 to 8, and learning rate from 0.001 to 0.1. As for the construction period’s logistic regression algorithm we used multiple solver to guarantee convergence such as Newton and BFGS solvers, a one-vs-the rest (OVR) multi-class strategy which consists of fitting one classifier per class, and finally features were selected according their k-score which is an inter-reliabilty measure for categorical variables. As for its NN, the hidden layers ranged 3 from to 8, with corresponding number of neurons varying from 1 to 40. The solvers we used were ADAM, BFGS, and Sigmoid, and a variety of activation functions were applied such as logistic, tanh, and relu. The learning rate was varied between 0.001 to 0.1. Finally, the RF estimators ranged from 10 to 500, with maximum depth ranging between 3 and 6, and maximum features used when considering the optimal split were defined using auto, and the criteria to evaluate the quality f the split was measure by the Gini impurity and the entropy.”

Referee 1: In general, it would be interesting to see how sensitive the pipeline is to changes and what the different results were during the exploration step. As this might be crucial if other people would want to use the same method.

Reply: We Thank the Referee for asking this important question on the sensitivity of the pipeline. We reassert that the reported models outperformed all the others and here we show a sample of how sensitive the prediction period to the normalization and sampling.

TABLE I. Sample of the pipeline’s sensitivity analysis.

Construction period RF with sampling without normalization RF with sampling and normalization

3

1 0%

2 30.9%

3 62.1%

4 46.1%

5 53.1%

37.5% 43.6% 51.5% 44.7% 46.5%

Action: The above table was added to the text. The sensitivity of the pipeline to the our desired methodology was also tested. Here 191 we present an illustration of the effect of sampling and normalization on RF. It is worth 192 noting that without sampling the model misses all of the buildings from the first 193 construction period as shown in Table 3.

Referee 1: The authors compare the distribution of floors in Beirut to a power-law and a lognormal distribtution. What does it mean that they follow more one or the other? What are the additional insights one gains from this?

Reply/Action: We added the following sentences in red to clarify the meaning of this finding: The fact that the distribution of predicted buildings’ heights follows a power-law and not a log-normal is a confirmation that our model recovers known properties about the heights. This is further a consistency check on the validity of the results. These distributions are namely indicators of the underlying dynamical processes that generate them: power-laws result from multiplicative processes while log-normal from additive log-Gaussian ones [28].

Referee 1: I have mentioned before, the work is very appealing for me. However, I think the manuscript in its current form is not suitable for publication in an interdisciplinary journal like PLOS One, as it is currently located in a space where it has not enough detail for subject experts (e.g. what kind of NN model did you use?) and not explanatory enough for non-experts (e.g. what does it mean, if the results fit one distribution better than another?). I advocate for some major revisions to bring it more into a form that is suitable for this journal as it is in its core very interesting work, about which I want to know more detail about.

Reply: We thank the Referee for stressing what what we need to direct our attention to in this revised version of the manuscript. These points have been addressed in details above.

Referee 2: Please clarify if the city of Beirut or the metro-region is considered. I guess it is the former. Please also add an approximate population figure. Dividing the population by the number of buildings gives a rough idea about population density/floors.

Reply: As the Referee correctly points out our study area is the city of Beirut. The first Referee states one of the challenges of studying a city like Beirut is the absence of census data. The latest was carried out in 1994. Therefore, any population density estimation is flawed with errors. We argue that surveying

Action: This was added to clarify the area of our study: “Beirut, the city, is located on the eastern shore of the Mediterranean sea with a stock of 17, 742 buildings (in 2016).”

Referee 2: In my opinion 3D representation inadequate never works in 2D. Please develop an alternative representation. Reply: Both Referees agree about the representation of the PCA in 3D. The point the second Referee raises was addressed by

adding the 2D scatter plot matrix shown in Figure 2 in this revised version of the manuscipt.

Referee 2: The power-law exponent is very large (also in the publication by Batty). The problem is that such steep power-law distributions loose what makes power-laws special and they become similar to other distributions.

Reply: We agree with the Referee that the value of the power-law exponent is large. However, it could still be differentiated from a lognormal distribution through the log-likelihood ratio.

Referee 2: The prediction of period of construction could be improved by including information on location, e.g. distance from center.

Reply: We agree with the Referee that the location from the center could be a contributing factor in the prediction of the year of construction. It is worth noting that we have started our experimentation with the latitude and longitude as independent variables, which proved to be way less significant that the ones we kept. We also looked at correlation between buildings’ year of construction and their relative distance from the center. This we believe to be caused by the fact that our dataset is already not spatially very extended from the center and sparse when it comes to the distribution of years of construction. However, the point the Referee makes is a very important one and would give lead into the historical evolution of the city.

4

26 Nov 2020

PONE-D-20-10968R1

Machine learning for buildings' characterization and

power-law recovery of urban metrics

PLOS ONE

Dear Dr. Najem,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

Thanks for having revised the article which is improved now. Please address the requests of reviewer 1...

==============================

Please submit your revised manuscript by Jan 10 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

• A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
• A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
• An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Celine Rozenblat

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewer's Responses to Questions

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: (No Response)

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: I thank the authors for taking my input into account. The manuscript has

largely been improved in my opinion. However, there are still a few minor

points I would like the authors to address:

1) The diagonal of the scatter plot matrix could be used to show the

distribution of the two classes. It would add valuable information to the

figure. An example of such a plot can be found at, for example,

2) Since all information is contained in figure 2, I would recommend removing

figure 1, as it does not provide any useful additional information. But this I

leave to the authors.

3) It would be, additionally, be very informative to have a map similar to

figures 3 and 4 with the buildings that are actually used in the analysis.

4) Please provide a few references to the added part in lines 136-137 to

justify the percentages used.

5) In line 202 you mention that you recover known properties about heights.

6) I apologize reiterating this point again, but in the case of buildings,

what does it mean that the underlying processes are either multiplicative or

additive? I personally have no intuition what that means in terms of building

7) I appreciate that you added the section from line 100 and onward. However,

it does still not clarify why MLF-NNs are a good choice for the analysis you

did. I'm no expert in machine learning, so please make this point a little

clearer for people like me.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

2 Dec 2020

We thank the Editor for seeing merit in our work and for sending us the Referees’ reports. We also thank the Referees positive comments on the revised version of the manuscript, which is seen to have “largely been improved in my opinion.” quoting the first Reviewer, “All comments have been addressed” as the second Reviewer states.

Concerning the detailed comments, we address them below and list the corresponding changes in the manuscript. Original comments of the first Referee are in blue and changes in the manuscript are in red, both here and in the revised manuscript.

With these changes and clarification, we trust our manuscript is now suitable for publication in PLOS One.

Referee 1: The diagonal of the scatter plot matrix could be used to show the distribution of the two classes. It would add valuable

information to the figure. An example of such a plot can be found at, for example, https : //seaborn.pydata.org/examples/scatterplotmat

Reply: We agree with the Referee and we thank him/her for the reference he/she provided us with. We added the below scatter plot matrix.

Action: The below figure was added along with the sentence “To visualize the outliers, the six features were reduced to three using the Principle Component Analysis (PCA). The correlations between each new dimension and the two others were then illustrated using a two-dimensional scatter plot matrix shown in Fig. 1. The diagonal plots show the univariate distribution of each dimension.”

Correlation plot of the buildings samples after applying PCA for dimension reduction, with outliers highlighted in blue.

Referee 1: It would be, additionally, be very informative to have a map similar to figures 3 and 4 with the buildings that are actually used in the analysis.

Reply: Indeed we agree that the suggested map would inform the reader about the used buildings in the analysis.

Action: The below figure was added along with the sentence “To visualize the outliers, the six features were reduced to three using the Principle Component Analysis (PCA). The correlations between each new dimension and the two others were then

illustrated using a two-dimensional scatter plot matrix shown in Fig. 1. The diagonal plots show the univariate distribution of each dimension. The spatial distribution of the buildings used in the development of the predictive algorithms is shown in Fig. 2.”

0 0.5 1 2 Km

Excluded Buildings

Buildings Used

2

Spatial distribution of the accepted buildings after the data pre-processing.

Referee 1: Since all information is contained in figure 2, I would recommend removing figure 1, as it does not provide any useful additional information. But this I leave to the authors.

We agree with the Referee that the information showed in the two figures is redundant, especially after updating the scatter plot. The figure illustrating the 3D correlations between PCA dimensions was removed.

Referee 1: Please provide a few references to the added part in lines 136-137 to justify the percentages used.

Reply: We thank the Referee for highlighting the need for references. The paper of A. Clark entitled “The machine learning

Action: The reference was added along with the sentence “The dataset of 1, 536 samples was subdivided into training, valida- tion, and test sets each containing respectively 859, 369, and 308 samples, which correspond respectively to the 55%, 25%, and 20% splits, often recommended in the literature [31].”

Referee 1: In line 202 you mention that you recover known properties about heights. Please add a sentence what these properties are or add references.

Reply/Action: We thank the Referee for asking for clarification. Indeed the sentence is not clear.

Action: We added the following to the text: The fact that the distribution of predicted buildings’ heights follows a power-law and not a log-normal is a confirmation that our model recovers known properties about the heights; namely that they follow a

power-law and not a log-normal distribution.

Referee 1: I apologize reiterating this point again, but in the case of buildings, what does it mean that the underlying processes are either multiplicative or additive? I personally have no intuition what that means in terms of building heights. Please clarify further.

Reply: We thank the Referee for his/her care for clarity. In terms of distribution, we a power-laws mathematically arise when the underlying process is multiplicative. This reference includes the mathematical details of our statement. Mitzenmacher, Michael. “A brief history of generative models for power law and lognormal distributions.” Internet mathematics 1.2 (2004): 226-251.

Action: Rereading the statement we made about the multiplicative processes we see that it has no relevance to the flow of idea and thus decided to remove it.

Referee 1: I appreciate that you added the section from line 100 and onward. However, it does still not clarify why MLF-NNs are a good choice for the analysis you did. I’m no expert in machine learning, so please make this point a little clearer for people like me.

Reply: Many machine learning algorithms are available with different architectures. In our manuscript, we chose three well- known algorithms (linear regression, NN, and RF). Each model with a given architecture learns its parameters based on the training set. After that, to evaluate the model’s performance, and thus to choose the best among them, a performance metric is applied to compare how close the actual data is to the model’s prediction. This metric is normally a measure of error, or how far the predictions are from the actual data and thus the algorithm with the best metric value is chosen. After running the algorithms with our dataset, the NN algorithm described in the manuscript performed better than all the other. Therefore, it was considered the best choice for our analysis.

Action: The following sentence was added at the beginning of the Methods section: Many machine learning algorithms are available with different architectures. In our manuscript, we chose three well-known algorithms (linear and logistic regression, NN, and RF) that are described in more details below. Each model with a given architecture learns its parameters based on the training set. After that, to evaluate the model’s performance, and thus to choose the best among them, a performance metric is applied to compare how close the actual data is to the model’s prediction. This metric is normally a measure of error, or how far the predictions are from the actual data and thus the algorithm with the best metric value is chosen.

3

14 Jan 2021

Machine learning for buildings' characterization and

power-law recovery of urban metrics

PONE-D-20-10968R2

Dear Dr. Najem,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

Kind regards,

Celine Rozenblat

PLOS ONE

Reviewer's Responses to Questions

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: All my questions have been clearly addressed. I thank the authors for their careful revisions and clarifications!

Reviewer #2: The authors already in the previous iteration addressed my comments. Now, under consideration of the comments from the other reviewer, the manuscript further improved.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Reviewer #1: No

Reviewer #2: No

18 Jan 2021

PONE-D-20-10968R2

Machine learning for buildings’ characterization and power-law recovery of urban metrics

Dear Dr. Najem:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Celine Rozenblat