\
Data: describe the data you identified as available and suitable to answer the question; what were the sources and where is the data available? are more data available but not included? why ? what were the weaknesses of the data (errors, biases, missing features…)? what data wrangling and processing techniques were applied? You may want to include plots and or tables that help the reader understand the data, although you ALSO need a good description in good English prose!)
    This research rely on data from year 2016 only and all data were cleaned to focus on residential information only. The data do not contain any personal information. The analysis was performed in the granularity level of Zip codes, which seemed a reasonable geographical unit to observe urban renewal trends. The study required the use of several data sources. 
    First, Permit Issuance data were obtained from DOB permits issuance open data and were cleaned to include Residential permits only, then were filtered again to include only New Buildings (NB) and massive Alternation (AL) permit types, ignoring insignificant permit types as plumbing related permits, signs, equipment, etc. The permits data were normalized by the overall number of occupied housing units, obtained from the United States Census, the American Fact Finder website, using the ACS 5 years estimate data. Data for 2016 do not exist in the zip code geography level; for that reason data included from year 2013, assuming the change in the number of housing units is not meaningful. All data were grouped by zip code to count the number of permits issued in each zip code in 2016.
    The data of the Department of Buildings (DOB) violations are divided to more than a hundred complaint categories, most of them are minor in the sense of urban renewal. It was hard to define the exact categories that will best contribute to this analysis. In order to avoid possible biases and misinterpretations another data source was selected - 311 complaints.  The 311 data were filtered to include only Building related complaints and 2016 complaints only prior retrieving. The complaints in the 311 dataset are also divided by complaint descriptor; the descriptors were included in this analysis are:
    311 is a relatively new citizens-city/government engagement system, of which not all citizens are taking advantage or aware. To overcome this bias the 311 data were normalized by dividing each Zip code's number of building-related complaints by the overall number of 311 complaints in the specific zip code. Due to the large size of the 311 data the way to assess the overall number of complaints in 2016 was to extract two months only from that year, January and June, proxies for winter overall complaints and summer overall complaints respectively. The possible weakness of the use of 311 data is that even when neutralizing the bias in the citizens' use of the 311 system, it is harder to address and assess the differences between citizens' involvement and engagement to the city's processes and their feeling about complaining about their surrounding.  Again, the data were grouped by zip code to count the number of building violation complaints in each zip code in 2016.
    Additionally, for the second part of the analysis, data of number of renter-occupied housing units and  owner-occupied housing units were obtained, also from the American Fact finder website of the US Census Bureau. Finally, New York City's Zip codes shape files were included for spatial analysis of the variables of this study, also obtained from NYC open data website.  
    
Methodology:
    The first step of the analysis was to observe and describe the data of both primarily variables, Permit Issuance and 311 Building Violation complaints. The distribution of the normalized variables were viewed to assess the possible similarity in their statistical behaviour:
 (plot of distributions)
    Then, a bar plot of all zip codes', over layering the two variables, was observed. The plot didn't give a further understanding of the data nor the possible relationship between the variables and its magnitude or direction. 
(plot of bar plot)
    A linear model was built to assess the power of Permit issuance with predicting Building Violation complaints. As shown in the model summary, the R-squared is extremely low although the p-value is below the significance threshold which might imply for a meaningful effect of it on Building Violations complaints. Figure 3 is a scatter of the data, BV complaints over permit issuance, and the fitted line. 
(plot of scatter)
    At that point I detect the possible outliers of the data. That part was interesting and surprising, having the zip codes with the highest number of permits in Staten Island and  South Queens. I performed the methods above on a shortened dataset of only Manhattan and Brooklyn. The results were even less significant than the whole city's. 
(table? bdfjdbjv scatter w/wo outleirs?)
Conclusions and Limitations:
    The analysis' results were not significant enough to consider permit issuance as a reliable predictor for building violations complaints. The initial assumption, according to which higher number of permits issued in a certain place will probably mean the same place will encounter less building violation complaints did not approved in this research. Even when dividing the data into two sub samples - of Manhattan and Brooklyn, there was no consistency detected in the behaviour of the variables and their relationship. 
    A possible limitation of this study is the use of 311 complaints regarding building violations as representative for actual building / use violations, which by definition are not compliant. Furthermore,  as assumed in the data section, it is hard to assess the use of 311 system across the city. Further work could analyze DOB approved violations rather that 311 data, although the later being normalized.
    Another approach that could have made this study more accurate is to assess the housing units-per-permit, and by this to weight the overall affect of each permit over the city as a whole.  This could prevent areas with single family houses such as Staten Island or Raw houses to be mistakenly revealed as urban renewed areas.  
    To conclude, the assumption according which higher permit issuance will result with lower number of building violations complaints is a long-shot assumption, which might be true in very specific area of rapid urban renewal, and even then - other factors such as the ratio of renter-owner occupancy and income level could be significant to the sense of predicting housing violations and should be taken into consideration.