Dissertation‎ > ‎

Data Sources

The setting for my dissertation is the city of Madrid, Spain, from around 2005 through 2015. The main spatial unit for most analysis is the census section, the smallest census unit available in Spain, with around 1500 people living in each (and around 2400 census sections in the city of Madrid). Think about half a block in the most dense areas, an entire block in some other areas and several blocks in less dense/suburban areas. 
I'm using a plethora of data sources for this. Most of them have not been used in health research studies before. The fields of demography, sociology, urban planning and regional studies have made good use of some of thse data sources. The combination of all these data sources is, I believe, quite unique so far. Most of the data has been obtained from the City of Madrid government that wonderfully make them freely available here and here.  
  • Padron: this is a continuous census of the entire population of Spain, with data on education, country of origin, age and sex. This is a dataset of 2400 units * 11 years
  • Catastro (Cadastre): this is a universal registry of all properties, including housing, industrial and commercial. Includes data on year of construction, surface area and renovations. This is a 6gb dataset for the city of Madrid with around 2 million properties.
  • Censo de Locales (commercial spaces census): this is a universal census obtained for licensing/inspection purposes that includes all commercial (open or not) spaces in Madrid. Includes data on geocoded location and economic activity of all active spaces. Each monthly cut has around ~150,000 units for analysis.
  • Idealista Report and Idealista API: data on average sale price of housing spaces done through Idealista (biggest real state company in Spain) by neighborhood from 2000 through 2015. The API also provides cross-sectional data on current properties listed in their website with geocoded data.
  • Unemployment rate and Social Security: data compiled at the neighborhood level from ~2002 or 2010 through 2016, with universal data on unemployment rate, occupational class, job tenure, work hours, etc.
  • Elections: data on all elections since 1979 with registered voters, casted votes and votes to each party.
  • Other: total number of registered vehicles, GDP/capita, fertility and mortality rates, marriage/divorce rates, etc.
Health wise data has been obtained from the HeartHealthyHoods project Retrospective Study from the Area 4 of Madrid (four districts of Northeastern Madrid). This includes electronic health records on around 600,000 people for each year from 2009 through 2014, with data on diabetes, hypertension, cardiovascular disease and laboratory values related to cardiovascular disease risk factors.