competition complete

DAT102x: Predicting Evictions
Hosted By Microsoft


Problem Description

About the Data

Your goal is to predict the number of evictions at the county level from other socioeconomic and demographic indicators. According to the Eviction Lab, "An eviction happens when a landlord expels people from property he or she owns. Evictions are landlord-initiated involuntary moves that happen to renters." The data is compiled from a wide range of sources and made publicly available by the United States Department of Agriculture Economic Research Service and the Eviction Lab.

Target Variable

We're trying to predict the variable evictions (a positive integer) for each row of the test data set.

Your job is to:

  1. Train a model using the inputs in train_values.csv and the labels train_labels.csv
  2. Predict floats for each row in test_values.csv for which you don't know the true number of evictions.
  3. Output your predictions in a format that matches submission_format.csv exactly.
  4. Upload your predictions to this competition in order to get a score.
  5. Export your token and paste it into the assignment grader on edX to get your course grade.

Submission Format

The format for the submission file is two columns with row_id and evictions. The data type of evictions is an integer, so make sure there is no decimal point in your submission. For example 100 is a valid integer but 100.0 is not.

If you predicted 1 eviction for each county, the .csv file that you submit would look like:


Performance Metric

We're predicting a numeric quantity, so this is a regression problem. To measure regression, we'll use a metric known as R-squared, also called the coefficient of determination. It is a quantity between -∞ and 1 where a higher value is better.

$$ R^2=1 - \sum_i{(y_i - \hat{y}_i)^2} / \sum_i{(y_i - \bar{y})^2} $$

where |$ \hat y $| is the predicted number of evictions, |$ y $| is the actual number of evictions, and |$ \bar y $| is the average of the actual number of evictions. A score of 1 means predictions exactly match the test values.


There are 47 variables in this dataset. Each row in the dataset represents a United States county, and the dataset we are working with covers two particular years, denoted a, and b. We provide a unique identifier for an individual county, but note that the counties in the test set are distinct from counties in the train set. In other words, no county that appears in the train set will appear in the test set. Thus, county-specific features (i.e. county dummy variables) will not be an option. However, the counties in the test set still share similar patterns as those in the train set and so other feature engineering will work the same as usual.

The variables are as follows:


  • county_code - Unique identifier for each county
  • year - Year, denoted as a or b
  • state - Unique identifier for each state
  • population - Total population


  • renter_occupied_households - Count of renter-occupied households
  • pct_renter_occupied - Percent of occupied housing units that are renter-occupied
  • median_gross_rent - Median cost of rent
  • median_household_income - Median household income
  • median_property_value - Median property value
  • rent_burden - Median gross rent as a percentage of household income


  • pct_white - Percent of population that is White alone and not Hispanic or Latino
  • pct_af_am - Percent of population that is Black or African American alone and not Hispanic or Latino
  • pct_hispanic - Percent of population that is of Hispanic or Latino origin
  • pct_am_ind - Percent of population that is American Indian and Alaska Native alone and not Hispanic or Latino
  • pct_asian - Percent of population that is Asian alone and not Hispanic or Latino
  • pct_nh_pi - Percent of population that is Native Hawaiian and Other Pacific Islander alone and not Hispanic or Latino
  • pct_multiple - Percent of population that is two or more races and not Hispanic or Latino
  • pct_other - Percent of population that is other race alone and not Hispanic or Latino


  • poverty_rate - Percent of the population with income in the past 12 months below the poverty level
  • rucc - Rural-Urban Continuum Codes "form a classification scheme that distinguishes metropolitan counties by the population size of their metro area, and nonmetropolitan counties by degree of urbanization and adjacency to a metro area. The official Office of Management and Budget (OMB) metro and nonmetro categories have been subdivided into three metro and six nonmetro categories. Each county in the U.S. is assigned one of the 9 codes." (USDA Economic Research Service)
  • urban_influence - Urban Influence Codes "form a classification scheme that distinguishes metropolitan counties by population size of their metro area, and nonmetropolitan counties by size of the largest city or town and proximity to metro and micropolitan areas." (USDA Economic Research Service)
  • economic_typology - County Typology Codes "classify all U.S. counties according to six mutually exclusive categories of economic dependence and six overlapping categories of policy-relevant themes. The economic dependence types include farming, mining, manufacturing, Federal/State government, recreation, and nonspecialized counties. The policy-relevant types include low education, low employment, persistent poverty, persistent child poverty, population loss, and retirement destination." (USDA Economic Research Service)
  • pct_civilian_labor - Civilian labor force, annual average, as percent of population.
  • pct_unemployment - Unemployment, annual average, as percent of population


  • pct_uninsured_adults - Percent of adults without health insurance
  • pct_uninsured_children - Percent of children without health insurance
  • pct_adult_obesity - Percent of adults who meet clinical definition of obese
  • pct_adult_smoking - Percent of adults who smoke
  • pct_diabetes - Percent of population with diabetes
  • pct_low_birthweight - Percent of babies born with low birth weight
  • pct_excessive_drinking - Percent of adult population that engages in excessive consumption of alcohol
  • pct_physical_inactivity - Percent of adult population that is physically inactive
  • air_pollution_particulate_matter_value - Fine particulate matter in µg/m³
  • homicides_per_100k - Deaths by homicide per 100,000 population
  • motor_vehicle_crash_deaths_per_100k - Deaths by motor vehicle crash per 100,000 population
  • heart_disease_mortality_per_100k - Deaths from heart disease per 100,000 population
  • pop_per_dentist - Population per dentist
  • pop_per_primary_care_physician - Population per Primary Care Physician


  • pct_female - Percent of population that is female
  • pct_below_18_years_of_age - Percent of population that is below 18 years of age
  • pct_aged_65_years_and_older - Percent of population that is aged 65 years or older
  • pct_adults_less_than_a_high_school_diploma - Percent of adult population that does not have a high school diploma
  • pct_adults_with_high_school_diploma - Percent of adult population which has a high school diploma as highest level of education achieved
  • pct_adults_with_some_college - Percent of adult population which has some college as highest level of education achieved
  • pct_adults_bachelors_or_higher - Percent of adult population which has a bachelor's degree or higher as highest level of education achieved
  • birth_rate_per_1k - Births per 1,000 of population
  • death_rate_per_1k - Deaths per 1,000 of population

Example Row

Here's an example of one of the rows in the dataset so that you can see the kinds of values you might expect in the dataset. Most are numeric, a few are categorical, and there can be missing values.

county_code a4e2211
year b
state d725a95
population 45009
renter_occupied_households 6944
pct_renter_occupied 37.218
median_gross_rent 643
median_household_income 33315
median_property_value 98494
rent_burden 33.389
pct_white 0.41207
pct_af_am 0.493459
pct_hispanic 0.0701932
pct_am_ind 0.00258823
pct_asian 0.00457455
pct_nh_pi 0.000200638
pct_multiple 0.0159206
pct_other 0.000993158
poverty_rate 18.451
rucc Nonmetro - Urban population of 20,000 or more, adjacent to a metro area
urban_influence Micropolitan adjacent to a large metro area
economic_typology Nonspecialized
pct_civilian_labor 0.407
pct_unemployment 0.093
pct_uninsured_adults 0.239
pct_uninsured_children 0.068
pct_adult_obesity 0.332
pct_adult_smoking 0.277
pct_diabetes 0.145
pct_low_birthweight 0.12
pct_excessive_drinking 0.077
pct_physical_inactivity 0.313
air_pollution_particulate_matter_value 12.1653
homicides_per_100k 14.01
motor_vehicle_crash_deaths_per_100k 18.21
heart_disease_mortality_per_100k 318
pop_per_dentist 2420
pop_per_primary_care_physician 1960
pct_female 0.532
pct_below_18_years_of_age 0.252
pct_aged_65_years_and_older 0.153
pct_adults_less_than_a_high_school_diploma 0.233
pct_adults_with_high_school_diploma 0.375
pct_adults_with_some_college 0.278
pct_adults_bachelors_or_higher 0.114
birth_rate_per_1k 12.9151
death_rate_per_1k 11.2051