2 weeks left

DAT102x: Predicting County-Level Rents
Hosted By Microsoft

 

Problem Description

About the Data

Your goal is to predict the median gross rent at the county level from other socioeconomic and demographic indicators.

Target Variable

We're trying to predict the variable gross_rent (a positive integer) for each row of the test data set.

Your job is to:

  1. Train a model using the inputs in train_values.csv and the labels train_labels.csv
  2. Predict floats for each row in test_values.csv for which you don't know the true number of evictions.
  3. Output your predictions in a format that matches submission_format.csv exactly.
  4. Upload your predictions to this competition in order to get a score.
  5. Export your grading token (click the "Export Score for EdX" tab) and paste it into the assignment grader on edX to get your course grade.

Submission Format

The format for the submission file is two columns with row_id and gross_rent. The data type of gross_rent is an integer, so make sure there is no decimal point in your submission. For example 100 is a valid integer but 100.0 is not.

If you predicted a gross rent of 1 for each county, the .csv file that you submit would look like:

row_id,gross_rent
0,1
1,1
2,1
3,1
4,1
⁝

Performance Metric

We're predicting a numeric quantity, so this is a regression problem. To measure regression, we'll use a metric known as R-squared, also called the coefficient of determination. It is a quantity between -∞ and 1 where a higher value is better.

$$ R^2=1 - \sum_i{(y_i - \hat{y}_i)^2} / \sum_i{(y_i - \bar{y})^2} $$

where |$ \hat y $| is the predicted gross rent, |$ y $| is the actual gross rent, and |$ \bar y $| is the average of the actual gross rents. A score of 1 means predictions exactly match the test values.

Features

There are 43 variables in this dataset. Each row in the dataset represents a United States county in a single year. We provide a unique identifier for an individual county, but note that the counties in the test set are distinct from counties in the train set. In other words, no county that appears in the train set will appear in the test set. Thus, county-specific features (i.e. county dummy variables) will not be an option. However, the counties in the test set still share similar patterns as those in the train set and so other feature engineering will work the same as usual.

The variables are as follows:

ID

  • county_code - Unique identifier for each county
  • state - Unique identifier for each state
  • population - Total population

Housing

  • renter_occupied_households - Count of renter-occupied households
  • pct_renter_occupied - Percent of occupied housing units that are renter-occupied
  • evictions - Number of eviction judgments in which renters were ordered to leave in a given area and year
  • rent_burden - Median gross rent as a percentage of household income

Ethnicity

  • pct_white - Percent of population that is White alone and not Hispanic or Latino
  • pct_af_am - Percent of population that is Black or African American alone and not Hispanic or Latino
  • pct_hispanic - Percent of population that is of Hispanic or Latino origin
  • pct_am_ind - Percent of population that is American Indian and Alaska Native alone and not Hispanic or Latino
  • pct_asian - Percent of population that is Asian alone and not Hispanic or Latino
  • pct_nh_pi - Percent of population that is Native Hawaiian and Other Pacific Islander alone and not Hispanic or Latino
  • pct_multiple - Percent of population that is two or more races and not Hispanic or Latino
  • pct_other - Percent of population that is other race alone and not Hispanic or Latino

Economic

  • poverty_rate - Percent of the population with income in the past 12 months below the poverty level
  • rucc - Rural-Urban Continuum Codes "form a classification scheme that distinguishes metropolitan counties by the population size of their metro area, and nonmetropolitan counties by degree of urbanization and adjacency to a metro area. The official Office of Management and Budget (OMB) metro and nonmetro categories have been subdivided into three metro and six nonmetro categories. Each county in the U.S. is assigned one of the 9 codes." (USDA Economic Research Service)
  • urban_influence - Urban Influence Codes "form a classification scheme that distinguishes metropolitan counties by population size of their metro area, and nonmetropolitan counties by size of the largest city or town and proximity to metro and micropolitan areas." (USDA Economic Research Service)
  • economic_typology - County Typology Codes "classify all U.S. counties according to six mutually exclusive categories of economic dependence and six overlapping categories of policy-relevant themes. The economic dependence types include farming, mining, manufacturing, Federal/State government, recreation, and nonspecialized counties. The policy-relevant types include low education, low employment, persistent poverty, persistent child poverty, population loss, and retirement destination." (USDA Economic Research Service)
  • pct_civilian_labor - Civilian labor force, annual average, as percent of population.
  • pct_unemployment - Unemployment, annual average, as percent of population

Health

  • pct_uninsured_adults - Percent of adults without health insurance
  • pct_uninsured_children - Percent of children without health insurance
  • pct_adult_obesity - Percent of adults who meet clinical definition of obese
  • pct_adult_smoking - Percent of adults who smoke
  • pct_diabetes - Percent of population with diabetes
  • pct_low_birthweight - Percent of babies born with low birth weight
  • pct_excessive_drinking - Percent of adult population that engages in excessive consumption of alcohol
  • pct_physical_inactivity - Percent of adult population that is physically inactive
  • air_pollution_particulate_matter_value - Fine particulate matter in µg/m³
  • homicides_per_100k - Deaths by homicide per 100,000 population
  • motor_vehicle_crash_deaths_per_100k - Deaths by motor vehicle crash per 100,000 population
  • heart_disease_mortality_per_100k - Deaths from heart disease per 100,000 population
  • pop_per_dentist - Population per dentist
  • pop_per_primary_care_physician - Population per Primary Care Physician

Demographic

  • pct_female - Percent of population that is female
  • pct_below_18_years_of_age - Percent of population that is below 18 years of age
  • pct_aged_65_years_and_older - Percent of population that is aged 65 years or older
  • pct_adults_less_than_a_high_school_diploma - Percent of adult population that does not have a high school diploma
  • pct_adults_with_high_school_diploma - Percent of adult population which has a high school diploma as highest level of education achieved
  • pct_adults_with_some_college - Percent of adult population which has some college as highest level of education achieved
  • pct_adults_bachelors_or_higher - Percent of adult population which has a bachelor's degree or higher as highest level of education achieved
  • birth_rate_per_1k - Births per 1,000 of population
  • death_rate_per_1k - Deaths per 1,000 of population

Example Row

Here's an example of one of the rows in the dataset so that you can see the kinds of values you might expect in the dataset. Most are numeric, a few are categorical, and there can be missing values.

0
county_code 8e686a7
state fb8cab1
population 3876
renter_occupied_households 408
pct_renter_occupied 24.583
evictions NaN
rent_burden 18.38
pct_white 0.945945
pct_af_am 0.0107611
pct_hispanic 0.0260384
pct_am_ind 0.00568528
pct_asian 0.00563532
pct_nh_pi 0
pct_multiple 0.00593507
pct_other 0
poverty_rate 4.172
rucc Nonmetro - Completely rural or less than 2,500 urban population, not adjacent to a metro area
urban_influence Noncore adjacent to micro area and does not contain a town of at least 2,500 residents
economic_typology Manufacturing-dependent
pct_civilian_labor 0.55
pct_unemployment 0.023
pct_uninsured_adults 0.107
pct_uninsured_children 0.062
pct_adult_obesity 0.31
pct_adult_smoking 0.166
pct_diabetes 0.1
pct_low_birthweight NaN
pct_excessive_drinking 0.262
pct_physical_inactivity 0.342
air_pollution_particulate_matter_value 11.0229
homicides_per_100k NaN
motor_vehicle_crash_deaths_per_100k NaN
heart_disease_mortality_per_100k 217
pop_per_dentist NaN
pop_per_primary_care_physician NaN
pct_female 0.471
pct_below_18_years_of_age 0.218
pct_aged_65_years_and_older 0.19
pct_adults_less_than_a_high_school_diploma 0.0832497
pct_adults_with_high_school_diploma 0.327984
pct_adults_with_some_college 0.389168
pct_adults_bachelors_or_higher 0.199599
birth_rate_per_1k 10.009
death_rate_per_1k 9.75234

References