competition complete

DAT264x: Identifying Topics of World Bank Publications
Hosted By Microsoft

 

Problem Description

About the Data

Your goal is to predict the topic(s) of publications from the World Bank, where there are 29 possible topics. You will be given the first six pages of text from each dociument. Each document has at least one topic and can have multiple topics. This is known as a multilabel problem.

Target Variable

Your job is to:

  1. Train a model using the inputs in train_values.csv and the labels train_labels.csv.
  2. Predict labels for each row (document) in test_values.csv for which you don't know the true topics.
  3. Output your predictions in a format that matches submission_format.csv exactly.
  4. Upload your predictions to this competition in order to get an accuracy score.
  5. Export your token and paste it into the assignment grader on edX to get your course grade.

For each document in train_values.csv, you are given 29 columns in test_values.csv. Each column corresponds to a possible topic, with either a 1 or a 0 in it, where 1 means the topic applies to the document and 0 means it does not.

The topic columns are as follows:

  • information_and_communication_technologies
  • governance
  • urban_development
  • law_and_development
  • public_sector_development
  • agriculture
  • communities_and_human_settlements
  • health_and_nutrition_and_population
  • culture_and_development
  • environment
  • social_protections_and_labor
  • industry
  • macroeconomics_and_economic_growth
  • international_economics_and_trade
  • conflict_and_development
  • finance_and_financial_sector_development
  • science_and_technology_development
  • rural_development
  • poverty_reduction
  • private_sector_development
  • informatics
  • energy
  • social_development
  • water_resources
  • education
  • transport
  • water_supply_and_sanitation
  • gender
  • infrastructure_economics_and_finance

Submission Format

The format for the submission file is a CSV with a header row (row_id,information_and_communication_technologies,governance,urban_development,law_and_development,public_sector_development,agriculture,communities_and_human_settlements,health_and_nutrition_and_population,culture_and_development,environment,social_protections_and_labor,industry,macroeconomics_and_economic_growth,international_economics_and_trade,conflict_and_development,finance_and_financial_sector_development,science_and_technology_development,rural_development,poverty_reduction,private_sector_development,informatics,energy,social_development,water_resources,education,transport,water_supply_and_sanitation,gender,infrastructure_economics_and_finance).

Each row contains a row id followed by 29 topic columns, all separated by commas. The data type for all the topic labels is an integer, so your topic labels must be integers.

Note: Except for the actual prediction values in the appliance column, your submission must exactly match the submission_format.csv file provided, including the order of the rows.

For example, if you guessed all topics applied to every document, your submission would look like:

row_id information_and_communication_technologies governance urban_development law_and_development public_sector_development agriculture communities_and_human_settlements health_and_nutrition_and_population culture_and_development environment social_protections_and_labor industry macroeconomics_and_economic_growth international_economics_and_trade conflict_and_development finance_and_financial_sector_development science_and_technology_development rural_development poverty_reduction private_sector_development informatics energy social_development water_resources education transport water_supply_and_sanitation gender infrastructure_economics_and_finance
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Your .csv file that you submit would look like:

row_id,information_and_communication_technologies,governance,urban_development,law_and_development,public_sector_development,agriculture,communities_and_human_settlements,health_and_nutrition_and_population,culture_and_development,environment,social_protections_and_labor,industry,macroeconomics_and_economic_growth,international_economics_and_trade,conflict_and_development,finance_and_financial_sector_development,science_and_technology_development,rural_development,poverty_reduction,private_sector_development,informatics,energy,social_development,water_resources,education,transport,water_supply_and_sanitation,gender,infrastructure_economics_and_finance
0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
⁝

Performance metric


To measure your model's performance, we'll use the F1 score which balances the precision and recall of a classifier. Traditionally, the F1 score is used to evaluate performance on a binary classifier, but since we have 29 possible labels we will use a variant called the micro averaged F1 score.

$$F_{micro} = \frac{2 \cdot P_{micro} \cdot R_{micro}}{P_{micro} + R_{micro}}$$

where

$$P_{micro} = \frac{\sum_{k=1}^{29}TP_{k}}{\sum_{k=1}^{29}(TP_{k} + FP_{k})},~~R_{micro} = \frac{\sum_{k=1}^{29}TP_{k}}{\sum_{k=1}^{29}(TP_{k} + FN_{k})}$$

and |$TP$| is True Positive, |$FP$| is False Positive, |$FN$| is False Negative, and |$k$| represents each class in |${1,2,3...29}$|.

In Python, you can easily calculate this loss using sklearn.metrics.f1_score with the keyword argument average='micro'. Here are some references that discuss the micro-averaged F1 score further:

Hints for How to Approach this Problem

Since you are given over 18,000 documents in the train set, it can be computationally intensive to run models on this entire set. If this is the case, try running your preprocessing and modeling on a smaller subset of the data to identify the best model before re-training it on the entire dataset.

Remember that this is a multilabel problem, meaning each document can have multiple topics associated with it (i.e. for a given row in train_labels.csv, there may be multiple 1s across the 29 different columns. Scikit-learn, for example, has some helpful resources on multilabel problems, including a OneVsRestClassifier that may be useful.

References

The World Bank. Publications. http://www.worldbank.org/en/research/brief/publications.