A Language for Survey Programming

Ben Ewing

2020/01/05

What’s wrong with XLSForms?

XLSForm defines a very accessible (other than reliance on Excel) standard for programming surveys compatible with OpenDataKit, SurveyCTO, and a myriad of other data collection tools used in low/no connectivity environments. XLSForm’s are dead easy to program, but can produce very complex surveys. However, in my ~5 years of programming XLSForm surveys I’ve encountered a few problems caused by the reliance on Excel for programming:

To address these issues, I’ve tried building a domain specific language (DSL) embedded in R that can generate XLSForms. This takes the form of the XLSFormTools R package. My first iteration used a tidyverse-style API, but I’ve found the embedded DSL approach to be a little more flexible and easier to use for the end user. This solution solves all of the problems I outlined by allowing users can use Git (or any text/diff based version control system) for collaboration and version control. Users do not need Excel (or any alternative) to program their surveys.

Why embed in R?

I chose to use R for a number of reasons:

The Language

I tried to keep the language as close to the XLSForm standard as possible; if the user thinks something should work, it hopefully should.

Surveys are defined inside the to_survey environment. Within this environment all R syntax and functions apply (though I’ve gone back and forth on importing the global environment; using a closed environment would allow for more specialized syntax).

library(XLSFormTools)

to_survey({
  # This is a survey, note that this is an R-style comment!
})

This function returns a list with three tibbles which represent the survey, choices, and settings sheets of an XLSForm.

Settings and choices are easily defined:

to_survey({
  # Settings, this syntax could be improved
  setting(form_title, "example_form")
  setting(version, "1")
  
  # Choices
  choice(list_name = "yn", "0" = "No", "1" = "Yes")
  choice(list_name = "state", "1" = "CA", "2" = "NC")
})

All meta and regular question types work as expected:

to_survey({
  # Choices
  choice(list_name = "yn", "0" = "No", "1" = "Yes")
  choice(list_name = "states", "1" = "CA", "2" = "NC")
  
  # Meta questions for start and end time
  start(name = "start_time")
  end(name = "end_time")
  
  # Some questions, this is not a good survey :)
  select_one(choices = "states", name = "favorite_state", 
             label = "A1. What is your favorite state?")
  select_one(choices = "yn", name = "live_in_fav_state",
             label = "A2. Do you live in your favorite state?")
  
  # Verify location
  geopoint(name = "survey_location", label = "A3. Enumerator: Please take a GPS recording.")
})

Lastly, groups and repeat groups are quite intuitive:

a_survey <- to_survey({
  # Group A
  begin_group(name = "section_a", label = "Section A", {
    # Integer questions use `int` so they don't overwrite base::integer
    int("hh_size", "A1. How many people live in your household?")
  })
  
  begin_repeat(name = "section_b", label = "Section B: Household Roster", 
               repeat_count = "${hh_size}", {
    int("age", "B1. What is this household member's age?")
  })
})

As mentioned, the to_survey function just returns a list of tibbles:

a_survey
## $survey
## # A tibble: 6 x 14
##   type  name  hint  constraint constraint_mess~ required required_message
##   <chr> <chr> <chr> <chr>      <chr>            <chr>    <chr>           
## 1 begi~ sect~ ""    ""         ""               ""       ""              
## 2 inte~ hh_s~ ""    ""         ""               ""       ""              
## 3 end_~ <NA>   <NA>  <NA>       <NA>             <NA>     <NA>           
## 4 begi~ sect~ ""    ""         ""               ""       ""              
## 5 inte~ age   ""    ""         ""               ""       ""              
## 6 end_~ <NA>   <NA>  <NA>       <NA>             <NA>     <NA>           
## # ... with 7 more variables: default <chr>, relevant <chr>, read_only <chr>,
## #   calculation <chr>, appearance <chr>, label <chr>, repeat_count <chr>
## 
## $choices
## # A tibble: 0 x 0
## 
## $settings
## # A tibble: 0 x 0

Downsides

There are of course several downsides. First, some users will find text-based programming intimidating than a nicely organized Excel workbook. Second, XLSFormTools does add another layer of abstraction. An ODK user must first program the survey in R, export to an XLSForm, then convert that to ODK’s XForm standard. However, I consider this additional layer to be pretty thin.

Future Work

There is still a lot of work to be done: