Here at the RCE, we’ve come to learn that more specific datasets are not always conducive to what we do.

Recently, I was looking to see if anybody at the ACS Data Users Group had dealt with an issue and stumbled upon a statement that caught me off guard. The post mentioned that the Census Bureau has said that derived estimates of American Community Survey (ACS) data should not be built from more than four component estimates because the derived Margin of Errors (MoE) will diverge. We’ve always used guidance from the published Handbooks for Data Users in deriving our MoEs, and there is no mention of this concern.

After thinking about it though, it made sense that the calculation (square root of the sum of component estimate MoEs squared) would be affected by large aggregations. Just by virtue of the equation, each component estimate MoE is going to increase the derived MoE. However, we like having documented statements to support common (or not so common) sense. Plus, this was going to be a relatively large overhaul of our CRT data. I found a reference to the more than four estimates problem in a PowerPoint given by a Census Bureau employee at one point. While looking for discussions of Coefficient of Variations (CVs), Lindsay found the following information in a Census Bureau publication (Instructions for Applying Statistical Testing to the 2011-2013 ACS 3-Year Data and the 2009-2013 ACS 5-Year Data) discussing standard errors:

All methods in this section are approximations and users should be cautious in using them. This is because these methods do not consider the correlation or covariance between the basic estimates. They may be overestimates or underestimates of the derived estimate’s standard error, depending on whether the two basic estimates are highly correlated in either the positive or negative direction. As a result, the approximated standard error may not match direct calculations of standard errors or calculations obtained through other methods.

As the number of basic estimates involved in the sum or difference increases, the results of this formula become increasingly different from the standard error derived directly from the ACS microdata. Care should be taken to work with the fewest number of basic estimates as possible. If there are estimates involved in the sum that are controlled in the weighting then the approximate standard error can be tremendously different.

With this in mind, I went on a hunt to find ACS tabulations that were not as rich, but where their estimates were more closely aligned to what we were using. Being a data nerd, I was really curious about how big of a difference this change in source tables would be on the data we present. In the future, we will be using CVs on the CRT to indicate which data to use with caution, so I calculated CVs for our old derived estimates and for our new published or derived estimates.

For educational attainment, we had previously used a table that provided educational attainment broken out by sex and by 16 different levels of education. Really interesting, but far more detail than what we use. We house data on the Percentage of Adults with less than High School Education, and previously, that was derived from 16 different columns of data. Using a different table, I was able to reduce this down to 2. For each of the geographies we report on, I’ve graphed how our old table CVs match up with our new table CVs.

While the number of geographies with “excellent” CVs has remained pretty consistent across the years, I find the distribution of “good” and “poor” CVs in earlier years quite interesting. I don’t know for sure why earlier year MoEs and CVs seem more vulnerable to aggregation in this dataset, especially since other tables do not have the same kind of divergence. That might be a rabbit hole for another day.

Before we dig into the some of the many intricacies of the data world, we will by sharing how we approach working with data. Standards of practice for data collection and processing are very often challenging to find. In an effort to be transparent within our team as well as with users interacting with the site, we developed the following list of data quality standards:

• We will ethically use and display data
• We will be transparent in our data collection and processing methods
• We will be as precise in the data as is possible
• We will be explicit about weaknesses in the data with our users
• We will have a firm grasp of the methods (data collection and our own processing methods) before we post data on the Communities Reporter Tool (CRT)
• We will explain methods clearly to everyone, both on the website and in other communications
• We will be mindful of the story we are telling with the data, as well as the story that needs to be told regarding the topic at hand
• We will address data errors as quickly as possible when discovered
• We will view the discovery of data errors as opportunities to learn and improve our methods

It is with these standards in mind that we dive into the depths of the data world.

Join us, won’t you?