top of page
Search
  • Joe Spearing

Latent social distancing with rubbish data

I have just finished reading a really interesting paper by Attar and Tekin-Koru (“Latent Social Distancing”, issue 26 of a series of “vetted” but not yet peer-reviewed papers on Covid economics). They adapt an epidemiological “SEIRD” (susceptible, exposed, infected, recovered, dead) model to include a “latent social distancing” term i.e. the number of people who get exposed is a function of the number of people susceptible and the number of people infected as in the baseline model, but also of social distancing, which is a measure they calculate as a kind of residual from the data. The punchline is that social distancing increases with the stringency of government measures, and with the number of people who have currently been recorded as dying in a country (there is a behavioural effect of increased deaths on compliance with social distancing) and that Africa is the only continent which socially distanced more effectively than Asia on average. They also confirm their measure “works” by matching it to data on movement of consumers from Apple and Google.


This is quite cool. What I’m interested in here is a techy point about how on earth this measure possibly works, given how unusually bad data on infection rates are for Covid. This model uses observed data on the number of infections and the number of deaths and recoveries to infer the level of “social distancing”. The data on the number of cases are very poor however, simply because a large number of people may be asymptomatic, and even if they have mild symptoms, the testing infrastructure does not exist to ensure that everybody can be tested. The scale of this problem can be illustrated thus. When a random sample of the Spanish public was taken on the 13th May, 5% had covid19 antibodies. The official cumulative case load in Spain at this time was 229 thousand, or 0.49% of the Spanish population. In other words, these data are junk. How do Attar and Tekin-Koru get such a reliable measure of social distancing with it? In order to assess this, I recreate a model of an epidemic, and then investigate what happens if I introduce bias into the data and estimate social distancing using their method.


The SEIRD model with social distancing


The full model SEIRD model functions as follows. Define the following five measures: susceptible (the percentage of the population who have not yet been exposed to the virus), exposed (those who have come in contact with an infected person), infected (those who currently have the virus), recovered (those who have had the virus and are now immune) and dead (fairly self-explanatory). Now let’s consider the laws of motion for these variables. The percentage of the population which is susceptible declines when someone who is susceptible comes into contact with someone who is infected. This in turn is proportional to the percentage of the population which is infected times the percentage of the population which is susceptible. So, the more people susceptible to infection and the more people infected, the more likely it is that somebody who is susceptible to being infected will be infected. So far so good.


The percentage of the population exposed in each period is equal to the percentage of the population exposed last period, plus the newly exposed (those who were susceptible last period) less some fraction of the exposed last period to become infected. Empirically, this fraction turns out to be roughly a seventh. This fraction is then added to the number of infected people each period, and a fraction of these either recover or die each period.


Attar and Tekin-Koru’s innovation is to introduce a time varying “social distancing” term which mediates the number of newly exposed people. Assuming that we observe infection rates, death rates, and recovery rates, In principle we then have three equations in three unknowns (the susceptibles, the exposed and the social distancing term) and we can then infer the level of social distancing.


I have simulated an epidemic using these laws of motion below. I pick the same structural parameters as the authors and set the infected percentage of the population at the outset to be as if 1000 people in the UK were infected (1000/70,000,000). Social distancing is a time-varying number bounded between 0 and 1, and I let its logarithm follow an AR(1) process with a persistence parameter 0.8 and a normally distributed error term with a standard deviation of 0.1.




Introducing bad data


So then, suppose the data are rubbish. i.e. suppose that the picture above describes the actual epidemic, but I in fact only get to observe a fraction of the infected (and not an unbiased fraction). How badly off would my measure of social distancing be? In order to do this, I make some assumptions about which of the data I would get to observe, construct a dataset of these, and then follow Attar and Tekin-Koru’s methodology to uncover the extent of social distancing.


Specifically, I assume 1) that only half of those who have coronavirus ever actually test positive for it and 2) that all the serious cases would get tested and be found positive for coronavirus. This means that although my measure of deaths would be correct, my measure of the infected would by off by 50% and my measure of the recovered would be similarly wrong, because I would disproportionately observe the more serious cases. Taking this “data” and inferring social distancing gives me the picture below, of actual versus observed social distancing. Strikingly, it is pretty good. The two series have a correlation of 0.64, and average values of 0.54 and 0.46 respectively.



So how is this possible? The answer as far as I can tell is illuminated in their identifying equation for the social distancing term.




Here alpha is the percentage exposed who go on to become infected, gamma is the sum of the death rate and the recovery rate, S is the percentage of the population which is susceptible, and beta and mu are structural parameters. The variable e is the ratio of exposed to infected. If my measures of infected and recovered people are biased, then presumably the measure of gamma and susceptibles is also biased.


Indeed this is what we find. In the graph below, the red lines are the “data”, and the blue lines are what we would estimate the values to be if we were mis-observing the data as described above. Naturally, there is a large difference between the actually infected and the observed infected. That is by assumption. We overestimate the rate of deaths and recoveries because I have assumed that although we only get to see half of infections, we get to see all the deaths. However, the ‘e’ term, the ratio of exposed to infected which is used to calculate the social distancing measure, is not so bad. This is because it is identified from the law of motion for infections, so it only depends on the rate of increase in infections and the rate of death and recovery. This means that provided the percentage bias in the infection rate is consistent from one period to the next, and the bias in the rate of deaths and recoveries is small, the estimate will be relatively accurate. Hence, because our estimate of gamma is biased upwards, our estimate of the ratio of exposed to infected is biased upwards. However, precisely because the bias in the gamma estimate is so small, this bias is also small. Further, it is offset by the bias in the susceptibles estimate, and reduced by the square root term in the d calculation.




Therefore, even though the sample of infections we observe is incomplete and there is a bias in the sample we observe, the measure of social distancing is reasonably close.



This argument goes through when, instead of restricting our observations of infections to half, we assume that an increasing percentage of infections are caught throughout the epidemic. The reason why is that having a small bias in e depends on the percentage of infections which are observed is similar from one period to the next. In the graph below, I assume that the percentage of infections which are observed is not a flat 50%, but rather rises linearly from 20% at the beginning of the epidemic to 75% by the end. The resulting estimate is also very close.


Conclusion


So what does this mean for a SEIRD model with social distancing? It suggests that even with patchy testing, and variation in the level of testing throughout an epidemic, this methodology is an appropriate way of assessing the extent of social distancing throughout a COVID19 epidemic. We knew this, however, because Attar and Tekin-Koru have shown this by matching their social distancing measure to other data. The contribution of this modelling (other than entertaining me for a Sunday) is that it shows why this measure still works. It depends, crucially on the (relatively) low mortality rate for COVID. This suggests that it may not map as straightforwardly onto more deadly infectious diseases.

89 views0 comments

Recent Posts

See All
Post: Blog2_Post
bottom of page