Garage of Code: COVID19: How many unconfirmed cases are there?

We have data on the number of cases and what is the delay from onset to confirmation.
We use this to estimate a best case scenario for the number of unconfirmed cases.

Answer: best case in Europe and US is the total number of cases is 2 times the confirmed cases. So double the official figure. The situation is better in Asia: Japan and South Korea have a best case bound of about 10% extra.

More in depth info about the method below the graphs.

Belgium

France

Germany

Iran

Italy

Japan

China

Netherlands

Norway

Singapore

South Korea

Spain

Sweden

Switzerland

United Kingdom

United States

Method

Suppose that the pattern of delay from onset to confirmed infection with COVID19 is as in:

For each country, we have a time series of the number of confirmed cases, call these Y. The total number of cases with symptoms each day are our unknowns, call these X. For each day, there is one observation and one unknown. We also have extra unknowns in the start, because of the delay. So there are many sets of X which will fit the data Y.

How to choose between them? What I do here is to, instead of picking the model for X that I think is most likely, pick the model which gives the most actionable information. Actionable information in this case is that we know that at least so-and-so many are infected. If the best case is bad, and it is, then we know that it is not an option to choose inaction.

Choose the most optimistic X by penalizing all x values over 0 in the fit model. Implicitly, this means that we think that the most likely world is one where there are no new cases each day. This way, the data does the whole job of "lifting" up our estimation of X.

	def estimate_unconfirmed():
	df = pd.read_csv(num_cases_data)
	for country in ["Singapore", "Sweden", "US", "France", "Germany", "Italy",
	"Iran", "UK", "South Korea", "Netherlands",
	"Norway", "Belgium", "Spain", "Switzerland",
	"Japan", "Mainland China"]:
	df_country = df[df["Country/Region"] == country]
	day2case = df_country.groupby(["ObservationDate"])["Confirmed"].sum()
	days = [datetime.strptime(date, "%m/%d/%Y") for date in day2case.keys()]

	total_confirmed = day2case.to_numpy()
	times_arr = np.array(range(len(total_confirmed)))

	cutoff_days = 14
	delay2freq = get_delay2freq()
	delay2freq = delay2freq[:cutoff_days] # two weeks cutoff
	delay2freq = delay2freq / np.sum(delay2freq) # normalize
	delay2freq = delay2freq[::-1]

	new_confirmed = np.diff(total_confirmed)
	#new_confirmed = new_confirmed[:-28]
	N = len(new_confirmed)
	A = np.zeros([N, N + cutoff_days - 1])
	for i in range(N):
	A[i, i:i+cutoff_days] = delay2freq

	solver = get_solver("couenne")

	new_cases = [solver.NumVar(lb=0) for _ in range(N + cutoff_days)]
	errs = []
	for row, new_conf in zip(A, new_confirmed):
	est_confirmed = solver.Sum([nc * elem for elem, nc in zip(row, new_cases)])
	err = solver.NumVar(lb=0)
	err = (est_confirmed - new_conf)**2
	errs.append(err)
	total_err = solver.Sum(errs)

	total_cases = [solver.Sum(new_cases[:i+1]) for i in range(N + cutoff_days)]
	total_cases = total_cases[cutoff_days-1:]
	print(len(total_confirmed), len(total_cases))
	for tn_conf, tn_case in zip(total_confirmed, total_cases):
	solver.Add(tn_case >= tn_conf)

	#mass: baseline hypothesis is that number of new cases is 0
	mass = solver.Sum(nc**2 for nc in new_cases)

	#diff: baseline hypothesis is that number of new cases is like yesterday
	#diff = solver.Sum([(nc1 - nc0)**2 for nc0, nc1 in zip(new_cases, new_cases[1:])])

	#exp_diff: baseline hypothesis is that number of new cases grow by 30% each day
	#exp_diff = solver.Sum([(nc1 - 1.3 * nc0)**2 for nc0, nc1 in zip(new_cases, new_cases[1:])])

	solver.SetObjective(total_err + mass * 0.1, maximize=False)
	solver.Solve(time_limit=10, verbose=False)

	new_cases_solve = [solver.solution_value(nc) for nc in new_cases]
	total_cases = np.cumsum(new_cases_solve)