24 Mar 2020

How to predict the future of SARS-CoV-2 pandemic in Poland with mathematical models and machine learning?

In the age of the coronavirus pandemic, we are dealing with uncertainty and with a very vague vision of what will happen in the near future. Due to the fast spread of the virus and high mortality among older people, the government in Poland quite quickly decided to implement restrictive lockdowns, closing schools, kindergartens and universities, cultural places, pubs, and restaurants, as well as banning events and gatherings of over 50 people. At the same time, it was recommending social distancing and self-isolation, limiting contact with other people, and if possible switching to remote work and staying at home.

In addition, it has also implemented stricter measures as of March 24th, which state that people can only leave the home in cases of work or necessary shopping. Public transportation can only be filled to half the seating capacity. Lastly, people cannot walk the streets in groups larger than two people, unless they are family. Unfortunately, the current situation will have serious economic and social consequences, and therefore everyone realizes that it cannot be maintained for too long. Therefore, most of us have the following questions:

  • How long should we keep the lockdown?
  • How many people will be infected?
  • How long the pandemic will go?
  • How many people will die?
  • What if we release the lockdown after Easter?
  • What if we do more testing for COVID-19?

In this article, we show you how to perform COVID-19 pandemic analysis using mathematical modeling and machine learning, and answer the above questions in an approximate way. The models were fitted to the data for Poland from March 5 – March 23 containing the number of total cases, deaths, and recoveries [1] (see Figure 1). In the beginning it should be noted that each mathematical model is limited by the set of assumptions that must be made before fitting it to the data. In addition, we currently do not have a large amount of data on the course of the pandemic in Poland. Therefore, the results of the analysis presented in the article are more illustrative and the numbers should be treated with some caution.

We based our analysis on the SEIR epidemiological model, which is commonly used to analyze the spread of coronavirus [2][3]. However, this model has three limitations.

  • First of all, it does not directly model the course of the number of cases, deaths, and recoveries that are available in the data.
  • Secondly, it assumes that all patients are recognized and isolated, and does not include people who have no symptoms or have not been tested for the virus.
  • Thirdly, it does not model uncertainty, which would show how strongly reality can depart from what the model shows us.

Our contribution is then fourfold.

  • First, we have extended the SEIR model with additional elements that allow us to forecast the values ​​available in the data.
  • Secondly, we extended the model to include asymptomatic or untested patients.
  • Thirdly, we used the Bayesian approach to model uncertainty.
  • Fourthly, we fitted our model to the data for Poland and showed how to perform some basic analysis.

The article was divided into 3 parts:

  • Results & Discussion. The section contains the results of experiments along with a comment and answers to the questions posed in the introduction. If you are not interested in technical details, you can jump to Conclusions immediately after this section.
  • Model Details. The section contains a detailed description of the model used, code fragments and mathematical equations that will allow you to repeat the analysis contained in the article yourself.
  • Limitations. The section contains a list of selected limitations and imperfections of the model, which can be a good starting point for independent improvement and experimenting with data.

Results & Discussion

To answer the questions, we conducted three simulations, where we control the lockdown and the number of people diagnosed with tests in various ways:

  • In the first simulation, the lockdown is removed after Easter.
  • In the second, lockdown is maintained all the time until the epidemic stops, and the number of tests we perform at the current level.
  • In the third, the lockdown is maintained and we additionally increase the number of tests, isolating infected people more effectively.

Forecasts for the following values ​​were presented in all experiments: Exposed – people who are infected on a given day, Infectious – people who are sick and who infect others, Never Diagnosed – people who have undergone asymptomatic or have not been tested, Total Cases Diagnosed – reported coronavirus cases, Diagnosed & Recovered – reported cure cases, Dead – reported deaths.

For the purposes of the following analysis, we assumed that by introducing lockdown and widespread self-isolation in Poland, each citizen reduced the number of people with whom they interacted on average 5-10 times. In addition, we assume that we do not know what percentage of patients is diagnosed with the help of tests and we carefully assume that it is in the range between 10% and 90%. (see Model Details section for better insight). This is an important assumption, as recent publications show that up to 86% of patients can undergo COVID-19 asymptomatically [8], and therefore, without testing a wide spectrum of people, many cases may not be isolated.

Experiment 1: Releasing the lockdown after Easter

In Figure 2, we see that releasing the lockdown after Easter will cause the epidemic to shoot up sharply at the beginning of May, and we will see an increase in the number of infected reaching tens of thousands of people. In this scenario, the peak incidence will fall in the second half of May and may reach several million. The number of deceased people can go into hundreds of thousands. The epidemic will end with the beginning of the vacation season.

Experiment 2: Keeping the lockdown

Assuming that the lockdown will be followed through in Figure 3, we can see that in the near future the number of cases will probably be steadily increasing. Additionally, using the model, we can calculate that under these assumptions, the probability that the epidemic will end by the end of May is ~7.5% (it can be approximated using the predictive distribution for the Infectious curve, see Model Details section).

Figure 4 gives us a broader perspective and we can see that strong social distancing extends the epidemic in time, while moving the peak of fall in the autumn. Thanks to this, we buy time during which a drug or vaccine can be developed, and for other reasons, the epidemic may spontaneously expire. In addition, we observe here flattening the curve, reducing the peak incidence to ~200,000. Assuming that ~5% of cases require specialist care, there should be a realistic preparation of adequate medical facilities by this time.

It is worth noting, however, that the economic effects of a long lockdown will be catastrophic. To address this problem in part, Neil Ferguson and his group from Imperial College London [4] proposed a strategy that involves cyclical taking off and putting on a lockdown. The public would have to adapt to the epidemic where alternately everything is open for a few weeks and then close again for a few more.

Experiment 3: Keeping the lockdown & increasing the number of tested people

In the experiment below, we assume that we are conducting extensive testing for the presence of coronavirus in society and only less than 20% of infected people remain undiagnosed and not isolated.

The extensive testing strategy is used in Germany, where despite a large number of cases (~ 30,000) we have a mortality rate of 0.4%, which is an argument for the hypothesis that we have much more light and asymptomatic cases that must be isolated so that they do not infect others people.

Figure 5 shows that lockdown combined with extensive patient testing and isolation ensures that we are able to control and gradually quench the epidemic. Even if the numerical values are underestimated due to the data we have, the trend will be maintained. At the same time, the number of deaths will be kept at a very low level. Using the model, we can estimate that in such a scenario, the probability of ending the epidemic by the end of May is ~44%, and by the end of July already ~78%, and the clear slow-down of the epidemic should be in mid-April.

Model Details

Epidemic model

The SEIR model (S – susceptible, E – exposed, I – infectious, R – recovered) is a model where epidemic dynamics is described by the set of differential equations [2][3]. To use publicly available data, we have extended this model to the form of SEINCRAD (S – susceptible, E – exposed, I – infectious, N – never diagnosed, C – total cases diagnosed, R – diagnosed & recovered, A – active cases, D – dead). It is described as follows:

Some of the extensions above were also proposed by Gabriel Goh [5]. Intuitively, the above model can be imagined as a set of communicating vessels, where each vessel describes one of the groups of people defined above. The dynamics of the spread of the disease is modeled as the flow of people between individual vessels with the pace set by the parameters. The interpretation of the model parameters is as follows:

  • , where  is the incubation period (in days) of the virus i.e. the time from infection to the beginning of the infection of others. For coronavirus, the incubation time is estimated at about 5 days, however, the uncertainty of this parameter varies in the range (2, 8) [5].
  • , where  is the time of infectious period i.e. the time from commencing infection to isolation in a hospital or at home. The time of infection is estimated at about 3 days with uncertainty in the range (0, 15) [5]. This parameter is used only for people, who have pronounced symptoms of COVID-19 and will be diagnosed in that period.
  • , where  is contagious time in which a patient with mild or no symptoms is not isolated and can infect others. It is estimated that contagious time can last for up to 14 days since the first symptoms were observed.
  • , where  is the reproduction number i.e. the average number of people who will be infected by one infected person. It is estimated that for coronavirus it is about 2.2 with uncertainty in the range (1.4, 4.6) [5]. In addition, it is estimated that if restrictive lockdown and social distancing are used, the reproduction ratio can be reduced by more than 80% (reduction level). This means that on average we start to have contact with over 5 times fewer people than normal.
  • , where  is the average isolation time i.e. the duration of the disease, where the patient remains isolated. After this time, the disease ends with recovery. The data shows that on average it is about 14 days with uncertainty in the range (5, 30).
  •  is the mortality rate. We took 1.7% for the purposes of the experiment, although its value for Poland to date is 1.07%.
  •  is a non-detection rate i.e. the percentage of people who are ill but because of no symptoms or not being tested they are not diagnosed. According to the paper published in Science by Li Ruiyun [8], the number of people who have no symptoms of infection is at the level of 86%.
  • , where  is a period until death i.e. the average time from isolation to death in case of serious or critical symptoms. According to the data, it is about 5-10 days.

In addition, to perform the simulation using the SEINCRAD model for Poland, we must know the following data:

  • Lockdown day. In the case of Poland, this is March 15, which is 10 days after the first COVID-19 case was diagnosed.
  • Population. We assume that for Poland there are 37.98M inhabitants.
  • The number of initially infected people, , who returned from abroad. For this parameter, we will assume uncertainty in the range [1, 200].

The following Python code shows how to perform a single simulation for the SEINCRAD model using the odeint function from the scipy.integrate package.

1234567891011121314151617def seincrad_model_with_lockdown(state, t, beta, eps, eta, kappa,      mr, reduction_level, sigma, omega, ndr, lockdown_day):    if t >= lockdown_day:       beta = reduction_level * beta          s, e, i, n, c, r, a, d = state   dsdt = -beta * i * s   dedt = beta * i * s - eps * e   didt = eps * e - eta * (1-ndr) * i - sigma * ndr * i   dndt = sigma * ndr * i   dcdt = eta * (1-ndr) * i   drdt = kappa * (1-mr) * a   dadt = eta * (1-ndr) * i - kappa * (1-mr) * a - omega * mr * a   dddt = mr * omega * a      return [dsdt, dedt, didt, dndt, dcdt, drdt, dadt, dddt]
1234567i0 = initially_infected / populationstate0 = [1.0, 0.0, i0, 0.0, 0.0, 0.0, 0.0]t = np.linspace(0, simulation_days, time_steps)   simulation = odeint(seiacrd_model_with_lockdown, state0, t,                   args=(beta, eps, eta, kappa, mr, reduction_level,                          sigma, omega, ndr, lockdown_day))

Uncertainty modeling

To perform the correct simulation of the epidemic, we still need to properly choose the model parameters, such as incubation period, infectious period, reproduction number, or non-detection rate. The values of these parameters are difficult to estimate precisely and often depend on the dispersion and lifestyle of a given society, and also on the number of performed tests and diagnosed cases. Therefore, instead of providing a point estimate, we will use the Bayesian approach and we will model each parameter with its prior distribution. Based on the data, we estimate their marginal posterior distributions. Thanks to this, we can assume that the parameters are uncertain and let the data explain them.

where  is the available data presented in Figure 1, and  is the set of parameters, where  is the initial reproduction number, and  is its reduction level after the introduction of lockdown. You can read about the various uses of the Bayesian approach in machine learning in the book by Kevin Murphy [6]. We assume uniform prior distributions for all parameters. The ranges from which the parameters are drawn are defined in the following code snippet:

1234567891011param dict = {   'tinc' : (1.0, 8.0),   'tinf' : (1.0, 10.0),   'tcon' : (1.0, 15.0),   'tiso' : (14.0, 30.0),   'tdth' : (1, 14),   'ndr' : (0.1, 0.9),   'r0' : (2.0, 4.5),   'initially_infected' : (1.0, 200.0),   'reduction_level' : (0.1, 0.2)}

We use the Approximate Bayesian Computation (ABC) approach to estimate the posterior distribution. It allows you to easily approximate marginal posteriors using a finite set of random samples. The application of the ABC method to differential equation models can be found e.g. in the paper by Jakub Tomczak & Ewelina Węglarz-Tomczak [7]. The procedure is as follows:

  • We generate a random set of parameters, , from a priori distribution.
  • We simulate the SEINCRAD model run, ,  for the parameters .
  • We calculate the Euclidean distance, , between data and simulations . If the distance is below a given threshold , we keep the sample, otherwise we reject it.

Figure 6 shows the marginal posterior distributions generated for data from Poland. There are several interesting insights from it:

  • Distributions of incubation period and infectious period center around 4-5 and 3, respectively, which coincides with the data presented in publications [5].
  • The reproduction number is concentrated at ~4.0, which is consistent with publications, although higher than the most frequently quoted average (~2.8).
  • The initially infected distribution is concentrated in the range [0, 25], suggesting that the epidemic in Poland probably began with a small group of infected people.
  • The non-detection rate is between 70-80%, which would suggest that only one in five people are diagnosed.

Having estimated posterior distributions for parameters, we can also approximate the posterior predictive distribution for individual components of the SEINCRAD model, and consequently obtain the probability distribution of each variable as a function of time.

Its exact calculation is intractable, however, using the kept samples we can estimate its moments. In particular, we calculate the predictive means together with percentiles 5% and 95% to illustrate uncertainties. Predictive distributions for most of the components of the SEINCRAD model are presented in Figures 2-5.


Like any mathematical model, also the SEINCRAD model is only a simplification of reality and has many limitations that will introduce systematic bias into the forecasts made. In particular:

  • The model does not take into account the capacity of healthcare and changes in the mortality rate after its clogging.
  • The model assumes that the entire society is both infected and isolated in a homogeneous manner. For example, it does not take into account that there may be professional groups (e.g. doctors, salespeople), which for obvious reasons cannot keep social distancing at a certain level. It also does not include periodic changes in behavior, e.g. Easter.
  • Prior distributions of the model parameters are imprecise. Better estimation of uncertainty for them would improve the quality of forecasts.


In this article, we showed how we can predict the development of the coronavirus epidemic by using mathematical modeling and machine learning. In particular, the results show the essence of social distancing and self-isolation to keep infected people at a controlled level and not to clog the health care system. In addition, we see that widespread testing and isolation of all infected is an important element, which on a large scale can even lead to an epidemic coming to an end in a short time. Simulations show that we must at least take measures to control the pace of the epidemic until a drug, vaccine or pandemic is invented.

At the same time, we hope that the tutorial above contains enough details to repeat the whole experiment. We encourage you to stay at home and experiment with data.

#stayathome #zostanwdomu


  1. Worldometer. https://www.worldometers.info/coronavirus/
  2. Wu et al. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. 2020
  3. Kucharski et al. Analysis and projections of transmission dynamics of nCoV in Wuhan. 2020
  4. Ferguson et al. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand. 2020
  5. Gabriel Goh. Epidemic calculator. http://gabgoh.github.io/COVID/index.html
  6. Kevin Murphy. Machine Learning: A Probabilistic Perspective. 2012
  7. Tomczak & Węglarz-Tomczak. Estimating kinetic constants in the Michaelis–Menten model from one enzymatic assay using Approximate Bayesian Computation. 2019
  8. Ruiyun Li et al. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2). 2020

Let's write your intelligent automation story together.

Start now for free