## Time Series Analysis and Forecasting: Examples, Approaches, and Tools

- 14 min read
- Business , Data Science , UX Design
- 21 Mar, 2022
- 1 Comment Share

## What are time series forecasting and analysis?

What is time series analysis, trends, seasons, cycles, and irregularities.

Source: Forecasting: Principles & Practice, Rob J Hyndman, 2014

Trends and seasonality are clearly visible

## Time series forecasting and analysis: examples and uses cases

Demand forecasting for retail, procurement, and dynamic pricing, price prediction for customer-facing apps and better user experience.

Source: Fareboom.com

The engine has 75 percent confidence that the fares will rise soon

## Forecasting pandemic spread, diagnosis, and medication planning in healthcare

Anomaly detection for fraud detection, cyber security, and predictive maintenance.

Finding anomalies in time series data. Source: Neptune.ai

## Approaches to time series forecasting

“Prediction is very difficult, especially if it’s about the future.”

Nils Bohr, Nobel laureate in Physics

Bringing stationarity to data

## Traditional machine learning methods

Stream learning approach, ensemble methods.

Source: Our quest for robust time series forecasting at scale , Eric Tassone and Farzan Rohani, 2017, Forecast procedure in Google

## Tools and services used for time series forecasting

Facebook’s prophet.

Source: Forecasting at Scale, Sean J. Taylor and Benjamin Letham, 2017

## Google’s TensorFlow, BigQuery, and Vertex AI

Amazon forecast.

Amazon time series forecasting algorithms, compare. Source: AWS

## Azure Time Series Insights for IoT Data

Time series forecasting will become more automated in the future, subscribe to our newsletter.

Stay tuned to the latest industry updates.

## Latest Business Articles

## 20 Key Product Management Metrics and KPIs

## Product Management: Main Stages and Product Manager Role

## Hotelier's Guide to Google Hotels and its API: Listing, Updating, Competing

## Quality Assurance, Quality Control, and Testing — the Basics of Software Quality Management

Join us on the techtalks.

Discover new opportunities for your travel business, ask about the integration of certain technology, and of course - help others by sharing your experience.

## Write an article for our blog

Almost 50 guest articles published from such contributors as Amadeus, DataQuest, MobileMonkey, and CloudFactory.

## tableau.com is not available in your region.

## Introduction to Environmental Data Science

16 time series case studies, 16.1 loney meadow flux data.

At the beginning of this chapter, we looked at an example of a time series in flux tower measurements of northern Sierra meadows, such as in Loney Meadow where during the 2016 season a flux tower was used to capture CO 2 flux and related micrometeorological data.

We also captured multispectral imagery using a drone, allowing for creating high-resolution (5-cm pixel) imagery of the meadow in false color (with NIR as red, red as green, green as blue), useful for showing healthy vegetation (as red) and water bodies (as black).

Figure 16.1: Loney Meadow False Color image from drone-mounted multispectral camera, 2017

Figure 16.2: Flux tower installed at Loney Meadow, 2016. Photo credit: Darren Blackburn

The flux tower data were collected at a high frequency for eddy covariance processing where 3D wind speed data are used to model the movement of atmospheric gases, including CO 2 flux driven by photosynthesis and respiration processes. Note that the sign convention of CO 2 flux is that positive flux is release to the atmosphere, which might happen when less photosynthesis is happening but respiration and other CO 2 releases continue, while a negative flux might happen when more photosynthesis is capturing more CO 2 .

A spreadsheet of 30-minute summaries from 17 May to 6 September can be found in the igisci extdata folder as "meadows/LoneyMeadow_30minCO2fluxes_Geog604.xls" , and includes data on photosynthetically active radiation (PAR), net radiation (Qnet), air temperature, relative humidity, soil temperature at 2 and 10 cm depth, wind direction, wind speed, rainfall, and soil volumetric water content (VWC). There’s clearly a lot more we can do with these data (see Blackburn, Oliphant, and Davis ( 2021 ) ), but we’ll look at CO 2 flux and PAR using some of the same methods we’ve just explored.

First we’ll test read in the data (I’d encourage you to also look at the spreadsheet in Excel [but don’t change it] to see how it’s organized) …

… and see that just as with the Bugac and Hungary data, it has half-hour readings and the second line of the file has measurement units. There are multiple ways of dealing with that, but this time we’ll capture the variable names then add them back after removing the first two rows:

The time unit we’ll want to use for time series is going to be days, and we can also then look at the data over time, and a group_by-summarize process by days will give us a generalized picture of changes over the collection period reflecting phenological changes from first exposure after snowmelt through the maximum growth period and through the major senescence period of late summer. We’ll look at a faceted graph from a pivot_long table.

Figure 16.3: Facet plot with free y scale of Loney flux tower parameters

Now we’ll build a time series for CO 2 for an 8-day period over the summer solstice, using the start time and frequency (there’s also a time stamp, but this was easier, since I knew the data had no gaps):

Figure 16.4: Loney CO 2 decomposition by day, 8-day period at summer solstice

Finally, we’ll create a couple ensemble average plots from all of the data, with sd error bars similar to what we did for Manaus, and with cowplot used again to compare two ensemble plots:

Figure 16.5: Loney meadow CO 2 and PAR ensemble averages

We can explore the Loney meadow data further, maybe comparing multiple ensemble averages, relating variables (like the example here):

Figure 16.6: Loney CO2 flux vs Qnet

## Time Series Analysis: A Quick Introduction with Examples

We all know Coca-Coca, the conglomerate giant. The company makes millions of dollars each year and can seemingly anticipate every issue or new market trend. However, have you ever thought about how they stay on top of the game? It’s because their data science teams use time series analysis .

In this article, we’ll explore what this technique entails through real-world examples, and discuss the types of time series modeling you’re likely to encounter.

## What Is Time Series Analysis?

In a management context, we are typically interested in forecasting certain types of outcomes . Some examples are sales (at a total or a division level), customer satisfaction levels, the company’s ability to achieve target cost levels, or capability to deliver successful projects. In all these cases, we’ll use past data to come up with a prediction about the future. Time series analysis is part of predictive analysis, gathering data over consistent intervals of time (a.k.a. collecting time series data ). It’s an effective tool that allows us to quantify the impact of management decisions on future outcomes.

Let’s take Coca-Cola again and look at a time series analysis example through the lens of the company’s sales. Two quarters from now, their expected sales will be anywhere between 250,000 and 300,000 units. Historical sales indicate a strong relationship between unit sales and weather – otherwise known as correlation analysis . Based on that, it is likely that the numbers will be closer to 290,000 in the summer months. However, to achieve similar results in the winter quarter, the company will need some additional marketing investments.

The technique the Coca-Cola team can use to perform this type of future forecasting is precisely time series analysis. When applied, the model will provide a range of potential outcomes. In our example, the variable we are interested to predict is future sales volume. Therefore, the outcomes will vary depending on numerous factors, which may affect sales development throughout the year.

Let’s suppose the weather is 5% warmer than average, and Coca-Cola spend 5% more on marketing by investing in TV ads and promotional events. Then, based on historical data, we can reasonably expect that sales will be on the higher end of the range we indicated - 290,000 units. By changing the weather condition assumptions and running hypothesis testing on different marketing spend, the model would yield a separate time series analysis forecast. Typically, in practice, we will provide a range of estimates. For Coca-Cola, they might look something like this:

- 290,000 units in the best-case scenario
- 250,000 units in the worst-case scenario
- 270,000 in a base-case scenario

## What Are the Types of Time Series Modelling Methods?

There are 4 modeling methods that analysts often use to support time series analysis:

- Probabilistic
- Deterministic

We’ll now explore each type and give you examples of how to apply them in a business setting.

## Naive Time Series Method

A naive forecast – or persistence forecast – is the simplest form of time series analysis where we take the value from the previous period as a reference:

\[x_t = x_{t+1}\]

It does not require large amounts of data – one data point for each previous period is sufficient. Additionally, naive time series modeling can take seasonality and trend into account.

If you recall the Coca-Cola example, seasonality suggests that there is a cyclical pattern in the data that only appears periodically. Instead of taking the sales volume for the previous month, you can take last year’s value for the same month you’re trying to predict now:

In essence, you would be using last December’s numbers, instead of this November’s values, to forecast the sales for this upcoming December.

Another option is to consider the trend. For example, based on our historical analysis, we can see that last year’s September sales dropped 10% versus those made in August. We can use this information to forecast September of this year by applying the 10% reduction of sales versus the August ones:

The naive forecasting method is easy to understand and to use. However, the past is not always a good indicator of the future. That is why more sophisticated analytical techniques are often required to come up with more accurate sales forecasts.

## Probabilistic Time Series Method

Probabilistic modeling is also known as a Monte Carlo simulation. It’s named after the gambling hot spot in Monaco as it simulates real-life events with uncertain outcomes.

When faced with significant uncertainty, the Monte Carlo Simulation allows you to use a range of input values rather than just replacing the uncertain variable with a number. More precisely, these input values make use of the variable’s distribution function and help obtain a large number of possible realizations of the output variable.

To illustrate, here is an example of a Monte Carlo simulation for a revenue forecast:

There is a 90% chance that total revenue will be between X and Y, and a 61.2% chance that the service will be higher than the forecast.

The advantage of Monte Carlo simulation is that it fully explores the probability distribution function of a certain variable. In our example, that’s the development of sales. Fortunately, we are able to study the probability of the sales that will fall within a certain bandwidth. Knowing this will help us manage risk.

However, to be successful, we need reliable data. If we do a Monte Carlo simulation and obtain no certainty that sales will fall within a specific bandwidth, then the probabilistic modeling has no added value.

## Deterministic Time Series Method

The third method we’ll be looking at is the deterministic model – a more complex form of time series analysis that includes user-defined confidence intervals . As an example, let’s examine a historical trend and a forecast with a certain level of confidence for the year to come:

Suppose we want to see the sales forecast within a 95% range of certainty. Then, based on the graph, we can say with 95% certainty that we expect sales to be in the region from 240,000 to 280,000 units.

In other words, we provide an interval based on a deterministic trend, instead of making a definitive claim that we will make 265,000 sales. Thus, we have a better chance of preparing for the future because we know what the best- and worst-case scenarios look like.

## Hybrid Time Series Method

The last type of time series analysis we will discuss is called hybrid modeling. As the name suggests, it combines two other types of models - probabilistic and deterministic. The hybrid model considers the available data, then steps on it to simulate how uncertainties can affect the output.

For example, suppose we increase our marketing budget whilst having similar weather as last year. Then, we can expect a sales volume between 240,000 and 280,000 units with 90% certainty:

This is a multi-step process, so we don’t get these numbers right away. Instead, we first go with the deterministic approach to find a model which describes the data well. In most cases, this is some variation of an ARIMA model : $x = \alpha + \beta_1 x_{t-1} + \beta_2 x_{t-1}$ , etc.

Then, we expand it to include a trend or a seasonal component based on some manual analysis. After we find the best fitting model, we conduct a Monte Carlo simulation to see how a random variable with the same statistical parameters would evolve over time. Of course, the simulation conducts this forecast thousands of times. At the end of it, we get a range of the most frequently predicted values to create our 90% confidence interval.

Overall, we can say that hybrid modeling is the most popular approach as it combines two types of methods to give us the highest percentage of certainty possible.

## Time Series Analysis: What’s Next?

Time series analysis brings exponential value to business development. Analysts utilize it to help companies estimate their revenue, predict trends, and future-proof their products.

As this type of analysis is part of business analytics, having it in your tool box means that you will be at a vital position in a company which offers heaps of career growth opportunities.

Learn data science with industry experts

Randy Rosseel

Business Analytics expert

Randy Rosseel is a Six Sigma Master Black Belt, and a CFA charter holder with long-standing executive career at world-class organizations. Apart from leading global change projects, Randy also enjoys sharing his expertise with aspiring professionals, which inspired him to create the Introduction to Business Analytics course in collaboration with 365 Data Science.

We Think you'll also like

Time Series Analysis Tutorials

What Is an ARMA Model?

What Is an Autoregressive Model?

Similar Posts

Time Series Forecasting in Python: A Quick Practical Guide

How To Pre-Process Time Series Data

What Is Time Series Data?

What Is a Moving Average Model?

Comprehensive training, exams, certificates. Find your dream job.

- Open access
- Published: 30 April 2022

## A tutorial on the case time series design for small-area analysis

- Antonio Gasparrini 1 , 2

BMC Medical Research Methodology volume 22 , Article number: 129 ( 2022 ) Cite this article

4292 Accesses

14 Citations

1 Altmetric

Metrics details

The increased availability of data on health outcomes and risk factors collected at fine geographical resolution is one of the main reasons for the rising popularity of epidemiological analyses conducted at small-area level. However, this rich data setting poses important methodological issues related to modelling complexities and computational demands, as well as the linkage and harmonisation of data collected at different geographical levels.

This tutorial illustrated the extension of the case time series design, originally proposed for individual-level analyses on short-term associations with time-varying exposures, for applications using data aggregated over small geographical areas. The case time series design embeds the longitudinal structure of time series data within the self-matched framework of case-only methods, offering a flexible and highly adaptable analytical tool. The methodology is well suited for modelling complex temporal relationships, and it provides an efficient computational scheme for large datasets including longitudinal measurements collected at a fine geographical level.

The application of the case time series for small-area analyses is demonstrated using a real-data case study to assess the mortality risks associated with high temperature in the summers of 2006 and 2013 in London, UK. The example makes use of information on individual deaths, temperature, and socio-economic characteristics collected at different geographical levels. The tutorial describes the various steps of the analysis, namely the definition of the case time series structure and the linkage of the data, as well as the estimation of the risk associations and the assessment of vulnerability differences. R code and data are made available to fully reproduce the results and the graphical descriptions.

## Conclusions

The extension of the case time series for small-area analysis offers a valuable analytical tool that combines modelling flexibility and computational efficiency. The increasing availability of data collected at fine geographical scales provides opportunities for its application to address a wide range of epidemiological questions.

Peer Review reports

## Introduction

The field of epidemiology has experienced profound changes in the last decade, with the fast development of data science methods and technologies. Modern monitoring devices, for instance remote sensing instruments or mobile wearables [ 1 ], provide real-time measurements of a variety of risk factors with unparalleled coverage, quantity, and precision. Similarly, advancements in linkage procedures [ 2 ], together with improved computational capabilities, storage, and accessibility [ 3 ], offer epidemiologists rich and high-quality data to investigate health risks.

The availability of data on health outcomes and exposures with increased resolution is the main driver of the rising popularity of epidemiological analyses at small-area level [ 4 ]. Originally developed in spatial analysis, small-area methods have been then extended for spatio-temporal data to analyse observations collected longitudinally [ 5 , 6 ]. Similarly to traditional studies based on aggregated data, these investigations often make use of administratively collected information, usually more available to researchers and less sensitive to confidentiality restrictions. Nonetheless, these studies provide a richer data framework, merging information gathered from various sources at multiple geographical levels. The aggregation of information at finer spatial scales makes small-area studies less prone to ecological fallacies affecting traditional investigations using large-scale aggregations, and the availability of more detailed data can inform about more complex epidemiological mechanisms. Still, this context poses non-trivial practical and methodological problems, for instance high computational requirements related to the size of the data, and modelling issues due to their complexity [ 7 ].

The case time series (CTS) design is a methodology recently proposed for epidemiological analyses of short-term risks associated with time-varying exposures [ 8 ]. The design combines the modelling flexibility of time series models with the self-matched structure of case-only methods [ 9 ], providing a suitable framework for complex longitudinal data. Originally illustrated in individual-level analyses, the CTS design can be easily adapted for studies using data aggregated over small areas. This extension makes available a flexible methodology applicable for a wide range of research topics.

In this contribution, we provide a tutorial on the application of the CTS design for the analysis of small-area data. The tutorial describes several steps, including data gathering and linkage, modelling of epidemiological associations, and definition of effect summaries and outputs. The associated with non-optimal temperature in London, United Kingdom. The example is fully reproducible, with data and code in the R software available in a GitHub repository.

## The case time series data structure

The real-data example is based on a dataset published by the Office of National Statistics (ONS), reporting the deaths that occurred in London in the summer period (June to August) of two years, 2006 and 2013. The data are aggregated by day of occurrence across 983 middle layer super output areas (MSOAs), small census-based aggregations with approximately 7,200 residents each. The dataset includes the death counts for both the age group 0–74 and 75 and older, which are combined in total numbers of daily deaths for this analysis. The paragraph below describes how these data must be formatted in a CTS structure.

The CTS design is based on the definition of cases , representing observational units for which data are longitudinally collected. The design involves the definition of case-specific series of continuous sequential observations. In the applications of the original article presenting the methodology [ 8 ], cases were represented by subjects, but the design can be extended by defining the observational units as small geographical areas. In this example, the process implies the aggregation of the mortality data in MSOA-specific daily series of mortality counts, including days with no death. It is worth noting that the design is similarly applicable with different types of health outcomes, for instance continuous variables obtained by averaging measurements within each area.

The mortality series derived for five of the 983 MSOAs in the summer of 2006 are displayed in Fig. 1 (top panel). Each MSOA is characterised by no more than one or a few daily deaths, with most of the days totalling none. The data can be then aggregated further by summing across all MSOAs, thus defining a single daily mortality series for the whole area of London, shown in Fig. 1 (bottom panel). These fully aggregated data will be used later to compare the results of the CTS methodology with a traditional time series analysis.

Daily series of deaths for all causes in the period June–August 2006 in five random MSOAs (top panel) and aggregated across all the 983 MSOAs of London (bottom panel)

The definition of the geographical units depends both on the research question and practical considerations. The areas should be representative of exposure and health risk processes, in addition to being consistent with the resolution of the available data. Choosing finely aggregated areas can better capture underlying associations in the presence of small-scale dependencies, but would pointlessly inflate the computational demand in the presence of low-resolution exposure data or risk mechanisms acting at wider spatial scales.

## Linking high-resolution exposure data

In this setting, one of the important advantages of the CTS design is the use of exposure measurements assigned to small areas (each of them representing a case), rather than averaging their values across large regions. The same applies to potential co-exposures or time-varying factors acting as confounders, that can be collected at the same small-area scale. Researchers have nowadays access to a variety of resources to retrieve high-resolution measurements of a multitude of risk factors across large populations. These resources include clinical and health databases, census and administrative data, consumer and marketing company data, and measurement networks, among others [ 3 ].

Environmental studies, for instance, can now rely on climate re-analysis and atmospheric emission-dispersion models that offer full coverage and high-resolution measures for a number of environmental stressors. In this case study, we extracted temperature data from the HadUK-Grid product developed by the Met Office [ 10 ]. This database includes daily values of minimum and maximum temperature on a 1 × 1 km grid across the United Kingdom. These data were averaged to derive mean daily temperature values and linked with the mortality series.

The linkage process consists in spatially aligning the two sources of information, namely the polygons defining the 983 MSOAs and the intersecting grid cells with corresponding temperature data. Figure 2 displays the two spatial structures, with the average summer temperature in the two years in each of the grid cells overlayed by the MSOA boundaries. The maps show the spatial differences in temperature within the areas of London, with higher values in more densely urbanised zones.

Average summer temperature (°C) in 2006 (left) and 2013 (right) in a 1 × 1 km grid of the London area, with superimposed the boundaries of the 983 MSOAs

The alignment procedure is carried out using GIS techniques to compute the area-weighted average of the cells intersecting each MSOA, with weights proportional to the intersection areas. This step creates MSOA-specific daily series of temperatures that can be linked with the mortality data. The results are illustrated in Fig. 3 , which show the temperature distribution in three consecutive days in July 2006, demonstrating the differential temporal changes of temperature across areas of the city. The same linkage process can be applied to other exposures or confounders, each potentially defined over different spatial boundaries.

Mean temperature in three consecutive days (13–15 July 2006) across the 983 MSOAs of London

An important advantage of the CTS design is the possibility to use data disaggregated at smaller scales, thus capturing differential changes in exposure across space and time, compared to traditional analyses using a single aggregated series that rely entirely on temporal contrasts. Even in the absence of measurement errors in both disaggregated and aggregated analysis, the former is therefore expected to result in more precise estimates. In this specific example, though, the gain in precision can be limited, as Fig. 3 indicates that the temporal variation seems to dominate compared to spatial differences. The two components of variation can be quantified by the average between-day and between-MSOA standard deviations in temperature, respectively. Results confirm the visual impression, with a temporal deviation of 3.0 °C compared to 0.4 °C of the spatial one.

## Main analysis

The CTS design allows the application of flexible modelling techniques developed for time series analysis, but without requiring the aggregation of the data in a single series. The modelling framework is based on regression models with the following general form:

The model in Eq. 1 has a classical time series form, with outcomes \({y}_{it}\) collected along time \(t\) modelled through multiple regression terms [ 11 ]. Specific functions can be used to define the association with the exposure of interest \(x\) , potentially including delayed effects through the inclusion of lagged values \(x_{t-\ell}\) along lag period \(\ell=0,\dots,L\) . Other terms can be represented by functions modelling the underlying temporal trends using multiple transformations of \(t\) , and potential time-varying predictors \(z\) . The main difference from traditional time series models is in the presence of multiple series for cases represented by the index \(i\) . In particular, cases define matched risk sets , with intercepts \({\xi }_{i}\) expressing baseline risks varying across observational units. The risk sets can be stratified further by defining different intercepts \({\xi }_{i(k)}\) for each time stratum \(k\) , thus modelling within-case variations in risk. The regression is efficiently performed using fixed-effects estimators available for different outcome families [ 12 , 13 ].

In our illustrative example, \({y}_{it}\) represents daily death counts for each of the \(i=1,\dots ,983\) MSOAs. The risk association with temperature \(x\) is modelled through a distributed lag non-linear model (DLNM) with a cross-basis term [ 14 ]. This bi-dimensional parametrisation is obtained using natural cubic splines defining the exposure–response (two knots at the 50 th and 90 th temperature percentiles) and lag-response (one knot at lag 1 over lag period 0–3) relationships. The other terms are two functions of time \(t\) , specifically natural cubic splines of day of the year with 3 degrees of freedom and an interaction with year indicators to model differential seasonal effects in 2006 and 2013, plus indicators for day of the week. Risk sets are defined by MSOA/year/month strata indicators \({\xi }_{i(k)}\) , allowing within-MSOA variation in baseline risks in addition to common trends captured by the temporal terms in Eq. 1 above. The model is fitted using a fixed-effects regression model with a quasi-Poisson family to account for overdispersion.

Results are displayed in Fig. 4 , which shows the overall cumulative exposure–response curve (dark gold) expressing the temperature-mortality association. The curve indicates an increase in mortality risks above 16 \(^\circ\) C, the optimal value corresponding minimum mortality temperature (MMT). The left tail of the curve suggests an increased risk also for relatively cold temperatures experienced during the summer period.

Exposure–response relationships representing the temperature-mortality risk cumulated within lag 0–3 estimated using the CTS model on data disaggregated by MSOAs (dark gold) and from the standard time series model with the aggregated data (green)

The CTS model can be compared to a standard time series analysis performed by aggregating the data in single mortality (Fig. 1 , bottom panel) and temperature series, the latter obtained by averaging the daily values across MSOAs. The model is specified using the same terms and parameterisation as above. The estimated relationship is added to Fig. 4 (green curve). The aggregated analysis reports the association over a narrower range, as local extreme temperatures are averaged out (see Fig. 3 ), and indicates slightly lower risks, in particular failing to capture the residual cold effects. As anticipated, there seems to be little gain in statistical precision from the CTS model, given that in this example the temperature variation is mainly driven by day-to-day variation more than by spatial differences.

## Assessing differentials in vulnerability

The analysis can be extended by introducing additional terms in the model of Eq. 1 , for instance to control for confounders or investigate effect modifications. Associations with time-varying factors can be specified in the usual way through main and interaction terms included directly in the model. In contrast, the conditional framework of fixed-effects regression removes effects associated with time-invariant factors, which are absorbed in the intercepts \({\xi }_{i(k)}\) [ 12 ]. This ensures that potential confounding from such terms is controlled for by design, but has the drawback that their main effects cannot be estimated. Still, interactions with time-invariant terms can be specified to model differential health risks across small areas. In our case study, we apply this method to investigate vulnerability to extreme temperature depending on socio-economic status, represented by the index of multiple deprivation (IMD).

As mentioned above, small-area studies can rely on information collected at different geographical levels, but this requires all the variables to be re-aligned over the same spatial structure, as shown for mortality and temperature above. In this example. IMD scores (defined from 0 as the most deprived to 1 as the least deprived) were originally collected at the smallest census level, the lower super-output areas (LSOAs). Therefore, this information is first re-aligned by averaging the values by MSOA.

The model is then extended by specifying a linear interaction between the cross-basis of temperature and the IMD score. The results are shown in Fig. 5 , which displays the overall cumulative exposure–response curves predicted for low (in blue) and high (red) IMD scores, with values set at the inter-quartile range. The graph suggests little evidence of differential risks by deprivation, as confirmed by the likelihood ratio test (accounting for overdispersion) that returns a p -value of 0.73. It is worth noting, however, that this lack of evidence can be explained by the limited statistical power due to the short study period (two summers).

Exposure–response relationships representing the temperature-mortality risk cumulated within lag 0–3 predicted for less (blue) and more (red) deprived areas, defined by the inter-quartile range of the IMD score

This contribution presents a tutorial on the extension of the CTS design for the analysis of small-area data. The tutorial illustrates the analytical steps using a real-data example, and it discusses practical issues, for instance linkage procedures and data analysis, as well as methodological aspects. The case study uses publicly available datasets with data and R code documented and made available in a GitHub repository. The example is therefore fully reproducible and can be easily adapted to other settings for epidemiological analyses using small-area data.

The main feature of the CTS design is the embedment of flexible time series methods within a self-matched framework based on multiple observational units. This setting offers strong control for both time-invariant and time-varying confounding as well as the possibility to model complex temporal relationships using finely disaggregated data. These aspects are demonstrated in the case study illustrated above. Specifically, the stratification of the baseline risk removes structural differences between MSOAs, while allowing control for area-specific temporal variations on top of common trends modelled through interactions between splines terms and year indicators. Likewise, the time series structure lends itself neatly to the application of distributed lag linear and non-linear models to define complex exposure-lag-response relationships. Finally, the design can improve the characterisation of the association of interest by providing both spatial and temporal contrasts. This is demonstrated in the case study example, where we show how the case time series framework can account for local exposure differences, for instance due to heat island effects, and allows investigating geographical variations in vulnerability.

The advantages of small-area studies, when compared to more traditional approaches based on largely aggregated data, are obvious. First, measurements of health outcomes and risk factors at a small scale are expected to represent more appropriately risk association mechanisms and to provide better control for confounding, thus reducing potential biases that affect ecological studies [ 7 ]. Even in the absence of classical measurement error, whereby the aggregated exposure value is a valid proxy of the true population average, small-area studies can reduce the Berkson-type error and therefore increase the statistical power [ 15 ]. As discussed in the example above, the gain in precision is proportional to the geographical differences in exposure across the study area relative to temporal variations.

The CTS design can be compared to other approaches previously used for epidemiological analyses using small-area data. Traditionally, spatial and spatio-temporal analyses are performed using Bayesian hierarchical models [ 6 ]. These methods provide a powerful framework that accounts for spatial correlations and allows geographically-varying risks, but they present high computational demands that pose limits in the analysis of large datasets and/or complex associations. In contrast, the CTS design offers a flexible and computationally efficient scheme to analyse temporal dependencies while removing entirely potential biases linked to between-area comparisons. As an alternative approach, other studies have replicated two-stage designs developed in multi-city investigations to small-area analyses [ 16 , 17 ]. However, this method encounters estimation issues in the presence of sparse information due to finely disaggregated data, and for instance it would be unfeasible for the analysis of MSOAs in the illustrative example (see Fig. 1 ). Conversely, the CTS design sets no limit to data disaggregation, being applicable with the same structure to individual-level analyses. This aspect is shared by the case-crossover design, a popular methodology previously proposed in small-area analysis [ 18 , 19 ]. In fact, the CTS methodology can replicate exactly the matching structure of the case-crossover scheme [ 20 ], while allowing a more flexible control for temporal trends and modelling of temporal relationships, as demonstrated in the illustrative case study.

Some limitations must be acknowledged. First, similarly to traditional time series methods, the CTS design is only applicable to study short-term risk associations with time-varying exposures, and cannot be used to assess long-term health effects. Likewise, its application in small-area studies is still based on aggregated data and it essentially retains an ecological nature. However, the extreme stratification can prevent some of the associated biases, and it is worth noting that the CTS methodology can be seamlessly applied to individual-level data, when these are available. Finally, its time series structure is ideal for modelling complex temporal dependencies and trends, but presents limitations in capturing spatially correlated and varying risks.

In conclusion, the CTS methodology represents a valuable analytical tool analysis of small-area data. The framework is highly adaptable to various data settings, and it offers flexible features for modelling complex temporal patterns while controlling for time-varying factors and trends. The availability of data collected at small-area level provides opportunities for its application in a variety of epidemiological investigations of risk associations.

## Availability of data and materials

The data, software and code for replicating the analysis and complete set of results are made fully available in a GitHub repository ( https://github.com/gasparrini/CTS-smallarea ). The original data, at the time of writing, were publicly available from online resources. Specifically, the number of daily deaths by MSOAs of London in the summers of 2006 and 2013 was published by ONS ( link ); the geographical boundaries of the MSOAs and the lookup table between LSOAs and MSOAs (for the 2011 census) were available at the Open Geography Portal of ONS ( link ) and the data Open Data portal of GOV.UK ( link ); the gridded daily temperature data temperature data in the HadUK-Grid database from the Met Office were extracted from the Centre for Environmental Data Analysis (CEDA) archive ( link ); the IMD scores by LSOAs (for the year 2015) were provided at GOV.UK ( link ). Additional information on the linkage procedure with the original resources to obtain the final data, as well as the use of the R scripts, are provided in the GitHub repository.

## Abbreviations

- Case time series

Distributed lag non-linear model

Middle layer super output area

Lower layer super output area

Index of multiple deprivation

Office for National Statistics

Reis S, Seto E, Northcross A, Quinn NWT, Convertino M, Jones RL, et al. Integrating modelling and smart sensors for environmental and human health. Environ Model Softw. 2015;74:238–46.

Article Google Scholar

Harron KL, Doidge JC, Knight HE, Gilbert RE, Goldstein H, Cromwell DA, et al. A guide to evaluating linkage quality for the analysis of linked data. Int J Epidemiol. 2017;46(5):1699–710.

Hodgson S, Fecht D, Gulliver J, Iyathooray Daby H, Piel FB, Yip F, et al. Availability, access, analysis and dissemination of small-area data. Int J Epidemiol. 2020;49(Suppl 1):i4–14.

Fecht D, Cockings S, Hodgson S, Piel FB, Martin D, Waller LA. Advances in mapping population and demographic characteristics at small-area levels. Int J Epidemiol. 2020;49(Suppl 1):i15–25.

Meliker JR, Sloan CD. Spatio-temporal epidemiology: principles and opportunities. Spat Spatio-Temporal Epidemiol. 2011;2(1):1–9.

Blangiardo M, Cameletti M, Baio G, Rue H. Spatial and spatio-temporal models with R-INLA. Spat Spatio-Temporal Epidemiol. 2013;4:33–49.

Piel FB, Fecht D, Hodgson S, Blangiardo M, Toledano M, Hansell AL, et al. Small-area methods for investigation of environment and health. Int J Epidemiol. 2020;49(2):686–99.

Gasparrini A. 2021. The case time series design. Epidemiology. 2021;32(6):829-37.

Mostofsky E, Coull BA, Mittleman MA. Analysis of observational self-matched data to examine acute triggers of outcome events with abrupt onset. Epidemiology. 2018;29(6):804–16.

Met Office, Hollis D, McCarthy M, Kendon M, Legg T, Simpson I. HadUK-Grid gridded climate observations on a 1km grid over the UK, v1. 0.1. 0 (1862–2018). Centre for Environmental Data Analysis, 2019.

Bhaskaran K, Gasparrini A, Hajat S, Smeeth L, Armstrong B. Time series regression studies in environmental epidemiology. Int J Epidemiol. 2013;42(4):1187–95.

Gunasekara FI, Richardson K, Carter K, Blakely T. Fixed effects analysis of repeated measures data. Int J Epidemiol. 2013;43(1):264–9.

Allison PD. Fixed Effects Regression Models. US: SAGE Publications Inc; 2009.

Gasparrini A, Armstrong B, Kenward MG. Distributed lag non-linear models. Stat Med. 2010;29(21):2224–34.

Article CAS Google Scholar

Armstrong BG. Effect of measurement error on epidemiological studies of environmental and occupational exposures. Occup Environ Med. 1998;55(10):651.

Benmarhnia T, Kihal-Talantikite W, Ragettli MS, Deguen Se, ,verine. Small-area spatiotemporal analysis of heatwave impacts on elderly mortality in Paris: A cluster analysis approach. Science of The Total Environment. 2017;592:288–94.

Zafeiratou S, Analitis A, Founda D, Giannakopoulos C, Varotsos KV, Sismanidis P, et al. Spatial variability in the effect of high ambient temperature on mortality: an analysis at municipality level within the Greater Athens area. Int J Environ Res Public Health. 2019;16(19):3689.

Bennett JE, Blangiardo M, Fecht D, Elliott P, Ezzati M. Vulnerability to the mortality effects of warm temperature in the districts of England and Wales. Nat Clim Chang. 2014;4(4):269.

Stafoggia M, Bellander T. Short-term effects of air pollutants on daily mortality in the Stockholm county - A spatiotemporal analysis. Environ Res. 2020;188: 109854.

Armstrong BG, Gasparrini A, Tobias A. Conditional Poisson models: a flexible alternative to conditional logistic case cross-over analysis. BMC Med Res Methodol. 2014;14(1):122.

Download references

## Acknowledgements

Not applicable.

This work was supported by the Medical Research Council-UK (Grant ID: MR/R013349/1).

## Author information

Authors and affiliations.

Department of Public Health, Environments and Society, London School of Hygiene and Tropical Medicine (LSHTM), 15-17 Tavistock Place, London, WC1H 9SH, UK

Antonio Gasparrini

Centre for Statistical Methodology, London School of Hygiene & Tropical Medicine (LSHTM), Keppel Street, London, WC1E 7HT, UK

You can also search for this author in PubMed Google Scholar

## Contributions

AG is the sole author of this article. The author(s) read and approved the final manuscript.

## Corresponding author

Correspondence to Antonio Gasparrini .

## Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

## Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

## About this article

Cite this article.

Gasparrini, A. A tutorial on the case time series design for small-area analysis. BMC Med Res Methodol 22 , 129 (2022). https://doi.org/10.1186/s12874-022-01612-x

Download citation

Received : 12 February 2022

Accepted : 12 April 2022

Published : 30 April 2022

DOI : https://doi.org/10.1186/s12874-022-01612-x

## Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

- Time series
- Distributed lag models
- Study design
- Temperature

## BMC Medical Research Methodology

ISSN: 1471-2288

- General enquiries: [email protected]

## To read this content please select one of the options below:

Please note you do not have access to teaching notes, time series analysis of covid-19 cases.

World Journal of Engineering

ISSN : 1708-5284

Article publication date: 11 January 2021

Issue publication date: 22 February 2022

This study analyses the prevalent coronavirus disease (COVID-19) epidemic using machine learning algorithms. The data set used is an API data provided by the John Hopkins University resource centre and used the Web crawler to gather all the data features such as confirmed, recovered and death cases. Because of the unavailability of any COVID-19 drug at the moment, the unvarnished truth is that this outbreak is not expected to end in the near future, so the number of cases of this study would be very date specific. The analysis demonstrated in this paper focuses on the monthly analysis of confirmed, recovered and death cases, which assists to identify the trend and seasonality in the data. The purpose of this study is to explore the essential concepts of time series algorithms and use those concepts to perform time series analysis on the infected cases worldwide and forecast the spread of the virus in the next two weeks and thus aid in health-care services. Lower obtained mean absolute percentage error results of the forecasting time interval validate the model’s credibility.

## Design/methodology/approach

In this study, the time series analysis of this outbreak forecast was done using the auto-regressive integrated moving average (ARIMA) model and also seasonal auto-regressive integrated moving averages with exogenous regressor (SARIMAX) and optimized to achieve better results.

The inferences of time series forecasting models ARIMA and SARIMAX were efficient to produce exact approximate results. The forecasting results indicate that an increasing trend is observed and there is a high rise in COVID-19 cases in many regions and countries that might face one of its worst days unless and until measures are taken to curb the spread of this disease quickly. The pattern of the rise of the spread of the virus in such countries is exactly mimicking some of the countries of early COVID-19 adoption such as Italy and the USA. Further, the obtained numbers of the models are date specific so the most recent execution of the model would return more recent results. The future scope of the study involves analysis with other models such as long short-term memory and then comparison with time series models.

## Originality/value

A time series is a time-stamped data set in which each data point corresponds to a set of observations made at a particular time instance. This work is novel and addresses the COVID-19 with the help of time series analysis. The inferences of time series forecasting models ARIMA and SARIMAX were efficient to produce exact approximate results.

- Time series analysis
- Forecasting

Bhangu, K.S. , Sandhu, J.K. and Sapra, L. (2022), "Time series analysis of COVID-19 cases", World Journal of Engineering , Vol. 19 No. 1, pp. 40-48. https://doi.org/10.1108/WJE-09-2020-0431

Emerald Publishing Limited

Copyright © 2020, Emerald Publishing Limited

## Related articles

We’re listening — tell us what you think, something didn’t work….

Report bugs here

## All feedback is valuable

Please share your general feedback

## Join us on our journey

Platform update page.

Visit emeraldpublishing.com/platformupdate to discover the latest news and updates

## Questions & More Information

Answers to the most commonly asked questions here

Time series and analytics

PostgreSQL, but faster. Built for lightning-fast ingest and querying of time-based data.

Vector (AI/ML)

PostgreSQL engineered for fast search with high recall on millions of vector embeddings.

Dynamic PostgreSQL

PostgreSQL managed services with the benefits of serverless, but none of the problems.

Industries that rely on us

Timescale benchmarks

We're in your corner even during the trial phase. Contact us to discuss your use case with a Timescale technical expert.

Timescale Docs

Start using and integrating Timescale for your demanding data needs.

AI / Vector

Learn PostgreSQL

Timescale is PostgreSQL, but faster. Learn the PostgreSQL basics and scale your database performance to new heights.

Subscribe to the Timescale Newsletter

By submitting, you acknowledge Timescale's Privacy Policy

## The Ultimate Guide to Time-Series Analysis (With Examples and Applications)

What is time-series analysis.

Time-series analysis is a statistical technique that deals with time-series data, or trend analysis. It involves the identification of patterns, trends, seasonality, and irregularities in the data observed over different time periods. This method is particularly useful for understanding the underlying structure and pattern of the data.

When performing time-series analysis, you will use a mathematical set of tools to look into time-series data and learn not only what happened but also when and why it happened.

While both time-series analysis and time-series forecasting are powerful tools that developers can harness to glean insights from data over time, they each have specific strengths, limitations, and applications.

Time-series analysis isn't about predicting the future; instead, it's about understanding the past. It allows developers to decompose data into its constituent parts—trend, seasonality, and residual components. This can help identify any anomalies or shifts in the pattern over time.

Key methodologies used in time-series analysis include moving averages, exponential smoothing, and decomposition methods. Methods such as Autoregressive Integrated Moving Average (ARIMA) models also fall under this category—but more on that later.

On the other hand, time-series forecasting uses historical data to make predictions about future events. The objective here is to build a model that captures the underlying patterns and structures in the time-series data to predict future values of the series.

## Use Cases for Time-Series Analysis

The “time” element in time-series data means that the data is ordered by time. In this type of data, each entry is preceded and followed by another and has a timestamp that determines the order of the data. Check out our earlier blog post to learn more and see examples of time-series data .

A typical example of time-series data is stock prices or a stock market index. However, even if you’re not into financial and algorithmic trading, you probably interact daily with time-series data.

When you drive your car through a digital toll or your smartphone tells you to walk more or that it will rain, time-series data is part of these interactions. If you're working with observability, monitoring different systems to track their performance and ensure they run smoothly, you're also working with time-series data. And if you have a website where you track customer or user interactions (event data), guess what? You're also a time-series analysis use case.

To illustrate this in more detail, let’s look at the example of health apps—we'll refer back to this example throughout this blog post.

## A Real-World Example of Time-Series Analysis

If you open a health app on your phone, you will see all sorts of categories, from step count to noise level or heart rate. By clicking on “show all data” in any of these categories, you will get an almost endless scroll (depending on when you bought the phone) of step counts, which were timestamped when the data was sampled.

This is the raw data of the step count time series. Remember, this is just one of many parameters sampled by your smartphone or smartwatch. While many parameters don’t mean much to most people (yes, I’m looking at you, heart rate variability), when combined with other data, these parameters can give you estimations on overall quantifiers, such as cardio fitness.

To achieve this, you need to connect the time-series data into one large dataset with two identifying variables—time and type of measurement. This is called panel data . Separating it by type gives you multiple time series, while picking one particular point in time gives you a snapshot of everything about your health at a specific moment, like what was happening at 7:45 a.m.

## Why Should You Use Time-Series Analysis?

Now that you’re more familiar with time-series data, you may wonder what to do with it and why you should care. So far, we’ve been mostly just reading off data—how many steps did I take yesterday? Is my heart rate okay?

But time-series analysis can help us answer more complex or future-related questions, such as forecasting. When did I stop walking and catch the bus yesterday? Is exercise making my heart stronger?

To answer these, we need more than just reading the step counter at 7:45 a.m.—we need time-series analysis. Time-series analysis happens when we consider part or the entire time series to see the “bigger picture.” We can do this manually in straightforward cases: for example, by looking at the graph that shows the days when you took more than 10,000 steps this month.

But if you wanted to know how often this occurs or on which days, that would be significantly more tedious to do by hand. Very quickly, we bump into problems that are too complex to tackle without using a computer, and once we have opened that door, a seemingly endless stream of opportunities emerges. We can analyze everything, from ourselves to our business, and make them far more efficient and productive than ever.

## Time-series components

To correctly analyze time-series data, we need to look to the four components of a time series:

- Trend : this is a long-term movement of the time series, such as the decreasing average heart rate of workouts as a person gets fitter.
- Seasonality : regular periodic occurrences within a time interval smaller than a year (e.g., higher step count in spring and autumn because it’s not too cold or too hot for long walks).
- Cyclicity : repeated fluctuations around the trend that are longer in duration than irregularities but shorter than what would constitute a trend. In our walking example, this would be a one-week sightseeing holiday every four to five months.
- Irregularity : short-term irregular fluctuations or noise, such as a gap in the sampling of the pedometer or an active team-building day during the workweek.

Let’s go back to our health app example. One thing you may see immediately, just by looking at a time-series analysis chart, is whether your stats are trending upward or downward. That indicates whether your stats are generally improving or not. By ignoring the short-term variations, it's easier to see if the values rise or decline within a given time range. This is the first of the four components of a time series—trend.

## Limitations of Time-Series Analysis

If you’re performing time-series analysis, it can be helpful to decompose it into these four elements to explain results and make predictions. Trend and seasonality are deterministic, whereas cyclicity and irregularities are not.

Therefore, you first need to eliminate random events to know what can be understood and predicted. Nothing is perfect, and to be able to capture the full power of time-series analysis without abusing the technique and obtaining incorrect results and conclusions, it’s essential to address and understand its limitations.

Generalizations from a single or small sample of subjects must be made very carefully (e.g., finding the time a customer is most likely running requires analyzing the run frequencies of many customers). Predicting future values may be impossible if the data hasn’t been prepared well, and even then, there can always be new irregularities in the future.

Forecasting is usually only stable when you consider the near future. Remember how inaccurate the weather forecast can be when you look it up 10 days in advance. Time-series analysis will never allow you to make exact predictions, only probability distributions of specific values. For example, you can never be sure that a health app user will take more than 10,000 steps on Sunday, only that it is highly likely that they will do it or that you’re 95 % certain they will.

## Types of Time-Series Analysis

Time to dive deeper into how time-series analysis can extract information from time-series data. To do this, let’s divide time-series analysis into five distinct types.

## Exploratory analysis

An exploratory analysis is helpful when you want to describe what you see and explain why you see it in a given time series. It essentially entails decomposing the data into trend, seasonality, cyclicity, and irregularities.

Once the series is decomposed, we can explain what each component represents in the real world and even, perhaps, what caused it. This is not as easy as it may seem and often involves spectral decomposition to find any specific frequencies of recurrences and autocorrelation analysis to see if current values depend on past values.

## Curve fitting

Since time series is a discrete set, you can always tell exactly how many data points it contains. But what if you want to know the value of your time-series parameter at a point in time that is not covered by your data?

To answer this question, we have to supplement our data with a continuous set—a curve. You can do this in several ways, including interpolation and regression. The former is an exact match for parts of the given time series and is mostly useful for estimating missing data points. On the other hand, the latter is a “best-fit” curve, where you have to make an educated guess about the form of the function to be fitted (e.g., linear) and then vary the parameters until your best-fit criteria are satisfied.

What constitutes a “best-fit” situation depends on the desired outcome and the particular problem. Using regression analysis, you also obtain the best-fit function parameters that can have real-world meaning, for example, post-run heart rate recovery as an exponential decay fit parameter. In regression, we get a function that describes the best fit to our data even beyond the last record opening the door to extrapolation predictions.

## Forecasting

Statistical inference is the process of generalization from sample to whole. It can be done over time in time-series data, giving way to future predictions or forecasting: from extrapolating regression models to more advanced techniques using stochastic simulations and machine learning. If you want to know more, check out our article about time-series forecasting .

## Classification and segmentation

Time-series classification is the process of identifying the categories or classes of an outcome variable based on time-series data. In other words, it's about associating each time-series data with one label or class.

For instance, you might use time-series classification to categorize server performance into 'Normal' or 'Abnormal' based on CPU usage data collected over time. The goal here is to create a model that can accurately predict the class of new, unseen time-series data.

Classification models commonly used include decision trees, nearest neighbor classifiers, and deep learning models. These models can handle the temporal dependencies present in time-series data, making them ideal for this task.

Time-series segmentation , on the other hand, involves breaking down a time series into a series of segments, each representing a specific event or state. The objective is to simplify the time-series data by representing it as a sequence of more manageable segments.

For example, in analyzing website traffic data, you might segment the data into periods of 'High,' 'Medium,' and 'Low' activity. This segmentation can provide simpler, more interpretable insights into your data.

Segmentation methods can be either top-down, where the entire series is divided into segments, or bottom-up, where individual data points are merged into segments. Each method has its strengths and weaknesses, and the choice depends on the nature of your data and your specific requirements.

As you may have already guessed, problems rarely require just one type of analysis. Still, it is crucial to understand the various types to appreciate each aspect of the problem correctly and formulate a good strategy for addressing it.

## Visualization and Examples—Run, Overlapping, and Separated Charts

There are many ways to visualize a time series and certain types of its analysis . A run chart is the most common choice for simple time series with one parameter, essentially just data points connected by lines.

However, there are usually several parameters you would like to visualize at once. You have two options in this case: overlapping or separated charts. Overlapping charts display multiple series on a single pane, whereas separated charts show individual series in smaller, stacked, and aligned charts, as seen below.

Let’s take a look at three different real-world examples illustrating what we’ve learned so far. To keep things simple and best demonstrate the analysis types, the following examples will be single-parameter series visualized by run charts.

## Electricity demand in Australia

Stepping away from our health theme, let's explore the time series of Australian monthly electricity demand in the figures below. Visually, it is immediately apparent there is a positive trend, as one would expect with population growth and technological advancement.

Second, there is a pronounced seasonality to the data, as demand in winter will not be the same as in summer. An autocorrelation analysis can help us understand this better. Fundamentally, this checks the correlation between two points separated by a time delay or lag.

As we can see in the autocorrelation function (ACF) graph, the highest correlation comes with a delay of exactly 12 months (implying a yearly seasonality), and the lowest with a half-year separation since electricity consumption is highly dependent on the time of year (air-conditioning, daylight hours, etc.).

Since the underlying data has a trend (it isn’t stationary), as the lag increases, the ACF dies down since the two points are further and further apart, with the positive trend separating them more each year. These conclusions can become increasingly non-trivial when data spans less intuitive variables.

## Boston Marathon winning times

Back to our health theme from the more exploratory previous example, let’s look at the winning times of the Boston Marathon. The aim here is different: we don’t particularly care why the winning times are such. We want to know whether they have been trending and where we can expect them to go.

To do this, we need to fit a curve and assess its predictions. But how to know which curve to choose? There is no universal answer to this; however, even visually, you can eliminate a lot of options. In the figure below, we show you four different choices of fitted curves:

1. A linear fit

f(t) = at + b

2. A piecewise linear fit, which is just several linear fit segments spliced together

3. An exponential fit

f(t) = ae bt + c

4. A cubic spline fit that’s like a piecewise linear fit where the segments are cubic polynomials that have to join smoothly

f(t) = at 3 + bt 2 + ct + d

Looking at the graph, it’s clear that the linear and exponential options aren’t a good fit. It boils down to the cubic spline and the piecewise linear fits. In fact, both are useful, although for different questions.

The cubic spline is visually the best historical fit, but in the future (purple section), it trends upward in an intuitively unrealistic way, with the piecewise linear actually producing a far more reasonable prediction. Therefore, one has to be very careful when using good historical fits for prediction, which is why understanding the underlying data is extremely important when choosing forecasting models.

## Electrocardiogram analysis

As a final example to illustrate the classification and segmentation types of problems, take a look at the following graph. Imagine wanting to train a machine to recognize certain heart irregularities from electrocardiogram (ECG) readings.

First, this is a segmentation problem, as you need to split each ECG time series into sequences corresponding to one heartbeat cycle. The dashed red lines in the diagram are the splittings of these cycles. Having done this on both regular and irregular readings, this becomes a classification problem—the algorithm should now analyze other ECG readouts and search for patterns corresponding to either a regular or irregular heartbeat.

## Challenges in Handling Time-Series Data

Although time-series data offers valuable insights, it also presents unique challenges that need to be addressed during analysis.

## Dealing with missing values

Time-series data often contains missing or incomplete values, which can adversely affect the accuracy of analysis and modeling. To handle missing values, various techniques like interpolation or imputation can be applied, depending on the nature of the data and the extent of missingness.

## Overcoming noise in time-series data

Noise refers to random fluctuations or irregularities in time-series data, which can obscure the underlying patterns and trends. Filtering techniques, such as moving averages or wavelet transforms, can help reduce noise and extract the essential information from the data.

## Learn More About Time-Series Analysis

This was just a glimpse of what time-series analysis offers. By now, you should know that time-series data is ubiquitous. To measure the constant change around you for added efficiency and productivity (whether in life or business), you need to go for it and start analyzing it .

I hope this article has piqued your interest, but nothing compares to trying it out yourself. And for that, you need a robust database to handle the massive time-series datasets. Try Timescale , a modern, cloud-native relational database platform for time series that will give you reliability, fast queries, and the ability to scale infinitely to understand better what is changing, why, and when .

Continue your time-series journey:

- What Is Time-Series Data? (With Examples)
- What Is Time-Series Forecasting?
- Time-Series Database: An Explainer
- A Guide on Data Analysis on PostgreSQL
- What Is a Time-Series Graph With Examples
- What Is a Time-Series Plot, and How Can You Create One
- Get Started With TimescaleDB With Our Tutorials
- How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions
- Speeding Up Data Analysis With TimescaleDB and PostgreSQL
- Clay Grewcoe

## Related posts

## The PostgreSQL Job Scheduler You Always Wanted (Use it With Caution)

We created a job scheduler built into PostgreSQL with no external dependencies. This is the power you always wanted, but with a few caveats.

## Using pg_stat_statements to Optimize Queries

Discover how the pg_stat_statements PostgreSQL extension can help you identify problematic queries and optimize your query performance.

## TimescaleDB 2.3: Improving Columnar Compression for Time-Series on PostgreSQL

TimescaleDB 2.3 makes built-in columnar compression even better by enabling inserts directly into compressed hypertables.

## The Case Time Series Design

Affiliations.

- 1 Department of Public Health Environments and Society, London School of Hygiene & Tropical Medicine, London, United Kingdom.
- 2 Centre for Statistical Methodology, London School of Hygiene & Tropical Medicine, London, United Kingdom.
- PMID: 34432723
- PMCID: PMC7611753
- DOI: 10.1097/EDE.0000000000001410

Modern data linkage and technologies provide a way to reconstruct detailed longitudinal profiles of health outcomes and predictors at the individual or small-area level. Although these rich data resources offer the possibility to address epidemiologic questions that could not be feasibly examined using traditional studies, they require innovative analytical approaches. Here we present a new study design, called case time series, for epidemiologic investigations of transient health risks associated with time-varying exposures. This design combines a longitudinal structure and flexible control of time-varying confounders, typical of aggregated time series, with individual-level analysis and control-by-design of time-invariant between-subject differences, typical of self-matched methods such as case-crossover and self-controlled case series. The modeling framework is highly adaptable to various outcome and exposure definitions, and it is based on efficient estimation and computational methods that make it suitable for the analysis of highly informative longitudinal data resources. We assess the methodology in a simulation study that demonstrates its validity under defined assumptions in a wide range of data settings. We then illustrate the design in real-data examples: a first case study replicates an analysis on influenza infections and the risk of myocardial infarction using linked clinical datasets, while a second case study assesses the association between environmental exposures and respiratory symptoms using real-time measurements from a smartphone study. The case time series design represents a general and flexible tool, applicable in different epidemiologic areas for investigating transient associations with environmental factors, clinical conditions, or medications.

Copyright © 2021 The Author(s). Published by Wolters Kluwer Health, Inc.

## Publication types

- Research Support, Non-U.S. Gov't
- Computer Simulation
- Environmental Exposure* / analysis
- Research Design*

## Grants and funding

- MR/R013349/1/MRC_/Medical Research Council/United Kingdom

## Advanced Epidemiological Analysis

Chapter 3 time series / case-crossover studies.

We’ll start by exploring common characteristics in time series data for environmental epidemiology. In the first half of the class, we’re focusing on a very specific type of study—one that leverages large-scale vital statistics data, collected at a regular time scale (e.g., daily), combined with large-scale measurements of a climate-related exposure, with the goal of estimating the typical relationship between the level of the exposure and risk of a health outcome. For example, we may have daily measurements of particulate matter pollution for a city, measured daily at a set of Environmental Protection Agency (EPA) monitors. We want to investigate how risk of cardiovascular mortality changes in the city from day to day in association with these pollution levels. If we have daily counts of the number of cardiovascular deaths in the city, we can create a statistical model that fits the exposure-response association between particulate matter concentration and daily risk of cardiovascular mortality. These statistical models—and the type of data used to fit them—will be the focus of the first part of this course.

## 3.1 Readings

The required readings for this chapter are:

- Bhaskaran et al. ( 2013 ) Provides an overview of time series regression in environmental epidemiology.
- Vicedo-Cabrera, Sera, and Gasparrini ( 2019 ) Provides a tutorial of all the steps for a projecting of health impacts of temperature extremes under climate change. One of the steps is to fit the exposure-response association using present-day data (the section on “Estimation of Exposure-Response Associations” in the paper). In this chapter, we will go into details on that step, and that section of the paper is the only required reading for this chapter. Later in the class, we’ll look at other steps covered in this paper. Supplemental material for this paper is available to download by clicking http://links.lww.com/EDE/B504 . You will need the data in this supplement for the exercises for class.

The following are supplemental readings (i.e., not required, but may be of interest) associated with the material in this chapter:

- B. Armstrong et al. ( 2012 ) Commentary that provides context on how epidemiological research on temperature and health can help inform climate change policy.
- Dominici and Peng ( 2008c ) Overview of study designs for studying climate-related exposures (air pollution in this case) and human health. Chapter in a book that is available online through the CSU library.
- B. Armstrong ( 2006 ) Covers similar material as Bhaskaran et al. ( 2013 ) , but with more focus on the statistical modeling framework
- Gasparrini and Armstrong ( 2010 ) Describes some of the advances made to time series study designs and statistical analysis, specifically in the context of temperature
- Basu, Dominici, and Samet ( 2005 ) Compares time series and case-crossover study designs in the context of exploring temperature and health. Includes a nice illustration of different referent periods, including time-stratified.
- B. G. Armstrong, Gasparrini, and Tobias ( 2014 ) This paper describes different data structures for case-crossover data, as well as how conditional Poisson regression can be used in some cases to fit a statistical model to these data. Supplemental material for this paper is available at https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-122#Sec13 .
- Imai et al. ( 2015 ) Typically, the time series study design covered in this chapter is used to study non-communicable health outcomes. This paper discusses opportunities and limitations in applying a similar framework for infectious disease.
- Dominici and Peng ( 2008b ) Heavier on statistics. Describes some of the statistical challenges of working with time series data for air pollution epidemiology. Chapter in a book that is available online through the CSU library.
- Lu and Zeger ( 2007 ) Heavier on statistics. This paper shows how, under conditions often common for environmental epidemiology studies, case-crossover and time series methods are equivalent.
- Gasparrini ( 2014 ) Heavier on statistics. This provides the statistical framework for the distributed lag model for environmental epidemiology time series studies.
- Dunn and Smyth ( 2018 ) Introduction to statistical models, moving into regression models and generalized linear models. Chapter in a book that is available online through the CSU library.
- James et al. ( 2013 ) General overview of linear regression, with an R coding “lab” at the end to provide coding examples. Covers model fit, continuous, binary, and categorical covariates, and interaction terms. Chapter in a book that is available online through the CSU library.

## 3.2 Time series and case-crossover study designs

In the first half of this course, we’ll take a deep look at how researchers can study how environmental exposures and health risk are linked using time series studies . Let’s start by exploring the study design for this type of study, as well as a closely linked study design, that of case-crossover studies .

It’s important to clarify the vocabulary we’re using here. We’ll use the terms time series study and case-crossover study to refer specifically to a type of study common for studying air pollution and other climate-related exposures. However, both terms have broader definitions, particularly in fields outside environmental epidemiology. For example, a time series study more generally refers to a study where data is available for the same unit (e.g., a city) for multiple time points, typically at regularly-spaced times (e.g., daily). A variety of statistical methods have been developed to apply to gain insight from this type of data, some of which are currently rarely used in the specific fields of air pollution and climate epidemiology that we’ll explore here. For example, there are methods to address autocorrelation over time in measurements—that is, that measurements taken at closer time points are likely somewhat correlated—that we won’t cover here and that you won’t see applied often in environmental epidemiology studies, but that might be the focus of a “Time Series” course in a statistics or economics department.

In air pollution and climate epidemiology, time series studies typically begin with study data collected for an aggregated area (e.g., city, county, ZIP code) and with a daily resolution. These data are usually secondary data, originally collected by the government or other organizations through vital statistics or other medical records (for the health data) and networks of monitors for the exposure data. In the next section of this chapter, we’ll explore common characteristics of these data. These data are used in a time series study to investigate how changes in the daily level of the exposure is associated with risk of a health outcome, focusing on the short-term period. For example, a study might investigate how risk of respiratory hospitalization in a city changes in relationship with the concentration of particulate matter during the week or two following exposure. The study period for these studies is often very long (often a decade or longer), and while single-community time series studies can be conducted, many time series studies for environmental epidemiology now include a large set of communities of national or international scope.

The study design essentially compares a community with itself at different time points—asking if health risk tends to be higher on days when exposure is higher. By comparing the community to itself, the design removes many challenges that would come up when comparing one community to another (e.g., is respiratory hospitalization risk higher in city A than city B because particulate matter concentrations are typically higher in city A?). Communities differ in demographics and other factors that influence health risk, and it can be hard to properly control for these when exploring the role of environmental exposures. By comparison, demographics tend to change slowly over time (at least, compared to a daily scale) within a community.

One limitation, however, is that the study design is often best-suited to study acute effects, but more limited in studying chronic health effects. This is tied to the design and traditional ways of statistically modeling the resulting data. Since a community is compared with itself, the design removes challenges in comparing across communities, but it introduces new ones in comparing across time. Both environmental exposures and rates of health outcomes can have strong patterns over time, both across the year (e.g., mortality rates tend to follow a strong seasonal pattern, with higher rates in winter) and across longer periods (e.g., over the decade or longer of a study period). These patterns must be addressed through the statistical model fit to the time series data, and they make it hard to disentangle chronic effects of the exposure from unrelated temporal patterns in the exposure and outcome, and so most time series studies will focus on the short-term (or acute) association between exposure and outcome, typically looking at a period of at most about a month following exposure.

The term case-crossover study is a bit more specific than time series study , although there has been a strong movement in environmental epidemiology towards applying a specific version of the design, and so in this field the term often now implies this more specific version of the design. Broadly, a case-crossover study is one in which the conditions at the time of a health outcome are compared to conditions at other times that should otherwise (i.e., outside of the exposure of interest) be comparable. A case-crossover study could, for example, investigate the association between weather and car accidents by taking a set of car accidents and investigating how weather during the car accident compared to weather in the same location the week before.

One choice in a case-crossover study design is how to select the control time periods. Early studies tended to use a simple method for this—for example, taking the day before, or a day the week before, or some similar period somewhat close to the day of the outcome. As researchers applied the study design to large sets of data (e.g., all deaths in a community over multiple years), they noticed that some choices could create bias in estimates. As a result, most environmental epidemiology case-crossover studies now use a time-stratified approach to selecting control days. This selects a set of control days that typically include days both before and after the day of the health outcome, and are a defined set of days within a “stratum” that should be comparable in terms of temporal trends. For daily-resolved data, this stratum typically will include all the days within a month, year, and day of week. For example, one stratum of comparable days might be all the Mondays in January of 2010. These stratums are created throughout the study period, and then days are only compared to other days within their stratum (although, fortunately, there are ways you can apply a single statistical model to fit all the data for this approach rather than having to fit code stratum-by-stratum over many years).

When this is applied to data at an aggregated level (e.g., city, county, or ZIP code), it is in spirit very similar to a time series study design, in that you are comparing a community to itself at different time points. The main difference is that a time series study uses statistical modeling to control from potential confounding from temporal patterns, while a case-crossover study of this type instead controls for this potential confounding by only comparing days that should be “comparable” in terms of temporal trends, for example, comparing a day only to other days in the same month, year, and day of week. You will often hear that case-crossover studies therefore address potential confounding for temporal patterns “by design” rather than “statistically” (as in time series studies). However, in practice (and as we’ll explore in this class), in environmental epidemiology, case-crossover studies often are applied to aggregated community-level data, rather than individual-level data, with exposure assumed to be the same for everyone in the community on a given day. Under these assumptions, time series and case-crossover studies have been determined to be essentially equivalent (and, in fact, can use the same study data), only with slightly different terms used to control for temporal patterns in the statistical model fit to the data. Several interesting papers have been written to explore differences and similarities in these two study designs as applied in environmental epidemiology ( Basu, Dominici, and Samet 2005 ; B. G. Armstrong, Gasparrini, and Tobias 2014 ; Lu and Zeger 2007 ) .

These types of study designs in practice use similar datasets. In earlier presentations of the case-crossover design, these data would be set up a bit differently for statistical modeling. More recent work, however, has clarified how they can be modeled similarly to when using a time series study design, allowing the data to be set up in a similar way ( B. G. Armstrong, Gasparrini, and Tobias 2014 ) .

Several excellent commentaries or reviews are available that provide more details on these two study designs and how they have been used specifically investigate the relationship between climate-related exposures and health ( Bhaskaran et al. 2013 ; B. Armstrong 2006 ; Gasparrini and Armstrong 2010 ) . Further, these designs are just two tools in a wider collection of study designs that can be used to explore the health effects of climate-related exposures. Dominici and Peng ( 2008c ) provides a nice overview of this broader set of designs.

## 3.3 Time series data

Let’s explore the type of dataset that can be used for these time series–style studies in environmental epidemiology. In the examples in this chapter, we’ll be using data that comes as part of the Supplemental Material in one of this chapter’s required readings, ( Vicedo-Cabrera, Sera, and Gasparrini 2019 ) . Follow the link for the supplement for this article and then look for the file “lndn_obs.csv.” This is the file we’ll use as the example data in this chapter.

These data are saved in a csv format (that is, a plain text file, with commas used as the delimiter), and so they can be read into R using the read_csv function from the readr package (part of the tidyverse). For example, you can use the following code to read in these data, assuming you have saved them in a “data” subdirectory of your current working directory:

This example dataset shows many characteristics that are common for datasets for time series studies in environmental epidemiology. Time series data are essentially a sequence of data points repeatedly taken over a certain time interval (e.g., day, week, month etc). General characteristics of time series data for environmental epidemiology studies are:

- Observations are given at an aggregated level. For example, instead of individual observations for each person in London, the obs data give counts of deaths throughout London. The level of aggregation is often determined by geopolitical boundaries, for example, counties or ZIP codes in the US.
- Observations are given at regularly spaced time steps over a period. In the obs dataset, the time interval is day. Typically, values will be provided continuously over that time period, with observations for each time interval. Occasionally, however, the time series data may only be available for particular seasons (e.g., only warm season dates for an ozone study), or there may be some missing data on either the exposure or health outcome over the course of the study period.
- Observations are available at the same time step (e.g., daily) for (1) the health outcome, (2) the environmental exposure of interest, and (3) potential time-varying confounders. In the obs dataset, the health outcome is mortality (from all causes; sometimes, the health outcome will focus on a specific cause of mortality or other health outcomes such as hospitalizations or emergency room visits). Counts are given for everyone in the city for each day ( all column), as well as for specific age categories ( all_0_64 for all deaths among those up to 64 years old, and so on). The exposure of interest in the obs dataset is temperature, and three metrics of this are included ( tmean , tmin , and tmax ). Day of the week is one time-varying factor that could be a confounder, or at least help explain variation in the outcome (mortality). This is included through the dow variable in the obs data. Sometimes, you will also see a marker for holidays included as a potential time-varying confounder, or other exposure variables (temperature is a potential confounder, for example, when investigating the relationship between air pollution and mortality risk).
- Multiple metrics of an exposure and / or multiple health outcome counts may be included for each time step. In the obs example, three metrics of temperature are included (minimum daily temperature, maximum daily temperature, and mean daily temperature). Several counts of mortality are included, providing information for specific age categories in the population. The different metrics of exposure will typically be fit in separate models, either as a sensitivity analysis or to explore how exposure measurement affects epidemiological results. If different health outcome counts are available, these can be modeled in separate statistical models to determine an exposure-response function for each outcome.

## 3.4 Exploratory data analysis

When working with time series data, it is helpful to start with some exploratory data analysis. This type of time series data will often be secondary data—it is data that was previously collected, as you are re-using it. Exploratory data analysis is particularly important with secondary data like this. For primary data that you collected yourself, following protocols that you designed yourself, you will often be very familiar with the structure of the data and any quirks in it by the time you are ready to fit a statistical model. With secondary data, however, you will typically start with much less familiarity about the data, how it was collected, and any potential issues with it, like missing data and outliers.

Exploratory data analysis can help you become familiar with your data. You can use summaries and plots to explore the parameters of the data, and also to identify trends and patterns that may be useful in designing an appropriate statistical model. For example, you can explore how values of the health outcome are distributed, which can help you determine what type of regression model would be appropriate, and to see if there are potential confounders that have regular relationships with both the health outcome and the exposure of interest. You can see how many observations have missing data for the outcome, the exposure, or confounders of interest, and you can see if there are any measurements that look unusual. This can help in identifying quirks in how the data were recorded—for example, in some cases ground-based weather monitors use -99 or -999 to represent missing values, definitely something you want to catch and clean-up in your data (replacing with R’s NA for missing values) before fitting a statistical model!

The following applied exercise will take you through some of the questions you might want to answer through this type of exploratory analysis. In general, the tidyverse suite of R packages has loads of tools for exploring and visualizing data in R. The lubridate package from the tidyverse , for example, is an excellent tool for working with date-time data in R, and time series data will typically have at least one column with the timestamp of the observation (e.g., the date for daily data). You may find it worthwhile to explore this package some more. There is a helpful chapter in Wickham and Grolemund ( 2016 ) , https://r4ds.had.co.nz/dates-and-times.html , as well as a cheatsheet at https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf . For visualizations, if you are still learning techniques in R, two books you may find useful are Healy ( 2018 ) (available online at https://socviz.co/ ) and Chang ( 2018 ) (available online at http://www.cookbook-r.com/Graphs/ ).

Applied: Exploring time series data

Read the example time series data into R and explore it to answer the following questions:

- What is the study period for the example obs dataset? (i.e., what dates / years are covered by the time series data?)
- Are there any missing dates (i.e., dates with nothing recorded) within this time period? Are there any recorded dates where health outcome measurements are missing? Any where exposure measurements are missing?
- Are there seasonal trends in the exposure? In the outcome?
- Are there long-term trends in the exposure? In the outcome?
- Is the outcome associated with day of week? Is the exposure associated with day of week?

Based on your exploratory analysis in this section, talk about the potential for confounding when these data are analyzed to estimate the association between daily temperature and city-wide mortality. Is confounding by seasonal trends a concern? How about confounding by long-term trends in exposure and mortality? How about confounding by day of week?

Applied exercise: Example code

In the obs dataset, the date of each observation is included in a column called date . The data type of this column is “Date”—you can check this by using the class function from base R:

Since this column has a “Date” data type, you can run some mathematical function calls on it. For example, you can use the min function from base R to get the earliest date in the dataset and the max function to get the latest.

You can also run the range function to get both the earliest and latest dates with a single call:

This provides the range of the study period for these data. One interesting point is that it’s not a round set of years—instead, the data ends during the summer of the last study year. This doesn’t present a big problem, but is certainly something to keep in mind if you’re trying to calculate yearly averages of any values for the dataset. If you’re getting the average of something that varies by season (e.g., temperature), it could be slightly weighted by the months that are included versus excluded in the partial final year of the dataset. Similarly, if you group by year and then count totals by year, the number will be smaller for the last year, since only part of the year’s included. For example, if you wanted to count the total deaths in each year of the study period, it will look like they go down a lot the last year, when really it’s only because only about half of the last year is included in the study period:

- Are there any missing dates within this time period? Are there any recorded dates where health outcome measurements are missing? Any where exposure measurements are missing?

There are a few things you should check to answer this question. First (and easiest), you can check to see if there are any NA values within any of the observations in the dataset. This helps answer the second and third parts of the question. The summary function will provide a summary of the values in each column of the dataset, including the count of missing values ( NA s) if there are any:

Based on this analysis, all observations are complete for all dates included in the dataset. There are no listings for NA s for any of the columns, and this indicates no missing values in the dates for which there’s a row in the data.

However, this does not guarantee that every date between the start date and end date of the study period are included in the recorded data. Sometimes, some dates might not get recorded at all in the dataset, and the summary function won’t help you determine when this is the case. One common example in environmental epidemiology is with ozone pollution data. These are sometimes only measured in the warm season, and so may be shared in a dataset with all dates outside of the warm season excluded.

There are a few alternative explorations you can do to check this. Perhaps the easiest is to check the number of days between the start and end date of the study period, and then see if the number of observations in the dataset is the same:

This indicates that there is an observation for every date over the study period, since the number of observations should be one more than the time difference. In the next question, we’ll be plotting observations by time, and typically this will also help you see if there are large chunks of missing dates in the data.

You can use a simple plot to visualize patterns over time in both the exposure and the outcome. For example, the following code plots a dot for each daily temperature observation over the study period. The points are set to a smaller size ( size = 0.5 ) and plotted with some transparency ( alpha = 0.5 ) since there are so many observations.

There is (unsurprisingly) clear evidence here of a strong seasonal trend in mean temperature, with values typically lowest in the winter and highest in the summer.

You can plot the outcome variable in the same way:

Again, there are seasonal trends, although in this case they are inversed. Mortality tends to be highest in the winter and lowest in the summer. Further, the seasonal pattern is not equally strong in all years—some years it has a much higher winter peak, probably in conjunction with severe influenza seasons.

Another way to look for seasonal trends is with a heatmap-style visualization, with day of year along the x-axis and year along the y-axis. This allows you to see patterns that repeat around the same time of the year each year (and also unusual deviations from normal seasonal patterns).

For example, here’s a plot showing temperature in each year, where the observations are aligned on the x-axis by time in year. We’re using the doy —which stands for “day of year” (i.e., Jan 1 = 1; Jan 2 = 2; … Dec 31 = 365 as long as it’s not a leap year) as the measure of time in the year. We’ve reversed the y-axis so that the earliest years in the study period start at the top of the visual, then later study years come later—this is a personal style, and it would be no problem to leave the y-axis as-is. We’ve used the viridis color scale for the fill, since that has a number of features that make it preferable to the default R color scale, including that it is perceptible for most types of color blindness and be printed out in grayscale and still be correctly interpreted.

From this visualization, you can see that temperatures tend to be higher in the summer months and lower in the winter months. “Spells” of extreme heat or cold are visible—where extreme temperatures tend to persist over a period, rather than randomly fluctuating within a season. You can also see unusual events, like the extreme heat wave in the summer of 2003, indicated with the brightest yellow in the plot.

We created the same style of plot for the health outcome. In this case, we focused on mortality among the oldest age group, as temperature sensitivity tends to increase with age, so this might be where the strongest patterns are evident.

For mortality, there tends to be an increase in the winter compared to the summer. Some winters have stretches with particularly high mortality—these are likely a result of seasons with strong influenza outbreaks. You can also see on this plot the impact of the 2003 heat wave on mortality among this oldest age group—an unusual spot of light green in the summer.

Some of the plots we created in the last section help in exploring this question. For example, the following plot shows a clear pattern of decreasing daily mortality counts, on average, over the course of the study period:

It can be helpful to add a smooth line to help detect these longer-term patterns, which you can do with geom_smooth :

You could also take the median mortality count across each year in the study period, although you should take out any years without a full year’s worth of data before you do this, since there are seasonal trends in the outcome:

Again, we see a clear pattern of decreasing mortality rates in this city over time. This means we need to think carefully about long-term time patterns as a potential confounder. It will be particularly important to think about this if the exposure also has a strong pattern over time. For example, air pollution regulations have meant that, in many cities, there may be long-term decreases in pollution concentrations over a study period.

The data already has day of week as a column in the data ( dow ). However, this is in a character data type, so it doesn’t have the order of weekdays encoded (e.g., Monday comes before Tuesday). This makes it hard to look for patterns related to things like weekend / weekday.

We could convert this to a factor and encode the weekday order when we do it, but it’s even easier to just recreate the column from the date column. We used the wday function from the lubridate package to do this—it extracts weekday as a factor, with the order of weekdays encoded (using a special “ordered” factor type):

We looked at the mean, median, and 25th and 75th quantiles of the mortality counts by day of week:

Mortality tends to be a bit higher on weekdays than weekends, but it’s not a dramatic difference.

We did the same check for temperature:

In this case, there does not seem to be much of a pattern by weekday.

You can also visualize the association using boxplots:

You can also try violin plots—these show the full distribution better than boxplots, which only show quantiles.

All these reinforce that there are some small differences in weekend versus weekday patterns for mortality. There isn’t much pattern by weekday with temperature, so in this case weekday is unlikely to be a confounder (the same is not true with air pollution, which often varies based on commuting patterns and so can have stronger weekend/weekday differences). However, since it does help some in explaining variation in the health outcome, it might be worth including in our models anyway, to help reduce random noise.

Exploratory data analysis is an excellent tool for exploring your data before you begin fitting a statistical model, and you should get in the habit of using it regularly in your research. Dominici and Peng ( 2008a ) provides another walk-through of exploring this type of data, including some more advanced tools for exploring autocorrelation and time patterns.

## 3.5 Statistical modeling for a time series study

Now that we’ve explored the data typical of a time series study in climate epidemiology, we’ll look at how we can fit a statistical model to those data to gain insight into the relationship between the exposure and acute health effects. Very broadly, we’ll be using a statistical model to answer the question: How does the relative risk of a health outcome change as the level of the exposure changes, after controlling for potential confounders?

In the rest of this chapter and the next chapter, we’ll move step-by-step to build up to the statistical models that are now typically used in these studies. Along the way, we’ll discuss key components and choices in this modeling process. The statistical modeling is based heavily on regression modeling, and specifically generalized linear regression. To help you get the most of this section, you may find it helpful to review regression modeling and generalized linear models. Some resources for that include Dunn and Smyth ( 2018 ) and James et al. ( 2013 ) .

One of the readings for this week, Vicedo-Cabrera, Sera, and Gasparrini ( 2019 ) , includes a section on fitting exposure-response functions to describe the association between daily mean temperature and mortality risk. This article includes example code in its supplemental material, with code for fitting the model to these time series data in the file named “01EstimationERassociation.r.” Please download that file and take a look at the code.

The model in the code may at first seem complex, but it is made up of a number of fairly straightforward pieces (although some may initially seem complex):

- The model framework is a generalized linear model (GLM)
- This GLM is fit assuming an error distribution and a link function appropriate for count data
- The GLM is fit assuming an error distribution that is also appropriate for data that may be overdispersed
- The model includes control for day of the week by including a categorical variable
- The model includes control for long-term and seasonal trends by including a spline (in this case, a natural cubic spline ) for the day in the study
- The model fits a flexible, non-linear association between temperature and mortality risk, also using a spline
- The model fits a flexible non-linear association between temperature on a series of preceeding days and current day and mortality risk on the current day using a distributed lag approach
- The model jointly describes both of the two previous non-linear associations by fitting these two elements through one construct in the GLM, a cross-basis term

In this section and the next chapter, we will work through the elements, building up the code to get to the full model that is fit in Vicedo-Cabrera, Sera, and Gasparrini ( 2019 ) .

Fitting a GLM to time series data

The generalized linear model (GLM) framework unites a number of types of regression models you may have previously worked with. One basic regression model that can be fit within this framework is a linear regression model. However, the framework also allows you to also fit, among others, logistic regression models (useful when the outcome variable can only take one of two values, e.g., success / failure or alive / dead) and Poisson regression models (useful when the outcome variable is a count or rate). This generalized framework brings some unity to these different types of regression models. From a practical standpoint, it has allowed software developers to easily provide a common interface to fit these types of models. In R, the common function call to fit GLMs is glm .

Within the GLM framework, the elements that separate different regression models include the link function and the error distribution. The error distribution encodes the assumption you are enforcing about how the errors after fitting the model are distributed. If the outcome data are normally distributed (a.k.a., follow a Gaussian distribution), after accounting for variance explained in the outcome by any of the model covariates, then a linear regression model may be appropriate. For count data—like numbers of deaths a day—this is unlikely, unless the average daily mortality count is very high (count data tend to come closer to a normal distribution the further their average gets from 0). For binary data—like whether each person in a study population died on a given day or not—normally distributed errors are also unlikely. Instead, in these two cases, it is typically more appropriate to fit GLMs with Poisson and binomial “families,” respectively, where the family designation includes an appropriate specification for the variance when fitting the model based on these outcome types.

The other element that distinguishes different types of regression within the GLM framework is the link function. The link function applies a transformation on the combination of independent variables in the regression equation when fitting the model. With normally distributed data, an identity link is often appropriate—with this link, the combination of independent variables remain unchanged (i.e., keep their initial “identity”). With count data, a log link is often more appropriate, while with binomial data, a logit link is often used.

Finally, data will often not perfectly adhere to assumptions. For example, the Poisson family of GLMs assumes that variance follows a Poisson distribution (The probability mass function for Poisson distribution \(X \sim {\sf Poisson}(\mu)\) is denoted by \(f(k;\mu)=Pr[X=k]= \displaystyle \frac{\mu^{k}e^{-\mu}}{k!}\) , where \(k\) is the number of occurences, and \(\mu\) is equal to the expected number of cases). With this distribution, the variance is equal to the mean ( \(\mu=E(X)=Var(X)\) ). With real-life data, this assumption is often not valid, and in many cases the variance in real life count data is larger than the mean. This can be accounted for when fitting a GLM by setting an error distribution that does not require the variance to equal the mean—instead, both a mean value and something like a variance are estimated from the data, assuming an overdispersion parameter \(\phi\) so that \(Var(X)=\phi E(X)\) . In environmental epidemiology, time series are often fit to allow for this overdispersion. This is because if the data are overdispersed but the model does not account for this, the standard errors on the estimates of the model parameters may be artificially small. If the data are not overdispersed ( \(\phi=1\) ), the model will identify this when being fit to the data, so it is typically better to prefer to allow for overdispersion in the model (if the size of the data were small, you may want to be parsimonious and avoid unneeded complexity in the model, but this is typically not the case with time series data).

In the next section, you will work through the steps of developing a GLM to fit the example dataset obs . For now, you will only fit a linear association between mean daily temperature and mortality risk, eventually including control for day of week. In later work, especially the next chapter, we will build up other components of the model, including control for the potential confounders of long-term and seasonal patterns, as well as advancing the model to fit non-linear associations, distributed by time, through splines, a distributed lag approach, and a cross-basis term.

Applied: Fitting a GLM to time series data

In R, the function call used to fit GLMs is glm . Most of you have likely covered GLMs, and ideally this function call, in previous courses. If you are unfamiliar with its basic use, you will want to refresh yourself on this topic—you can use some of the resources noted earlier in this section and in the chapter’s “Supplemental Readings” to do so.

- Fit a GLM to estimate the association between mean daily temperature (as the independent variable) and daily mortality count (as the dependent variable), first fitting a linear regression. (Since the mortality data are counts, we will want to shift to a different type of regression within the GLM framework, but this step allows you to develop a simple glm call, and to remember where to include the data and the independent and dependent variables within this function call.)
- Change your function call to fit a regression model in the Poisson family.
- Change your function call to allow for overdispersion in the outcome data (daily mortality count). How does the estimated coefficient for temperature change between the model fit for #2 and this model? Check both the central estimate and its estimated standard error.
- Change your function call to include control for day of week.
- Fit a GLM to estimate the association between mean daily temperature (as the independent variable) and daily mortality count (as the dependent variable), first fitting a linear regression.

This is the model you are fitting:

\(Y_{t}=\beta_{0}+\beta_{1}X1_{t}+\epsilon\)

where \(Y_{t}\) is the mortality count on day \(t\) , \(X1_{t}\) is the mean temperature for day \(t\) and \(\epsilon\) is the error term. Since this is a linear model we are assuming a Gaussian error distribution \(\epsilon \sim {\sf N}(0, \sigma^{2})\) , where \(\sigma^{2}\) is the variance not explained by the covariates (here just temperature).

To do this, you will use the glm call. If you would like to save model fit results to use later, you assign the output a name as an R object ( mod_linear_reg in the example code). If your study data are in a dataframe, you can specify these data in the glm call with the data parameter. Once you do this, you can use column names directly in the model formula. In the model formula, the dependent variable is specified first ( all , the column for daily mortality counts for all ages, in this example), followed by a tilde ( ~ ), followed by all independent variables (only tmean in this example). If multiple independent variables are included, they are joined using + . We’ll see an example when we start adding control for confounders later.

Once you have fit a model and assigned it to an R object, you can explore it and use resulting values. First, the print method for a regression model gives some summary information. This method is automatically called if you enter the model object’s name at the console:

More information is printed if you run the summary method on the model object:

Make sure you are familiar with the information provided from the model object, as well as how to interpret values like the coefficient estimates and their standard errors and p-values. These basic elements should have been covered in previous coursework (even if a different programming language was used to fit the model), and so we will not be covering them in great depth here, but instead focusing on some of the more advanced elements of how regression models are commonly fit to data from time series and case-crossover study designs in environmental epidemiology. For a refresher on the basics of fitting statistical models in R, you may want to check out Chapters 22 through 24 of Wickham and Grolemund ( 2016 ) , a book that is available online, as well as Dunn and Smyth ( 2018 ) and James et al. ( 2013 ) .

Finally, there are some newer tools for extracting information from model fit objects. The broom package extracts different elements from these objects and returns them in a “tidy” data format, which makes it much easier to use the output further in analysis with functions from the “tidyverse” suite of R packages. These tools are very popular and powerful, and so the broom tools can be very useful in working with output from regression modeling in R.

The broom package includes three main functions for extracting data from regression model objects. First, the glance function returns overall data about the model fit, including the AIC and BIC:

The tidy function returns data at the level of the model coefficients, including the estimate for each model parameter, its standard error, test statistic, and p-value.

Finally, the augment function returns data at the level of the original observations, including the fitted value for each observation, the residual between the fitted and true value, and some measures of influence on the model fit.

One way you can use augment is to graph the fitted values for each observation after fitting the model:

For more on the broom package, including some excellent examples of how it can be used to streamline complex regression analyses, see Robinson ( 2014 ) . There is also a nice example of how it can be used in one of the chapters of Wickham and Grolemund ( 2016 ) , available online at https://r4ds.had.co.nz/many-models.html .

A linear regression is often not appropriate when fitting a model where the outcome variable provides counts, as with the example data, since such data often don’t follow a normal distribution. A Poisson regression is typically preferred.

For a count distribution were \(Y \sim {\sf Poisson(\mu)}\) we typically fit a model such as

\(g(Y)=\beta_{0}+\beta_{1}X1\) , where \(g()\) represents the link function, in this case a log function so that \(log(Y)=\beta_{0}+\beta_{1}X1\) . We can also express this as \(Y=exp(\beta_{0}+\beta_{1}X1)\) .

In the glm call, you can specify this with the family parameter, for which “poisson” is one choice.

One thing to keep in mind with this change is that the model now uses a non-identity link between the combination of independent variable(s) and the dependent variable. You will need to keep this in mind when you interpret the estimates of the regression coefficients. While the coefficient estimate for tmean from the linear regression could be interpreted as the expected increase in mortality counts for a one-unit (i.e., one degree Celsius) increase in temperature, now the estimated coefficient should be interpreted as the expected increase in the natural log-transform of mortality count for a one-unit increase in temperature.

You can see this even more clearly if you take a look at the association between temperature for each observation and the expected mortality count fit by the model. First, if you look at the fitted values without transforming, they will still be in a state where mortality count is log-transformed. You can see by looking at the range of the y-scale that these values are for the log of expected mortality, rather than expected mortality (compare, for example, to the similar plot shown from the first model, which was linear), and that the fitted association for that transformation , not for untransformed mortality counts, is linear:

You can use exponentiation to transform the fitted values back to just be the expected mortality count based on the model fit. Once you make this transformation, you can see how the link in the Poisson family specification enforced a curved relationship between mean daily temperature and the untransformed expected mortality count.

For this model, we can interpret the coefficient for the temperature covariate as the expected log relative risk in the health outcome associated with a one-unit increase in temperature. We can exponentiate this value to get an estimate of the relative risk:

If you want to estimate the confidence interval for this estimate, you should calculate that before exponentiating.

In the R glm call, there is a family that is similar to Poisson (including using a log link), but that allows for overdispersion. You can specify it with the “quasipoisson” choice for the family parameter in the glm call:

When you use this family, there will be some new information in the summary for the model object. It will now include a dispersion parameter ( \(\phi\) ). If this is close to 1, then the data were close to the assumed variance for a Poisson distribution (i.e., there was little evidence of overdispersion). In the example, the overdispersion is around 5, suggesting the data are overdispersed (this might come down some when we start including independent variables that explain some of the variation in the outcome variable, like long-term and seasonal trends).

If you compare the estimates of the temperature coefficient from the Poisson regression with those when you allow for overdispersion, you’ll see something interesting:

The central estimate ( estimate column) is very similar. However, the estimated standard error is larger when the model allows for overdispersion. This indicates that the Poisson model was too simple, and that its inherent assumption that data were not overdispersed was problematic. If you naively used a Poisson regression in this case, then you would estimate a confidence interval on the temperature coefficient that would be too narrow. This could cause you to conclude that the estimate was statistically significant when you should not have (although in this case, the estimate is statistically significant under both models).

Day of week is included in the data as a categorical variable, using a data type in R called a factor. You are now essentially fitting this model:

\(log(Y)=\beta_{0}+\beta_{1}X1+\gamma^{'}X2\) ,

where \(X2\) is a categorical variable for day of the week and \(\gamma^{'}\) represents a vector of parameters associated with each category.

It is pretty straightforward to include factors as independent variables in calls to glm : you just add the column name to the list of other independent variables with a + . In this case, we need to do one more step: earlier, we added order to dow , so it would “remember” the order of the week days (Monday before Tuesday, etc.). However, we need to strip off this order before we include the factor in the glm call. One way to do this is with the factor call, specifying ordered = FALSE . Here is the full call to fit this model:

When you look at the summary for the model object, you can see that the model has fit a separate model parameter for six of the seven weekdays. The one weekday that isn’t fit (Sunday in this case) serves as a baseline —these estimates specify how the log of the expected mortality count is expected to differ on, for example, Monday versus Sunday (by about 0.03), if the temperature is the same for the two days.

You can also see from this summary that the coefficients for the day of the week are all statistically significant. Even though we didn’t see a big difference in mortality counts by day of week in our exploratory analysis, this suggests that it does help explain some variance in mortality observations and will likely be worth including in the final model.

The model now includes day of week when fitting an expected mortality count for each observation. As a result, if you plot fitted values of expected mortality versus mean daily temperature, you’ll see some “hoppiness” in the fitted line:

This is because each fitted value is also incorporating the expected influence of day of week on the mortality count, and that varies across the observations (i.e., you could have two days with the same temperature, but different expected mortality from the model, because they occur on different days).

If you plot the model fits separately for each day of the week, you’ll see that the line is smooth across all observations from the same day of the week:

Wrapping up

At this point, the coefficient estimates suggests that risk of mortality tends to decrease as temperature increases. Do you think this is reasonable? What else might be important to build into the model based on your analysis up to this point?

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

- Publications
- Account settings
- Advanced Search
- Journal List
- Europe PMC Author Manuscripts

## The Case Time Series Design

Antonio gasparrini.

1 Department of Public Health Environments and Society, London School of Hygiene & Tropical Medicine, London UK

2 Centre for Statistical Methodology, London School of Hygiene & Tropical Medicine, London UK

## Associated Data

Online supplemental material includes documents for simulating data with the same features of the datasets used in the two case studies, and for reproducing the steps and results of the analyses presented in the article. An updated version complemented with scripts of the R statistical software is available at https://github.com/gasparrini/CaseTimeSeries .

Modern data linkage and technologies provide a way to reconstruct detailed longitudinal profiles of health outcomes and predictors at the individual or small-area level. While these rich data resources offer the possibility to address epidemiologic questions that could not be feasibly examined using traditional studies, they require innovative analytical approaches. Here we present a new study design, called case time series, for epidemiologic investigations of transient health risks associated with time-varying exposures. This design combines a longitudinal structure and flexible control of time-varying confounders, typical of aggregated time series, with individual-level analysis and control-by-design of time-invariant between-subject differences, typical of self-matched methods such as case–crossover and self-controlled case series. The modelling framework is highly adaptable to various outcome and exposure definitions, and it is based on efficient estimation and computational methods that make it suitable for the analysis of highly informative longitudinal data resources. We assess the methodology in a simulation study that demonstrates its validity under defined assumptions in a wide range of data settings. We then illustrate the design in real-data examples: a first case study replicates an analysis on influenza infections and the risk of myocardial infarction using linked clinical datasets, while a second case study assesses the association between environmental exposures and respiratory symptoms using real-time measurements from a smartphone study. The case time series design represents a general and flexible tool, applicable in different epidemiologic areas for investigating transient associations with environmental factors, clinical conditions, or medications.

Observational studies aim to discover and understand causal relationships between exposures and health outcomes through the analysis of epidemiologic data. 1 Paramount to this objective is removing biases due to the non-experimental setting, in the first place confounding. It is, therefore, no surprise that traditional approaches based on cohort and case–control methods have been complemented with, and extended by, alternative study designs and statistical techniques applicable in specific contexts. An active area of research is so-called self-matched studies, which investigate acute effects of intermittent exposures by comparing observations sampled at different times within the same unit. These include individual-level designs such as the case–crossover, 2 the case-only, 3 the case–time–control, 4 the exposure–crossover, 5 and the self-controlled case series, 6 among others. An alternative but related epidemiologic method for aggregated data is the time series design, applied in particular in environmental studies. 7 A thorough overview of self-matched methods is provided in a recent publication by Mostofsky and colleagues. 8

This landscape is likely to be transformed further by ongoing technologic and methodologic developments in data science, which offers unique opportunities for epidemiologic investigations, for instance through electronic health records linkage, 9 exposure modelling, 10 and real-time measurements technologies. 11 , 12 Ultimately, these data resources can be used to reconstruct detailed longitudinal profiles with repeated measures of health outcomes and various risk factors, offering the chance to investigate complex aetiological mechanisms and to test elaborate causal hypotheses. However, existing self-matched methods present limitations in this context, and new analytical techniques must be developed for epidemiologic investigations in these intensive longitudinal and big data settings. 13

In this contribution, we present the case time series design , a novel self-matched method for the analysis of transient changes in risk of acute outcomes associated with time-varying exposures. This innovative design combines the longitudinal modelling structure of time series analysis with the individual-level setting of other self-matched methods, offering a flexible and generally applicable tool for modern epidemiologic studies. First, we introduce the case time series design and its features, including the design structure, modelling framework, estimation methods, and key assumptions. Later, we assess the methodology in a simulation study that evaluates its performance under various data generating scenarios. Then, we demonstrate its application through two real-data epidemiologic analyses. In a final discussion section, we describe the epidemiologic context, advantages, and limitations, and areas of further development. We add documents for reproducing real-data examples and the simulation study as eAppendix 1–3 in the online supplementary material, with an updated version complemented with and R scripts available at the personal website and GitHub webpage of the author (see ‘Data and Code’).

## A Novel Self-Matched Design

The study design proposed here, called case time series, is a generally applicable tool for the analysis of transient health associations with time-varying risk factors. This novel design considers multiple observational units, defined as cases, for which data are longitudinally collected over a pre-defined follow-up period. The main design feature that defines the case time series methodology is the split of the follow-up period in equally spaced time intervals, which results in a set of multiple case-level time series. Data forming the series can originate from actual sequential observations or be reconstructed by aggregating or averaging longitudinal measurements, but, eventually, they are assumed to represent a continuous temporal frame. A graphical representation is provided in Figure 1 , showing case-specific time series data with various types of measurements of outcome and exposure collected for multiple subjects.

Graphical representation of data configurations for the case time series design applied in the analysis of transient health risks of time-varying exposures. The figure represents three examples of data for three subjects (cases) followed for a period of time, with equally spaced measures of outcome and exposure that form case-level time series. This setting allows the definition of predictors and time axes as unique and sequential observations. The three examples illustrate different measures of outcome and exposure. The former is represented as counts (top), a binary indicator (middle), or a continuous measure (bottom). Similarly, exposure can be represented by a simple binary episode indicator (top), or continuous term (middle and bottom). Continuous variables are represented by shaded colours. The graphical representation demonstrates the potential of the case time series design to be applied in various research areas for modelling associations defined by different types of measurements.

The case time series data setting provides a flexible framework that can be adapted for studying a wide range of epidemiologic associations. For instance, outcomes, exposures, and other predictors can be represented by either indicators for events, episodes, or continuous measurements that vary across units and times, as in Figure 1 . The time intervals can be of any length (from seconds to years), depending on the temporal association between outcome and exposures and on practical design considerations. A case is a general definition, and it can represent a subject or other entities such as a geographic area to which observations are assigned, thus allowing analyses to be conducted either at individual level or with aggregated data. Eventually, the case time series structure combines characteristics of various other study designs: it allows individual-level analyses of transient risk associations as in traditional self-matched methods, but it retains the longitudinal temporal frame typical of time series data, with ordered repeated measures of outcomes, exposures, and other predictors. As discussed below, this flexible design setting offers important advantages.

## Modelling Framework

A case time series model can be written in a regression form by defining the expectation of a given health outcome y it for case i at time t in relation to a series of predictor terms. Algebraically, the model can be written as:

The definition in Eq. (1) resembles a classic time series regression model traditionally used in environmental epidemiology, where the ordered and sequential nature of the data allows the application of cutting-edge analytical techniques. 7 Specifically, the function f ( x , ℓ ) specifies the association with the exposure of interest x , defined either as a binary episode indicator or as a continuous variable, optionally allowing for non-linearity and complex temporal dependencies along the lag dimension ℓ . These complex relationships can be modelled through distributed lag linear and non-linear models (DLMs and DLNMs), which can flexibly define cumulative effects of multiple exposure episodes. 14 The term(s) S j represent functions expressed at different timescales to model temporal variations in risk associated to underlying trends or seasonality, among others. 15 Other measurable time-varying confounders z p can be modelled through functions h p , and these can include for instance age or time since a specific intervention. The two sets of terms S j and h p ensure a strict control of temporal variation in risks over multiple time axes. The outcome y can represent binary indicators, counts of rare or frequent events, or continuous measures. The analysis can be performed on multiple cases i = 1, …, n , with intercepts ξ i(k) expressing baseline risks for different risk sets, optionally stratified further in time strata k = 1,…, K i nested within them, allowing an additional within-case control for temporal variations in risk.

The estimation procedures in case time series analyses rely on estimators and efficient computational algorithms provided by the general framework of fixed-effects models. 16 These were developed in econometrics and often applied in panel studies with repeated observations. 10 , 17 Fixed-effects methods allow the estimation of coefficients for the various functions in Eq. (1) , without including the potentially high number of case/stratum-specific intercepts ξ i(k) , treated as nuisance (or incidental) parameters. 16

Fixed-effects estimators are available for the three main types of outcomes and distributions within the extended exponential family of generalized linear models (GLMs). Specifically, for continuous outcomes with a Gaussian distribution, the estimation procedure involves mean-centring and a simple correction of the degrees of freedom. For event-type indicator or count outcomes following a Bernoulli and Poisson distribution, respectively, estimators for fixed-effects models with canonical logit and log links can be defined through conditional likelihoods for logistic and Poisson regression. 18 , 19 These are forms of partial likelihoods that are derived by defining reduced sufficient statistics for ξ i(k) , obtained by conditioning on the total number of events within each of the n cases or n × K strata.

The main advantage of fixed-effects models is that the effect of any unmeasured predictor that does not vary within each risk set is absorbed by the intercept ξ i(k) , and therefore the related confounding effect is controlled for implicitly by design, as in other self-matched methods. 8 In addition, the within-case design offers important computational advantages, especially from a big data perspective. First, the analysis is restricted to informative strata, i.e. cases and risk sets with variation in both outcome and exposure. Second, the estimators are based on efficient computational schemes, where the conditional or fixed-effect likelihood is defined by the sum of parts related to multiple risk sets, and the corresponding nuisance parameters ξ i(k) are not directly estimated.

## Key assumptions and threats to validity

As discussed above, the case time series framework has interesting design and modelling features that offer important advantages. On the other hand, its self-controlled structure, while appealing, only operates within an elementary causal framework and requires relatively strict assumptions to protect against key threats to validity. Specifically, the main requirements are the following:

- Distributional assumptions on the outcome. The outcome y it must represent conditionally independent observations originating from one of the standard family distributions, for instance, Poisson counts, Bernoulli binary indicators, or Gaussian continuous measures.
- Outcome-independent follow-up period. The period of observation for each case i must be independent of a given outcome, meaning that the follow-up period cannot be defined or modified by the outcome itself.
- Outcome-independent exposure distribution. The probability of the exposure x t must be independent of the outcome history prior to t , meaning that the occurrence of a given outcome must not modify the exposure distribution in the following period.
- Constant baseline risk conditionally on measured time-varying predictors. The baseline risk along the (strata of) follow-up period of each case i must be constant, meaning that variations in risks must be fully explained by model covariates.

These requirements enable valid conditional comparison of observations at different times within the follow-up of each case. Departures from these assumptions can produce imbalances in the temporal distribution of the outcome, the exposure, or unmeasured risk factors, thus determining spurious associations.

Some of these assumptions have been separately described in the literature of self-matched designs and fixed-effects models. 20 – 23 Specifically, Assumption 1 dictates that outcomes must occur independently, and in particular that the occurrence of a given outcome level or event must not modify the risk of following outcomes. 24 This assumption indirectly implies that outcomes are recurrent, and non-recurrent events can only be analysed if rare in the population of interest. 25 , 26 Assumptions 2 and 3 are those posing more limitations to the application of self-matched methods, as for many associations of interest an outcome can modify both the follow-up period and exposure distribution. 27 , 28 These requirements often restrict the case time series designs to the analysis of exogenous exposures, which are by definition outcome-independent, and for which the observation period can be extended even beyond a terminal event, as in bi-directional case–crossover schemes. 29 Assumption 4 requires a constant baseline risk to ensure conditional exchangeability between observations within each risk sets, 20 , 30 , 31 requiring that relevant time-varying confounders are included and all the terms in Eq. (1) are correctly specified.

Importantly, the design setting described above is not suited to represent complex causal scenarios characterised by dynamic mechanisms between time-varying terms. Specifically, feedback between outcomes and between outcome and exposure are forbidden by Assumptions 1 and 3, respectively, while more generally exposure–confounder feedback cannot be validly handled through traditional regression-based methods for longitudinal data. 32

## Simulation Study

We evaluated the performance of the case time series design in a set of simulated scenarios that involved various data-generating processes and assumptions ( Table ). Detailed information on the simulation settings, definitions, and additional results are provided in eAppendix 3 (online supplementary material). Briefly, we simulated and analysed data for 500 subjects followed up for one year, testing the method in terms of relative bias, coverage, and relative root mean square error (RMSE) in 50,000 replications. The basic scenario involves an outcome represented by repeated event counts and binary indicators of exposure episodes associated with a constant increase in risk in the next 10 days.

Results of the simulation study, with ten scenarios representing increasingly complex data settings (Scenarios 1-10), and four additional scenarios simulating data where the key design assumptions are violated (Scenarios 11-14). The table reports empirical figures of relative bias (%), coverage, and relative root mean square error (RMSE, %) in 50,000 replications. A detailed description of the scenarios, definitions, and additional results and graphs are provided in the supplementary material [ Appendix A ].

The first part of the simulation study (Scenarios 1-10) evaluates the performance of the new design in recovering the true association under increasingly complex data settings. Specifically, the scenarios depict different outcome and exposure types, the presence of common or subject-specific trends, time-invariant and time-dependent confounders, and more complex lag structures. Results in the Table indicate that the case time series design provides correct point estimates and confidence intervals in almost all ten scenarios. The small underestimation in Scenario 2 is consistent with the asymptotic bias of maximum likelihood estimators originating from the extreme unbalance of expected events between risk and control periods, previously described and defined analytically in the self-controlled case series literature. 33 eFigure 1 (online supplementary material) shows that the case time series models can correctly recover the true association, both in the basic Scenario 1 with constant risk and no confounding, and in the more complex Scenario 10 representing varying lag effects, strong temporal trends, and highly correlated confounders.

The second part of the simulation study (Scenarios 11-14) illustrates basic applications, but where each of the four assumptions, in turn, does not hold. Specifically, Scenario 11 describes the case where the occurrence of an outcome can change the risk status of a subject and temporally reduce their underlying risk. This can occur for instance when the event results in the prescription of drugs or therapies. This induces a form of dependency in the outcome series that violates Assumption 1 and, in this example, results in a negative bias ( Table ). Scenarios 12 simulates a different situation, namely when the outcome event carries a risk of censoring the follow-up, for instance, if it increases the probability of death. This contravenes Assumption 2 and generates a bias in the opposite direction. In Scenario 13, the outcome event reduces instead the probability of exposure episodes in the following two weeks, a situation that can occur for example if the event results in hospitalization or lifestyle changes. Here Assumption 3 does not hold, and the estimators are again biased upward. Finally, Scenario 14 illustrates the case of unobserved periods of lower baseline risk within the follow-up, for instance corresponding to holiday periods with a reduced probability of an outcome being reported. This undermines the conditional exchangeability requirements of Assumption 4 and induces a large positive bias.

## Illustrative Examples

This section illustrates the application of the case time series design in two real-data examples. These case studies are described here only for illustrative purposes, and they are not meant to offer substantive epidemiological evidence on the associations under study. Detailed information on the setting and sources of data can be found in the cited references. Documents in the online supplementary material ( eAppendix 1 and 2 ) provide notes and R code that reproduce the steps of these analyses using simulated data, and they offer details on the specific modelling choices.

## Flu and Myocardial Infarction

The first example replicates a published analysis that assessed the role of influenza infection as a trigger for acute myocardial infarction (AMI). 34 The data, retrieved by linking electronic health records from primary care and cohort databases for England and Wales, include 3,927 acute MI cases with at least one flu episode in the period 2003-2009. A representation of a sub-interval of the follow-up for six subjects is reported in eFigure 2 (online supplementary material). The original analysis relied on the self-controlled case series design to examine the association, using exposure windows in the 1-91 days after each flu episode and controlling for trends using 5-year age strata and trimester indicators. Limitations of this approach are the use of stratification to describe smooth continuous dependencies and the fact that multiple flu episodes experienced by some subjects resulted in the long exposure windows to overlap (see eFigure 2 ), requiring ad-hoc fixes that can generate biases. 35 Conversely, the rarity of the exposure, with most of the subjects experiencing a single flu episode, prevents the application of the case–crossover design, as most control sampling schemes would generate non-discordant case–referent sets.

We replicated the analysis with a case time series design, splitting the follow-up period of each subject into daily time series (see eAppendix 1 , online supplementary material). We fitted a fixed-effects Poisson model to estimate the flu–AMI association while controlling for underlying trends across multiple time scales. The model includes smooth functions to define the baseline risk, specifically using natural splines (with two knots at the interquartile range) for age and cyclic splines (with three degrees of freedom) for seasonality. More importantly, we applied DLMs defined by either splines (with knots at 3, 10, and 29 lags) or step functions (with strata 1-3, 4-7, 8-14, 15-28, and 29-91 lags) to describe temporal effects along with the exposure window.

Results are reported in Figure 2 . The left and middle panels display the variation in risk of AMI by age and season, showing how the case time series design allows modelling baseline trends fluctuating smoothly across multiple time axes. The right panel illustrates the risk after a flu episode within the selected lag period, as estimated using a DLM with spline functions. The graph indicates a high risk in the first days after a flu episode, which then attenuates and disappears after approximately one month. The same panel also includes the fit of the alternative DLM defined by step functions, which assumes a constant risk within exposure windows (see also eFigure 3 in the online supplementary material). This specification matches the stratification approach in the original self-controlled case series analysis, 34 although the case time series design with DLMs accounts for cumulative effects of potentially overlapping periods of flu episodes.

Results of the analysis on the association between influenza infection and acute myocardial infarction (AMI), as incident rate ratio (IRR) and 95% confidence intervals. The three panels show the AMI risk by age (left) and by season (middle), and the lag-response curve representing the risk in the 1-91 days after a flu episode (right). The latter is estimated in the main model using natural splines (continuous red line), with superimposed the results from an alternative model using step functions (dashed grey line).

## Environmental exposures and respiratory symptoms

The second example illustrates a preliminary analysis of the role of multiple environmental stressors in increasing the risk of respiratory symptoms using smartphone technology. Data were collected within AirRater, an integrated online platform operating in Tasmania that combines symptom surveillance, environmental monitoring, and real-time notifications. 12 A smartphone app allowed the self-reported recording of respiratory symptoms and the reconstruction of personalized exposure series by linking geo-located positions with high-resolution spatio-temporal maps derived from environmental monitors (see Figure 3 ). Standard cohort analyses based on between-subject comparisons are unsuitable in this complex study setting, characterized by continuous recruitment, high dropout rates, and intermittent participation (see eFigure4 in the online supplementary material). Similarly, the frequent and highly seasonal outcome pose problems in adopting a case–crossover design, with issues in selecting control times and about the assumption of constant within-stratum risk. Finally, the presence of multiple continuous exposures prevents the application of the self-controlled case series design, either in its standard or extended forms. 36 , 37

Graphical representation of the individual time series of a subject participating in the AirRater study on the association between environmental exposures and respiratory symptoms. The four panels (from top to bottom) display the daily series of indicators of allergic events and levels of the three environmental stressors, represented by pollen (grains/m 3 ), PM 2 . 5 (μg/m 3 ), and temperature (°C).

We, therefore, applied a case time series design (see eAppendix 2 , online supplementary material). The analysis included 1,601 subjects followed between October 2015 and November 2018, with a total of 364,384 person–days. The event-type outcome was defined as daily indicators of reported respiratory symptoms and associated with individual exposure to pollen (grains/m 3 ), fine particulate matter (PM 2.5 , μg/ m 3 ), and temperature (°C) ( Figure 3 ). We modelled the relationships using a fixed-effects logistic regression over a lag period of 0-3 days, using an unconstrained DLM for the linear association with PM 2.5 , and bi-dimensional spline DLNMs for specifying non-linear dependencies with pollen and temperature. 14 , 38 A strict temporal control was enforced by using subject/month strata intercepts, natural splines of time (with 8 df/year), and indicators of the day of the week, thus modelling individually varying baseline risks on top of shared long-term, seasonal, and weekly trends.

Figure 4 shows the preliminary results, with estimated associations reported as odds ratios (ORs) from the model that includes simultaneously the three environmental stressors. The graphs display the overall cumulative exposure-response relationships (top panels), interpreted as the net effects across lags, and the full bi-dimensional exposure-lag-response associations (bottom panels) 14 , 38 . The lefthand panels indicate a positive association between risk of allergic symptoms and pollen, with a step increase in risk that flattens out at high exposures, and a lagged effect up to 2 days. The middle panels suggest an independent association with PM 2.5 , where the risk is entirely limited to the same-day exposure. Finally, results in the righthand panels show a positive association with high ambient temperature, with the OR increasing above 1 beyond daily averages of 15°C.

Results of the analysis on the association between environmental exposures and respiratory symptoms, as odds ratio (OR) and 95% confidence intervals. The three columns of panels show estimated associations with pollen (left, grains/m 3 ), PM 2.5 (middle, μg/m 3 ), and temperature (right, °C). The top row of panels displays the net risk cumulated in the lag period 0-3 days as overall cumulative exposure-response associations, assumed linear for PM 2.5 and non-linear for pollen and temperature. The bottom row of panels shows instead the full exposure-lag-response associations, represented as the bi-dimensional risk surface for pollen and temperature or the lag-specific risks for a 10 μg/m 3 increase in PM 2.5 .

The novel case time series methodology offers a general modelling framework for the analysis of epidemiologic associations with time-varying exposures. The design is adaptable to various data settings for the analysis of highly informative longitudinal measurements, and it is particularly well-suited in applications with modern data resources such as individual-level exposure models and real-time technologies.

The main feature of methodology is a flexible scheme that embeds a longitudinal time series structure in a within-subject design, providing unique modelling advantages. For instance, the sequential order of observations offers the opportunity to assess complex temporal relationships with multiple exposures, where patterns of cumulative effects for linear or non-linear dependencies can be easily modelled. Furthermore, the time series and self-controlled features offer a structure that enables strict control for confounding: time-invariant and time-varying factors can be adjusted for by stratifying the baseline risk between and within subjects, respectively, while residual temporal variations can be directly modelled through time-varying predictors that represent confounders or shared trends across multiple time axes.

The new design complements and extends the already rich set of self-matched methods for observational studies described in the epidemiological literature. 8 Previous methodological contributions have highlighted links and similarities between various designs, 18 , 21 , 29 , 30 , 39 – 41 and ultimately these can be seen as alternative approaches to model the same risk associations. However, each method relies on different sets of assumptions and modelling choices, which explain in part their separate areas of application. The case time series methodology, nevertheless, offers a general framework that combines and extends features of existing designs, with important advantages. For example, it borrows flexible modelling tools from aggregated-data time series design, but it implements them in individual-level analyses that allow a finer reconstruction of outcomes, exposures, and other risk factors. It is applicable to assess associations with multiple continuous predictors as the case–crossover design, and it can model recurrent events, either common or rare, as the self-controlled case series analyses, but it can be extended to the analysis of outcomes represented by binary indicators or continuous measures, simply assuming different distributions. Finally, its time series structure allows the application of sophisticated techniques such as smoothing methods and distributed lag models, characterized by well-defined parameterizations, computational efficiency, and standard software implementations. A thorough and critical comparison of the case time series methodology with alternative approaches will be provided in future contributions.

Together with other self-matched methods, the new case time series design is based on strict assumptions to protect against key threats to validity. However, these conditions are not always met in practice, and their violations can lead to important biases. Specifically, the requirement that both exposures and follow-up periods are independent of the outcome poses severe limitations to the application of the method, in particular in clinical and pharmaco-epidemiologic studies. In fact, the temporal distribution of endogenous predictors such as behaviours, clinical therapies, or drug prescriptions are often modified by an outcome event. In contrast, the case time series and other self-controlled designs are well suited for the analysis of exogenous exposures such as environmental factors, as discussed before. Extension to test and relax these strong assumptions have been developed for the self-controlled case series design, 27 , 28 but further research is needed to implement and assess their validity in case time series models. Conversely, the new design is well suited to control for temporal confounding that can invalidate the assumption of constant baseline risk, through the stratification of the follow-up period and the inclusion of lagged and smooth continuous terms in the model.

Other limitations and areas of current research must be discussed. First, as a method based on a within-subject comparison, the case time series design is ideal for investigating phenomena with short-term changes in risk relative to the study period, while it is less suitable for the analysis of long-term effects and chronic exposures. In fact, while it is in theory possible to extend indefinitely the lag period within the follow-up interval, there is a limit to which the model can disentangle long-lagged effects from seasonal and other trends. 42 In addition, the splitting of the follow-up period in individual-level time series produces a substantial data expansion, with considerable computational demand especially in the presence of a high number of subjects or long study periods. Schemes based on risk-set sampling, previously proposed for cohort and nested case–control studies, 43 – 45 are currently under development to address this issue. Finally, the simulation study and the two real-data examples presented basic epidemiological relationships between time-varying variables. However, more complex causal dependencies, involving, for instance, dynamic feedback or multiple pathways, explicitly violate the strict assumptions underpinning the case time series design, and cannot be modelled in the proposed framework. The definition, limitations, and potential extensions of fixed-effects models and related designs within a general causal inference setting is an area of current research. 23

In conclusion, the case time series design represents a novel epidemiologic method for the analysis of transient health associations with time-varying exposures. Its flexible modelling framework can be adapted to various contexts and research areas, for instance in clinical, environmental, and pharmaco-epidemiology, and it is suitable for the analysis of intensive longitudinal data provided by modern data technologies.

## Supplementary Material

Supplemental digital content, acknowledgments.

The author is thankful to Dr Charlotte Warren-Gash, and Dr Fay Johnston and Mr Iain Koolhof for providing data access and information for the two case studies used as illustrative examples. The author is also grateful to colleagues who provided comments on various drafts of the manuscript and analyses, in particular Mr Francesco Sera, Dr Ana Maria Vicedo-Cabrera, and Prof Ben Armstrong. Finally, the author is indebted to Prof Paddy Farrington for offering critical insights on asymptotic biases of maximum likelihood estimators in self-controlled case series. The study on influenza and AMI was originally approved by the Independent Scientific Advisory Committee (ISAC) of the Clinical Practice Research Datalink (Ref: 09_034), the Cardiovascular Disease Research Using Linked Bespoke Studies and Electronic Records (CALIBER) Scientific oversight committee and Myocardial Ischaemia National Audit Project (MINAP) Academic Group (ref: 09_08), and the UCL Research Ethics committee (Ref: 2219/001). This study, which used the analysis dataset only, was approved through a minor ISAC amendment (granted on 12/01/2016) and a MINAP Academic Group amendment (granted on 11/01/2016). More information about AirRater are available at https://airrater.org .

This work was supported by the Medical Research Council-UK [Grant ID: MR/R013349/1].

Competing financial interests : The author declares he has no actual or potential conflict of interest.

## Data and code

## IT Services

Emerging technology, engagement models.

- ECO Dashboard
- Why Partner with Us
- Case Studies
- +1 949 264 1472

Businesses from all over the world require predictive models to forecast sales, revenue growth, costs, etc. It is imperative for firms to gain a proper understanding of what the future holds for them, based on their actions or the lack thereof. Because it enables them to craft and implement a viable and robust enterprise-wide strategy. Thus, time series analysis (TSA) serves as a powerful tool to forecast changes in product demand, price fluctuations of raw materials, and other such factors which impact decision-making.

This is a case study that shows how Xavor built a top-notch predictive pricing model using time series analysis for a company in the polymer chemical industry.

## Problem Statement

The management team at a client company in Texas wanted to predict monthly product sales and the price of chemical raw materials. The client used these raw materials to develop polymer products. The company was particularly interested in knowing whether the sale of one polymer product model could be used to predict the sale of another polymer product.

Xavor’s data scientists realized that time is money. Therefore, we offered the client a dynamic decision-making technology as a crucial tool to facilitate the managerial decision-making process. The tool sought to enable managers to make informed decisions regarding a wide range of business activities. Such decisions also involved analyzing the direct relation between time, sales, and money.

We all make forecasts while making strategic decisions under uncertainty. Sometimes we think we are not forecasting. But the truth is that all our decisions are directed by our anticipation of the results of our actions or inactions.

## Analytical Design

Our Advanced Analytics and Data Science team designed a state-of-the-art solution for our client. The goal was to determine the optimal price of the chemical raw materials sourced by the client.

A time-series analysis data set was created, combining data from different sources. These sources included data from the client as well as the internet. The TSA Dataset was adjusted to be monthly, appropriately handling missing values, holidays, sales seasons, etc.

A brief description of the features was also added. It enabled us to better understand the business problem and its dependencies. For example, these were Feature Name | Data Type | Brief Description).

## Data Statistics

The client’s sales department gave us two years of data statistics ; it contained the following distinct parameters:

- 1,294 variables ingested, 95% external vs. 5% internal.
- A number of features – 74.
- The number of samples – 19,993.
- Economic indicators for the Chinese mainland and its five biggest export partners (Hong Kong, Japan, Korea, US, and Vietnam) – trade, wages, economic conditions, and country-specific industry health indicators.
- Market indicators of the four major customer industries of the client’s polymer products – automotive, construction, appliances, and containers.
- Schedules for major trade and production spikes – consumer shopping holidays, western and Chinese national holidays, and weather conditions.
- Industry data – futures and market prices of upstream raw materials, factory capacity and shutdowns in major producing countries, and MDI imports.
- Internal data from SAP and Vendavo – inventory, deal values and volumes, production, costs, and customer information.

## Data Cleaning and Transformation

Based on the conducted study and an in-depth understanding of the given data set, we removed all the redundant features from the raw data set. We used data-filling techniques, such as mean and mode filling, to tackle the missing data problem. We also used advanced techniques like data imputation.

Each feature used for training the models had different ranges leading to an extensive feature space. Thus, the training was prone to swinging gradients. Later, as part of the data transformation step, the data set’s attributes were standardized to be used by machine learning algorithms and statistical modeling.

## Feature Engineering

We built micro & macroeconomic time series variables to serve as potential predictors of demand. Moreover, we also selected transactional data (19,993 data sets). We used 90% of these for model training and the remaining 10% for model prediction.

Furthermore, we applied variable selection methods to more than 74 macroeconomic variables to identify the most promising linear and nonlinear predictors, lagged predictors, and combinations of predictors.

## Statistical Tests

In order to use time series analysis (TSA) on statistical models like ARIMA, prior convergence of the time series to global minima is essential, which ensures that the data is stationary. We performed two types of statistical tests to satisfy these initial assumptions.

- Augmented Dicky Fuller Test (ADFT)
- Granger Causality Test

## Strategy

The project started by considering the time series as a regression problem based on the client’s demand. We used machine learning algorithms such as random forest to select optimal features. These algorithms were used again with the selected features for predicting the prices of chemicals used in the production of polymer products.

As the prices were to be predicted for the latest month based on two years’ data, the major issue was that due to the changing market conditions, the price hiked drastically in the last month.

Therefore, we devised a semi-supervised machine learning technique consisting of k-means clustering and random forest to address the problem.

The natural structure of the TSA data set reflected the characteristics of different industries and applications of businesses, mainly by observing current customers, exports, and collecting potential customers.

The method first assigned the data to one of the clusters and used the clustering algorithm and statistical techniques, such as Silhouette, to divide the categories into several groups. Each group of chemical products had its own characteristics, needs, and customer domains.

We found the optimal number of clusters to be eight based on understanding the variables.

Next, we applied supervised learning algorithms (random forest and SVM) to every cluster to predict the price of each individual chemical product. Dividing the data into clusters was very helpful as it significantly reduced the error rate. Later on, we devised marketing strategies for each category of users based on these features, sales, and needs of the client organization.

We used the TSA data set to train the ARTXP algorithm of Microsoft Time Series Analysis . We further optimized it for predicting the next likely value in a series. Moreover, we used the results of the standard ARIMA algorithm to improve the accuracy of long-term price predictions of the chemical products manufactured by the client.

ARIMA model forecast the time series by modeling the predictor variable as a regressor. We used each algorithm separately to train the model. However, the blend of the results from the ARTXP-ARIMA hybrid model yielded better price predictions for a variable number of chemical prices of polymer products.

Moreover, the inherent complexities of time series forecasting were implemented using the Prophet library to predict future sales of individual polymer products.

Prophet uses Markov chain Monte Carlo (MCMC) methods to generate forecasts. Thus, MCMC was essential to consider the frequency of our time series. Because we were working with monthly data, we clearly specified the desired frequency of the timestamps (in this case, MS is the start of the month according to the Prophet’s conventions).

Therefore, the make_future_dataframe function in Facebook’s Prophet library generated 24 monthly timestamps for us based on the historical data provided by the client company. In other words, we were looking to predict the future sales of our polymer products two years into the future.

We also incorporated the effect of the holiday season in the forecasting model. It was tailored to user requirements, along with additional prior knowledge of demanding industries.

Also, we used Prophet to return the components of our forecasts. This helped reveal how daily, weekly, and yearly patterns of the time series contributed to the overall sales of the polymer products of the company.

## Visualization

Some distinguishable patterns and trends appeared in the data when we plotted it. We also visualized the TSA data using a method called time series decomposition. It allowed us to decompose our series into three distinct components: trend, seasonality, and noise.

The plot above clearly shows that the prices of chemical raw materials are unstable. It also shows the seasonal price hike at the beginning of each month.

## Handling

We suggested the company update the model with recent sales data and update their predictions to model recent trends. This update should be carried out quarterly. Xavor created a general prediction model to make corrections for departments that did not accurately or consistently update sales data. We used it to develop predictions for all the regions.

A web-based forecasting and prediction tool was developed that allowed the client’s management team to enter updated values of predictor variables each month and forecast the future demand for the specific chemical product.

The client company subsequently validated the custom cluster-based model. It did so by comparing forecast vs. actual values for the first several months to predict the price of chemicals. The resulting forecast accuracy was impressive. Therefore, they could use it in their business model to increase revenue and sales of products. It enabled them to:

- Use the model forecasts with weekly and monthly visualizations as input for business operations.
- Conduct a subsequent study, applying the forecasting and prediction method in another similar polymer product category.

We record metrics against each characteristic for each of the time series models. We draw conclusions based on these. Based on the MAE and MAPE metrics, the custom semi-supervised model stands out in terms of both forecasting sales and predicting chemical prices.

However, in terms of execution time, all the models performed roughly on the same scale. But ARIMA marginally ran for a longer duration.

## Conclusion

The approach and the experiments adopted in this study reasonably answered the client company’s objectives. The approach was to compare the performance of the in-house built time series model and widely used conventional models and libraries in TSA.

Our team also studied the generalizability and model behavior for predicting long-term forecasts. With the increase in the length of the forecast, the generalizability of the cluster-based semi-supervised model was superior to that of the other time series models employed despite them being state-of-the-art models in the industry.

This was particularly true for this case, where the market change was correlated with confounding variables. However, this behavior is observed during a standard forecast length. Beyond this forecast length, the volatility in the market outweighs the model behavior.

The customer was given a choice in the web application to select the best-performing model based on the forecast’s duration and sales prediction.

Are you looking to build and use a similar price-predicting model for your company? Get in touch with us at [email protected] .

## Let's make it happen

## How To Use R For Time Series Analysis: A Step-By-Step Approach

Grasping R for time series analysis is crucial for developers looking to analyze temporal data. This article guides you through importing data, identifying patterns, and forecasting, with practical R code examples

## 💡 KEY INSIGHTS

- Global vs. local methods in time series analysis are crucial; global methods fit a regression over the entire series, while local methods, like LOESS, offer more flexibility by focusing on smaller segments.
- Creating a time series graph in Displayr involves different methods of data input, including using outputs in 'Pages', variables in 'Data', or directly typing the data, each with its unique approach to handling date information.
- In creating interactive time series dashboards , Displayr allows for easy extraction and integration of diverse data sources, such as stock prices and Google trends, with features like automatic updating and republishing.
- Displayr's tracking analysis tools enhance efficiency in market research by automating tasks like statistical testing, data aggregation, and analysis updates, providing a comprehensive solution for modern data analysis needs.

R, a programming language renowned for its statistical capabilities, offers a robust toolkit for time series analysis. This article guides programmers and developers through practical steps and techniques in R, enhancing their ability to analyze and interpret temporal data effectively. With a focus on real-world applications, we explore how R's specialized packages and functions streamline the process of time series analysis.

## Importing Time Series Data

Basic time series analysis, forecasting techniques, visualizing results, frequently asked questions, time series data, converting data to time series object, handling date formats.

Importing data into R is the first critical step in time series analysis. R supports various data formats, including CSV, Excel, and databases. The read.csv() function is commonly used for CSV files.

After importing, it's essential to check the data structure . The str() function provides a concise summary of the object type and the structure of your dataset.

For time series analysis, data should be in a time series object format. The ts() function in R converts a numeric vector into a time series object. Specify the start and end time, and the frequency of the data points.

Dates in R might require formatting to ensure they are recognized correctly. The as.Date() function is used for this purpose.

Finally, use the summary() function to get a basic statistical summary of your time series data. This step is crucial for a preliminary understanding of the data's characteristics.

## Time Series Analysis

Summary statistics, simple forecasting.

Once your data is in a time series format, the next step is trend analysis . This involves identifying patterns over time. The plot() function in R is a straightforward way to visualize these trends.

## Seasonality Detection

Detecting seasonal patterns is a key aspect of time series analysis. The decompose() function helps in breaking down the series into trend, seasonal, and irregular components.

## Autocorrelation Analysis

Understanding autocorrelation is crucial for identifying the relationship of a variable with its past values. The acf() function in R helps in analyzing autocorrelation.

A basic yet essential part of time series analysis is forecasting future values . The forecast() function from the forecast package can be used for simple predictions.

Gaining a statistical summary of your time series can provide insights into central tendencies and dispersion . The summary() function is useful for this purpose.

## Forecasting Future Values

Exponential smoothing, cross-validation, choosing the right model, visualizing forecasts.

In time series analysis, forecasting future values is a crucial step. The forecast package in R offers various methods, with ARIMA being one of the most popular.

Another common technique is Exponential Smoothing , suitable for data with trends and seasonality. The ets() function from the same package can be used.

To ensure the accuracy of forecasts, cross-validation is essential. The tsCV() function helps in understanding the forecast error over different horizons.

Selecting the appropriate model is key to accurate forecasting. Comparing models based on criteria like AIC (Akaike Information Criterion) can guide this choice.

Finally, visualizing the forecast results is important for interpretation. The plot() function can be used to display the forecasted values against the actual data.

This plot provides a visual comparison between the actual time series data and the forecasted values, highlighted in red.

## Visualizing

Adding layers to plots, multiple time series, customizing plots, seasonal decomposition plot.

Visualizing time series data is a powerful way to communicate findings and insights. The plot() function in R is a versatile tool for this purpose.

To enhance understanding, adding layers like trends or forecasts to your plot can be beneficial. The lines() function allows for this addition.

Comparing multiple time series in one plot can offer valuable insights. The ts.plot() function is useful for plotting multiple series together.

Customizing plots to make them more informative is often necessary. Adjusting aspects like colors, labels, and legends enhances readability.

For datasets with seasonality, a seasonal decomposition plot can be very revealing. The plot() function applied to a decomposed object displays trend, seasonal, and random components.

## What's the Difference Between ARIMA and Exponential Smoothing?

ARIMA models are generally used for data showing correlations between successive observations, while Exponential Smoothing is better for data with a clear trend and seasonality.

## How Can I Handle Missing Values in Time Series Data?

Missing values can be handled by imputation methods like linear interpolation or by using specific functions in R designed for time series data that can handle gaps.

## Is It Possible to Analyze Multiple Time Series Together?

Yes, R allows for the analysis of multiple time series simultaneously. Techniques like Vector Autoregression (VAR) can be used for analyzing interdependencies between multiple series.

## How Do I Choose the Right Forecasting Model?

The choice of model depends on the data's characteristics. It's often recommended to compare different models using criteria like the Akaike Information Criterion (AIC) or Mean Squared Error (MSE).

Let’s test your knowledge!

## What function is used to check the structure of a time series data in R?

Continue learning with these 'programming' guides.

- How To Debug In R: Effective Strategies For Developers
- How To Use R For Simulation: Effective Strategies And Techniques
- How To Install R Packages: Steps For Efficient Integration
- How To Import Data In R: Essential Steps For Efficient Data Analysis
- How To Clean Data In R: Essential Techniques For Effective Data Management

## Subscribe to our newsletter

Subscribe to be notified of new content on marketsplash..

- - Google Chrome

Intended for healthcare professionals

- Access provided by Google Indexer
- My email alerts
- BMA member login
- Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

## Search form

- Advanced search
- Search responses
- Search blogs
- Short term exposure to...

## Short term exposure to low level ambient fine particulate matter and natural cause, cardiovascular, and respiratory morbidity among US adults with health insurance: case time series study

- Related content
- Peer review
- Yuantong Sun , research data analyst 1 ,
- Chad W Milando , research scientist 1 ,
- Keith R Spangler , research scientist 1 ,
- Yaguang Wei , postdoctoral researcher 2 ,
- Joel Schwartz , professor 2 ,
- Francesca Dominici , professor 3 ,
- Amruta Nori-Sarma , assistant professor 1 ,
- Shengzhi Sun , professor 4 5 6 ,
- Gregory A Wellenius , professor 1
- 1 Department of Environmental Health, Boston University School of Public Health, Boston, MA, USA
- 2 Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- 3 Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- 4 School of Public Health, Capital Medical University, Beijing 100069, China
- 5 Beijing Municipal Key Laboratory of Clinical Epidemiology, Capital Medical University, Beijing, China
- 6 School of Public Health, The Key Laboratory of Environmental Pollution Monitoring and Disease Control, Ministry of Education Guizhou Medical University, Guiyang, China
- Correspondence to: S Sun shengzhisun{at}ccmu.edu.cn (or @darrensun3 on Twitter)
- Accepted 17 January 2024

Objective To estimate the excess relative and absolute risks of hospital admissions and emergency department visits for natural causes, cardiovascular disease, and respiratory disease associated with daily exposure to fine particulate matter (PM 2.5 ) at concentrations below the new World Health Organization air quality guideline limit among adults with health insurance in the contiguous US.

Design Case time series study.

Setting US national administrative healthcare claims database.

Participants 50.1 million commercial and Medicare Advantage beneficiaries aged ≥18 years between 1 January 2010 and 31 December 2016.

Main outcome measures Daily counts of hospital admissions and emergency department visits for natural causes, cardiovascular disease, and respiratory disease based on the primary diagnosis code.

Results During the study period, 10.3 million hospital admissions and 24.1 million emergency department visits occurred for natural causes among 50.1 million adult enrollees across 2939 US counties. The daily PM 2.5 levels were below the new WHO guideline limit of 15 μg/m 3 for 92.6% of county days (7 360 725 out of 7 949 713). On days when daily PM 2.5 levels were below the new WHO air quality guideline limit of 15 μg/m 3 , an increase of 10 μg/m 3 in PM 2.5 during the current and previous day was associated with higher risk of hospital admissions for natural causes, with an excess relative risk of 0.91% (95% confidence interval 0.55% to 1.26%), or 1.87 (95% confidence interval 1.14 to 2.59) excess hospital admissions per million enrollees per day. The increased risk of hospital admissions for natural causes was observed exclusively among adults aged ≥65 years and was not evident in younger adults. PM 2.5 levels were also statistically significantly associated with relative risk of hospital admissions for cardiovascular and respiratory diseases. For emergency department visits, a 10 μg/m 3 increase in PM 2.5 during the current and previous day was associated with respiratory disease, with an excess relative risk of 1.34% (0.73% to 1.94%), or 0.93 (0.52 to 1.35) excess emergency department visits per million enrollees per day. This association was not found for natural causes or cardiovascular disease. The higher risk of emergency department visits for respiratory disease was strongest among middle aged and young adults.

Conclusions Among US adults with health insurance, exposure to ambient PM 2.5 at concentrations below the new WHO air quality guideline limit is statistically significantly associated with higher rates of hospital admissions for natural causes, cardiovascular disease, and respiratory disease, and with emergency department visits for respiratory diseases. These findings constitute an important contribution to the debate about the revision of air quality limits, guidelines, and standards.

## Introduction

A large number of epidemiological studies have consistently reported that exposure to ambient fine particulate matter (PM 2.5 ) is associated with increased risk of morbidity and mortality. 1 2 3 4 5 6 According to the Global Burden of Disease study, 7 exposure to PM 2.5 accounts for an estimated 7.6% of total global mortality and 4.2% of global disability adjusted life years. In light of this extensive body of evidence, the World Health Organization recently introduced an ambitious new air quality guideline limit in 2021, recommending that the 24 hour average PM 2.5 levels should not exceed 15 μg/m 3 on more than 3-4 days each year. 8 In the US, the current national ambient air quality standards for the 24 hour average PM 2.5 (calculated as 98th centile concentrations averaged over a three year period) were set at 35 μg/m 3 in 2012, and a revision of these standards is currently being considered. 9

Although the literature on the adverse health effects of short term exposure to PM 2.5 is vast, 1 2 3 4 5 6 several key knowledge gaps remain. Specifically, owing to the accessibility of national vital statistics data and Medicare data, most large scale studies in the US have focused on the health effects of PM 2.5 among adults aged ≥65 years, 1 5 with relatively few studies including young or middle aged adults; most studies have focused on mortality and hospital admissions as the outcomes, 1 2 3 5 6 10 with relatively less information available on the impact of PM 2.5 on emergency department visits (ie, where individuals are treated in the emergency department room but not admitted to hospital for inpatient care) 11 ; and no study has specifically examined whether the observed associations persist at daily PM 2.5 levels below the 2021 WHO air quality guideline limit.

To tackle these knowledge gaps, we estimated the association between short term exposure to PM 2.5 at concentrations below the 2021 WHO air quality guideline limit and the risks of hospital admissions and emergency department visits for natural causes, cardiovascular disease, and respiratory disease among adults with health insurance in the contiguous US from 2010 to 2016 using the healthcare utilization deidentified claims data from the Optum Laboratories Data Warehouse (OLDW). We also examined whether the observed associations of PM 2.5 level and morbidity differed across strata defined by age, sex, insurance type, and geographic region.

## Study population

This study used deidentified medical claims data between 1 January 2010 and 31 December 2016 from OLDW. 12 This database includes facility and physician claims and enrollment records for enrollees with either commercial or Medicare Advantage health insurance. The database contains longitudinal health information on more than 200 million enrollees, representing a diversity of ages and geographic regions across the US.

Medical claims in OLDW are classified using ICD-9 and ICD-10 (international classification of diseases, ninth revision and 10th revision, respectively) codes, revenue codes, current procedure terminology codes, and place of service codes (see supplementary table S1). We used claims to identify clinical encounters classified as either hospital admissions or emergency department visits. For each patient encounter, we extracted information on age, sex, county of residence, date of service, insurance type, and principal diagnoses. Analyses were limited to adults aged ≥18 years who had hospital admissions or emergency department visits for natural causes (ICD-9: 0-799 or ICD-10: A0-R99), cardiovascular disease (ICD-9: 390-459 or ICD-10: I00-I99), or respiratory disease (ICD-9: 460-519 or ICD-10: J00-J99).

For each health outcome, we aggregated medical claims into daily counts of hospital admissions or emergency department visits by age (18-29, 30-39, 40-49, 50-64, 65-74, or ≥75 years), sex (male v female), insurance type (commercial insurance v Medicare Advantage), and geographic regions of the country, as defined by the US Global Change Research Program’s Fourth National Climate Assessment (NCA4) for each county (see supplementary figure S1).

## Environmental data

To estimate daily concentrations of 24 hour average PM 2.5 at 1 km × 1 km grid cells in the contiguous US, we used a previously developed spatiotemporal ensemble model. 13 The model incorporates three machine learning algorithms: neural network, random forest, and gradient boosting. These algorithms relied on multiple predictor variables, including satellite data at 1 km × 1 km resolution, meteorological conditions, such as ambient temperature, land use variables, such as elevation and road density, and predictions from chemical transport models. 13 The predictions of daily PM 2.5 level from each algorithm were then combined with a geographically weighted generalized additive model. 13 14 Additionally, we obtained monitoring data for PM 2.5 from 2156 surveillance sites operated by the US Environmental Protection Agency. To account for spatial and temporal autocorrelation, we calculated lagged PM 2.5 levels from monitoring sites and incorporated these as supplementary inputs into the ensemble models, along with the aforementioned mentioned predictor variables. 13 14 The final ensemble model showed good performance when predicted values were compared with monitored data, achieving a 10-fold cross validation R 2 of 0.86. 13 14

We also derived meteorological variables, including daily outdoor ambient air temperature and relative humidity, from the Parameter-elevation Relationships on Independent Slopes model (PRISM) dataset, which is a publicly available gridded climate dataset. 15 16 The dataset we used provides daily estimates of several meteorological variables at a horizontal grid spacing of about 4 km across the contiguous US. 15 16

Daily PM 2.5 concentrations and meteorological variables were calculated at the county level by extracting the grid values at the population centroids for each census tract (based on the 2000 US Census) within a given county. We then calculated the population weighted average values based on the proportion of the county’s population residing in each census tract. 15 17 The concentration of PM 2.5 and meteorological variables were assigned to each county and the exposure data linked with time series data of hospital admissions and emergency department visits in each respective county.

## Statistical analysis

To estimate the association between short term exposure to PM 2.5 and risks of hospital admissions and emergency department visits for natural causes, cardiovascular disease, and respiratory disease, we used a case time series design—a novel approach suitable for analyzing small area data. 18 19 This design incorporates the self-matched structure in case only models into a traditional time series form, providing a flexible and computationally efficient tool for complex longitudinal data. 18 19 In the case time series design, observations defined as cases are collected longitudinally over a period of time at equally spaced time intervals, forming a set of case level time series data. 18 The time series of cases can be either individual level outcomes or aggregated measurements over small geographic areas. 18 19 In our study, we aggregated medical claims into county specific daily time series of hospital admission and emergency department visit counts and linked these case time series with county level PM 2.5 concentrations and meteorological conditions. We chose a case time series design because of its suitability to accommodate the longitudinal time series structure of health outcomes at small area level, strict control for time invariant confounding, and flexibility to adjust time varying confounding in the modeling framework. 18 19

We used a conditional quasi-Poisson log-link regression model stratified by year, month, day of week, and county of residence, therefore adjusting for the differential baseline morbidity risk and trends across the different spatiotemporal strata. Meanwhile, we controlled the residual temporal variations with time varying covariates, such as ambient temperature and relative humidity. Consistent with previous studies, 1 2 6 our primary analyses considered the moving average of the current and previous day (lag 0-1) PM 2.5 as the exposure of interest and assumed a linear exposure-response association between exposure to PM 2.5 and risk of morbidity. In the models, we adjusted for the two day moving average of daily mean ambient temperature using a natural cubic spline function with three degrees of freedom, and the two day moving average of daily mean relative humidity using a natural cubic spline function with three degrees of freedom. 1 To account for spatial heterogeneity of ambient temperature, we interacted the temperature splines with NCA4 regions. We additionally adjusted for federal holiday (dummy variable). Low level PM 2.5 was defined as daily PM 2.5 concentrations below the 2021 WHO air quality guideline limit of 15 μg/m 3 . To estimate the association between short term exposure to low level PM 2.5 and risk of morbidity, we restricted our analyses to days with daily PM 2.5 concentrations <15 μg/m 3 . To examine the lag structure of the association, we also fitted the models with separate terms for PM 2.5 on the same day of admission (lag 0) and 1 day previously (lag 1).

We presented results as both the excess relative risk and the excess absolute risk of hospital admissions or emergency department visits associated with a 10 µg/m 3 increase in PM 2.5 concentrations. 1 4 20 Excess relative risk was defined as (relative risk−1)×100%, and excess absolute risk was calculated as α×(relative risk−1)/relative risk, where α is the incidence rate of cause specific hospital admissions or emergency department visits (see supplementary eAppendix).

Several sensitivity analyses were performed to examine the robustness of our findings. First, to examine the exposure-response association between exposure to PM 2.5 and risk of morbidity, we used a natural cubic spline with three degrees of freedom for lag 0-1 PM 2.5 . Second, to assess whether our findings were robust to the choice of temperature metric, we replaced the daily mean ambient temperature in the models with daily maximum ambient temperature or daily minimum ambient temperature. Third, to account for spatial heterogeneity in relative humidity, we additionally included an interaction term between relative humidity splines and NCA4 regions in the models. Fourth, to examine the association between exposure to PM 2.5 and morbidity across the range of PM 2.5 levels, we repeated the main analysis without restricting our analyses to days with daily PM 2.5 concentrations <15 μg/m 3 .

To evaluate potential differences in susceptibility, we examined whether the association between exposure to PM 2.5 and risk of morbidity differed across subgroups of the population defined by age, sex, insurance type, and geographic region. We fit separate models for each stratum and conducted Wald tests to evaluate the heterogeneity across these strata. 21

All statistical analyses were conducted in R software (version 4.0.2). We used the “gnm” (version 1.1.1) package 22 to fit the conditional Poisson regression models and the “dlnm” (version 2.4.2) package 23 to model the non-linear exposure-response functions.

## Patient and public involvement

As the study used deidentified medical claims data, no patients or members of the public were involved in implementing the study design.

## Descriptive statistics

Of the 3144 counties in the contiguous US, we excluded those with no recorded hospital admissions or emergency department visits during the study period, resulting in a total of 2935 to 2939 counties included in the analysis, depending on the specific outcome (see supplementary table S2). Between 1 January 2010 and 31 December 2016, a total of 10.3 million hospital admissions and 24.1 million emergency department visits for natural causes were recorded among 50.1 million adults aged ≥18 years with commercial or Medicare Advantage health insurance (see supplementary table S3). The incidence rates of hospital admissions and emergency department visits for natural causes were 207.9 and 485.7 per million enrollees per day during the study period, respectively (see supplementary table S4).

Of these healthcare encounters, more than 50% of hospital admissions and 28% of emergency department visits were for cardiovascular and respiratory diseases, and the distribution varied considerably across different age groups ( fig 1 and fig 2 ). For example, among adults aged <30 years, only 11.3% of all hospital admissions were attributed to cardiovascular or respiratory diseases, but this percentage increased to 92.7% among those aged ≥75 years. In terms of absolute numbers, the incidence rates for hospital admissions and emergency department visits increased with age and tended to be higher in women compared with men, except for cardiovascular disease (see supplementary table S4).

Number of hospital admissions for cardiovascular, respiratory, and other diseases categorized by age and sex among adults with commercial or Medicare Advantage health insurance in the contiguous US, 2010-16. The width of age groupings is not uniform. The higher number of hospital admissions observed among adults aged 50-64 years partly results from the expansion of age ranges

- Download figure
- Open in new tab
- Download powerpoint

Number of emergency department visits for cardiovascular, respiratory, and other diseases categorized by age and sex among adults with commercial or Medicare Advantage health insurance in the contiguous US, 2010-16. The width of age groupings is not uniform. The higher number of hospital admissions and emergency department visits observed among adults aged 50-64 years partly results from the expansion of age ranges

Incidence rates of hospital admissions and emergency department visits also varied across geographic regions. The highest incidence rates for hospital admissions related to natural causes were observed in the northern Great Plains and the northeast, whereas the highest incidence rates for emergency department visits for natural causes were documented in the southeast and midwest ( fig 3 ). Supplementary tables S3 and S4 show the total number and incidence rates of hospital admissions and emergency department visits for natural causes and for cardiovascular and respiratory diseases across different geographic areas, respectively.

Maps of the 98th centile distribution of PM 2.5 (fine particulate matter) concentrations and incidence rates per million enrollees per day of hospital admissions and emergency department visits for natural causes across US counties, 2010-16: (A) 98th centile distributions of PM 2.5 concentrations; (B) 98th centile distributions of PM 2.5 concentrations categorized by World Health Organization and national ambient air quality standards; (C) incidence rate per million enrollees per day of hospital admissions for natural causes in US counties; (D) incidence rate per million enrollees per day of emergency department visits for any cause in US counties. Crosshatching represents counties with missing air pollution data—the total number of hospital admissions or emergency department visits in these counties fell below the small cell suppression limit of 11

During the study period, only 0.1% of county days (8344 out of 7 949 713) recorded daily PM 2.5 concentrations that exceeded the current national ambient air quality standards of 35 μg/m 3 ; these counties were primarily located in central California, northwestern Utah, southwestern Montana, and east Idaho. Daily PM 2.5 levels were below the new WHO air quality guideline limit of 15 μg/m 3 in 92.6% of county days (7 360 725 out of 7 949 713) (see supplementary figure S2). Restricting our sample of events to days below this level resulted in the exclusion of 9.4% of hospital admissions and 9.1% of emergency department visits.

## Regression results

Exposure to PM 2.5 at concentrations below the new WHO air quality guideline limit was associated with an increased risk of hospital admissions for natural causes, cardiovascular disease, and respiratory disease. Specifically, each 10 μg/m 3 increase in lag 0-1 PM 2.5 was associated with a 0.91% (95% confidence interval 0.55% to 1.26%) higher relative risk of hospital admissions for natural causes, 1.39% (0.81% to 1.98%) higher relative risk of hospital admissions for cardiovascular disease, and 1.90% (1.15% to 2.66%) higher relative risk of hospital admissions for respiratory disease ( fig 4 , also see supplementary table S5). The corresponding excess absolute risk was 1.87 (95% confidence interval 1.14 to 2.59), 1.04 (0.61 to 1.48), and 0.85 (0.52 to 1.18) per million enrollees per day for hospital admissions related to natural causes, cardiovascular disease, and respiratory disease, respectively ( fig 4 , also see supplementary table S6).

Excess relative risk (percentage) and excess events of hospital admissions and emergency department visits for natural causes, cardiovascular disease, and respiratory disease associated with each 10 μg/m 3 increase in fine particulate matter (PM 2.5 ) during the current and previous day

The association between exposure to PM 2.5 and risk of hospital admissions was most pronounced at lag 0, but with little evidence of continued higher risk at lag 1, except for respiratory diseases. For example, a 10 μg/m 3 increase in PM 2.5 was associated with a 0.86% (95% confidence interval 0.52% to 1.19%) higher relative risk at lag 0 and a 0.04% (−0.31% to 0.38%) higher relative risk at lag 1 for hospital admissions related to natural causes. The corresponding excess relative risk for respiratory disease was 0.86% (95% confidence interval 0.16% to 1.57%) at lag 0 and 1.03% (0.31% to 1.76%) at lag 1 (see supplementary table S7).

For emergency department visits, a 10 μg/m 3 increase in lag 0-1 PM 2.5 was associated with a 1.34% (95% confidence interval 0.73% to 1.94%) excess relative risk of emergency department visits for respiratory disease ( fig 4 , also see supplementary table S5), corresponding to 0.93 (95% confidence interval 0.52 to 1.35) additional emergency department visits per million enrollees per day ( fig 4 , also see supplementary table S6). The estimated association between exposure to PM 2.5 and emergency department visits for natural causes or for cardiovascular disease was weaker and not statistically significant.

We performed a series of sensitivity analyses to evaluate the robustness of our findings. In analyses that allowed for a flexible exposure-response relation for the association between exposure to PM 2.5 and morbidity, we found a monotonic association between exposure to PM 2.5 and the relative risk of hospital admissions for natural causes, cardiovascular disease, and respiratory disease, with no indication of a threshold at lower concentrations ( fig 5 ). We found a monotonic association between exposure to PM 2.5 and relative risk of emergency department visits for respiratory disease, with the association appearing more pronounced at lower PM 2.5 levels ( fig 5 ). When we adjusted for daily maximum ambient temperature or daily minimum ambient temperature instead of daily mean temperature, the results remain consistent with our main findings, except for the association between PM 2.5 level and emergency department visits for natural causes and cardiovascular disease, which became statistically significant when we adjusted for daily minimum temperature (see supplementary table S8). Our results were not materially different when we additionally included an interaction term between relative humidity splines and NCA4 regions in the models (see supplementary table S9). Additionally, when we expanded our analysis beyond days with daily PM 2.5 concentrations <15 μg/m 3 , we generally found an attenuation in the association for PM 2.5 compared with low level exposure to PM 2.5 , suggesting that PM 2.5 concentrations are even more strongly associated with adverse outcomes <15 μg/m 3 versus >15 μg/m 3 (see supplementary table S10).

Exposure-response curve for association between two day moving average of PM 2.5 (fine particulate matter) concentrations and hospital admissions and emergency department visits for natural causes, cardiovascular disease, and respiratory disease. The reference point on the curve is the counterfactual concentration at 0 μg/m 3 of the two day moving average PM 2.5 during the study period

We found that the association between exposure to PM 2.5 and hospital admissions for natural causes was statistically significant only among adults aged ≥65 years ( fig 6 ). For example, a 10 μg/m 3 increase in PM 2.5 was associated with an excess relative risk of 0.36% (95% confidence interval −0.72% to 1.45%) among adults aged 18-29 years compared with 1.43% (0.60% to 2.26%) and 2.21% (1.52% to 2.91%) among those aged 65-74 years and ≥75 years, respectively. The corresponding excess absolute risks were 0.41 (95% confidence interval −0.84 to 1.67), 4.76 (2.04 to 7.48), and 14.57 (10.09 to 19.06) per million enrollee per day for adults aged 18-29 years, 65-74 years, and ≥75 years, respectively. The association between exposure to PM 2.5 and hospital admissions for natural causes was more pronounced among men, those residing in the northeast US, and those with Medicare Advantage health insurance.

Excess relative risk (percentage) and excess events of hospital admissions for natural causes, cardiovascular disease, and respiratory disease associated with each 10 μg/m 3 increase in PM 2.5 (fine particulate matter) during the current and previous day, stratified by age, sex, insurance type, and NCA4 (US Global Change Research Program’s Fourth National Climate Assessment) regions

For emergency department visits, we found a statistically significant association between exposure to PM 2.5 and respiratory disease (see supplementary figure S3). This association was most pronounced in young and middle aged adults and in the southern Great Plains. For example, we found that adults aged 40-49 years had the highest excess relative risk of 2.57% (95% confidence interval 0.87% to 4.30%) compared with the older population among whom the association was attenuated and not statistically significant. A 10 μg/m 3 increase in PM 2.5 was associated with an excess relative risk of 5.64% (3.77% to 7.54%) in the southern Great Plains versus 1.07% (−0.43% to 2.59%) in the US northeast.

Using data encompassing more than 10 million hospital admissions and 24 million emergency department visits across the contiguous US from 2010 to 2016, we found that short term exposure to PM 2.5 , even at concentrations below the new WHO air quality guideline limit of 15 μg/m 3 , was statistically significantly associated with a higher risk of hospital admissions for natural causes, cardiovascular disease, and respiratory disease, as well as emergency department visits for respiratory disease.

## Comparison with other studies

Studies of the potential health effects of PM 2.5 at low levels (by today’s standards) provide valuable insights and critically inform national health policies. However, relatively few such studies have been conducted. 1 24 25 For example, in a pooled analysis of multiple European cohorts, one study evaluated the association between long term exposure to PM 2.5 and mortality, with a focus on the health effects of such exposure below the current standards and guidelines of the European Union (25 μg/m 3 ) and US (12 μg/m 3 ) and previous guidelines from WHO (10 μg/m 3 ). 24 Another study used mortality data from the US Medicare fee-for-service population to estimate the association between short term exposure to PM 2.5 below the current daily national ambient air quality standard (35 μg/m 3 ) and mortality in the US. 1 In a recent study, researchers found that the adverse health effects of PM 2.5 on all cause mortality persisted at lower levels of PM 2.5 below the previous WHO air quality guideline limit of 25 μg/m 3 . 3 26 Our study investigated whether the risk of morbidity occurs at levels of PM 2.5 below the newly revised WHO air quality guideline limit of 15 μg/m 3 , using data on both hospital admissions and emergency department visits among younger and older adults in the US.

The new WHO air quality guideline calls for a limit on 24 hour mean PM 2.5 concentrations to 15 μg/m 3 , a reduction from the previous limit of 25 μg/m 3 set in 2005, in response to compelling evidence of substantial health effects of PM 2.5 even at concentrations below the earlier limit. Our study identified a consistent monotonic exposure-response association between exposure to PM 2.5 at levels below the new WHO air quality guideline limit and hospital admissions for natural causes, cardiovascular disease, and respiratory disease, and emergency department visits for respiratory diseases. Notably, when examining emergency department visits, we noted that the association between PM 2.5 level and respiratory disease appears to be more pronounced at lower PM 2.5 levels. These findings provided evidence that PM 2.5 continues to pose adverse health risks even when concentrations are below the newly revised WHO air quality guidelines, which corroborates the conclusion that probably no safe level of exposure to PM 2.5 exists—that is, the level below which no adverse health effects are observed. 1 24 25 26

Our findings that PM 2.5 level was associated with hospital admissions among adults ≥aged 65 years but not among adults aged <65 years were consistent with previous studies. 10 27 For example, a nationwide study in Italy, examining the link between air pollution and hospital admissions for respiratory disease, found that PM 2.5 level was associated with higher risk of hospital admissions among adults aged ≥75 years but not among adults aged <75 years. 10

We found statistically significant associations between exposure to PM 2.5 and increased risk of emergency department visits for respiratory disease exclusively among adults aged <50 years. These findings indicate that previous studies that focused on older populations may not have fully captured the adverse health effects of PM 2.5 on respiratory related emergency department visits. Relying on such studies could potentially lead to an underestimation of the health effects of PM 2.5 , particularly among younger age groups. Our findings were consistent with a US nationwide study among 40 million respiratory related emergency department visits collected through the Centers for Disease Control and Prevention’s national environmental public health tracking network. 11 This study found a strong association between PM 2.5 level and emergency department visits for respiratory disease among children and young people aged 0-18 years, a moderate association among adults aged 19-64 years, and no significant association among older adults aged ≥65 years. 11

We found that most of the adverse health effects of PM 2.5 were more pronounced among adults with Medicare Advantage health insurance than among adults with commercial health insurance. Enrollees in the Medicare Advantage programme are largely adults aged >65 years. Thus, the health effect of PM 2.5 among Medicare enrollees is similar to that among people aged ≥65 years. For example, no association was found between short term exposure to PM 2.5 and emergency department visits for respiratory disease among both Medicare enrollees and older adults. We observed geographic differences in the health effects of PM 2.5 , with beneficiaries residing at higher latitudes tending to show higher risk of hospital admissions and emergency department visits for cardiovascular disease associated with exposure to PM 2.5 . We speculate that geographic locations may modify the association between exposure to PM 2.5 and morbidity, possibly through factors such as differences in local temperature, 28 29 and particle composition contributing to varying levels of oxidative potential. 30

## Limitations and strengths of this study

Our study has several limitations. First, we used county level PM 2.5 level as a proxy for personal exposure to PM 2.5 , which could potentially lead to misclassification of exposure. Additionally, the absence of information on patients’ time-activity patterns might introduce additional misclassification of exposure. Nevertheless, we expect that any potential misclassification would likely be non-differential and would tend to bias our results toward the null hypothesis of no association. 31 Second, our study population was limited to US adults with health insurance, which may limit the generalizability of our findings to individuals without medical insurance, children and adolescents, and individuals living outside the contiguous US. Third, our analysis was conducted using data available up to 2016. Future studies with more recent data on exposure to air pollution and morbidity are warranted to provide further evaluation of the evidence on the association between exposure to PM 2.5 and morbidity, particularly at daily PM 2.5 levels below the 2021 WHO air quality guideline limit.

Our study has three main strengths. First, our study population included more than 10 million hospital admissions and 24 million emergency department visits among beneficiaries with commercial and Medicare Advantage insurance. The enrollees span a broad spectrum of age groups older than 18 years, and they are distributed across different climate regions in the contiguous US. Second, we not only assessed the excess relative risk of morbidity for natural causes, cardiovascular disease, and respiratory disease but also quantified the absolute risk associated with short term exposure to PM 2.5 . Therefore, our study provides a comprehensive evaluation of the health effects of daily PM 2.5 at levels below the new WHO air quality guideline limit of 15 µg/m 3 . Third, we included both hospital admissions and emergency department visits as our primary health outcomes, allowing us to compare the health impacts of PM 2.5 levels on different morbidity metrics.

## Conclusions and policy implications

Short term exposures to PM 2.5 below the new WHO air quality guideline limit of 15 µg/m 3 are associated with higher risks of hospital admission for natural causes, cardiovascular disease, and respiratory disease as well as for emergency department visits for respiratory disease among adults with health insurance in the contiguous US. Our study contributes to the evidence that ambient air pollution is associated with morbidity even at PM 2.5 levels below the current WHO air quality guideline limit. Our study provides an initial assessment of the newly revised WHO guidelines for PM 2.5 and provides valuable reference for future national air pollution standards.

## What is already known on this topic

Short term exposure to fine particulate matter (PM 2.5 ) has been associated with increased risk of morbidity and mortality

Previous studies have primarily focused on older adults or used data on hospital admissions

Evidence for the association between short term exposure to PM 2.5 and morbidity at levels below the new World Health Organization air quality guideline limit remains unclear

## What this study adds

In this nationwide study in the US, exposure to ambient PM 2.5 at levels below the new WHO air quality guideline limit of 15 µg/m 3 was statistically significantly associated with hospital admissions for natural causes, cardiovascular disease, and respiratory disease, and emergency department visits for respiratory disease

These findings provide evidence of health harms associated with short term exposure to PM 2.5 even at levels below the new WHO air quality guideline limit

## Ethics statements

Ethical approval.

This study involved analysis of pre-existing, deidentified data and was exempted from institutional review board approval.

## Data availability statement

No additional data available.

Web extra Extra material supplied by authors

Contributors: SS and GAW contributed equally to this paper. YS, SS, and GAW designed the study. YS, CWM, SS, and GAW developed the analysis plan. YS performed statistical analysis and took responsibility for the accuracy of the data analysis. YS and GAW drafted the manuscript. CWM, KRS, YW, JS, FD, and AN-S contributed to the interpretation of the results and revision of the final manuscript. YS is the guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding: This study was supported by the National Institutes of Health (R01-ES029950). JS was supported by the National Institute of Environmental Health Sciences (R01ES032418-01). The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: all authors had financial support from the National Institutes of Health and the National Institute of Environmental Health Sciences for the submitted work; GAW previously served as a consultant for Google (Mountain View, CA) and currently serves as a consultant for the Health Effects Institute (Boston, MA); no other relationships or activities that could appear to have influenced the submitted work.

The lead author (YS) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned and registered have been explained.

Dissemination to participants and related patient and public communities: There are no plans to disseminate the results of the research directly to study participants. We plan to disseminate the results of the research to the general public through media outreach, including press releases by the media departments of the authors’ research institutes and plain language messaging in social media.

Provenance and peer review: Not commissioned; externally peer reviewed.

Publisher’s note: Published maps are provided without any warranty of any kind, either express or implied. BMJ remains neutral with regard to jurisdictional claims in published maps.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/ .

- Christiani DC ,
- Warrington JA ,
- Dominici F ,
- Burnett R ,
- ↵ World Health Organization. WHO global air quality guidelines: Particulate matter (PM 2.5 and PM 10 ), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide. World Health Organization, 2021 [cited 2023 Nov 29]. https://iris.who.int/bitstream/handle/10665/345329/9789240034228-eng.pdf?sequence=1
- ↵ United States Environmental Protection Agency (EPA). NAAQS Table. EPA, 2012 [cited 2023 Oct 31]; https://www.epa.gov/criteria-air-pollutants/naaqs-table
- Scortichini M ,
- Forastiere F ,
- BEEP collaborative Group
- Strosnider HM ,
- Darrow LA ,
- Vaidyanathan A ,
- Strickland MJ
- ↵ Optum Labs. Optum Labs and Optum Labs Data Warehouse (OLDW) Descriptions and Citation . Eden Prairie, MN: March 2023. PDF. Reproduced with permission from Optum Labs.
- Koutrakis P ,
- Lyapustin A ,
- Spangler KR ,
- Weinberger KR ,
- Wellenius GA
- Halbleib M ,
- Yanosky JD ,
- Gasparrini A
- Nori-Sarma A ,
- ↵ Rothman KJ, Greenland S, Lash TL. Modern epidemiology. 3rd ed. Philadelphia: Wolters Kluwer Health/Lippincott Williams & Wilkins; 2008. Chapter 15, Introduction to Stratified Analysis – Test Homogeneity; p279.
- ↵ Turner H, Firth D. Generalized nonlinear models in R: An overview of the gnm package. 2007.
- Brunekreef B ,
- Zanobetti A ,
- Kosheleva A ,
- Schwartz JD
- Bergmann S ,
- Sidwell A ,
- Kioumourtzoglou MA ,
- Spiegelman D ,
- Szpiro AA ,

- Free Case Studies
- Business Essays

Write My Case Study

Buy Case Study

Case Study Help

- Case Study For Sale
- Case Study Service
- Hire Writer

## Time Series Analysis

Time series is an ordered sequence of values of a variable at equally spaced time intervals. Time series occur frequently when looking at industrial data. The essential difference between modeling data via time series methods and the other methods is that Time series analysis accounts for the fact that data points taken over time may have an internal structure such as autocorrelation, trend or seasonal variation that should be accounted for. A Time-series model explains a variable with regard to its own past and a random disturbance term.

Special attention is paid to exploring the historic trends and patterns (such as seasonality) of the time series involved, and to predict the future of this series based on the trends and patterns identified in the model. Since time series models only require historical observations of a variable, it is less costly in data collection and model estimation. . Time series models can broadly be categorized into linear and nonlinear Models. Linea models depend linearly on previous data points.

We Will Write a Custom Case Study Specifically For You For Only $13.90/page!

They include the autoregressive (AR) models, the integrated (I) models, and the moving average (MA) models.

The general autoregressive model of order p (AR(p)) can be written as And that of the moving average model of order q as The autoregressive (AR) models, were first introduced by Yule (1927) while the moving average process was developed by Slutzky (1937). Combinations of these ideas produce autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) models. s an autoregressive moving average process of order p,q denoted as ARMA(p,q) if is stationary and if for every where . is linearly related to the p most recent observations , q most recent forecast errors and the current disturbance . A non-stationary ARMA(p,q) process which requires differencing d times before it becomes stationary is said to follow an Autoregressive Integrated Moving Average of order (p,d,q) abbreviated as ARIMA(p,d,q). The difference operator when applied to the entry yields the difference .

Nonlinear time series are able to show cyclicity, asymmetry and capture higher moments such as skewness and kurtosis. They include the bilinear models introduced by Subba and Gabr (1984), the exponential autoregressive models introduced by Ozaki and Oda (1978) and the Autoregressive Conditional Heteroscedastic (ARCH) models introduced by Engle (1982). The general bilinear model is given by Where is a sequence of i. i. d random variables, usually but not always with zero mean and variance and , , and are model parameters.

That of the EAR model given by , And that of the Autoregressive Conditional Heteroscedastic (ARCH) models For t =1,…,T where is a kx1 vector of exogenous variables, and is a kx1 regression parameters. 2. 4Autoregressive moving average (ARMA) models The ARMA model is expressed as ARMA (p,q) where p= Number of autoregressive parameters and q = Number of moving average parameters. It is defined as, The basic assumption in estimating the ARMA coefficients is that the data are stationary, that is, the trend or seasonality cannot affect variance.

## Related posts:

- A Time Series Analysis of the Adjusted Closing Stock Prices
- Determinants of Gross Domestic Saving in Ethiopia: a Time Series Analysis
- Movies, Television Shows, Drama Series and Comedy
- The relationship between a series of straight, non-parallel, infinite lines on a plane surface
- Time Series Analysis – Arima Models – Basic Definitions and Theorems About Arima Models
- Series 7 Study Guide
- MBA content series Case Study

## Quick Links

Privacy Policy

Terms and Conditions

Testimonials

## Our Services

Case Study Writing Service

Case Studies For Sale

## Our Company

Welcome to the world of case studies that can bring you high grades! Here, at ACaseStudy.com, we deliver professionally written papers, and the best grades for you from your professors are guaranteed!

[email protected] 804-506-0782 350 5th Ave, New York, NY 10118, USA

Acasestudy.com © 2007-2019 All rights reserved.

Hi! I'm Anna

Would you like to get a custom case study? How about receiving a customized one?

Haven't Found The Case Study You Want?

For Only $13.90/page

## IMAGES

## VIDEO

## COMMENTS

Time series analysis focuses on understanding the dependencies in data as it changes over time. Unlike forecasting, it tries to answer the questions what happens? and why does that happen? Forecasting, on the other hand, corresponds to finding out what will happen.

Description Chapters Reviews This book is a monograph on case studies using time series analysis, which includes the main research works applied to practical projects by the author in the past 15 years.

Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of time. In time series analysis, analysts record data points at consistent intervals over a set period of time rather than just recording the data points intermittently or randomly.

16 Time Series case studies 16.1 Loney Meadow flux data At the beginning of this chapter, we looked at an example of a time series in flux tower measurements of northern Sierra meadows, such as in Loney Meadow where during the 2016 season a flux tower was used to capture CO 2 flux and related micrometeorological data.

Time series, measurements of a quantity taken over time, are fundamental data objects studied across the scientific disciplines, including measurements of stock prices in finance, ion fluxes in astrophysics, atmospheric air temperatures in meteorology and human heartbeats in medicine.

Time series analysis is part of predictive analysis, gathering data over consistent intervals of time (a.k.a. collecting time series data ). It's an effective tool that allows us to quantify the impact of management decisions on future outcomes.

Introduction Linear regression is a very common model used by Data Scientist. An outcome or target variable is explained by a set of features. There is a case where the same variable is collected over time and we used a sequence of measurements of that variable made at regular time intervals. Welcome to Time Series.

1 Altmetric Metrics Abstract Background The increased availability of data on health outcomes and risk factors collected at fine geographical resolution is one of the main reasons for the rising popularity of epidemiological analyses conducted at small-area level.

The term "lag" is used often in time series analysis. To understand that term, consider a column of numbers starting with Y(2) and ending with Y(n) where n is the number of observations available. A corresponding column beginning with Y(1) and ending with Y(n-1) would constitute the "lag 1" values of that first column.

Time series analysis of COVID-19 cases Kamalpreet Singh Bhangu, Jasminder Kaur Sandhu, Luxmi Sapra World Journal of Engineering ISSN: 1708-5284 Article publication date: 11 January 2021 Permissions Issue publication date: 22 February 2022 Downloads 901 Abstract Purpose

An R script (\fm casestudy 1 0.r") collects daily US Treasury yield data from FRED, the Federal Reserve Economic Database, and stores them in the R workspace \casestudy 1.RData". The following commands re-load the data and evaluates the presence and nature of missing values. source("fm_casestudy_0_InstallOrLoadLibraries.r") # load the R ...

The study design proposed here, called case time series, is a generally applicable tool for the analysis of transient health associations with time-varying risk factors. This novel design considers multiple observational units, defined as cases, for which data are longitudinally collected over a predefined follow-up period.

This article contributes to the policy and methodological literature in two ways: First, by providing a synthesis of available methodological literature on qualitative time-series analysis; and second, by providing two illustrative qualitative case studies that used different time-series approaches to examine policy development over time while ...

Time-series analysis is a statistical technique that deals with time-series data, or trend analysis. It involves the identification of patterns, trends, seasonality, and irregularities in the data observed over different time periods. This method is particularly useful for understanding the underlying structure and pattern of the data.

Here we present a new study design, called case time series, for epidemiologic investigations of transient health risks associated with time-varying exposures.

Gasparrini and Armstrong ( 2010) Describes some of the advances made to time series study designs and statistical analysis, specifically in the context of temperature Basu, Dominici, and Samet ( 2005) Compares time series and case-crossover study designs in the context of exploring temperature and health.

Observational studies aim to discover and understand causal relationships between exposures and health outcomes through the analysis of epidemiologic data. 1 Paramount to this objective is removing biases due to the non-experimental setting, in the first place confounding.

Because it enables them to craft and implement a viable and robust enterprise-wide strategy. Thus, time series analysis (TSA) serves as a powerful tool to forecast changes in product demand, price fluctuations of raw materials, and other such factors which impact decision-making. This is a case study that shows how Xavor built a top-notch ...

* The purpose of this paper is to illustrate, by means of a "case 1. Introduction study," the application of the techniques developed by Box and Jenkins [4] for analyzing time series. Specifically, forecasting models are constructed for the inward and outward station movements using the monthly data from January 1951 to October 1966 of the ...

In this case study example, we will learn about time series analysis for a manufacturing operation. Time series analysis and modeling have many business and social applications. It is extensively used to forecast company sales, product demand, stock market trends, agricultural production etc.

The ts () function in R converts a numeric vector into a time series object. Specify the start and end time, and the frequency of the data points. # Converting data to a time series object my_time_series <- ts ( time_series_data $ ColumnOfInterest, start = c ( Year, Month), frequency =12) 📌. In this example, ColumnOfInterest represents the ...

Time series analysis (TSA) is a technique to describe the structure and forecast values of a particular variable based on a series of sequential observations. While commonly used in finance and ...

Design Case time series study. Setting US national administrative healthcare claims database. Participants 50.1 million commercial and Medicare Advantage beneficiaries aged ≥18 years between 1 January 2010 and 31 December 2016. ... a long-term time-series analysis of the US Medicare dataset. Lancet Planet Health 2021; 5: e534-41. doi: 10.1016 ...

Since time series models only require historical observations of a variable, it is less costly in data collection and model estimation. . Time series models can broadly be categorized into linear and nonlinear Models. Linea models depend linearly on previous data points. We Will Write a Custom Case Study Specifically. For You For Only $13.90/page!