Forecasting the present: The role of Google data in assessing real-time economic conditions

What if we could forecast the present? If you are not an economist, this question may seem silly. For example, if you want to know the weather today, all you have to do is to open the window and to look at the sky. But if, you need to know the economic situation of a country in real time, it makes sense.

Indeed, it turns out that you have access to an extremely large number of information every day, sometimes contradictory, stemming from various sources, such as Statistical Institutes, financial markets, medias,… In addition, there are generally large delays in the publication of official economic figures, adding uncertainty to the current situation. In this respect, the concept of nowcasting in macroeconomics has been popularized over the recent years by many researchers and forecasters involved in business cycle analysis, in the wake of the seminal work by Giannone, Reichlin and Small (2018). This concept differs from standard macroeconomic forecasting in that it involves assessing the real-time economic performance of a given country. Being able to establish an accurate diagnosis of the current state of the economy is often regarded as a first step towards building a longer-term outlook.

The idea is to provide policy-makers with a real-time evaluation of the state of the economy, in order to be able to rapidly react to sudden changes in economic conditions and to deliver better economic decisions.

Big Data is watching you

However, the task is particularly challenging as most countries publish their official Quarterly National Accounts – especially the benchmark macroeconomic indicator, gross domestic product or GDP – after the close of the period, and often with a significant lag. For example, in the euro area, Eurostat publishes its preliminary flash estimate of GDP growth around 30 days after the end of the quarter. This means that economists trying to evaluate first-quarter economic activity in the euro area do not have access to any official estimate of GDP from 1 January to 30 April. In the interim, however, they can collect a very large number of various types of economic variables released at a higher frequency, i.e. on a monthly, weekly or daily basis.

Nowadays, it turns out that big sets of alternative data are also widely used by economists for macroeconomic analysis, forecasting or monetary policy decisions. Alternative data are defined by opposition to standard official macroeconomic information stemming, for instance, from National Statistical Institutes, Central Banks, International Organizations … Various sources of alternative data have been used in the recent literature such as for example social network data, web scraped data, scanner data or satellite data. Generally, those datasets are extremely large and can be defined as “Big Data”.

Hey Google, can you help us to forecast the present?

One of the main sources of alternative data are Google search data and seminal papers on the use of such data for nowcasting/forecasting are the ones by Hal Varian and co-authors (see for example here). In the area of nowcasting/forecasting, the literature tends to show evidence of some forecasting power for Google data, at least for some specific macroeconomic variables such as consumption (Choi and Varian, 2012), unemployment rate (D’amuri and Marcucci, 2017), building permits (Coble and Pincheira, 2017) or car sales (Nymand and Pantelidis, 2018). However, when correctly compared with other sources of information, the jury is still out on the gain that economists can get from using Google data for forecasting and nowcasting.

In a recent paper (Ferrara and Simoni, 2022, forthcoming in the Journal of Business and Economic Statistics), we ask the question whether Google data are still useful in nowcasting euro area GDP when controlling for official variables, such as opinion surveys or production, generally used by forecasters. And if so, when exactly are those alternative data actually adding a gain in nowcasting accuracy in a true real-time framework.

Because Google search data are high dimensional, in the sense that the number of variables is large compared to the time series dimension, there is a price to pay for using them: first, we need to reduce their dimensionality from ultra-high to high by using a screening procedure and, second, we need to use a regularized estimator to deal with the pre-selected variables. In this respect, we put forward a new approach combining variable pre-selection and Ridge regularization enabling to account for a large database. This two-step approach is referred to as Ridge after Model Selection, and involves the following steps: (i) first, Google Search variables are preselected, conditional on the official variables, by targeting the macroeconomic aggregate to be nowcast, and (ii) second, a Ridge regularization is applied to those preselected Google Search variables and official variables. The Ridge regularization can be seen an extended linear regression with a penalty.

Finally, we conduct an empirical study to assess the role of Google Search data when nowcasting GDP growth for three countries/areas: the euro area, the U.S. and Germany. Usual GDP nowcasting tools integrate standard official macroeconomic information stemming, for instance, from national statistical institutes, central banks and international organizations. Typically, two sources of official data are considered: (i) hard data (production, sales, employment…) and (ii) opinion surveys (households or companies are asked about their view on current and future economic conditions). Sometimes, financial markets information, generally available on high frequency basis, is also integrated into the information set. In our study we include official data (i) and (ii) together with the alternative Google Search data into our information set. In addition, we consider financial market information for robustness check.

Google data through the recession

We analyze three different periods: a period of cyclical stability (2014q1-2016q1), a period that exhibits a sharp downturn in GDP growth rate (2017q1-2018q4) and a period of recession (the Great Recession period from 2008q1 to 2009q2). Overall, empirical results show that Google Search variables are useful when trying to nowcast GDP growth. At the beginning of the quarter, when there is no official information available about the current state of the economy, we show that using only Google data leads to very reasonable Mean Squared Forecasting Errors (MSFEs), sometimes only slightly higher than those obtained at the end of the quarter when the information set is complete. As soon as we integrate official macroeconomic information, starting from the fifth week of the quarter, MSFEs decrease reflecting the importance of this type of data in nowcasting. Overall, combining macroeconomic variables and Google variables in the same model appears to be generally fruitful.

A striking result coming out from our empirical analysis is that, on the one hand, the preselection step is crucial in the first two periods considered (that is 2014q1-2016q1 and 2017q1-2018q4) as it generates better outcomes compared to nowcasting procedures without any preselection. This result confirms previous findings from the nowcasting literature, in particular dealing with dynamic factor models. On the other hand, we highlight that a recession period presents specific patterns as a model that only contains Google variables, without any preselection step, tends to be preferred in terms of nowcasting accuracy. This result is quite robust over the three countries/areas that we consider in the study. This likely reflects the fact that uncertainty increases during a recession, meaning that accounting for a larger information set is useful during such economic episodes.

Overall, we believe that new alternative data, such as Google data, are really useful to improve the real-time economic diagnosis, in addition to more standard sources of information. In particular, those data appear extremely useful when official economic information is lacking or is fragmented, as for example for emerging and low-income countries. However, more efficient econometric tools have to be developed in order to deal with stylized facts of those alternative data, such as volume, variety, velocity, variability and veracity, also known as the 5V’s.