## Abstract

Commercial buildings generally have large thermal inertia and thus can provide services to power grids (e.g., demand response (DR)) by modulating their heating, ventilation, and air conditioning (HVAC) systems. Shifting consumption on timescales of minutes to an hour can be accomplished through temperature setpoint adjustments that affect HVAC fan consumption. Estimating the counterfactual baseline power consumption of HVAC fans is challenging but is critical for assessing the capacity and participation of DR from HVAC fans in grid-interactive efficient buildings (GEBs). DR baseline methods have been developed for whole-building power profiles. This study evaluates those methods on total HVAC fan power profiles, which have different characteristics than whole-building power profiles. Specifically, we assess averaging methods (e.g., Y-day average, HighXofY, and MidXofY, with and without additive adjustments), which are the most commonly used in practice, and a least squares-based linear interpolation method recently developed for baselining HVAC fan power. We use empirical submetering data from HVAC fans in three University of Michigan buildings in our assessment. We find that the linear interpolation method has a low bias and by far the highest accuracy, indicating that it is potentially the most effective existing baseline method for quantifying the effects of short-term load shifting of HVAC fans. Overall, our results provide new insights on the applicability of existing DR baseline methods to baselining fan power and enable the more widespread contribution of GEBs to DR and other grid services.

## 1 Introduction

The U.S. Federal Energy Regulatory Commission (FERC) defines demand response (DR) as “changes in electric usage by demand-side resources from their normal consumption patterns in response to changes in the price of electricity over time or to incentive payments designed to induce lower electricity use at times of high wholesale market prices or when system reliability is jeopardized” [1]. DR is one of the most flexible and effective solutions to reduce power system investment and operation costs and to displace generation and network reinforcement [2]. It is also capable of mitigating the impacts of renewable energy fluctuations and enhancing system reliability during periods of high demand [3].

Commercial buildings account for roughly 20% of the energy consumed in the United States [4]. They are well suited for DR as they generally have high thermal inertia that can be utilized as an energy reservoir for short periods, without negatively impacting the comfort of occupants [5]. Within a building, some equipment is more responsive and has a higher DR potential, while other equipment is less responsive or even nonresponsive. Heating, ventilation, and air conditioning (HVAC) systems represent the bulk of power consumption in most commercial buildings. They are also becoming increasingly controllable with relatively sophisticated control and communications architectures, especially in grid-interactive efficient buildings (GEBs) [6,7]. Thus, HVAC systems are one of the largest DR resources [8]. Among many DR strategies that modulate HVAC power consumption, directly or indirectly controlling HVAC supply and return fans is a key source of flexibility enabling the provision of grid services [7,9]. For example, Ref. [10] introduces a feed-forward architecture to control the fans in commercial building HVAC systems to provide frequency regulation to the grid.

The objective of this article is to evaluate the performance (e.g., accuracy and bias) of existing DR baseline estimation methods in quantifying the effects of short-term (i.e., timescales of minutes to an hour) load shifting of commercial building HVAC fan power using empirical data from real buildings. We apply the methods to estimate the baseline power on days without DR events and then compare the estimated baseline with the actual measurements, i.e., the true baseline on days without DR. The evaluation is focused on simple, interpretable, and DR participant-friendly baseline methods that are commonly used or satisfy the practical requirements of electric utilities and independent system operators (ISOs) [11–13].

DR baseline estimation seeks to estimate the counterfactual power profile that would have occurred without DR. Accurate baseline estimation methods are critical for assessing the performance of both individual GEBs and DR programs. Baselines are also needed for determining compensation to GEBs participating in DR programs or electricity markets, providing utilities and ISOs with a prediction of how much GEB flexibility was or is expected, and for a variety of grid operational and planning problems [14,15]. It is challenging to measure or calculate what would have occurred without DR and thus, fundamentally, baselines are imperfect [16]. In particular, for smaller devices and more granular end-uses with irregular or unpredictable power consumption, establishing a robust and accurate baseline can be difficult [17–19]. When a load is dependent on consumer behavior, it is also typically more difficult to establish a baseline [20].

We note that baseline estimation is different than electric load forecasting. The latter forecasts what the load will be in the future. In contrast, DR baseline estimation predicts what the load would have been without DR in the past or future, where prediction of the past can leverage a posteriori data. Similar building models could be used for both tasks; however, the choice of model/method to use is always a trade-off between accuracy and simplicity, and DR applications tend to favor simplicity. Specifically, DR baseline methods should also be simple enough for all stakeholders (including electricity customers participating in DR programs, referred to as *DR participants*) to understand, calculate, and implement [21–23]. Therefore, in practice, much simpler methods are used for DR baseline estimation than load forecasting.

A variety of DR baseline methods have been proposed, generally based on whole-building electric load (power) profiles. Those baseline methods can be classified into four categories: averaging methods, regression methods, control group methods, and machine learning methods. Among them, averaging methods are the most commonly used by electric utilities and ISOs [12,21,22,24]. Averaging methods use the average load of several days selected from recent days without DR events to estimate the baseline [25,26]. Averaging methods typically also incorporate a multiplicative or additive adjustment [25]. Their application and performance highly depend on the availability of recent days without DR events, which can be very limited if the DR resource is frequently actuated to provide DR services.

It may be possible to establish more accurate DR baseline estimates if, rather than using whole-building electric load data, we use sub-metered load data from the equipment or devices providing DR [14]. For instance, for short-term load shifting via room temperature setpoint control, we expect a response primarily from the supply and return fans and secondarily from the chiller(s) [6,10]. Not only does submetering have the potential to improve baselining to provide more accurate DR performance estimation and grid service delivery verification but it also allows us to attain a more granular understanding of and insights into the impact of DR actions [14].

In this article, we evaluate the ability of existing DR baseline methods to estimate total HVAC fan power baselines. Since these methods were designed for baselining whole-building electric load profiles, their applicability for baselining the total HVAC fan power is unknown. Table 1 lists the baseline methods selected for the evaluation. In addition to averaging methods, we investigate a least squares-based linear interpolation method recently developed for baselining HVAC fan power data [6]. All of these methods are commonly used or are adequately simple and interpretable to be used by utilities and ISOs. More sophisticated methods (e.g., regression methods, control group methods, and machine learning methods) are not included in the evaluation, as they are typically not used in practice due to their much higher data requirements, excessive complexity, lower interpretability, and only marginal gains in accuracy [14].

Method type | Method examples | Examples of application in practice |
---|---|---|

Y-day average | 5-day average | ISO New England [11] |

10-day average | California ISO [12] | |

HighXofY | High4of5 | PJM Interconnection [13] |

High5of10 | New York ISO [23] | |

MidXofY | Mid4of6 | Electric Reliability Council of Texas [13,21] |

LowXofY | Low4of5 | Unknown (proposed in Ref. [22], used in Refs. [24,27,28]) |

Low5of10 | Unknown (proposed in Ref. [22], used in Refs. [28,29]) | |

NearestXofY | Nearest3of6 | Unknown (proposed in Ref. [30], used in Ref. [31]) |

Nearest5of10 | Unknown (proposed in Ref. [30]) | |

Linear interpolation | Linear interpolation | Unknown (proposed in Ref. [6], used in Refs. [30–34]) |

Method type | Method examples | Examples of application in practice |
---|---|---|

Y-day average | 5-day average | ISO New England [11] |

10-day average | California ISO [12] | |

HighXofY | High4of5 | PJM Interconnection [13] |

High5of10 | New York ISO [23] | |

MidXofY | Mid4of6 | Electric Reliability Council of Texas [13,21] |

LowXofY | Low4of5 | Unknown (proposed in Ref. [22], used in Refs. [24,27,28]) |

Low5of10 | Unknown (proposed in Ref. [22], used in Refs. [28,29]) | |

NearestXofY | Nearest3of6 | Unknown (proposed in Ref. [30], used in Ref. [31]) |

Nearest5of10 | Unknown (proposed in Ref. [30]) | |

Linear interpolation | Linear interpolation | Unknown (proposed in Ref. [6], used in Refs. [30–34]) |

The contributions of this article are threefold: (1) We evaluate how existing baseline methods developed for baselining whole-building power consumption translate to baselining HVAC fan power consumption; (2) we apply selected baseline methods to empirical data from three University of Michigan buildings to quantitatively assess their performance; and (3) we find that the linear interpolation method has a low bias and by far the highest accuracy among the methods tested, making it potentially the most effective existing DR baseline method for quantifying the effects of short-term load shifting of HVAC fans. Our evaluation serves as one of the first systematic assessments of the applicability of existing DR baseline methods in baselining HVAC fan power. Note that compared with our technical report [30], this study contributes novel results based on a refined selection of evaluation metrics, improved and more practical implementation of adjustments for averaging methods, and new analyses and discussions. The results inform better baselining and in turn can enhance the implementation, financial settlement, and benefits realization of HVAC-based DR from GEBs.

The remainder of this article is organized as follows. Section 2 describes how HVAC fans can be used for DR. Section 3 introduces the baseline methods evaluated in this study. Section 4 introduces the evaluation methodology. Section 5 describes the data used in our evaluation. Section 6 presents the numerical results of the evaluation. Section 7 concludes this article and discusses the future work.

## 2 Demand Response From HVAC Fans

Modulating the power consumption of HVAC fans in GEBs is a key source of flexibility for grid services [7,9,10]. HVAC fans can respond quickly, providing high-value fast DR, which can help accommodate more renewable energy sources in the grid. Figure 1 shows an example of our recent experiments using temperature setpoint control to attain short-term load shifting of HVAC fans [32,33]. The morning experiment is called an up-down test. During the response window, room temperature setpoints are decreased below nominal values for 30 min and then increased symmetrically above nominal values for 30 min, causing HVAC fan power to go up and then go down. The fans return to normal operation after a settling window. The afternoon experiment is a down-up test with opposite setpoint changes. In this building, we have sub-metered all supply and return fans in the HVAC system. As shown in Fig. 1, the response is clearly identifiable from the fan power data. In contrast, it can be far less obvious with only whole-building electric load data.

To assess the impact of the load shifting strategy, fan power baselines, i.e., counterfactual fan power profiles that would have occurred without DR events, need to be estimated. In Fig. 1, we have included a linear baseline created through linear interpolation (see Sec. 3.8 for details), which does not aim to accurately capture the unknown time-series but aims to capture average changes in fan power consumption during the response and settling windows (specifically, here, we are interested in the average change in power in the first 30 min of the response window, second 30 min of the response window, and in the 60-min settling window). Figure 1 also helps us to highlight how baseline estimation is different from electric load forecasting. Load forecasting is based on predictions of explanatory variables and aims to produce accurate time-series forecasts of load profiles. In contrast, baseline estimation methods used to predict the past can leverage a posteriori knowledge, e.g., here we use data from before the response window and after the settling window to develop our linear interpolation. Furthermore, to assess the capacity, participation, performance, and financial rewards of GEBs providing grid services, which are the primary uses of DR baseline models [9,14], in most cases, we do not necessarily need accurate time-series predictions (i.e., to capture detailed load dynamics), but rather we need accurate predictions of average load over time windows.

As mentioned in Sec. 1, in this article, we consider only simple and interpretable DR baseline methods. Here, we provide more justification for why we have excluded alternative methods. Although regression methods [35] are used by some entities to baseline whole-building electric load, in our previous study, we found that HVAC fan power and outdoor temperature (the most commonly used regressor in estimating whole-building electric load baselines) have a low correlation (0.28 on average) [30]. Therefore, the applicability of regression methods for baselining HVAC fan power is greatly limited. Control group and machine learning methods are typically not used in practice as they can be complicated, less interpretable, and/or require a large volume of data. Control group methods estimate the baseline using load data of nonresponsive buildings with the most similar load patterns to DR participants [36]. However, it is challenging to match or cluster similar buildings since there is generally a large amount of variability in HVAC equipment. Machine learning methods find the relationship between the load and its related factors by training black-box models (e.g., neural networks [37]), which are difficult to interpret and explain to DR participants. Moreover, these more sophisticated methods only attain marginal gains in accuracy according to some ISOs [14]. Physics-based HVAC fan power models could also be used to generate DR baseline estimates, but they require significant data and effort for model calibration. Such models are generally more suitable for building analysis and load forecasting than DR baseline estimation.

To the best of our knowledge, there are only three methods in the literature specifically designed to baseline HVAC fan power. One is the linear interpolation method proposed in Ref. [6], illustrated in Fig. 1, and included in our evaluation. Another is the tensor completion based method developed in Ref. [31]. It estimates baselines by finding dominant fan power patterns hidden in high-dimensional data and is classified as a machine learning method. We do not include it in our evaluation for the reasons mentioned earlier. The third is the signal bandwidth separation method proposed in Ref. [10]. It estimates the fan power baseline using a low-pass filter on load data from DR events. This method is applicable to settings in which DR signals vary much faster than the baseline load. This method is not evaluated here either as our experimental data correspond to load shifting on timescales of minutes to an hour, i.e., timescales that overlap with the timescales of baseline load variation. In contrast, Ref. [10] use it for baselining fan power in GEBs participating in ancillary services by following DR signals that vary every 2–10 s.

We also note that for some DR programs ex-post baselines are not necessary because buildings are asked to self-schedule their baseline and deliver services around that baseline. However, this makes the job of the building harder as it not only needs to perform DR but also compensate for baseline forecast error. Self-scheduled baselines are less common in traditional DR programs than in emerging DR programs like loads providing frequency regulation.

## 3 Baseline Methods

Before describing the baseline methods, we define *DR days* as the days when DR events occur and other days as *baseline days*. Note that weekdays and weekends normally have different load patterns. Our current data set only includes a limited number of weekends. Therefore, we only consider weekdays in this work. Nevertheless, the methods for weekday baseline estimation also apply to weekends, and the same process of baseline method performance evaluation could be conducted on load data from weekends.

In this section, we first present the generic form of averaging baseline methods and then explain how it varies in different averaging methods. After that, we introduce an additive adjustment method that can be applied to the averaging methods. Finally, we present the linear interpolation baseline method. The nomenclature used in this section is partly based on Ref. [22], but with modifications for clarity.

### 3.1 Generic Form of Averaging Methods.

Let *i* be the index for GEBs and *d* be the index for days. According to the sampling time of the power data, we divide a day into a set of time-steps $T={1,2,\u2026,|T|}$, and let *t* be the index for time-steps. We define the actual load and estimated baseline load (i.e., the total HVAC fan power) of GEB *i* on day *d* at time-step *t* as *p*_{i}(*d*, *t*) and $p^i(d,t)$, respectively.

*d*. Let

*d*′ be the index of baseline days in the set $D(d)$. The baseline of GEB

*i*on day

*d*at time-step

*t*, i.e., $p^i(d,t)$, is then obtained by taking the mean of the same GEB’s load at time-step

*t*among the selected days:

### 3.2 Y-Day Average Method.

For the Y-day average method [25], the set $D(d)$ consists of the *Y* most recent baseline days preceding day *d* with the same day type (e.g., day of week, or weekend/weekday) as day *d*. We denote this set of days by $R(Y,d)$, i.e., $D(d)=R(Y,d)$. In this study, we use weekday/weekend day types and evaluate the 5-day average method (i.e., *Y* = 5) and the 10-day average method (i.e., *Y* = 10), which are used by ISO New England and California ISO, respectively [11,12].

### 3.3 HighXofY Average Method.

The HighXofY average method averages the load of the *X* days that have the highest daily electricity consumption among the *Y* most recent baseline days of the same day type as the DR day. We denote this set of days by $H(X,Y,d)$, i.e., $D(d)=H(X,Y,d)$. Let *d*″ be the index of baseline days in the set $R(Y,d)$ but not in the set $H(X,Y,d)$, which is determined by the following conditions:

$H(X,Y,d)\u2282R(Y,d)$;

$|H(X,Y,d)|=X$; and

*p*_{i}(*d*′) ≥*p*_{i}(*d*″) for any $d\u2032\u2208H(X,Y,d)$ and $d\u2033\u2208R(Y,d)\u2216H(X,Y,d)$, where*p*_{i}(*d*) is the daily total load of GEB*i*on day*d*, specifically, $pi(d)=\u2211t\u2208Tpi(d,t)$.

The first condition requires that the method select days from the set $R(Y,d)$, i.e., the *Y* most recent baseline days preceding day *d* with the same day type as day *d*. The second condition requires that *X* days are selected. The third condition requires that, among the days in $R(Y,d)$, any selected day has a daily total load greater than or equal to that of any unselected day. In this article, we evaluate the High4of5 baseline method and the High5of10 baseline method used by PJM Interconnection and New York ISO, respectively [13,23].

Note that this method is useful for baselining peak days with high electricity consumption. We do not expect it to work well here, since we use it to baseline both peak days and off-peak days, which is required for assessing DR used for grid services that may be needed at any time. We expect this method to produce positively biased baselines and so we explore the impact of an additive adjustment, described in Sec. 3.7, which can reduce bias.

### 3.4 LowXofY Average Method.

The LowXofY average method averages the load of the *X* days that have the lowest daily electricity consumption among the *Y* most recent baseline days of the same day type as the DR day. We denote this set of days by $L(X,Y,d)$, i.e., $D(d)=L(X,Y,d)$, which is determined by the following conditions:

$L(X,Y,d)\u2282R(Y,d)$;

$|L(X,Y,d)|=X$; and

*p*_{i}(*d*′) ≤*p*_{i}(*d*″) for any $d\u2032\u2208L(X,Y,d)$ and $d\u2033\u2208R(Y,d)\u2216L(X,Y,d)$.

Similar to the HighXofY average method, the first and second conditions require that *X* days are selected from the set $R(Y,d)$. The third condition requires that, among the days in $R(Y,d)$, any selected day has a daily total load less than or equal to that of any unselected day. In this article, we evaluate the Low4of5 and Low5of10 baseline methods, which are proposed in Ref. [22] and used in many articles such as Refs. [24,27–29].

This method is useful for baselining days with low electricity consumption and, again, we do not expect it to work well here. Reference [22] reports that it produces negatively biased baselines, but that the baselines can also have high accuracy. Again, we explore the impact of an additive adjustment, described in Sec. 3.7, to reduce bias.

### 3.5 MidXofY Average Method.

The MidXofY average method is used by the Electric Reliability Council of Texas [21]. It averages the load of the *X* days that have middling levels of daily electricity consumption among the *Y* most recent baseline days of the same day type as the DR day. We denote this set of days by $M(X,Y,d)$, i.e., $D(d)=M(X,Y,d)$, which is determined based on the following conditions:

$M(X,Y,d)\u2282R(Y,d)$;

$|M(X,Y,d)|=X$; and

$M(X,Y,d)=R(Y,d)\u2216{H(Z,Y,d)\u222aL(Z,Y,d)}$, where

*Z*= (*Y*−*X*)/2 and $(Y\u2212X)mod2=0$.

The first and second conditions again require that *X* days are selected from the set $R(Y,d)$. The third condition requires that the *Z* days with the lowest electricity consumption and the *Z* days with the highest electricity consumption are dropped, retaining *X* days with the middling levels of electricity consumption. In this article, we evaluate the Mid4of6 baseline method [13].

### 3.6 NearestXofY Average Method.

We also evaluate the NearestXofY average method proposed in our technical report [30] and used in Ref. [31]. It averages the load of the *X* days among the *Y* most recent baseline days of the same day type as the DR day that have load profiles outside of the DR event window nearest to that of the DR day. The DR event window includes the response window and a settling window, as shown in Fig. 1. We denote this set of days by $N(X,Y,d)$, i.e., $D(d)=N(X,Y,d)$, which is determined based on the following conditions:

$N(X,Y,d)\u2282R(Y,d)$;

$|N(X,Y,d)|=X$; and

$|\u2211t\u2208T\u2216TiDR(d)[pi(d\u2032,t)\u2212pi(d,t)]|\u2264|\u2211t\u2208T\u2216TiDR(d)[pi(d\u2033,t)\u2212$$pi(d,t)]|$ for any $d\u2032\u2208N(X,Y,d)$ and $d\u2033\u2208R(Y,d)\u2216$$N(X,Y,d)$, where $TiDR(d)$ is the set of time-steps within the DR event window (including the response window and settling window) for GEB

*i*on day*d*.

Again, the first and second conditions require that *X* days are selected from the set of baseline days $R(Y,d)$. The third condition requires that the electricity consumption over the entire DR day except for the DR event window is closer to that of selected baseline days than to that of unselected baseline days. In this article, we test the Nearest3of6 and Nearest5of10 baseline methods.

### 3.7 Adjustment Method.

Averaging methods are easy to understand and implement for both utilities and DR participants, but can have large errors [26]. Their performance highly depends on the similarity of power profiles between the DR day and the selected baseline days. However, conditions on DR and baseline days can be very different. Therefore, adjustments (including additive and multiplicative adjustments) based on the DR day data are frequently applied to improve accuracy and reduce bias. Additive adjustments add or subtract a fixed load to the estimated baseline load at each time-step, while multiplicative adjustments multiply the estimated baseline load at each time-step by a fixed amount, such that the adjusted baseline is equal to the observed load on average during a time window shortly before the start of the DR event, referred to as an *adjustment window*. Additive adjustments are generally preferred to multiplicative adjustments, as baselines can become volatile under multiplicative adjustments [38]. Note that by using an adjustment, we assume that the GEB does not take anticipatory actions (e.g., building pre-cooling) during the adjustment window.

*i*on DR day

*d*. In this study, we use the 2-h period directly before the DR event as the adjustment window, as in Refs. [24,25]. None of the buildings take anticipatory actions before the DR event. To evaluate the effectiveness of the adjustment method, we compare the overall performance of the averaging methods with and without the additive adjustment.

### 3.8 Linear Interpolation Method.

*i*on day

*d*at time-step

*t*is

*a*

_{i}(

*d*) and

*b*

_{i}(

*d*) are scalar constants. To obtain

*a*

_{i}(

*d*) and

*b*

_{i}(

*d*), we use least squares to fit the 1-min interval load data from the 5-min period just before the DR event and the 5-min period immediately after the settling time [33]. See Fig. 2 for an example. Since the linear interpolation method uses some data from immediately after the DR event window, it is applicable to ex-post analyses, but not to look-ahead analyses requiring forecasts. In line with the objectives of baseline estimation described in Sec. 2, the linear interpolation method does not aim to produce accurate time-series predictions, but instead to generate estimates that are accurate on average over DR event windows. In the previous study, this baseline method seemed to perform well; this article evaluates it on larger data sets and compares it with the other baseline methods.

## 4 Performance Evaluation Methodology

In this section, we introduce the methodology used to evaluate the performance of the selected baseline methods. We first explain the evaluation process. After that, the metrics for assessing the baseline methods are introduced. Finally, we discuss the limitations of our evaluation methodology.

### 4.1 Evaluation Process.

We evaluate the baseline methods assuming two different DR event windows, i.e., 9:00 a.m. to 11:00 a.m. (referred to as the morning event window) and 13:00 p.m. to 15:00 p.m. (referred to as the afternoon event window), which correspond to the times of our DR events on DR days [32,33,39]. Specifically, on a DR day, we conducted two short-term load shifting DR experiments each lasting for 1 h, i.e., 9:00 a.m. to 10:00 a.m. and 13:00 p.m. to 14:00 p.m.

We use 1-min interval data corresponding to the total HVAC fan power on baseline days to evaluate the baseline methods. That is, we apply the methods to estimate the baseline fan power on days without DR events (i.e., baseline days). If a baseline method is perfectly accurate, the estimated fan power should be exactly the same as the measured fan power data on baseline days. By comparing the estimated baseline with the measured fan power data (i.e., the true baseline), we can calculate and evaluate the baseline method error.

A rolling origin blocked cross validation process is conducted to evaluate each averaging method on each data set [31,40,41]. For example, to evaluate the 5-day averaging method on a data set with a total of 16 baseline days that are chronologically ordered, we first use the data of days 1–5 to estimate the baseline of day 6 and then use the data of days 2–6 to estimate the baseline of day 7. This process goes on until we reach the last run using the data of days 11–15 to estimate the baseline of day 16. The linear interpolation method does not require a cross validation paradigm. It is run directly on each day of a data set. In each run of a baseline method, evaluation metrics are calculated and used for the statistical analysis of the method’s performance on the data set.

### 4.2 Evaluation Metrics.

The performance of the evaluated baseline methods is quantified by the following metrics.

#### 4.2.1 Coefficient of Variation.

*i*on day

*d*, the CV is expressed as follows:

#### 4.2.2 Normalized Mean Bias Error.

*i*on day

*d*, the NMBE is expressed as follows:

*i*are expressed as follows:

*N*

_{i}is the number of those days [31,40,41]. We also present box plots of the metrics enabling visualization of the error performance statistics of each method.

### 4.3 Limitations of the Evaluation Process.

Our error evaluation process may underestimate the true baseline method error because we have assumed that we know the length of the settling window exactly. Specifically, we assume the HVAC system settles back to its baseline operation an hour after the DR event and use that time frame for our error assessment. However, in practice, the settling time is uncertain and unknown. Therefore, our assumption will affect the accuracy of our error assessment. Moreover, the practical implementation of some baseline methods including the linear interpolation method depends on an estimate of the settling time, which introduces additional error. Our error assessment does not capture that. We leave the estimation of the settling time and how its error impacts the baseline error for future investigation.

## 5 Data

We have installed current sensors in three buildings on the University of Michigan campus to submeter HVAC supply and return fans. The three buildings are the Bob and Betty Beyster Building (BBB), the Rackham Building (RAC), and the Weill Hall (WH). BBB is a 104,132 ft^{2} classroom/office building constructed in 2005; RAC is a 157,957 ft^{2} office/auditorium building constructed in 1938; and WH is a 97,989 ft^{2} classroom/office building constructed in 2006 [32,33]. All three buildings have the single-duct variable air volume HVAC systems.

We use data from the summers of 2017 and 2018, specifically, minutely single-phase current of each HVAC system fan in each building. As the voltages and power factors generally have only small variations when the buildings are occupied, we assume constant power factors (0.95 for supply fans and 0.99 for return fans) and voltage (275.8 volt), which were determined using 1 week of measured voltage and power factor data, and use these values to estimate the three-phase fan power [32,33]. The data are separated into five data sets corresponding to five building-years, i.e., BBB-2017, RAC- 2017, WH-2017, BBB-2018, and RAC-2018. Table 2 summarizes the five data sets used in our evaluation. This study evaluates the baseline methods on each data set separately. In the future work, we aim to evaluate how utilizing multiple years of building data and data from other buildings with similar characteristics could improve baseline estimation.

# of fans (SF: supply fan; RF: return fan.) | # of baseline days | Total fan power in occupied mode (kW) | ||
---|---|---|---|---|

Peak | Average | |||

BBB-2017 | 1 SF, 1 RF | 55 (in June to October) | 35.8 | 12.2 |

BBB-2018 | 4 SFs, 3 RFs | 16 (in October) | 105.3 | 38.3 |

RAC-2017 | 4 SFs, 4 RFs | 49 (in July to October) | 63.6 | 18.7 |

RAC-2018 | 4 SFs, 4 RFs | 30 (in May to October) | 63.0 | 24.7 |

WH-2017 | 2 SFs, 2 RFs | 86 (in June to October) | 125.9 | 45.7 |

# of fans (SF: supply fan; RF: return fan.) | # of baseline days | Total fan power in occupied mode (kW) | ||
---|---|---|---|---|

Peak | Average | |||

BBB-2017 | 1 SF, 1 RF | 55 (in June to October) | 35.8 | 12.2 |

BBB-2018 | 4 SFs, 3 RFs | 16 (in October) | 105.3 | 38.3 |

RAC-2017 | 4 SFs, 4 RFs | 49 (in July to October) | 63.6 | 18.7 |

RAC-2018 | 4 SFs, 4 RFs | 30 (in May to October) | 63.0 | 24.7 |

WH-2017 | 2 SFs, 2 RFs | 86 (in June to October) | 125.9 | 45.7 |

## 6 Results and Discussion

In this section, numerical results quantifying the performance of the baseline methods are presented and discussed. The baseline methods are evaluated on the data of each building-year.

### 6.1 Overall Performance.

Figure 3 compares the overall performance of the averaging and linear interpolation (Lin. intrpl.) methods. The left plot shows the results of the averaging methods without the additive adjustment, and the right plot shows the results with the additive adjustment. The CV is the mean of ten average CV values corresponding to five building-years and two DR event windows, and likewise for the NMBE. That is, although the evaluation data sets of different building-years have different numbers of days that a baseline method is tested on, the average CV or NMBE for each building-year has the same weight in assessing the method’s overall performance. Table 3 lists the number of days that each baseline method is tested on for each data set. The numbers (i.e., *N*_{i}) in Table 3 are also used to calculate the confidence intervals presented in Sec. 6.2.

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | 50 | 44 | 81 | 11 | 25 |

10-day average | 45 | 39 | 76 | 6 | 20 |

High4of5 | 50 | 44 | 81 | 11 | 25 |

High5of10 | 45 | 39 | 76 | 6 | 20 |

Mid4of6 | 49 | 43 | 80 | 10 | 24 |

Low4of5 | 50 | 44 | 81 | 11 | 25 |

Low5of10 | 45 | 39 | 76 | 6 | 20 |

Nearest3of6 | 49 | 43 | 80 | 10 | 24 |

Nearest5of10 | 45 | 39 | 76 | 6 | 20 |

Linear interpolation | 55 | 49 | 86 | 16 | 30 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | 50 | 44 | 81 | 11 | 25 |

10-day average | 45 | 39 | 76 | 6 | 20 |

High4of5 | 50 | 44 | 81 | 11 | 25 |

High5of10 | 45 | 39 | 76 | 6 | 20 |

Mid4of6 | 49 | 43 | 80 | 10 | 24 |

Low4of5 | 50 | 44 | 81 | 11 | 25 |

Low5of10 | 45 | 39 | 76 | 6 | 20 |

Nearest3of6 | 49 | 43 | 80 | 10 | 24 |

Nearest5of10 | 45 | 39 | 76 | 6 | 20 |

Linear interpolation | 55 | 49 | 86 | 16 | 30 |

Figure 3 shows that the additive adjustment greatly improves the performance of the averaging baseline methods. Specifically, the CV and NMBE of all averaging methods are decreased, except the NMBE of the Nearest5of10 average method, which is slightly increased from $0.86%$ to $\u22121.36%$. The figure also shows that the Nearest3of6 and Nearest5of10 average methods perform well in selecting the baseline days used to compute the averages. Specifically, from the left plot (without the additive adjustment), we can see that they have the smallest CV values (i.e., the highest accuracy) and low NMBE values (i.e., small bias) as compared to the other averaging methods. However, from the right plot, we see that they are no longer the best averaging methods when the adjustment is applied. In the future, it will be worth investigating how the NearestXofY average method can be appropriately adjusted and improved.

In general, the linear interpolation method is the most accurate (i.e., it has the lowest CV). The Low4of5 and 5-day average methods with additive adjustments and the linear interpolation method are the best in terms of NMBE, which, as explained earlier, is a better metric than CV for most DR applications. According to guidelines by the American Society of Heating, Refrigeration and Air Conditioning Engineers (ASHRAE) [42], it is much more difficult to achieve a lower CV than a lower NMBE. Therefore, the linear interpolation method generally has the best performance in our evaluation. Furthermore, as will be shown later in this section, when tested on different building-years and DR event windows, the linear interpolation method always attains the smallest confidence interval for the average NMBE, indicating that it is the most stable method in terms of bias variability. However, we note that these results may be specific to our setting, in particular, we assume short (i.e., 2 h) DR event windows. The relative performance of the linear interpolation method is likely a function of the event window duration and more bias may be present for longer windows.

### 6.2 Different Building-Years and Demand Response Event Windows.

In Tables 4–7, we report the average values and 95% confidence intervals associated with CV and NMBE for each baseline method (averaging methods were implemented with the additive adjustment), building-year, and DR event window. In each column of Tables 4–7, the smallest average value and smallest confidence interval are given in bold. The smallest average CV (respectively, the smallest average NMBE) corresponds to the method with the highest accuracy (respectively, the lowest bias) on average, and the smallest confidence interval corresponds to the method with the most consistent accuracy or bias across different days in the data set.

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | 21.42 ± 2.32 | 13.31 ± 3.24 | 20.50 ± 3.34 | 13.97 ± 3.32 | 20.24 ± 7.04 |

10-day average | 21.48 ± 2.50 | 13.36 ± 2.95 | 19.74 ± 3.03 | 16.00 ± 4.31 | 18.56 ± 7.51 |

High4of5 | 21.99 ± 2.31 | 14.19 ± 3.85 | 21.47 ± 3.47 | 14.66 ± 4.05 | 21.73 ± 7.23 |

High5of10 | 23.41 ± 2.60 | 18.14 ± 4.12 | 22.78 ± 3.41 | 16.86 ± 6.02 | 17.61 ± 6.37 |

Mid4of6 | 22.39 ± 2.51 | 13.66 ± 3.09 | 20.30 ± 3.04 | 15.07 ± 3.73 | 20.11 ± 7.57 |

Low4of5 | 22.18 ± 2.42 | 13.06 ± 2.99 | 19.89 ± 3.08 | 14.29 ± 3.27 | 18.86 ± 6.76 |

Low5of10 | 22.68 ± 2.80 | 12.39 ± 2.87 | 18.88 ± 2.98 | 15.96 ± 3.29 | 20.82 ± 9.02 |

Nearest3of6 | 22.23 ± 2.27 | 12.39 ± 2.77 | 19.45 ± 3.50 | 16.26 ± 4.59 | 19.27 ± 7.52 |

Nearest5of10 | 21.30 ± 2.13 | 11.98 ± 2.92 | 17.61 ± 2.83 | 15.72 ± 6.42 | 18.34 ± 7.78 |

Linear interpolation | 14.47 ± 0.48 | 5.95 ± 1.33 | 12.60 ± 1.43 | 9.68 ± 0.69 | 6.48 ± 3.64 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | 21.42 ± 2.32 | 13.31 ± 3.24 | 20.50 ± 3.34 | 13.97 ± 3.32 | 20.24 ± 7.04 |

10-day average | 21.48 ± 2.50 | 13.36 ± 2.95 | 19.74 ± 3.03 | 16.00 ± 4.31 | 18.56 ± 7.51 |

High4of5 | 21.99 ± 2.31 | 14.19 ± 3.85 | 21.47 ± 3.47 | 14.66 ± 4.05 | 21.73 ± 7.23 |

High5of10 | 23.41 ± 2.60 | 18.14 ± 4.12 | 22.78 ± 3.41 | 16.86 ± 6.02 | 17.61 ± 6.37 |

Mid4of6 | 22.39 ± 2.51 | 13.66 ± 3.09 | 20.30 ± 3.04 | 15.07 ± 3.73 | 20.11 ± 7.57 |

Low4of5 | 22.18 ± 2.42 | 13.06 ± 2.99 | 19.89 ± 3.08 | 14.29 ± 3.27 | 18.86 ± 6.76 |

Low5of10 | 22.68 ± 2.80 | 12.39 ± 2.87 | 18.88 ± 2.98 | 15.96 ± 3.29 | 20.82 ± 9.02 |

Nearest3of6 | 22.23 ± 2.27 | 12.39 ± 2.77 | 19.45 ± 3.50 | 16.26 ± 4.59 | 19.27 ± 7.52 |

Nearest5of10 | 21.30 ± 2.13 | 11.98 ± 2.92 | 17.61 ± 2.83 | 15.72 ± 6.42 | 18.34 ± 7.78 |

Linear interpolation | 14.47 ± 0.48 | 5.95 ± 1.33 | 12.60 ± 1.43 | 9.68 ± 0.69 | 6.48 ± 3.64 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | 14.08 ± 0.89 | 7.09 ± 1.58 | 15.26 ± 1.60 | 10.24 ± 1.50 | 8.68 ± 2.29 |

10-day average | 14.17 ± 0.98 | 6.85 ± 1.66 | 14.49 ± 1.54 | 10.17 ± 1.16 | 8.15 ± 1.99 |

High4of5 | 14.26 ± 0.91 | 7.42 ± 1.60 | 16.34 ± 1.75 | 10.86 ± 1.58 | 9.10 ± 2.50 |

High5of10 | 14.79 ± 0.99 | 8.65 ± 1.82 | 17.40 ± 2.05 | 10.69 ± 1.20 | 9.88 ± 2.42 |

Mid4of6 | 14.38 ± 0.91 | 6.80 ± 1.53 | 16.70 ± 1.76 | 11.22 ± 1.66 | 7.55 ± 2.04 |

Low4of5 | 14.39 ± 0.93 | 6.61 ± 1.54 | 15.60 ± 1.69 | 10.41 ± 1.46 | 6.83 ± 1.80 |

Low5of10 | 14.81 ± 1.01 | 6.09 ± 1.55 | 13.31 ± 1.60 | 10.29 ± 1.51 | 7.27 ± 1.81 |

Nearest3of6 | 14.68 ± 0.93 | 6.39 ± 1.44 | 14.82 ± 1.60 | 10.89 ± 1.45 | 8.35 ± 1.76 |

Nearest5of10 | 14.39 ± 1.00 | 6.10 ± 1.67 | 14.07 ± 1.58 | 10.66 ± 1.13 | 7.71 ± 2.37 |

Linear interpolation | 12.38 ± 0.52 | 3.90 ± 0.35 | 9.20 ± 1.29 | 7.21 ± 0.91 | 3.25 ± 0.55 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | 14.08 ± 0.89 | 7.09 ± 1.58 | 15.26 ± 1.60 | 10.24 ± 1.50 | 8.68 ± 2.29 |

10-day average | 14.17 ± 0.98 | 6.85 ± 1.66 | 14.49 ± 1.54 | 10.17 ± 1.16 | 8.15 ± 1.99 |

High4of5 | 14.26 ± 0.91 | 7.42 ± 1.60 | 16.34 ± 1.75 | 10.86 ± 1.58 | 9.10 ± 2.50 |

High5of10 | 14.79 ± 0.99 | 8.65 ± 1.82 | 17.40 ± 2.05 | 10.69 ± 1.20 | 9.88 ± 2.42 |

Mid4of6 | 14.38 ± 0.91 | 6.80 ± 1.53 | 16.70 ± 1.76 | 11.22 ± 1.66 | 7.55 ± 2.04 |

Low4of5 | 14.39 ± 0.93 | 6.61 ± 1.54 | 15.60 ± 1.69 | 10.41 ± 1.46 | 6.83 ± 1.80 |

Low5of10 | 14.81 ± 1.01 | 6.09 ± 1.55 | 13.31 ± 1.60 | 10.29 ± 1.51 | 7.27 ± 1.81 |

Nearest3of6 | 14.68 ± 0.93 | 6.39 ± 1.44 | 14.82 ± 1.60 | 10.89 ± 1.45 | 8.35 ± 1.76 |

Nearest5of10 | 14.39 ± 1.00 | 6.10 ± 1.67 | 14.07 ± 1.58 | 10.66 ± 1.13 | 7.71 ± 2.37 |

Linear interpolation | 12.38 ± 0.52 | 3.90 ± 0.35 | 9.20 ± 1.29 | 7.21 ± 0.91 | 3.25 ± 0.55 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | −1.02 ± 4.82 | 0.28 ± 4.42 | −0.93 ± 5.14 | −5.78 ± 6.50 | 0.50 ± 9.97 |

10-day average | −2.81 ± 5.24 | 0.50 ± 4.44 | −2.50 ± 4.91 | −12.13 ± 5.30 | 1.93 ± 10.39 |

High4of5 | −3.89 ± 4.82 | 0.54 ± 4.98 | −1.74 ± 5.38 | −7.06 ± 6.91 | 0.27 ± 10.51 |

High5of10 | −9.45 ± 5.15 | 3.33 ± 6.26 | −4.02 ± 5.65 | −13.06 ± 6.99 | 0.59 ± 9.08 |

Mid4of6 | −2.34 ± 5.18 | 0.18 ± 4.47 | −1.72 ± 4.91 | −9.57 ± 5.63 | −0.29 ± 10.42 |

Low4of5 | 0.88 ± 5.06 | 0.05 ± 4.17 | −0.18 ± 4.86 | −5.92 ± 6.53 | 0.68 ± 9.37 |

Low5of10 | 3.84 ± 5.61 | −2.34 ± 4.04 | −0.99 ± 4.70 | −11.19 ± 5.08 | 3.27 ± 12.22 |

Nearest3of6 | −1.08 ± 4.90 | −0.76 ± 3.91 | −1.58 ± 5.04 | −9.31 ± 7.32 | 0.88 ± 10.23 |

Nearest5of10 | −1.66 ± 4.88 | −2.23 ± 3.98 | −2.55 ± 4.33 | −10.67 ± 7.95 | 2.41 ± 10.49 |

Linear interpolation | 1.58 ± 1.17 | −1.23 ± 1.60 | −4.25 ± 2.11 | 2.21 ± 1.46 | 1.46 ± 3.31 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | −1.02 ± 4.82 | 0.28 ± 4.42 | −0.93 ± 5.14 | −5.78 ± 6.50 | 0.50 ± 9.97 |

10-day average | −2.81 ± 5.24 | 0.50 ± 4.44 | −2.50 ± 4.91 | −12.13 ± 5.30 | 1.93 ± 10.39 |

High4of5 | −3.89 ± 4.82 | 0.54 ± 4.98 | −1.74 ± 5.38 | −7.06 ± 6.91 | 0.27 ± 10.51 |

High5of10 | −9.45 ± 5.15 | 3.33 ± 6.26 | −4.02 ± 5.65 | −13.06 ± 6.99 | 0.59 ± 9.08 |

Mid4of6 | −2.34 ± 5.18 | 0.18 ± 4.47 | −1.72 ± 4.91 | −9.57 ± 5.63 | −0.29 ± 10.42 |

Low4of5 | 0.88 ± 5.06 | 0.05 ± 4.17 | −0.18 ± 4.86 | −5.92 ± 6.53 | 0.68 ± 9.37 |

Low5of10 | 3.84 ± 5.61 | −2.34 ± 4.04 | −0.99 ± 4.70 | −11.19 ± 5.08 | 3.27 ± 12.22 |

Nearest3of6 | −1.08 ± 4.90 | −0.76 ± 3.91 | −1.58 ± 5.04 | −9.31 ± 7.32 | 0.88 ± 10.23 |

Nearest5of10 | −1.66 ± 4.88 | −2.23 ± 3.98 | −2.55 ± 4.33 | −10.67 ± 7.95 | 2.41 ± 10.49 |

Linear interpolation | 1.58 ± 1.17 | −1.23 ± 1.60 | −4.25 ± 2.11 | 2.21 ± 1.46 | 1.46 ± 3.31 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | −0.42 ± 1.97 | −0.91 ± 2.36 | 1.02 ± 2.79 | 1.61 ± 3.90 | 2.07 ± 3.80 |

10-day average | −0.40 ± 2.32 | −1.66 ± 2.39 | 1.24 ± 2.73 | 0.64 ± 5.36 | −0.21 ± 3.91 |

High4of5 | 0.01 ± 1.98 | −1.39 ± 2.43 | 1.78 ± 3.03 | 2.93 ± 4.13 | 1.63 ± 4.06 |

High5of10 | 1.29 ± 2.37 | −3.84 ± 2.76 | 3.72 ± 3.30 | 2.20 ± 5.33 | −2.45 ± 4.66 |

Mid4of6 | −0.01 ± 2.07 | −1.42 ± 2.23 | 2.92 ± 3.11 | 2.90 ± 4.41 | −0.68 ± 3.38 |

Low4of5 | −0.71 ± 2.05 | −0.85 ± 2.20 | 1.80 ± 2.90 | 1.86 ± 3.88 | 0.41 ± 2.96 |

Low5of10 | −2.08 ± 2.40 | 0.53 ± 2.15 | −1.25 ± 2.54 | −0.93 ± 5.40 | 2.04 ± 3.34 |

Nearest3of6 | −0.17 ± 2.07 | −0.07 ± 2.08 | 0.39 ± 2.69 | 1.95 ± 3.92 | 0.75 ± 3.55 |

Nearest5of10 | 0.04 ± 2.26 | −0.15 ± 2.24 | −0.02 ± 2.65 | 1.18 ± 5.50 | 0.07 ± 3.90 |

Linear interpolation | −3.13 ± 1.02 | −0.16 ± 0.53 | −2.36 ± 1.30 | −0.12 ± 1.09 | −0.20 ± 0.93 |

Building-year | |||||
---|---|---|---|---|---|

Baseline method | BBB-2017 | RAC-2017 | WH-2017 | BBB-2018 | RAC-2018 |

5-day average | −0.42 ± 1.97 | −0.91 ± 2.36 | 1.02 ± 2.79 | 1.61 ± 3.90 | 2.07 ± 3.80 |

10-day average | −0.40 ± 2.32 | −1.66 ± 2.39 | 1.24 ± 2.73 | 0.64 ± 5.36 | −0.21 ± 3.91 |

High4of5 | 0.01 ± 1.98 | −1.39 ± 2.43 | 1.78 ± 3.03 | 2.93 ± 4.13 | 1.63 ± 4.06 |

High5of10 | 1.29 ± 2.37 | −3.84 ± 2.76 | 3.72 ± 3.30 | 2.20 ± 5.33 | −2.45 ± 4.66 |

Mid4of6 | −0.01 ± 2.07 | −1.42 ± 2.23 | 2.92 ± 3.11 | 2.90 ± 4.41 | −0.68 ± 3.38 |

Low4of5 | −0.71 ± 2.05 | −0.85 ± 2.20 | 1.80 ± 2.90 | 1.86 ± 3.88 | 0.41 ± 2.96 |

Low5of10 | −2.08 ± 2.40 | 0.53 ± 2.15 | −1.25 ± 2.54 | −0.93 ± 5.40 | 2.04 ± 3.34 |

Nearest3of6 | −0.17 ± 2.07 | −0.07 ± 2.08 | 0.39 ± 2.69 | 1.95 ± 3.92 | 0.75 ± 3.55 |

Nearest5of10 | 0.04 ± 2.26 | −0.15 ± 2.24 | −0.02 ± 2.65 | 1.18 ± 5.50 | 0.07 ± 3.90 |

Linear interpolation | −3.13 ± 1.02 | −0.16 ± 0.53 | −2.36 ± 1.30 | −0.12 ± 1.09 | −0.20 ± 0.93 |

As shown in Tables 4 and 5, the linear interpolation method has the smallest average CV and smallest confidence interval for all building-years and both event windows. According to the ASHRAE guidelines [42], the suggested acceptable maximum CV is $30%$ when using hourly data and $15%$ when using monthly data. Here, with minutely data, we find the average CV of the linear interpolation method is smaller than $15%$ in all cases. In contrast, the averaging baseline methods have average CV values larger than $15%$ in many cases and even larger than $20%$ in some cases.

As for the bias reported in Tables 6 and 7, none of the baseline methods consistently attains the smallest average NMBE. Nevertheless, the linear interpolation method always has the smallest confidence interval. It indicates that the linear interpolation method has the least variable bias levels across different days in the data set. The performance of the linear interpolation method is also consistent across different data sets. According to the ASHRAE guidelines [42], the suggested acceptable maximum NMBE is $\xb110%$ when using hourly data and $\xb15%$ when using monthly data. Here, with minutely data, we find the average NMBE of the linear interpolation method is lower than $\xb15%$ in all cases, while the other methods have average NMBE values higher than $\xb15%$ or even higher than $\xb110%$ in some cases.

The average CV, average NMBE, and their confidence intervals are generally smaller in the afternoon event window. That is, the baseline methods perform better in baselining afternoon DR events. The reason may be that the total fan power profile is more stable in the afternoon, while it is more volatile in the morning.

### 6.3 Example Time Series Plots.

To give more intuition for the results, Fig. 9 shows time-series plots of the actual total fan power of the WH building on Aug. 2, 2017, and baselines estimated by the implemented methods, where the averaging methods use additive adjustments. Specifically, the upper-left and upper-right plots show the actual and estimated fan power curves during the morning event window and the lower-left and lower-right plots show similar results for the afternoon event window. For clarity, each plot only includes five estimated baselines. For the same example, Fig. 10 shows the time-series errors of three baseline methods: the linear interpolation method, the Low4of5 average method, and the 5-day average method, which generally have the best overall performance as discussed in Sec. 6.1. However, they are not necessarily the best for this specific example.

As shown in Figs. 9 and 10, the linear interpolation method has the best performance. The averaging methods have relatively larger errors in this case, though the CV and NMBE of the High4of5 average method baseline for the morning window are $8.40%$ and $1.26%$, respectively, which are comparable to those of the linear interpolation baseline.

In general, we have found that averaging methods do not perform well in baselining fan power data, as it is difficult for them to precisely capture the minute-scale variation in HVAC fan power. As shown in Fig. 9, the actual baseline fan power is highly volatile, and while the averaging methods produce volatile estimates, they do not usually align with the actual load. In contrast, the linear interpolation method does not try to estimate the minute-scale variation of the HVAC fan power. It assumes that the HVAC fan power trend is approximately linear and expects the positive and negative errors to balance out over time. It is appropriate to use such a method when we care about the average response over the event (e.g., for overall impact assessment and financial settlement) instead of the accuracy of instantaneous estimates. As shown in Figs. 9 and 10, the linear interpolation method captures the hourly trend and its minute-scale positive and negative errors somewhat balance out over the DR event window. In many cases, the actual baseline fan power generally follows a linear trend in the short term, also helping explain the result that the linear interpolation method is generally best.

Another reason that the linear interpolation method generally performs better than the averaging methods is due to its more effective use of a posteriori knowledge. The HighXofY, MidXofY, LowXofY, and NearestXofY methods use the daily electricity consumption on the DR day in selecting baseline days included in the baseline calculation. However, as evidenced by the results, the daily electricity consumption might not be a sufficiently effective indicator of short-term fan power. The additive adjustment, which uses the data from a short window (i.e., 2 h) before the DR event to greatly improve the performance of the averaging methods, proves to be an effective indicator here. The linear interpolation method uses data both before and after the DR event window, which turn out to be effective indicators of the baseline fan power.

### 6.4 Practical Implications.

While HVAC fan power control represents an important DR source and submetering fan power data can improve baselining [7,14], the averaging baseline methods commonly used by utilities and ISOs do not produce accurate fan power baseline estimates. In general, such commonly used methods are outperformed by the simple linear interpolation method. As a result, building owner and operators aiming to utilize HVAC fans to provide DR and other grid services may benefits from adopting this simple method over traditional baselining methods.

## 7 Conclusions and Future Work

In this article, a variety of baseline methods were evaluated on building HVAC fan power data. Our numerical results show that, with an additive adjustment and assuming the building takes no anticipatory actions before the DR event, averaging methods work well for baselining fan power in some cases. Nevertheless, their performance is not consistent across all cases. The simple linear interpolation method generally has the best performance. In particular, it has a low bias and by far the highest accuracy on average. For DR applications such as analyzing the overall impacts of DR actions or financial settlement, the linear interpolation method, Low4of5 average method, and 5-day average method are the best baseline methods among the methods we tested as they have the smallest levels of bias on average. However, the linear interpolation method may be preferable for two reasons. First, as mentioned, it has low bias and the highest accuracy on average. Second, for every building-year in both DR event windows, it has the smallest CV and NMBE confidence intervals, indicating that its performance is more stable than that of the other methods.

In the future work, we aim to collect HVAC fan power data sets covering more diversified geographical areas with different climates and further evaluate the existing baseline methods and validate our results using such extended data sets that have or can be processed to have different temporal granularity. We also plan to explore the applicability of other baseline methods. While in Ref. [30], we found that regression methods using outdoor air temperature as the main explanatory variable are inappropriate for baseline estimation of total fan power, some time-series methods, e.g., Refs. [43,44], and machine learning methods, e.g., Refs. [45,46], may be applicable. It also may be possible to take advantage of fan power data from individual fans rather than using only total fan power data. Specifically, it is possible that more granular data can be utilized to obtain fan power patterns that are consistent among different fans and over different days, in turn improving our estimates of the total fan power baseline. In this regard, it is worth exploring methods such as tensor decomposition [31,47,48], which is capable of high-dimensional data mining and analysis. A further research topic is exploring how these methods could be leveraged to improve the linear interpolation method.

In the end, our results have implications for enabling deeper participation of commercial buildings in grid services. The widespread use of DR strategies that control building HVAC system fan power could provide much needed quick response to grids with increasing penetrations of intermittent renewable energy sources. Through better baselining of HVAC fan power, individual GEBs, flexibility aggregators, and ISOs can better assess the capacity and participation of building HVAC systems to deliver grid services that improve the reliability, economics, and sustainability of power grids.

## Acknowledgment

This work was supported by the U.S. Department of Energy Building Technologies Office under the project *I-DREEM: Impact of Demand Response on short and long-term building Energy Efficiency Metrics* (Contract Number DE-AC02-76SF00515).

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.