Large-scale two-phase estimation of wood production by poplar plantations exploiting Sentinel-2 data as auxiliary information

Growing demand for wood products, combined with efforts to conserve natural forests, have supported a steady increase in the global extent of planted forests. Here, a two-phase sampling strategy for large-scale assessment of the total area and the total wood volume of fast-growing forest tree crops within agricultural land is presented. The first phase is performed using tessellation stratified sampling on high-resolution remotely sensed imagery and is sufficient for estimating the total area of plantations by means of a Monte Carlo integration estimator. The second phase is performed using stratified sampling of the plantations selected in the first phase and is aimed at estimating total wood volume by means of an approximation of the first-phase Horvitz-Thompson estimator. Vegetation indices from Sentinel-2 are exploited as freely available auxiliary information in a linear regression estimator to improve the design-based precision of the estimator based on the sole sample data. Estimators of the totals and of the design-based variances of total estimators are presented. A simulation study is developed in order to check the design-based performance of the two alternative estimators under several artificial distributions supposed for poplar plantations (random, clustered, spatially trended). An application in Northern Italy is also reported. The regression estimator turns out to be invariably better than that based on the sole sample information. Possible integrations of the proposed sampling scheme with conventional national forest inventories adopting tessellation stratified sampling in the first phase are discussed.


Introduction
Growing demand for wood products, combined with efforts to conserve natural forests, have supported a 65% increase in the global extent of planted trees since 1990 (FAO 2015). Tree plantations contributed approximately to half the industrial roundwood production in 2012 (Payn et al. 2015). As distinctively concerns fast-growing tree plantations within agricultural farms, characterized by relatively short rotations and an intrinsic high dynamism in relation to the temporal variability of farmland destinations for cultivation, there is the need for a frequent updating of accurate and spatially detailed statistical data about tree species composition, stand structure and wood supply attributes. Such needs may exceed the scope of conventional forest inventories, making room for ad-hoc information sources (Alam et al. 2014).
From a methodological perspective, we have approached the large-scale estimation of fastgrowing tree plantations by a two-phase probabilistic sampling scheme. Stratified or systematic schemes are usually adopted in the first phase of forest surveys to ensure a uniform sampling coverage over the survey area. Under this scenario, we have followed the proposal by Baffetta et al. (2011), adopting the tessellation stratified sampling (TSS) to select plantations in the first phase. TSS is based on covering the area by means of regular polygons of equal size, each of them containing at least a portion of the area, and then randomly selecting one point in each polygon. Plantations spotted by one or more sample points are selected. TSS choice is consistent with the trend of last years, in which TSS is becoming increasingly popular. Both Italian and U.S.D.A. Forest Services are currently adopting TSS in the first phase of their NFIs (Fattorini et al. 2006;Tomppo et al. 2010), while the Italian Ministry of Environment and Protection of Land and Sea adopts TSS in the first phase of the IUTI (from the Italian acronym of Inventario dell'Uso delle Terre d'Italia) land-use survey (Corona et al. 2012).
While the first phase -performed on screen using high-resolution remotely sensed imageryis sufficient to estimate the total area of plantations by means of a Monte Carlo integration estimator, a second phase is necessary to reduce the sample size, achieving a stratified sampling of plantations to be visited on the ground. Once wood volumes are recorded for the plantations selected in the second phase, a two-phase estimator of total wood volume is derived starting from an approximation of the first-phase Horvitz-Thompson estimator, once again proposed by Baffetta et al. (2011) and suitable when units (in this case, fast-growing tree plantations within agricultural land, with size usually of few hectares) have extents much smaller than the survey area. Finally, the use of a linear regression estimator is proposed to exploit vegetation indices from high-resolution satellite images as auxiliary information, for improving the design-based precision of the estimator based on the sole sample information.

First-phase sampling and estimation
Denote by U the population of N tree plantations in a delineated survey area A, of size | |.
A We superimposed a region of regular shape, say Q A ⊃ , of size | |, Q partitioned into R non-overlapping regular polygons Q Q R 1 ,..., of equal size (e.g. quadrats or hexagons) and such that Q A i z Ø for all i = 1,...,R. We aimed at estimating the total area | | P of the region P constituted by the union of the N plantations, henceforth referred to as the coverage. We performed estimation as Monte Carlo integration, i.e. for each point p Q ∈ we considered the surface in such a way that the coverage to be estimated was expressed as the integral | | ( ) P z p dp Q ³ Following the sampling scheme referred to as TSS (Stevens 1997;Barabesi and Franceschi 2011;Barabesi et al. 2012), we randomly located a point within each polygon and we recorded the surface value z at any sampled point exploiting satellite imagery or orthophotos. Then, we computed the standard Monte Carlo estimate of | | P (e.g. Gregoire and Valentine 2008, chapter 4) by means of where R(P) was the number of sample points falling in P, and r R P R P = ( ) was the fraction of sample points falling in P. We then estimated the design-based variance of (1) by means of as if the R points were randomly and independently selected throughout the whole region Q. Under TSS, the Monte Carlo estimator (1) is unbiased with variance that cannot be estimated unbiasedly, and the variance estimator (2) is conservative (Barabesi and Franceschi 2011;Barabesi et al. 2012). Finally, we computed a nominal 95% confidence interval by means of We also aimed at estimating the total wood volume of plantations where y j was the value of wood volume in the j-th plantation. Following Baffetta et al. (2011), we exploited TSS as a scheme for selecting plantations, in the sense that a plantation was selected when at least one of the R sample points falls inside it (Fig. 1). Therefore, if we were able to record the volume and the extent for each plantation in the set S of the n poplar plantations selected by TSS, then we could use the following estimator of total wood volume together with a conservative estimator of its design-based variance where | | a j was the size of the j-th plantation. Baffetta et al. (2011) derived estimator (3) as an approximation of the familiar Horvitz-Thompson estimator, also providing the conservative estimator of its variance (4). The approximation was proposed because the direct use of the Horvitz-Thompson estimator would require the cumbersome quantification of the sizes of the portions of the selected plantations lying in adjacent polygons. The Authors proved that the approximation (3) is suitable when the sizes of patches to be sampled are small compared with the size of polygons, in such a way that patches are likely to lie within a unique polygon. Therefore, apart the pedagogical representation of Fig. 1, the considered fast-growing tree plantations within agricultural farms rarely straddle quadrat boundaries for quadrats sufficiently large. Example of a population of tree plantations sampled by means of tessellation stratified sampling. The region Q is tessellated by 12 quadrats, the green ellipses represent plantations and the red points represent sample points. Plantations encircled in red are those selected in the first phase.

Second-phase sampling and estimation
The estimator of type (3) and its variance estimator (4) would require that each plantation selected in the first phase was visited on the ground, thus involving a prohibitive effort for S large. To reduce sampling effort, we performed a second phase of sampling in which a sub-sample of plantations was selected from S and subsequently visited. We selected the second-phase sample W S ⊂ by means of stratified sampling. We partitioned the first-phase sample S into L strata S ,...,S 1 L of size n 1 ,...,n L and from each stratum we selected m 1 ,...,m L plantations by means of simple random sampling without replacement (SRSWOR) (Fig. 2). Stratification can be performed on the basis of some criterion regarding plantation characteristics (e.g. tree species or clone, rotation age, etc.) or on the basis of spatial criteria. For example, strata can coincide with administrative districts in such a way that the uniform sampling coverage ensured in the first phase by TSS is maintained in the second phase. Regarding spatial stratification, it is worth noting that plantations lying in more than one spatial stratum must be assigned to one of them, in such a way to ensure a proper partition of the first-phase sample S without plantations belonging to more than one stratum.
Denoting by W S l ⊂ l the set of plantations selected in the second phase in the stratum l, we estimated T y by means of and we estimated its design-based sampling variance by means of Fig. 2. Example of the stratified second-phase sampling. The six plantations selected in the first phase are partitioned into three strata (yellow, green and light blue plantations) and the 50% is selected from each stratum. Plantations encircled in red are those selected in the second phase.
It should be noticed that the double summand in equation (5) was the secondphase estimator of the summand in (3), based on the second-phase stratified sample W. Moreover, the first term in (6) constitutes an unbiased estimator of the first-phase variance estimator (4), while the second term constitutes an unbiased estimator of the variance increase due to the second phase over the variance due to the first phase.
Finally, we computed a nominal 95% confidence interval by means of

Linear regression second-phase estimator
Denote by x j the value of a vegetation index derived from satellite imagery in the j-th plantation, available for each j ∈S. Assuming a linear relationship between the x j s and volume densities d j s, we exploited the auxiliary information provided by x j s adopting the second-phase simple regression estimator (Särndal et al. 1992, Chapter 6) where and Practically speaking, while the summand in the first-phase estimator (3) was estimated in equation (5) by means of the sole sample information collected in the second phase, in equation (2) (2)ˆ1 .96 ( ).
it was estimated by means of a regression estimator, exploiting the information provided by vegetation indexes. Therefore, in analogy with the variance estimator (6), we estimated the designbased sampling variance of the regression estimator (7) by means of where Clearly, the two terms in equation (8) have analogous meaning of the two terms in equation (6).
Finally, we computed a nominal 95% confidence interval by means of

Simulation study
We checked the design-based performance of the proposed estimators by means of a simulation performed on a set of artificial populations of poplar plantations. The density, the size, the wood volume and the values of the auxiliary variable approximately resembled the results of some real assessments on poplar plantations in Northern Italy (Mattioli et al. 2019). Apart for these characteristics, the first part of the simulation (i.e. the generation of populations in accordance with several spatial patterns and the first phase of sampling) followed the simulation performed by Baffetta et al. (2011), regarding woodlots and tree-rows. A quadrat of size 9 000 000 ha (side 300 km) was taken as the survey area. Populations of N = 20 000 poplar plantations (density of about 0.22 poplar plantations per 100 ha) were settled within the quadrat. For simplicity, the shape of poplar plantations was assumed to be circular. In order to consider several spatial patterns, the centers of plantations were distributed over the quadrat (a) completely at random, (b) in accordance with a clustered process in which 10 cluster centers were randomly distributed over the quadrat and for each cluster center 2000 centers were generated from a bivariate normal distribution centered at the cluster center and having independent marginal distributions with standard deviation equal to 25 km; (c) in accordance with a spatially-trended process in which the coordinates of centers were independent random variables of type ( ) 1 2 300 u U with U uniformly distributed on (0,1). The centers of poplar plantations falling outside the quadrat were discarded and newly generated. Once the centers were located on the quadrat, circular poplar plantations were determined around each center with size generated from a lognormal distribution with expectation 1.29 ha and standard deviation 1.05 ha, in order to obtain populations with an average size of about 6 ha. Poplar plantations smaller than 0.5 ha, or partially lying outside the quadrat or overlapping previously generated poplar plantations were discarded and newly generated (see Fig. 3). (2) (2)ˆ1 .96 ( ).
ySREG ySREG T V T  Fig. 3. Spatial distribution of populations of N = 20 000 circular poplar plantations artificially generated from random (a), clustered (b) and trended (c) spatial patterns.
For any poplar plantation, the wood volume was generated from the linear relationship where the ε j s were stochastic errors with zero expectation and standard deviation proportional to | | a j values. The coefficients α y and β y were chosen in such a way to give an average wood volume of about 900 m 3 (150 m 3 ha -1 ), while the proportionality constant of the error standard deviations was chosen in such a way to provide a correlation coefficient between wood volume and size of about 0.90. Tree volumes less than 0 were discarded and newly generated. The linear relationship was adopted to generate the vegetation indexes to be exploited as auxiliary information in the linear regression estimator. The coefficients α x and β x were chosen in such a way to give an average vegetation index of about 4.20, while the ε j s were stochastic errors with zero expectation and standard deviation such that to provide a correlation coefficient between wood density and vegetation index of about 0.60. Vegetation indexes less than 0 were discarded and newly generated.
In the first phase we simulated TSS by partitioning the quadrat into R = 90 000 quadrats of size 100 ha, in order to ensure, in accordance with Baffetta et al. (2011), that quadrat size was relevantly larger than plantation sizes (100 ha against an average plantation size of 6 ha). The coverage estimate was computed using Eq. 1, randomly selecting a point in each quadrat and counting the number of points falling within poplar plantations. The corresponding estimate of the variance was computed from Eq. 2. All the poplar plantations containing at least one of the 90 000 sampling points were included in the first-phase sample.
In the second phase we simulated stratified sampling by partitioning the quadrat into L = 5 strata of size 2 340 000 ha, 2 160 000 ha, 1 710 000 ha, 2 070 000 ha and 720 000 ha, respectively. The dimensions of the strata were chosen to resemble those of the five administrative regions in the Padan Plain of Northern Italy (see the case study). Within each stratum, the 3% of poplar plantations selected in the first phase were selected by SRSWOR. Then we estimated the total wood volumes by means of Eq. 5 and Eq. 7, while the corresponding estimates of their variances were achieved using Eq. 6 and Eq. 8, respectively. From the variance estimates, the corresponding estimates of relative standard errors were achieved by the ratio of the square root of the variance estimate to the parameter estimate. Finally, for each parameter estimate, the nominal 95% confidence interval was computed by the estimate plus and minus 1.96 times the square root of the variance estimate.
For each artificial population, sampling and estimation were replicated 10 000 times, achieving the Monte Carlo distributions of estimators (1), (5) and (7) and of the corresponding variance estimators (2), (6) and (8). From the Monte Carlo distributions, expectations and mean squared errors of estimators were achieved together with expectations of the relative standard error estimators (ERSEE). From expectation and mean squared error of each estimator the relative bias (RB) and relative root mean squared error (RRMSE) were achieved. Finally, for each parameter, the actual coverage of the nominal 95% confidence intervals (AC95) was empirically determined as the percentage of times the simulated intervals contained the true parameter value.

Case study
We adopted the developed strategy for assessing poplar plantations in Northern Italy. The survey area was composed by five administrative Regions across the Padan Plain: Piedmont, Lombardy, Emilia-Romagna, Veneto and Friuli-Venezia Giulia, for a total extent of about 98 000 square kilometers. In the first phase, we adopted the TSS scheme of IUTI permanent national land-use pure-panel survey (Corona et al. 2012). The scheme was based on a grid of 360 000 semi-kilometer quadrats covering the whole Italian territory and as many sampling points randomly selected within each quadrat. Sampling points were photo-interpreted by high-resolution airborne imagery available for the whole area for the years 2014, 2015 and 2016, with geometric resolution <50 cm, and subsequently checked to obtain an error-free photo-interpretation using Google Maps images available for late summer 2017. We focused on the five northern regions that were covered by R = 390 643 quadrats. In those quadrats, sampling points classified as poplar plantations were 1736 (see Fig. 4). We then adopted estimators (1) and (2) to estimate the plantation area and its relative standard error. In the second phase, a subset of 57 (about 3%) poplar plantations of the 1736 plantations selected in the first-phase sample was selected by means of stratified sampling with proportional allocation. Strata were represented by the five administrative Regions. The plantations selected in the second-phase sample were visited on the ground in May-July 2017. Each tree in each visited plantation was enumerated and the diameter at breast height and the tree height were recorded, along with the plantation age. We then reckoned the total wood volume of second-phase plantations as the sum of volumes of single trees that, in turn, were determined by means of allometric equations developed by Chiarabaglio and Coaloa (2002).
We estimated total wood volume on the basis of the sole sample information by means of Eq. 5. Sampling variance was estimated by means of Eq. 6. Vegetation indexes were computed from ESA Sentinel-2 satellite mission, a constellation of two twin heliosynchronous satellites which provide freely available multispectral bands for the entire globe with a spatial resolution ranging between 10 m and 20 m (depending on the spectral band) and a revisit time of 5 days. The capabilities of Sentinel-2 spectral bands and vegetation indexes for wood volume estimation in Mediterranean forest ecosystems has been investigated by  (Louis et al. 2016) was used to calculate the Bottom of Atmosphere reflectance (BOA). This BOA dataset was used to create a cloud free image of the survey area. To do this, the spectral value of pixels covered by clouds was recomputed as the mean of values of cloud free pixels in the selected time window. Then, seven vegetation indices (Table 1) were computed from Sentinel-2 data taking into account previous studies (Chrysafis et al. 2017;Puletti et al. 2018;Chrysafis et al. 2019). The Pearson's r correlation coefficient was used to investigate the relationships between the wood volume measured in the field and vegetation indices. Finally, the vegetation index with the best Pearson coefficient value was used as auxiliary information to estimate total wood volume by means of the regression estimator given by Eq. 7. Sampling variance was estimated by means of Eq. 8. Table 2 reports the percent values of RB, RRMSE, ERSEE, AC95 for each artificial population and each parameter. The coverage estimator proves to be unbiased and precise with RB always smaller than 0.06% and RRMSE invariably around 3%. The estimator of RRMSE turns out to be approximately unbiased and conservative, with a slight overestimation. The actual coverage of nominal 95% confidence intervals is always greater than the nominal level. Concerning the two alternative estimators of wood volume, they both result approximately unbiased, with RRMSE invariably below 10%. The RRMSE estimators are approximately unbiased, but except for the estimator based on the sole sample information under the clustered spatial pattern, they all provide moderate underestimation. As to the performance of nominal 95% confidence intervals, the actual coverage is always smaller than the nominal coverage of 95%, although it never turns out to be smaller than 92%. Finally, the regression estimator turns out to be invariably better than the estimator based on the sole sample information.

Results
Regarding the case study, the estimated area of poplar plantations in Northern Italy is 43 400 ha with an estimated relative error of 2.4% and nominal 95% confidence interval of 41 358-45 442. The estimate of total wood volume based on the sole sample information is 6 098 438 m 3 with an estimated relative error of 11.7% and nominal 95% confidence interval of 4 699 944-7 496 932, while the linear regression estimate is 5 867 700 m 3 with an estimated relative error of 10.7% and nominal 95% confidence interval of 4 637 126-7 098 274. The linear regression estimate of total wood volume was computed using the SR vegetation index from Sentinel-2 data, as among the tested vegetation indices SR has proved the best relationships with field wood volume measures (Table 3), albeit very similar correlation was found for the considered indices.

Discussion
We have developed a two-phase sampling strategy for large scale estimation of the total area and the total wood volume of fast-growing tree crops within agricultural farms. Regarding the scheme for locating points in the first phase, we adopted TSS instead of the most common systematic grid sampling (SGS), widely adopted in forest surveys over large scale (Tomppo et al. 2010). SGS as TSS involves covering the survey region by a grid of regular polygons of equal sizes, selecting a point at random in one polygon and the repeating it in the remaining polygons. From a practical point of view, SGS can be straightforwardly performed by a random shift of the grid that is superimposed onto the survey area, taking the nodes as sample points. For this reason, SGS has constituted a sort of standard design of wide application in environmental surveys (e.g. U.S. Environmental Protection Agency 2002). However, from a theoretical point of view, Barabesi and Franceschi (2011) argue about the superiority of TSS over SGS. While TSS invariably outperforms the benchmark scheme achieved by randomly and independently selecting point throughout the whole survey area (the so called URS from the acronym of uniform random sampling), in the presence of some spatial regularity, SGS may be even worse than URS. Moreover while TSS variance can be conservatively estimated in a very simple way using equation (2), no conclusion about the properties of (2) under SGS can be drawn. Finally, while TSS ensures the asymptotic normality of the resulting Monte Carlo estimator (1), normality cannot be proven under SGS.
Regarding the second-phase scheme for selecting and visiting a sub-sample of plantations selected in the first phase, the use of a stratified sample ensured the spreading of second-phase samples throughout plantation characteristics or throughout space or both, in accordance with second-phase stratification criteria. Moreover, the use of second-phase estimators (5) and (7) based on an approximation of the first-phase Horvitz-Thompson estimator (3) involved relevant simplifications in estimation without deteriorating the precision of the strategy when the size of the population units were much smaller than the polygon size (like in the considered case: few hectares against hundreds).
The strategy has been tested by a simulation study and applied in a case study on poplar plantations in Northern Italy. The linear regression estimator (7) turns out to be invariably better than the estimator based on the sole sample information (5), and, from a practical point of view, is proven fully feasible to be implemented and replicated, with reduced costs when exploiting freely available remotely sensed data like Sentinel-2 images.
From a practical point of view, it is important to stress that the first phase of the strategy proposed here coincides with the first phase performed in NFIs for those countries adopting TSS as first phase (e.g. Italy and USA). Therefore, in order to save time and resources it may be appealing to exploit the first phase of these inventories as the first phase of plantation surveys. In these cases, plantation surveys can benefit of the intensive sampling usually performed by NFIs in the first phase. In this sense, the proposed approach may represent a suitable reference for integrated NFI frameworks effectively supporting outside-forest resource monitoring and assessment.
The benefits of the strategy proposed here will be distinctively valuable for those countries and regions where short-rotation plantation forestry is well widespread, like e.g. France, Turkey, Iran, Spain, Brazil or China. Taking fixed the first-phase sample of plantations -possibly selected during a NFI -the proposed second-phase sampling strategy can allow to update information on a short time; it has the advantage of being rapid and simple to be carried out, being particularly suited for fast-growing plantations, which rapidly alternate on agricultural lands, making conventional 5-10 years periodic inventory information temporally obsolete.