Modelling Percentile Based Basal Area Weighted Diameter Distribution

In percentile method, percentiles of the diameter distribution are predicted with a system of models. The continuous empirical diameter distribution function is then obtained by interpolating between the predicted values of percentiles. In Finland, the distribution is typically modelled as a basal-area weighted distribution, which is transformed to a traditional density function for applications. In earlier studies it has been noted that when calculated from the basal-area weighted diameter distribution, the density function is decreasing in most stands, especially for Norway spruce. This behaviour is not supported by the data. In this paper, we investigate the reasons for the unsatisfactory performance and present possible solutions for the problem. Besides the predicted percentiles, the problems are due to implicit assumptions of diameter distribution in the system. The effect of these assumptions can be somewhat lessened with simple ad-hoc methods, like increasing new percentiles to the system. This approach does not, however, utilize all the available information in the estimation, namely the analytical relationships between basal area, stem number and diameter. Accounting for these, gives further possibilities for improving the results. The results show, however, that in order to achieve further improvements, it would be recommendable to make the implicit assumptions more realistic. Furthermore, height variation within stands seems to have an important contribution to the uncertainty of some forest characteristics, especially in the case of sawnwood volume.


Introduction
Diameter distribution is one of the most descriptive and important stand characteristics.However, in forestry practice the empirical diameter distribution is seldom measured.For example, in Finnish compartmentwise inventory, the growing stock is described by partly visually assessed stand characteristics, such as mean diameter and basal area, for each tree species.In applications, the diameter distribution is predicted with models.The predicted distribution is used to compute stand volume characteristics with treewise height and volume models and as a basis for tree growth predictions (e.g.Päivinen 1980).
In Finland, basal-area weighted diameter distribution has been commonly used, since it can be easily scaled to observed basal area, and basal area is the most important forest characteristic assessed in practical field inventories.Scaling the distribution to observed basal area also ensures good estimates for the stand volume calculated from the distribution (e.g.Kangas and Maltamo 2000b).In applications, however, the basal-area weighted distribution is transformed to a frequency distribution.This is done in order to be able to utilize single-tree growth models, for example.Therefore, in addition to obtaining good estimates of volume, also good estimates of the frequency distribution are needed.Borders et al. (1987) developed the percentile based diameter distribution prediction method.This method characterises an empirical distribution function with 12 percentiles defined with respect to number of stems in a stand.The number of stems in desired diameter classes was calculated by linear interpolation between the predicted percentiles.Maltamo et al. (2000) used the percentile based approach to predict irregular diameter distributions of stands in a natural state.Gobakken and Naesset (2005) and Maltamo et al. (2006) expanded the use of percentile based distributions to applications where diameter distributions are predicted by using airborne laser scanning data.Kangas and Maltamo (2000a) estimated percentile based basal-area weighted diameter distribution models for the three most common tree species in Finland.Two sets of models were estimated: one set with and another without number of stems as a predictor.Transforming the basal-area weighted diameter distribution to the traditional frequency distribution has not, however, produced satisfactory results.The predicted frequency distribution has been decreasing in most stands, especially for spruce (Kangas andMaltamo 2000b, Bollandsås andNaesset 2007).This problem has been particularly evident for the model set without stem number as independent variable.
Overall, the accuracy of percentile based diameter distributions has proved to be quite similar to diameter distributions based on probability distributions (e.g.Kangas and Maltamo 2000b).More than on estimation method, the accuracy depends on the amount of information available from the stand.Inclusion of the number of stems as a predictor improved the total volume and saw timber volume estimates for all species, but the improvements were especially large for number of stems estimates obtained from the predicted distribution.It means that the predicted stem numbers are not correct, even if the stem number were known and used as an independent variable, but that knowing the stem number improves the behaviour of the model system.Thus, it can be assumed that the stem number carries information about the shape of the diameter distribution that is useful.
Stem number is not, however, measured in the usual field work so that the models including stem number as predictor cannot be applied in most cases.Using a predicted stem number in the models does not improve the model behavior: the important information in the stem number is obviously just the variation that cannot be explained with the basal area and mean diameter or other forest characteristics.Siipilehto (2006) presented an approach where a group of stand characteristics including stem number were predicted simultaneously.Measured value of any of these characteristics, e.g.basal area, can then be used to calibrate the estimate of stem number based on the correlations of the errors across the models.This kind of approach could possibly also improve the usability of predicted (and calibrated) stem number in diameter distribution models.
It is, however, also possible to utilize stem number indirectly, by accounting for the ana-lytical relationships between basal-area weighted diameter distribution and frequency distribution, i.e. between diameter, stem number and basal area.This analytical relationship can be accounted for using a set of models where the stem number is included as a soft constraint in the modelling process.The relations between the errors of the models are then utilized in the Seemingly Unrelative Regression (SUR) or Three Stage Least Square (3SLS) estimation.This enables striving for coefficients optimal for both estimating the basal-area weighted and the frequency distribution.
Furthermore, the models of Kangas and Maltamo (2000a) were estimated from an angle-count sampling data, which is unreliable for the frequencies in the smallest diameter classes (e.g.Schreuder et al. 1993).Estimating the models from a fixed-size sample plot data may also improve the model behaviour in the smallest diameter classes, since such data has smaller variances of frequency in these classes.
The aim of this paper is to investigate the reasons for the unsatisfactory performance of percentile methods without stem number as a predictor and present possible solutions for the problem.We analyze how much the problem can be lessened with simple ad hoc methods like increasing new percentiles into the system.We also present a method accounting for the analytical relationships between basal area, stem number and diameter in the modelling.Our assumption is that taking the analytical relationships into account in modelling is the best approach to alleviate the problems.

Material
The data set includes the permanent sample plots (INKA sample plots) measured by the Finnish Forest Research Institute (FFRI), originally for growth modelling purposes (Gustavsen et al. 1988).The sample plots were established on mineral soils across Finland.The data includes clusters of three circular plots located systematically within a stand avoiding stand edges.For estimating basal-area weighted diameter distribution, the information of these circular plots was combined.Altogether 100-120 trees were measured in each stand for diameter at breast height to the nearest 0.1 cm.Tree height was measured from about 30 sample trees to the nearest 0.1 meters.Data were selected according to the following criteria: the basal area of spruce in the stand had to be over 1 m 2 /ha and the number of spruce stems over 50 per hectare; the number of measured trees in the three sample plots had to be at least 10; the basal area median diameter had to be over 5 cm; and the range between both minimum and median and maximum and median diameter had to be over 2 cm.Altogether, the data included 328 stands.
Näslund's height model (1937) was constructed separately for each stand using sample tree measurements.The height of each tally tree was then predicted with these models.A random component was added to the predictions from a normal distribution using the estimated standard deviation of each height model.This was done in order to retain a realistic height variation in the data set.Total, sawnwood and pulpwood volumes were calculated for each tree using taper curve functions presented by Laasasenaho (1982).Basalarea weighted diameter distributions were formed by using basal areas of individual trees.Finally, stand characteristics were calculated as averages and sums of tallied trees (Table 1).

The Original Models
In the original models estimated by Kangas and Maltamo (2000a), the empirical basal area diameter distribution was described with the aid of percentiles of stand basal area (0, 10, …, 90, 95 and 100%), denoted by d 0 , d 10 ,...,d 100 .The 5th percentile was not used in this system, since d 0 and d 10 were deemed to be quite close in most stands.The logarithms of these 12 diameters were modelled using measured stand variables as predictors using the seemingly unrelated regression (SUR) (Zellner 1962).The median of the distribution (50th percentile) is commonly assessed in compartmentwise inventory in Finland and was thus assumed to be known.
To be able to construct the diameter distribution using the predicted diameter percentiles, all the diameters must be positive.Logarithmic models were used in order to meet this requirement.The diameters are also required to be monotonic with d 0 < d 10 < ... < d 100 , in order to produce a monotone distribution function and nonnegative frequencies for the diameter classes.Excluding the 5th percentile was assumed to help in producing monotonic distributions.However, to meet this requirement, an additional model was needed.This additional model was used to model the difference between d 10 and d 0 with an intercept term.Since SUR estimation minimises the variance with respect to each model considered, the additional model worked like an ad-hoc constraint in the estimation process.This procedure ensured the monotonicity in the estimation data set, but it does not guarantee it in all conditions.
In the application stage, the estimate of the relative basal area in each 1-cm diameter class [d, d + 1] was calculated from the cumulative distribution of diameters F as F(d + 1) -F(d).The value of the empirical distribution F was obtained by interpolating between the predicted percentiles with Späth's rational spline interpolation (Späth 1974, Lether 1984, Maltamo et al. 2000) with parameters q i and p i having fixed values 25 and 30 for each interval i.When q i and p i approach infinity, the rational spline degenerates to a piecewise linear function, and making q i and p i zero produces a cubic spline.The used parameter values (thus) produced a nearly piecewise linear interpolation.

The Implicit Assumptions in the System
The analytical relationship between the basal-area weighted and unweighted diameter distributions can be presented with formulas and where f G denotes the density of basal-area weighted distribution and f N the frequency distribution, and the nominator scales the density to unity (e.g.Gove and Patil 1998).With basal-area weighted diameter distribution, the stem number between diameters L and U can be estimated from and the stand stem number is obtained by having L = d 0 and U = d 100 .
Assuming a linear interpolation between the predicted percentiles means that the density of basal area is assumed to be uniform within each interval.This means that a decreasing stem number in the diameter classes within this interval is implicitly assumed (Fig. 1).This assumption may be realistic for most of the distribution, but not all.Thus, partly the unsatisfactory results may be due to these implicit assumptions in the linear interpolation, not in the percentile estimates.
It would, however, be possible to utilise special assumptions for tails only.For instance, it could be assumed that the tails were estimated with second-or third-order polynomials (David andNagaraja 2003, Mehtätalo et al. 2007).In the lower tail, this would cause the unweighted density to be increasing from d = d 0 upwards, but not necessarily for the whole interval (Fig. 1).For example, with quadratic interpolation it would turn to a decreasing function at d = 2d 0 .In the upper tail, higher order polynomials would make the tail lighter.Thus, higher order polynomials could be more realistic assumptions than linear interpolation, especially for the lower tail.However, they are also more difficult to parameterize into the model, as the analytic functions for stem number become more complicated.
In the case of linear interpolation, the density of the basal area weighted diameter distribution is constant within each interval where p i is the cumulative distribution value at percentile d i .The stem number for each interval can be obtained from Eq. 3 as Using second-order polynomials for tails, the density of the basal area weighted distribution within each interval would be where the tails are specified so that the global minimum of the polynomial used in the lower tail is 0 and the global maximum of the polynomial used in upper tail is 1.Parameters a l , b l , a u and b u can be solved by forcing the interpolated distribution function to pass through predicted 1st and 2nd percentiles in the lower tail and through 90th and 95th percentiles in the upper tail, for instance.This would lead to estimates  for upper tail.This, in turn, would lead to estimates of minimum and maximum diameters as 1 Using the tail model, stem number for the interval [d 0 , d 2 ] can be obtained from Eq. 3 as 4 Approaches for Improving the Results

Methods
If several separate but interrelated models are estimated, simultaneous estimation is needed.
Assuming the models to be independent will lead to biased coefficients or, at least, to inefficient estimation (e.g.Zellner 1962, Zellner andTheil 1962).In forest modelling, simultaneous estimation has been used in some growth and yield models (e.g.Borders and Bailey 1986, Zhang et al. 1997, Hasenauer et al. 1998, Eerikäinen 2002, and Siipilehto 2006), and also for diameter distribution models (Borders et al. 1997, Maltamo et al. 2000, Kangas and Maltamo 2000a, Robinson 2004, Maltamo et al. 2006).Simultaneous equations may be seemingly unrelated or directly related.In the first case, none of the independent variables are estimated with an equation in the system, but the errors of the separate models may be correlated.In the second case, some of the independent variables Thus, minimum and maximum diameters need not to be estimated with separate models.
in the equations are estimated with another equation (endogenous variables), and some are not (exogenous variables).The errors of the models may or may not be correlated.
In Seemingly Unrelated Regression (SUR), the correlations between the errors of models are accounted for in estimating the coefficients of the models.First, the coefficients are estimated with Ordinary Least Squere (OLS) separately for each model in the group, and the correlations between the errors are estimated.Then, these correlations are used in estimating a final set of parameters.In this case, OLS models are unbiased for each separate model, but modelling efficiency can be improved if the correlations are accounted for, and the independent variables are not the same in each model.If the models are directly related, twostage (2SLS) or three-stage least squares (3SLS) methods are used.In 2SLS, the equations for the endogenous variables are first estimated, and predicted values of these variables are then used as independent variables when estimating the final set of parameters.This is to ensure the unbiasedness of the approach.In 3SLS, it is also assumed that the errors of the models are correlated, so that after 2SLS the estimated correlations of the model errors are used in the same way as in SUR for estimating the final set of parameters.
It is also possible to use the SUR and/or 3SLS estimation as a sort of "soft constraint".For instance, Zhang et al. (1997) used the sum of treewise growth estimates to estimate the stand growth simultaneously.This helped to constrain the treewise growth models so that the estimates of standwise growth were more precise.Similar ideas are utilized in this study: the estimates of stem numbers in different diameter classes, based on the estimates of the percentiles, were used to constrain the percentile models.

Modelling Approaches
The first attempt to improve the original models was to re-estimate the models from the fixed-area INKA sample plots, and including three new percentiles, namely 1%, 2% and 5%, into the system.This approach may improve the results in a sense that the intervals in the smaller tail of the distribution would be smaller, so that the implicit assumption of decreasing density of number of stems within each interval does not have so big effect.It also means that the produced distributions should have less heavy tails overall.On the other hand, this approach is assumed to produce more problems due to non-monotonicity than the original model.
The modelling technique was the same as in original models, SUR.In the old models, a dummy variable for mesic and poorer mineral soils was included, but not in the new ones, and in the new ones the temperature sum (TS) was included unlike in the old ones.These models are later called re-estimated models.The models were re-estimated in a logarithmic form requiring bias correction when transforming the estimates back to arithmetic scale.
In the second attempt to improve the model behaviour, the relationship between basal area and stem number in a diameter class [d i , d i+1 ] was used to constrain the model.Assuming linear interpolation, the model for stem number between any two percentiles is obtained from Eq. 5 Thus, the strict analytical relationship (Eq.5) is loosened by including parameter b i and an error term for each class.Then, if the estimated parameters b i differ from their theoretical values 12732,4(p i+1p i ), it would indicate that linear interpolation assumption does not fit.If the estimated values are near to their theoretical values, linear interpolation assumption is suitable, and the imposed restrictions can be assumed to improve the parameters of interest, namely the coefficients for the percentile models.
As the equations are not linear with respect to percentile diameters, a nonlinear modeling approach is needed.The models were fitted using MODEL procedure of SAS, using nonlinear 3SLS method.The diameter percentiles were estimated with a model of form where β 0 - β p are parameters to be estimated, x 1 x p are the independent variables.The percentiles were assumed to be endogenous variables and the exogenous variables were stand basal area (G), logarithm of stand age (t), logarithm of the stand age divided by basal area, the basal area of spruce divided by the stand total basal area, temperature sum (TS) and the logarithm of the basal area median diameter d 50 .These independent variables are the same that were used in the reestimated models.Later these models are referred to as new models.In the new models, diameters were directly estimated so that bias corrections were not needed.Finally, improvements were attempted using the second-order polynomials for the tails.In this case, models for the minimum and maximum diameter were not estimated at all, but they were estimated using Formulas 11 and 12.The estimates for the stem numbers in interval [d 0 , d 2 ] was calculated with Eqs. 13 and 14, and in interval [d 1 , d 2 ] with Eq. 5.The stem number in interval from [d 1 , d 2 ] could also have been calculated exactly from the second order polynomial, but this approximation seemed to work well enough.The stem number in interval [d 0 , d 1 ] was obtained from subtraction.Eqs. 13 and 14 were also used as soft constraints in the modelling phase, so that 80 000/π was replaced with parameters b 1 and b 14 and an error term was included as in Eq. 15.

Comparison of the Models
First, the modelling approaches were compared based on the standard errors of the estimated models.The problems due to non-monotonicity were also considered for each case.Stem numbers for each interval were calculated using the analytical relationships presented, and their accuracy was analysed.
Then, the models were compared in an application stage.In this stage, rational spline was used for interpolation, with the same parameters as in the original study (Kangas and Maltamo 2000a).The height and volume models were applied and the accuracy of resulting stand characteristics was calculated.The basic performance of the models was examined by calculating the root mean square errors and biases of stand volume estimates (m 3 / ha) obtained with these methods.Tree total and sawnwood volumes for each diameter class were calculated with Laasasenaho's taper curve models (1982), using diameter at breast height and tree height as a predictors.Tree height was predicted by using models of Siipilehto (1999).In this approach, the parameters of Näslund's height model are predicted for each stand so that the height of the mean tree (tree with d 50 ) coincides with the observed value.
In both stages, the results were compared to the results obtained by using the true percentiles instead of estimated ones.This was done in order to find out how much of the problems were due to percentile estimates, how much were due to other reasons.
The absolute root mean square error (RMSE) was calculated as where n is the number of sample stands, V i is the true volume of stand i and Vi is the volume of stand i estimated from the predicted distribution.The relative RMSE of the volume estimate was calculated by dividing the absolute RMSE by the true mean volume V of the stands.The bias of the predictions was calculated as bias In addition to stand total volume, the RMSE and bias of sawnwood volume and number of stems were considered.
Finally, an error index proposed by Reynolds et al. (1988) was used in the comparisons as a measure of the goodness-of-fit of the distributions.The error index was calculated in 1-cm diameter classes for stem numbers.Thus, the error index of a given stand was the sum of the absolute differences between the actual and predicted stem frequencies of the diameter classes where f i ∧ and f i are the predicted and true frequency of diameter class i, respectively, and K is the number of diameter classes.

Results
The parameters of original models (Kangas and Maltamo 2000a) are presented in Table 2, those of the re-estimated models including the models for 1%, 2% and 5% percentiles in Table 3 and those of the new models in Table 4.It can be noted that in the re-estimated models, contrary to the prior beliefs, the model for minimum diameter model had a larger standard error than that of the original models, so that using fixed sample plots did not improve the models in this respect.With respect to standard errors of other common percentiles, those of the re-estimated models were a little smaller than those of the original models in 5 cases out of 10.The re-estimated model required three additional restricting models with only an intercept term in order to produce monotonic distributions, between 2% and 0%, between 100% and 95% and also between 5% and 2%.This is according to prior beliefs.The RMSEs of the new models are not directly comparable with the other, logarithmic models, but relative RMSEs provide a suitable basis for comparison.The standard errors of the logarithmic models can be interpreted as approximate relative RMSEs for the diameters in an arithmetic  Lappi 1986).Interpreted in this way, the relative RMSEs of the new models were better than those of the original models in most of the percentiles (excluding minimum diameter) and also better than those of the re-estimated models, except for percentiles 1, 2 and 5%.This may be partly due to smaller amount of independent variables that were used in the 3SLS approach for those percentiles (all variables not significant with 5% risk level were excluded), and partly due to constraining stem number models.The new models required one additional model between 1% and 0% in order to produce monotonic results.Thus, the information concerning the stem numbers enhanced more satisfactory behaviour of the model in this sense.
In the stem number models, the parameters b i differed from their theoretical values less than 1% in 7 cases out of 14, and less than 6.5% in 12 cases out of 14 (Table 5).In the first class, from 0% to 1%, however, the parameter was about 42% smaller than theoretical value and in the last interval 28% smaller than the theoretical value.This indicates that in these two intervals, the stem numbers are consistently overestimated if linear interpolation is used.The small value of b i thus compensates for the overestimation.
This can also be seen from the estimates of stem number for the 14 intervals obtained from the analytical relationships.Using true percentiles and the estimated percentiles form the reestimated models, the estimates of stem number were obtained using Eq. 5. Correspondingly, the estimates of stem number were obtained using the estimated percentiles and model (Eq.15) in the case of new models.The stem number estimates obtained from new models were in three intervals better than those estimated from true percentiles, and in 11 intervals out of 14 better than those estimated from the re-estimated models.Thus, accounting for the analytical relationships improved the estimates (slightly) in 11 cases.
In the first interval, the best estimates were obtained from re-estimated model (RMSE 70.21), and worst with true percentiles (RMSE 165.94).Thus, in this particular interval, the linear interpolation produced the greatest errors.In the case of re-estimated models, the errors in estimated percentiles compensated for that error, which reduced the RMSE.The most probable reason is that using a model shortens the tail, and therefore lessens the effect of implicit assumptions.In the new models, both the small value of b i in the model and the shortening tail compensated for the errors due to implicit assumptions, and the result was almost as large a bias as in the case of true percentiles, but to a different direction.If the theoretical value of b i had been used in this interval instead of the estimated one, the results would have been better: bias 10.5 and RMSE 108.8.The error due to linear interpolation covers for a large part of the uncertainty involved in the stem number estimates.The proportion of RMSE based on true percentiles from that of the new models varies from about 20% to 135%, being on average about 75%.Roughly three fourths of the uncertainty in stem number estimates in each interval is thus due to linear interpolation (ignoring the effect of possible compensation).
When the new models were used so that the minimum and maximum diameters and the stem numbers in the first and last interval were predicted with tail estimators (Eqs.11-14) and the corresponding models were excluded from the modelling phase, the results were bias 38.32 and RMSE 80.91 in the first interval and bias -5.57and 7.83 in the last interval.Thus, tail estimator improved the result in the first interval, but slightly worsened in the last one.This also indicates that linear interpolation does not fit to the first interval.In last interval, the second order polynomial seems to produce too light tail, and linear interpolation seems to be better.If models based on ( 13) and ( 14) were also included as soft constraints in the system, the results were bias -43.1 and RMSE 86.83 for the first interval, and -5.32 and 6.62 for the last one.Thus, this constraint did not improve the fit in the first interval but did so in the last one.In this case, the parameters were 29932 (17.5% greater than the theoretical value) and 22176 (12.9% smaller than the theoretical value).Thus, the constraint fitted clearly better than the one based on linear interpolation, but yet a better assumption would be required to obtain a truly useful constraint.On the other hand, using the second order polynomial for tail produced monotonous distributions in all stands without any ad-hoc constraints.
When the models were implemented into an application, and the resulting stand characteristics from all the three different models (original, re-estimated and new) were compared to the corresponding results obtained by using true percentiles, the results were quite surprising.The accuracy of forest characteristics obtained in the application phase was fairly similar in all these cases, except for the error index (Table 7).
The error index, which describes how well the distribution fits, was clearly better with true percentiles.The error index could also be improved from 10.808 to 9.461 by introducing three new percentiles, but it could not be further improved by using the analytical stem number information, even though the class-wise stem number estimates in most classes could be improved (Table 6).True percentiles did produce only slightly better results than the percentiles estimated with new models for volume (RMSEs 11.828% and 12.062%, respectively) and sawnwood volume (RMSEs 18.336% and 22.195%, respectively) (Table 7).
In stem number estimates, reduction in the relative RMSE of stem number was from 22.116% (original models) to 20.587% (new models).
As the stem number estimate is likely to be effected with the long and heavy tails, an adhoc shortening of the tail was also tested.This was carried out by using 1% diameter as a "true minimum" in the estimation.With this value, the RMSE of stem number could be reduced to 6.059 and bias to 5.212 (Table 7).However, the use of 1% diameter as a minimum in true percentile values produced worse diameter distributions than the re-estimated or new models, when the distributions were visually inspected (Fig. 2).
Irrespective of the seemingly minor improvements, using the stem number information for constraining the percentile models seems to force the models to behave visually more satisfactorily.In Fig. 2 are shown three example stands, for which the diameter distributions with different models are presented.In these examples, the density functions obtained with new models are no more decreasing, as they were with the original models.
Table 6.The accuracy of class stem number estimates (i.e.stem numbers at percentiles 1-100 minus the stem number at the preceding percentile, denoted by ir 1 -ir 100 ) , estimated with true percentiles, with the re-estimated models and Eq. 5, and estimated percentiles and the stem number models included in the new model system.

Discussion
In this paper, we investigated the reasons for the unsatisfactory performance of basal area weighted percentile-based diameter distribution models and presented possible solutions for the problem.The improvements were attempted by including new percentiles into the system and using analytical relationships between basal area, stem number and diameter.The results were indeed better than those obtained using the original model in the sense that the basal-area weighted distributions transformed to traditional density functions were not decreasing (Fig. 2).However, otherwise the improvements were minor, except in the error index, which could be decreased by 12.4%.When the results were compared to those obtained using true percentiles in the application phase, it can be seen that the results cannot be much improved by improving the percentile models.Small improvements are still possible with respect to error index, but in other respects further improvements in the results seem very difficult.Therefore, there are other sources of uncertainty involved, which seem to be more important than the errors in percentile estimates.
Besides the (nearly linear) spline interpolation used, another likely source of error is the height variation involved in the data set.This is supported by the fact that the original percentile models produced RMSE of about 3.29% (Kangas and Maltamo 2000b) for volume in the same INKA dataset, while in this study the RMSE was 11.83% even with true percentiles.In the former case, the volumes of the trees were calculated using diameter as the only independent variable, both for the measured trees used as reference information and for the trees sampled from the predicted distribution.Thus, height variation did not affect the results.In this study, a volume model with both height and diameter as independent variables was used, and the height variation of the reference data was accounted for.
For sawnwood volume, the RMSE was 18.34% with true percentiles and 21.89% for the original models.The effect of percentile estimates in this study seems thus to be fairly low, and that of height variation and linear interpolation assumption cover for most of the error (roughly four fifths).In the study by Kangas and Maltamo (2000b) the original models produced RMSE 11.85% for sawnwood volume in the same INKA data set that was used for the current study.Partly the difference can be explained with the fact that height variation was not accounted for in the original study, but partly the difference may be due to the differences in the definition of sawnwood volume.In the original study, all trees greater than 17 cm at breast height were considered as sawnwood trees, and their volume formed the sawnwood volume.Because of the definition, most of the errors were likely to be due to the errors in percentile estimates.In this study, the sawnwood volume was calculated as the proportion of tree volume fulfilling the minimum diameter requirements for each tree sampled from the distribution.Consequently, since the RMSE observed in the original study was fairly high, it can be assumed that some of the errors in percentiles were compensated with some other error sources in the case of new models, and the true effect is more than one fifth of the errors.
Another interesting phenomena in this study is that the other set of original percentile models with stem number as an independent variable (Kangas and Maltamo 2000a), produced clearly better estimates for stem number than the true percentiles in this study in the same data set, namely RMSE 6.88% and bias -40.69.It can be assumed that the models with stem number adjusted the estimate of minimum percentiles so that the implicit bias could be partly corrected.The most reasonable explanation for this is that those models shortened the tails, and thus lessened the effect of implicit assumptions.
Therefore, it seems that the bias and problems in the stem number are mostly due to the (nearly) linear interpolation used in the application.It may be that there are single trees in the stand that have a diameter much smaller than other trees, and these cases may produce problems to estimation, as linear interpolation of basal-area weighted diameter percentiles implicitly assumed a decreasing density between 0% and 1% percentiles.
Thus, the most significant remaining error sources are other than percentile estimates.It seems that further improvements would require that the implicit assumptions in the models are made more realistic.It would also mean that the height models should be improved.Adjusting the shape of the height model in addition to level could help.This means that more than one height sample tree should be measured and the height sample trees should not be too similar to each other in diameter.It could also be useful to utilize the whole conditional distribution of tree heights given the tree diameter, instead of using only conditional expectations.This would require integration of the volume estimates over the distribution of heights.Simpler approaches would be simulation of heights from the conditional distribution or a two-point distribution approach as used by Lappi et al. (2006).

Conclusions
The percentiles models of Kangas and Maltamo (2000a) have been widely used in forest planning calculations, e.g. in MELA simulator.Although the test results have shown the accuracy of these models being comparable with other models, the behaviour of percentile based models has been unsatisfactory, especially in Norway spruce stands.The work concerning the improvement of percentile models of this study has improved the behaviour of the models in the sense that frequency functions are not always decreasing any more.This study concentrated on the prediction of diameter distribution using current stand characteristics but it is expected that the usability of new percentiles models will be better also in growth simulations.This is due to the fact that the number of small trees will now be more realistic.The future research work will include the modelling of new percentile models also for other tree species such as Scots pine and birch species.
)and that for the interval [d 90 , d 100 ] as

Fig. 2 .
Fig. 2. Examples of predicted density functions in three example stands (1-3): a = true percentile values (1% diameter used as a true minimum), b = original percentile models of Kangas & Maltamo (2000b), c = re-estimated percentile models and d = new model.True distribution is presented with bars, and the estimate with lines.

Table 2 .
(Kangas and Maltamo 2000a)the original model, estimated from angle count plot data(Kangas and Maltamo 2000a).Median point (d gM ) is expected to be known.Clarifications of variable codes: d 0 ,…, d 100 diameter percentiles, Soil = dummy variable for stands on mesic and poorer mineral soil.For other variable codes, see Table1.

Table 3 .
Re-estimated models (SUR) for different percentile diameters of Norway spruce from INKA data, including the new percentiles (1,2, and 5%).

Table 4 .
New models estimated with 3SLS for different percentile diameters of Norway spruce including the constraining models of stem numbers in different diameter classes.The standard errors of the coefficients are presented in brackets.

Table 5 .
The parameters b i of the stem number models, the theoretical parameter values and the standard error of the estimates.

Table 7 .
The relative RMSE and absolute biases of volume, sawnwood volume, stand stem number and error index, calculated with true percentiles, with true percentiles modified so that 1% diameter was used as a minimum diameter, with original model percentiles, with re-estimated models, and with the new models.