• 论文解析 -- Efficiently Approximating Selectivity Functions using Low Overhead Regression Models (2020)


    这篇文章,主体在说,如何降低训练的代价

    从两个方面,降低training set大小,降低单个训练样本的label的收集成本

    ABSTRACT

    (现状)Today's query optimizers use fast selectivity estimation techniques but are known to be susceptible to large estimation errors.

    Recent work on supervised learned models for selectivity estimation signicantly improves accuracy while ensuring relatively low estimation overhead.

    (问题)However, these models impose signicant model construction cost as they need large numbers of training examples and computing selectivity labels is costly for large datasets.

    (提案)We propose a novel model construction method that incrementally generates training data and uses approximate selectivity labels,

    that reduces total construction cost by an order of magnitude while preserving most of the accuracy gains.

    The proposed method is particularly attractive for model designs that are faster-to-train for a given number of training examples, but such models are known to support a limited class of query expressions.

    We broaden the applicability of such supervised models to the class of select-project-join query expressions with range predicates and IN clauses.

    (结果)Our extensive evaluation on synthetic(合成的) benchmark and real-world queries shows that the 95th-percentile error of our proposed models is 10-100 better than traditional selectivity estimators.

    We also demonstrate signicant gains in plan quality as a result of improved selectivity estimates.

     

    INTRODUCTION

    (提出问题)Selectivity estimates are necessary inputs for a query optimizer, in order to identify a good execution plan [39].

    A good selectivity estimator should provide accurate and fast estimates for a wide variety of intermediate query expressions at reasonable construction overhead [16].

    Estimators in most database systems make use of limited statistics on the data, e.g., per-attribute histograms or small data samples on the base tables [41, 36, 35], to keep the query optimization overhead small [11].

    However, these statistics are insufficient to capture correlations across query predicates and can produce inaccurate estimates for intermediate query expressions [11, 28].

    Such inaccurate estimates often lead the optimizer to choose orders of magnitude slower plans [28].

    While there is a huge literature on selectivity estimators (refer to survey [16]), accurate selectivity estimation with low overhead remains an unsolved problem.

     

    Learned models for selectivity estimation

    (Supervised learned models似乎可以解决这个问题)

    Recent works [25, 17, 46, 47] have shown that the supervised learned models may fit well into the low overhead expectation of query optimization, as they can provide better accuracy than traditional techniques with low estimation overhead.

    As a case in point, the regression model design proposed in [17] requires only tens of KB of memory and 100 sec per estimation call, to deliver accurate estimates for correlated range filters on the base tables in the query.

    Their ability to adapt to the query workload, similar to self-tuning methods [10, 43], is also considered to be a signicant advantage.

    (带来的新问题是,训练集收集的成本)

    However, constructing supervised models can take many hours [25, 46, 22], which is orders of magnitude slower than the traditional statistics collection methods.

    To construct a supervised model for selectivity estimation of a given query expression, we need a set of example instances along with their true selectivities.

    Such training data is then used to train a regression model, which approximates the selectivity function captured by the given examples.

    It has been shown that the model training step is reasonably efficient - a few seconds with optimized libraries for gradient boosted trees [17] and several minutes for a simple neural network [46, 17].

    The bottleneck lies in the generation of labeled training examples, where the overhead increases with

    (1) the number of query expressions to be supported by the model,

    (2) the number of training examples per query expression, 

    and (3) the cost of generating true selectivity label for each training example.

    Recently, generation of labeled training examples has been highlighted as a major limitation of supervised models for selectivity estimation [22].

    (使用过去的执行历史日志作为训练集有一定的局限性)

    Contributions

    The focus of this work is to efficiently construct supervised selectivity estimation model for a given query expression without compromising accuracy.

    We present a novel model construction procedure that can reduce the training data collection overhead by reducing the number of training examples and the per-example label generation cost.

    Further, our analysis shows that faster-to-train models are better suited to reduce the total model construction cost for different query expressions with varying selectivity labeling cost.

    With this motivation, we build upon the low overhead regression models proposed in [17] to show that these simpler models can achieve accuracy similar to the more complex model design [25].

     

    Efficient model construction.

    Recent proposals with supervised approach for selectivity estimation [25, 17, 46, 47] use tens of thousands of training examples for model training.

    They also empirically observed that similar accuracy could also be achieved with much smaller number of examples [17, 46] and approximated labels [17, 47].

    We formalize these observations and propose a method to eciently construct a supervised model with a target accuracy.

     

    Automated determination of training data size

    Determining the required number of training examples is a challenging task [38, 23].

    To avoid generating an unnecessarily large number of training examples, a potent(Strong,Powerful) approach is to use an iterative method that incrementally generates training examples.

    It can lead to signicant savings, whenever a large enough test set is available to monitor the stopping criteria based on model accuracy [38, 23].

    (当前iterative的问题)

    Also, if the model of choice is relatively slow to train, repetitive model training in each iteration can signicantly increase the total time spent on model training.

    Overall, there is no existing iterative training method that can optimize the total model construction cost (model training cost and labeling cost) without compromising model accuracy.

     

    We propose a novel iterative procedure that

    (1) supports early stopping for query expressions that need a small number of examples to train, by using cross-validation to avoid generation of a large set of test examples,

    (2) ensures robust monitoring of model accuracy by using confidence intervals on tail q-errors to suit the selectivity estimation context, and 

    (3) optimizes the worst case total model construction cost by adapting the geometric step size to provide a constant-factor approximation guarantee.

    The optimal geometric step size is a function of the per-example label cost for the query expression and the per-example training cost for the chosen model training method.

    For instance, the constant-factor is  1.2, when training cost is 100X smaller than the label cost.

     

    Ecient approximation of training labels

    To reduce the absolute cost of training data generation in each iteration, we propose to use approximate selectivity labels for model training.

    To control the adverse(bad) impact on model accuracy, we use a small relative error threshold during this approximation.

    The key idea is that if the true selectivity has large value, it can be efficiently approximated using a relatively small uniform random sample, compared to the case when true selectivity is small.

    The challenge lies in identifying appropriate sample size for different examples.

    We present an algorithm that takes a set of unlabeled training examples and an error threshold as input, and uses uniform random data samples of increasing sizes to progressively refine the probabilistic estimate of the selectivity labels,

    until sufficiently accurate labels are determined.

     

    Extended applicability of regression models.

    With reduced label cost due to approximation, faster-to-train models proposed in [17], such as tree-ensembles, become more attractive as they can ensure that the total model construction overhead is closer to its lower bound,

    i.e., the cost of generating only the required number of training examples.

    But these models were evaluated only for single table range predicate [17].

    The second contribution of this work is to demonstrate that low overhead regression models [17] can be extended to support selectivity estimation for

    select-project-join queries with multiple categorical (IN) and range filters on the base tables, which is an important subclass of queries.

    We also discuss how these models can support other filter types by leveraging run-time features, i.e., sample-bitmap [25].

    Finally, our model design choices focus on reducing estimation overhead.

     

    Using models for join estimates

    Regression models for single tables [17] can be adapted for joins by following the same approach as statistics on views [9, 19], by treating a materialized view for the join as a base table.

    The materialized join can be discarded after constructing the required model.

    However, the cost of materialization adds up to the model construction cost.

    We empirically found that a reasonable approximation of the join selectivity function can be learned using a large enough sample of the join, which can be collected relatively efficiently without materializing the entire join result using existing techniques [18, 45, 50].

    Overall, our models can typically deliver accuracy comparable to join-samples by using only compile-time information similar to histograms.

    While we need to create a dedicated model for each query expression (refer to Section 5.5 for exceptions) rather than a single global model as in [25], the total memory footprint does not blow up quickly as individual models are small.

     

    Extensive evaluation and plan improvements.

    We present an extensive evaluation with 42 query expressions across 3 dierent datasets using queries with joins up to 5 tables with filter predicates on multiple base tables.

    We show that regression models with only 16KB memory can deliver 10-100 better 95th percentile error values compared to traditional techniques such as histograms, statistics on views, and join-samples.

    Surprisingly, our 16 KB regression models delivered accuracy comparable to custom design supervised models [25] with much larger number of model parameters.

    Our model construction improvements bring 10 or more savings for a large fraction of expressions compared to existing training method.

    Finally, we also evaluate improvement in quality of plans when estimates produced by our trained model are injected during query optimization.

    A sample experiment consisting of 500 test query instances with a fixed query template show that injected estimates bring improvement similar to injecting true selectivities.

    Overall, 30% queries improved by a factor of at least 1.2 and 10% of the instances improved by a factor of 10 or more.

     

     

     

     

  • 相关阅读:
    函数详解
    print()函数知识点总结
    python基本数据类型-字符串常用操作
    2020.7.17第十二天
    2020.7.16第十一天
    2020.7.15第十天
    2020.7.14第九天
    2020.7.13第八天
    2020.7.12第七天
    2020.7.11第六天
  • 原文地址:https://www.cnblogs.com/fxjwind/p/14475896.html
Copyright © 2020-2023  润新知