• airflow笔记


    官网:http://airflow.incubator.apache.org/project.html

    Here we pass a string that defines the dag_id, which serves as a unique identifier for your DAG.
    The first argument task_id acts as a unique identifier for the task.

    The precedence rules for a task are as follows:
    Explicitly passed arguments
    Values that exist in the default_args dictionary
    The operator’s default value, if one exists

    A task must include or inherit the arguments task_id and owner

    Let’s assume we’re saving the code from the previous step in tutorial.py in the DAGs folder referenced in your airflow.cfg.
    The default location for your DAGs is ~/airflow/dags.

    Note that if you use depends_on_past=True, individual task instances will depend on the success of the preceding task instance, except for the start_date specified itself, for which this dependency is disregarded.

    You can also set options with environment variables by using this format: $AIRFLOW__{SECTION}__{KEY}
    ================================
    # print the list of active DAGs
    airflow list_dags

    # prints the list of tasks the "tutorial" dag_id
    airflow list_tasks tutorial

    airflow backfill :airflow backfill tutorial -s 2015-06-01 -e 2015-06-07
    airflow test :It simply allows testing a single task instance.
    airflow webserver :will start a web server

    ===============

    1  “LocalExecutor” :an executor that can parallelize task instances locally.

    2 配置文件所在路径:$AIRFLOW_HOME/airflow.cfg,配置文件中的sql_alchemy_conn 指向源数据数据库的地址

    3 AIRFLOW_HOME 的默认值:~/airflow

    4 Admin->Connection : The pipeline code you will author will reference the ‘conn_id’ of the Connection objects

    5 环境变量里的值的优先级高于配置文件中对应的值

    6 连接的环境变量必须有前缀AIRFLOW_CONN_,环境变量必须是全大写,if the conn_id is named postgres_master the environment variable should be named AIRFLOW_CONN_POSTGRES_MASTER

    代表连接的环境变量的返回值应该是URI格式,如postgres://user:password@localhost:5432/master or s3://accesskey:secretkey@S3

    7 Users can specify a logs folder in airflow.cfg. By default, it is in the AIRFLOW_HOME directory. 

      Logs are stored in the log folder as {dag_id}/{task_id}/{execution_date}/{try_number}.log.

    8 operator :The airflow/contrib/ directory contains yet more operators built by the community

    9 a) SubDAG operators should contain a factory method that returns a DAG object.

     b)SubDAGs must have a schedule and be enabled. 

    c ) refrain from using depends_on_past=True in tasks within the SubDAG as this can be confusing

     d) It is common to use the SequentialExecutor if you want to run the SubDAG in-process and effectively limit its parallelism to one. Using LocalExecutor can be problematic

    10

     

    11 if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59

    12 The scheduler starts an instance of the executor specified in the your airflow.cfg. If it happens to be the LocalExecutor, tasks will be executed as subprocesses; in the case of CeleryExecutor andMesosExecutor, tasks are executed remotely.

     13 Airflow 可以为任意一个 Task 指定一个抽象的 Pool,每个 Pool 可以指定一个 Slot 数。 每当一个 Task 启动时,就占用一个 Slot,当 Slot 数占满时,其余的任务就处于等待状态

    14 上一轮的某个dag的处理时间可能很长,导致到下一轮处理的时候这个dag还没有处理完成。 Airflow 的处理逻辑是在这一轮不为这个dag创建进程,这样就不会阻塞进程去处理其余dag。

     15 A task must include or inherit the arguments task_id and owner, otherwise Airflow will raise an exception.task支持的详细参数可以看下BaseOperator的构造方法

    16 通过retry_exponential_backoff实现重试间隔越来越长
    通过wait_for_downstream实现上次dag没执行完,则这次不执行
    通过weight_rule计算每个task的优先级
    通过execution_timeout控制task的超时时间
    通过trigger_rule控制task执行的触发条件
    通过task_concurrency 控制同一个task可以并行执行的个数

    17 airflow test不判断任务的依赖关系,直接执行

    18 airflow偶尔占用内存太高问题定位:因为历史task挂死,一直没执行完,是running状态,随着时间的积累,
    导致处于running和queue状态的任务大于concurrency了,后续生成的taskinstance都是scheduled状态。而每次scheduler每次调度任务时,都会取出scheduled状态的任务,进行排序等操作,因为scheduled状态的任务太多,所以占用了很大内存

    19 模板参数:

    {
                'dag': task.dag,
                'ds': ds,
                'ds_nodash': ds_nodash,
                'ts': ts,
                'ts_nodash': ts_nodash,
                'yesterday_ds': yesterday_ds,
                'yesterday_ds_nodash': yesterday_ds_nodash,
                'tomorrow_ds': tomorrow_ds,
                'tomorrow_ds_nodash': tomorrow_ds_nodash,
                'END_DATE': ds,
                'end_date': ds,
                'dag_run': dag_run,
                'run_id': run_id,
                'execution_date': self.execution_date,
                'prev_execution_date': prev_execution_date,
                'next_execution_date': next_execution_date,
                'latest_date': ds,
                'macros': macros,
                'params': params,
                'tables': tables,
                'task': task,
                'task_instance': self,
                'ti': self,
                'task_instance_key_str': ti_key_str,
                'conf': configuration,
                'test_mode': self.test_mode,
                'var': {
                    'value': VariableAccessor(),
                    'json': VariableJsonAccessor()
                }
            }

    20

  • 相关阅读:
    改变Edit的光标(使用CreateCaret,ShowCaret和LoadBitmap三个API函数)
    浅析Delphi Container库(有开源的DCLX)
    Delphi接口的底层实现(接口在内存中仍然有其布局,它依附在对象的内存空间中,有汇编解释)——接口的内存结构图,简单清楚,深刻 good
    Asp.Net在多线程环境下的状态存储问题
    C#程序中注释过多的8条理由
    CentOS 6.4 编译安装LLVM3.3,Clang和Libc++
    Microsoft 2013校园招聘笔试题及解答
    代码契约CodeContract(八)
    T-SQL 临时表、表变量、UNION
    BST&AVL&红黑树简单介绍
  • 原文地址:https://www.cnblogs.com/testzcy/p/8480036.html
Copyright © 2020-2023  润新知