os: centos 7.4
gp: gpdb-5.7.0
三台机器
node1 为master host
node2、node3为segment host
psql 登录 node1 master
$ psql -d peiybdb
peiybdb=# select current_database();
current_database
------------------
peiybdb
(1 row)
create table tmp_t0(
c1 varchar(100),
c2 varchar(100),
c3 varchar(100)
);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'c1' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
peiybdb=# d+ tmp_t0
Table "public.tmp_t0"
Column | Type | Modifiers | Storage | Description
--------+------------------------+-----------+----------+-------------
c1 | character varying(100) | | extended |
c2 | character varying(100) | | extended |
c3 | character varying(100) | | extended |
Has OIDs: no
Distributed by: (c1)
查看node1、node2、node3的文件目录大小
[gpadmin@node1 gpseg-1]$ pwd
/u01/greenplum-data/gpseg-1
[gpadmin@node1 gpseg-1]$ du -sh
109M .
[gpadmin@node2 gpseg0]$ pwd
/u01/greenplum-data/gpseg0
[gpadmin@node2 gpseg0]$ du -sh
109M .
[gpadmin@node3 gpseg1]$ pwd
/u01/greenplum-data/gpseg1
[gpadmin@node3 gpseg1]$ du -sh
109M .
插入数据
peiybdb=# select gp_opt_version();
gp_opt_version
-----------------------------------------------
GPOPT version: 2.55.20, Xerces version: 3.1.2
(1 row)
set optimizer=on;
set optimizer_enumerate_plans=on;
set optimizer_minidump=always;
set optimizer_enable_constant_expression_evaluation=off;
set client_min_messages='debug5';
insert into tmp_t0
(c1,c2,c3)
select generate_series(1, 1000000) as c1,
md5(random()::text) as c2 ,
md5(random()::text) as c3
;
node1 master上查询pg_stat_activity
peiybdb=# x
Expanded display is on.
peiybdb=# select * from pg_stat_activity;
-[ RECORD 1 ]----+-------------------------------------------
datid | 16384
datname | peiybdb
procpid | 10904
sess_id | 20
usesysid | 10
usename | gpadmin
current_query | insert into tmp_t0
| (c1,c2,c3)
| select generate_series(1, 1000000) as c1,
| md5(random()::text) as c2 ,
| md5(random()::text) as c3
|
waiting | f
query_start | 2018-05-02 07:20:33.325637+00
backend_start | 2018-05-02 07:13:17.573451+00
client_addr | 192.168.56.1
client_port | 63217
application_name |
xact_start | 2018-05-02 07:20:33.3212+00
waiting_reason |
rsgid | 0
rsgname |
rsgqueueduration |
-[ RECORD 2 ]----+-------------------------------------------
datid | 16384
datname | peiybdb
procpid | 10906
sess_id | 21
usesysid | 10
usename | gpadmin
current_query | <IDLE>
waiting | f
query_start | 2018-05-02 07:17:17.675179+00
backend_start | 2018-05-02 07:13:17.580506+00
client_addr | 192.168.56.1
client_port | 63218
application_name |
xact_start |
waiting_reason |
rsgid | 0
rsgname |
rsgqueueduration |
-[ RECORD 3 ]----+-------------------------------------------
datid | 16384
datname | peiybdb
procpid | 11095
sess_id | 22
usesysid | 10
usename | gpadmin
current_query | select * from pg_stat_activity;
waiting | f
query_start | 2018-05-02 07:28:07.30111+00
backend_start | 2018-05-02 07:27:54.501407+00
client_addr |
client_port | -1
application_name | psql
xact_start | 2018-05-02 07:28:07.30111+00
waiting_reason |
rsgid | 0
rsgname |
rsgqueueduration |
再次查看node1、node2、node3的文件目录大小
[gpadmin@node1 gpseg-1]$ du -sh
109M .
[gpadmin@node2 gpseg0]$ du -sh
224M .
[gpadmin@node3 gpseg1]$ du -sh
224M .
可以观察到 master 节点node1的数据文件大小并没有发生变化,segment的node2、node3的数据文件增长了不少。
主要就是由于 master 节点是用来存储定义,segment是用来存储数据的。
查看tmp_t0的定义
peiybdb=#
peiybdb=# d
List of relations
Schema | Name | Type | Owner | Storage
--------+--------+-------+---------+---------
public | tmp_t0 | table | gpadmin | heap
(1 row)
peiybdb=# d+ tmp_t0;
Table "public.tmp_t0"
Column | Type | Modifiers | Storage | Description
--------+------------------------+-----------+----------+-------------
c1 | character varying(100) | | extended |
c2 | character varying(100) | | extended |
c3 | character varying(100) | | extended |
Has OIDs: no
Distributed by: (c1)
Distributed by 这个就是tmp_t0表的分布列,表的分布列一定要合理,能够降数据比较均匀的分布到各个segment节点上。
检索数据时能够在多个节点并发处理数据。