Reindexing
前文描述了如何通过创建graph index和vertex-centric index来提高性能。如果索引的label或key如果与创建索引操作在同一个事务中,这些索引可以立即生效,也就没有必要进行reindex了。反之,如果需要索引的key和label已经提前创建了,则需要重新索引整张图来使索引生效。
Overview
Janusgraph在索引定义完之后就可以进行增量更新,但在索引完整和可用前,JansuGraph必须要对索引schema相关的且已存在的所有元素执行一次全量读。一旦reindexing工作完成,索引即进入可以被使用的状态。然后通过设置索引为enable状态启用。
Prior to Reindexing
reindexing过程从索引构建完毕就开始了,需要注意的是,global graph index由其名称唯一定义,vertex-centric index由名称及edgg label或属性key唯一定义。
在对已经存在的schema元素定义了新的索引后,推荐用户等待几分钟,以使得索引能够通知到整个集群。注意索引名称,因在reindex时该名称是必须的。
Preparing to Reindex
reindex作业有两种执行框架可选。
- MapReduce
- JanusGraphManagement
在Mapreduce框架上进行reindex支持大的,横向分布的数据库。JanusGraphManagement适用于单机OPLP作业,能够为单节点数据库提供便利性和速度。
reindex需要:
- index name(用户构建索引时提供的名称)
- index type(vertex-centric index的edge label或property key的名称),仅适用于vertex-centric label
Executing a Reindex Job on MapReduce
推荐使用MapReduceManagement类进行reindex,下面是大略过程:
- 开启一个JanusGraph实例
- 将图的实例传给MapReduceIndexManagement构造器
- 在MapReduceManagement实例上调用updateIndex(<index>, SchemaAction.REINDEX)
- 如果该索引还么有enable,通过JanusGraphManagement启用
该类实现了一个updateIndex方法,且仅支持REINDEX和REMOVE_INDEX操作。该类使用classpath上的Hadoop配置和lib启动了一个Hadoop Mapreduce作业。同时支持Hadoop1和Hadoop2。该类通过传入其中的JanusGraph实例获取索引和后端存储的元数据。
graph = JanusGraphFactory.open(...) mgmt = graph.openManagement() mr = new MapReduceIndexManagement(graph) mr.updateIndex(mgmt.getRelationIndex(mgmt.getRelationType("battled"), "battlesByTime"), SchemaAction.REINDEX).get() mgmt.commit()
Reindex Example on MapReduce:
基于cassandra提供的一个完整的例子。
// Open a graph graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties") g = graph.traversal() // Define a property mgmt = graph.openManagement() desc = mgmt.makePropertyKey("desc").dataType(String.class).make() mgmt.commit() // Insert some data graph.addVertex("desc", "foo bar") graph.addVertex("desc", "foo baz") graph.tx().commit() // Run a query -- note the planner warning recommending the use of an index g.V().has("desc", containsText("baz")) // Create an index mgmt = graph.openManagement() desc = mgmt.getPropertyKey("desc") mixedIndex = mgmt.buildIndex("mixedExample", Vertex.class).addKey(desc).buildMixedIndex("search") mgmt.commit() // Rollback or commit transactions on the graph which predate the index definition graph.tx().rollback() // Block until the SchemaStatus transitions from INSTALLED to REGISTERED report = mgmt.awaitGraphIndexStatus(graph, "mixedExample").call() // Run a JanusGraph-Hadoop job to reindex mgmt = graph.openManagement() mr = new MapReduceIndexManagement(graph) mr.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.REINDEX).get() // Enable the index mgmt = graph.openManagement() mgmt.updateIndex(mgmt.getGraphIndex("mixedExample"), SchemaAction.ENABLE_INDEX).get() mgmt.commit() // Block until the SchemaStatus is ENABLED mgmt = graph.openManagement() report = mgmt.awaitGraphIndexStatus(graph, "mixedExample").status(SchemaStatus.ENABLED).call() mgmt.rollback() // Run a query -- JanusGraph will use the new index, no planner warning g.V().has("desc", containsText("baz")) // Concerned that JanusGraph could have read cache in that last query, instead of relying on the index? // Start a new instance to rule out cache hits. Now we're definitely using the index. graph.close() graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties") g.V().has("desc", containsText("baz"))
Executing a Reindex job on JanusGraphManagement
使用JanusGraphManagement.updateIndex()方法触发reindex作业,并携带参数SchemaAction.REINDEX参数,如:
m = graph.openManagement() i = m.getGraphIndex('indexName') m.updateIndex(i, SchemaAction.REINDEX).get() m.commit()
Example for JanusGraphManagement
下面的例子使用了BerkeleyDB作为存储后端。
import org.janusgraph.graphdb.database.management.ManagementSystem // Load some data from a file without any predefined schema graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje.properties') g = graph.traversal() m = graph.openManagement() m.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.LIST).make() m.makePropertyKey('lang').dataType(String.class).cardinality(Cardinality.LIST).make() m.makePropertyKey('age').dataType(Integer.class).cardinality(Cardinality.LIST).make() m.commit() graph.io(IoCore.gryo()).readGraph('data/tinkerpop-modern.gio') graph.tx().commit() // Run a query -- note the planner warning recommending the use of an index g.V().has('name', 'lop') graph.tx().rollback() // Create an index m = graph.openManagement() m.buildIndex('names', Vertex.class).addKey(m.getPropertyKey('name')).buildCompositeIndex() m.commit() graph.tx().commit() // Block until the SchemaStatus transitions from INSTALLED to REGISTERED ManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.REGISTERED).call() // Reindex using JanusGraphManagement m = graph.openManagement() i = m.getGraphIndex('names') m.updateIndex(i, SchemaAction.REINDEX) m.commit() // Enable the index ManagementSystem.awaitGraphIndexStatus(graph, 'names').status(SchemaStatus.ENABLED).call() // Run a query -- JanusGraph will use the new index, no planner warning g.V().has('name', 'lop') graph.tx().rollback() // Concerned that JanusGraph could have read cache in that last query, instead of relying on the index? // Start a new instance to rule out cache hits. Now we're definitely using the index. graph.close() graph = JanusGraphFactory.open("conf/janusgraph-berkeleyje.properties") g = graph.traversal() g.V().has('name', 'lop')
Index Removal
索引的移除是由多个步骤组成的手动过程,这些步骤需要仔细执行,以避免出现不一致的情况。
Overview
索引的删除是一个两段过程,在第一阶段,JanusGraph通知所有的存储后端该索引准备被删除,这个步骤索引的状态将会被设置为DISABLED。之后,JanusGraph停止使用该索引来响应查询请求,并停止索引的增量更新。与索引关联的数据在存储后端保持最新,但为索引所忽略。
第二阶段依赖于索引是mixed还是composite,composite索引可以通过JansuGraph删除,类似于reindex过程,可以通过MapReduce或者JanusGraphManagement实现。但是mixed 索引必须手动在索引后端删除;JanusGraph没有提供自动删除机制。
删除索引会同时移除索引关联的除了schema定义和DISABLE状态的一切,schema在索引删除后依然存在。
Prepareing for Index Removal
如果索引当前是enabled的,需要首先被disabled,这个操作通过ManagementSystem。
mgmt = graph.openManagement() rindex = mgmt.getRelationIndex(mgmt.getRelationType("battled"), "battlesByTime") mgmt.updateIndex(rindex, SchemaAction.DISABLE_INDEX).get() gindex = mgmt.getGraphIndex("byName") mgmt.updateIndex(gindex, SchemaAction.DISABLE_INDEX).get() mgmt.commit()
一旦当索引上的key的状态均变更为DISABLED,索引就准备被删除了。ManagementSystem中的一个工具可以等待DISABLE步骤。
ManagementSystem.awaitGraphIndexStatus(graph, 'byName').status(SchemaStatus.DISABLED).call()
当一个composite index设置为DISABLED状态后,可以通过以下两个框架进行索引的删除。
- MapReduce
- JanusGraphManagement
通过MapReduce支持大型的,横向分布的数据库;通过JanusGraphMangement支持单节点OPAP作业,通常适用于单机节点。
删除索引需要:
- index name
- index type(对vertex-centric索引适用)
Executing an Index Removal Job on MapReduce
与reindex步骤一样,推荐通过MapReduceManagment类执行索引移除作业。下面是大致执行步骤:
- 打开一个JanusGraph实例
- 如果索引不是禁用状态,通过JanusGraphManagement禁用
- 将graph实例传入MapReduceManagment构造器
- 执行updateIndex(<index>, SchemaAction.REMOVAL_INDEX)
Example for MapReduce
import org.janusgraph.graphdb.database.management.ManagementSystem // Load the "Graph of the Gods" sample data graph = JanusGraphFactory.open('conf/janusgraph-cassandra-es.properties') g = graph.traversal() GraphOfTheGodsFactory.load(graph) g.V().has('name', 'jupiter') // Disable the "name" composite index m = graph.openManagement() nameIndex = m.getGraphIndex('name') m.updateIndex(nameIndex, SchemaAction.DISABLE_INDEX).get() m.commit() graph.tx().commit() // Block until the SchemaStatus transitions from INSTALLED to REGISTERED ManagementSystem.awaitGraphIndexStatus(graph, 'name').status(SchemaStatus.DISABLED).call() // Delete the index using MapReduceIndexJobs m = graph.openManagement() mr = new MapReduceIndexManagement(graph) future = mr.updateIndex(m.getGraphIndex('name'), SchemaAction.REMOVE_INDEX) m.commit() graph.tx().commit() future.get() // Index still shows up in management interface as DISABLED -- this is normal m = graph.openManagement() idx = m.getGraphIndex('name') idx.getIndexStatus(m.getPropertyKey('name')) m.rollback() // JanusGraph should issue a warning about this query requiring a full scan g.V().has('name', 'jupiter')
Executing an Index Removal job on JanusGraphManagement
使用JanusGraphManagement.updateIndex()方法使用参数SchemaAction.REMOVE_INDEX运行一个索引移除作业。
m = graph.openManagement() i = m.getGraphIndex('indexName') m.updateIndex(i, SchemaAction.REMOVE_INDEX).get() m.commit()
Example for JanusGraphManagement
该例子使用BerkeleyDB后端。
import org.janusgraph.graphdb.database.management.ManagementSystem // Load the "Graph of the Gods" sample data graph = JanusGraphFactory.open('conf/janusgraph-cassandra-es.properties') g = graph.traversal() GraphOfTheGodsFactory.load(graph) g.V().has('name', 'jupiter') // Disable the "name" composite index m = graph.openManagement() nameIndex = m.getGraphIndex('name') m.updateIndex(nameIndex, SchemaAction.DISABLE_INDEX).get() m.commit() graph.tx().commit() // Block until the SchemaStatus transitions from INSTALLED to REGISTERED ManagementSystem.awaitGraphIndexStatus(graph, 'name').status(SchemaStatus.DISABLED).call() // Delete the index using JanusGraphManagement m = graph.openManagement() nameIndex = m.getGraphIndex('name') future = m.updateIndex(nameIndex, SchemaAction.REMOVE_INDEX) m.commit() graph.tx().commit() future.get() m = graph.openManagement() nameIndex = m.getGraphIndex('name') g.V().has('name', 'jupiter')
Common Problems with Index Management
IllegalArgumentException when starting job
通常出现如下错误:
The index mixedExample is in an invalid state and cannot be indexed.
The following index keys have invalid status: desc has status INSTALLED
(status must be one of [REGISTERED, ENABLED])
或
The index mixedExample is in an invalid state and cannot be indexed.
The index has status INSTALLED, but one of [REGISTERED, ENABLED] is required
当创建一个索引后,索引会被广播到所有的JanusGraph实例,JanusGraph集群的节点需要一段时间(可能是几分钟,视集群大小而定)才能感知到索引的存在。只有当所有节点均感知到索引存在后,reindexing才能被执行。因此用户在创建完索引后需要等待一段时间。
但是需要注意的是,这种感知可能由于JanusGraph实例的错误而无法完成,换句话说,JansuGraph可能无限等待失效节点的响应。在这种情况下,用户必须手动剔除失效实例。参考http://docs.janusgraph.org/latest/failure-recovery.html。当状态重置后,需要在Management中手动重新注册索引。
mgmt = graph.openManagement() rindex = mgmt.getRelationIndex(mgmt.getRelationType("battled"),"battlesByTime") mgmt.updateIndex(rindex, SchemaAction.REGISTER_INDEX).get() gindex = mgmt.getGraphIndex("byName") mgmt.updateIndex(gindex, SchemaAction.REGISTER_INDEX).get() mgmt.commit()
Could not find index
索引名称错误,在操作global index时,只需要提供索引名称即可,当操作vertex-centric index时还要指定label和索引属性。
Cassandra Mappers Fail with "Too Many openfiles"
错误栈类似于:
java.net.SocketException: Too many open files at java.net.Socket.createImpl(Socket.java:447) at java.net.Socket.getImpl(Socket.java:510) at java.net.Socket.setSoLinger(Socket.java:988) at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:118) at org.apache.thrift.transport.TSocket.<init>(TSocket.java:109)
解决方案为:
- Reduce the maximum size of the Cassandra connection pool. For example, consider setting the cassandrathrift storage backend’smax-activeandmax-idleoptions to 1 each, and settingmax-totalto -1. SeeChapter 12,Configuration Referencefor full listings of connection pool settings on the Cassandra storage backends.
- Increase thenofileulimit. The ideal value depends on the size of the Cassandra dataset and the throughput of the reindex mappers; if starting at 1024, try an order of magnitude larger: 10000. This is just necessary to sustain lingering TIME_WAIT sockets. The reindex job won’t try to open nearly that many sockets at once.
- Run the reindex task on a multi-node MapReduce cluster to spread out the socket load.