Pipeline语法简介
MongoDB聚合就是把一系列特殊操作符作用于一个集合。一个操作符就是一个拥有单个属性的JavaScript对象,其属性即操作符名称,其值是一个可选对象:
{ $name: { /* options */ } } |
支持的操作符命名有:$project, $match, $limit, $skip, $unwind, $group, and $sort,它们每个都有其各自的选项集。一系列操作符就称为管道(Pipeline):
[{ $project: { /* options */ } }, { $match: { /* options */ } }, { $group: { /* options */ } }] |
当在执行一个Pipeline时,MongoDB会互相传递操作符。”传递”在此处借用了其在Linux中的含义:一个操作符的输出会成为接下来操作符的输入。而每个操作符的结果会是文档的一个新的集合。所以MongoDB会如下所示来执行前面的管道:
collection | $project | $match | $group => result |
你可以给一个管道随意添加任意多的操作符,甚至是在两个不同的位置两次添加相同操作符:
collection | $match | $group | $match | $project | $group => result |
这也就解释了为何一个管道不写成简单的JavaScript对象,而是一个对象集:在一个对象中,同一个操作符不能出现两次:
// The first appearance of $match and $group would be ignored with this syntax { $match: { /* options */ }, $group: { /* options */ }, $match: { /* options */ }, $project: { /* options */ }, $group: { /* options */ } } // So MongoDB imposes a collection of JavaScript objects instead [ { $match: { /* options */ } }, { $group: { /* options */ } }, { $match: { /* options */ } }, { $project: { /* options */ } }, { $group: { /* options */ } } ] // That's longer and cumbersome to read, but you'll get used to it |
要在一个MongoDB集合上执行管道,则要在集合上使用aggregate()函数:
db.books.aggregate([{ $project: { title: 1 } }]); |
提示:如果你使用Node.js,本地适配器(从v0.9.9.2开始)和ODM(从v3.1.0开始)都是支持新聚合框架的。例如,想要在MongoDB模型上执行之前的Pipeline,你只需要写如下代码:
Books.aggregate([{ $project: { title: 1 } }], function(err, results) { // do something with the result }); |
聚合框架的主要好处是MongoDB在执行它时省却了JavaScript引擎的开销。直接以C++实现使得它执行起来速度是非常快的。相比较于经典SQL聚合,聚合框架的主要限制就是它被局限于一个单一集合。也就是说,你不能应用类似连接的操作在数个集合上进行MongoDB聚合。除此之外,它还是非常之强大。
在本文,我还将举例说明Pipeline操作符的威力,并与它们SQL中的同类进行比较。
选择,重命名,组合
可以使用$project 操作符来选择或是重命名集合中的属性,这与SQL中SELECT语句的使用是类似的
/ sample data > db.books.find(); [ { _id: 147, title: "War and Peace", ISBN: 9780307266934 }, { _id: 148, title: "Anna Karenina", ISBN: 9781593080273 }, { _id: 149, title: "Pride and Prejudice", ISBN: 9783526419358 }, ] # sample data > SELECT * FROM book; +-----+-----------------------+---------------+ | id | title | ISBN | +-----+-----------------------+---------------+ | 147 | 'War and Peace' | 9780307266934 | | 148 | 'Anna Karenina' | 9781593080273 | | 149 | 'Pride and Prejudice' | 9783526419358 | +-----+-----------------------+---------------+ > db.books.aggregate([ { $project: { title: 0, // eliminate from the output reference: "$ISBN" // use ISBN as source } } ]); [ { _id: 147, reference: 9780307266934 }, { _id: 148, reference: 9781593080273 }, { _id: 149, reference: 9783526419358 }, ] > SELECT id, ISBN AS reference FROM book; +-----+---------------+ | id | reference | +-----+---------------+ | 147 | 9780307266934 | | 148 | 9781593080273 | | 149 | 9783526419358 | +-----+---------------+ |
$project 操作符还可以使用任意支持的表达式操作符($and, $or, $gt, $lt, $eq, $add, $mod, $substr, $toLower, $toUpper, $dayOfWeek, $hour, $cond, $ifNull, to name a few) 来创建组合字段以及子文档。
归并文档
归并文档用的就是$group操作符 。
// fastest way > db.books.count(); 3 // if you really want to use aggregation > db.books.aggregate([ { $group: { // _id is required, so give it a constant value // to group all the collection into one result _id: null, // increment nbBooks for each document nbBooks: { $sum: 1 } } } ]); [ { _id: null, nbBooks: 3 } ] > SELECT COUNT(*) FROM book; +----------+ | COUNT(*) | +----------+ | 3 | +----------+ // sample data > db.books.find() [ { _id: 147, title: "War and Peace", author_id: 72347 }, { _id: 148, title: "Anna Karenina", author_id: 72347 }, { _id: 149, title: "Pride and Prejudice", author_id: 42345 } ] # sample data > SELECT * FROM book +-----+---------------------+-----------+ | id | title | author_id | +-----+---------------------+-----------+ | 147 | War and Peace | 72347 | | 148 | Anna Karenina | 72347 | | 149 | Pride and Prejudice | 42345 | +-----+---------------------+-----------+ > db.books.aggregate([ { $group: { // group by author_id _id: "$author_id", // increment nbBooks for each document nbBooks: { $sum: 1 } } } ]); [ { _id: 72347, nbBooks: 2 }, { _id: 42345, nbBooks: 1 } ] > SELECT author_id, COUNT(*) FROM book GROUP BY author_id; +-----------+----------+ | author_id | COUNT(*) | +-----------+----------+ | 72347 | 2 | | 42345 | 1 | +-----------+----------+ |
多操作符Pipeline
一个管道可能不止有一个操作符。以下就是$group操作符和$project的组合:
> db.books.aggregate([ { $group: { _id: "$author_id", nbBooks: { $sum: 1 } } }, { $project: { _id: 0, authorId: "$_id", nbBooks: 1 } } ]); [ { authorId: 72347, nbBooks: 2 }, { authorId: 42345, nbBooks: 1 } ] > SELECT author_id AS author, COUNT(*) AS nb_books FROM book GROUP BY author_id; +--------+----------+ | author | nb_books | +--------+----------+ | 72347 | 2 | | 42345 | 1 | +--------+----------+ |
更为复杂的聚合
$group支持大量的聚合函数:$first, $last, $min, $max, $avg, $sum, $push, 以及$addToSet。可以查看MongoDB文档http://docs.mongodb.org/manual/reference/aggregation
// sample data > db.reviews.find(); [ { _id: "455", bookId: "974147", date: new Date("2012-07-10"), score: 1 }, { _id: "456", bookId: "345335", date: new Date("2012-07-12"), score: 5 }, { _id: "457", bookId: "345335", date: new Date("2012-07-13"), score: 2 }, { _id: "458", bookId: "974147", date: new Date("2012-07-16"), score: 3 } ] # sample data > SELECT * FROM review; +-----+---------+--------------+-------+ | id | book_id | date | score | +-----+---------+--------------+-------+ | 455 | 974147 | "2012-07-10" | 1 | | 456 | 345335 | "2012-07-12" | 5 | | 457 | 345335 | "2012-07-13" | 2 | | 458 | 974147 | "2012-07-16" | 3 | +-----+---------+--------------+-------+ > db.reviews.aggregate([ { $group: { _id: "$bookId", avgScore: { $avg: "$score" }, maxScore: { $max: "$score" }, nbReviews: { $sum: 1 } } } ]); [ { _id: 345335, avgScore: 3.5, maxScore: 5, nbReviews: 2 }, { _id: 974147, avgScore: 3, maxScore: 3, nbReviews: 2 } ] > SELECT book_id, AVG(score) as avg_score, MAX(score) as max_score, COUNT(*) as nb_reviews FROM review GROUP BY book_id ; +---------+------------+----------+------------+ | book_id | avg_score | max_score | nb_reviews | +---------+------------+----------+------------+ | 345335 | 3.5 | 5 | 2 | | 974147 | 2 | 3 | 2 | +---------+------------+----------+------------+ |
条件
你可以对集合加以限制,使其被查询对象处理,再传递给$match操作符 。至于你是将此操作符置于$group操作符之前还是之后,也就决定着它在SQL中的同等角色是WHERE还是HAVING。
> db.reviews.aggregate([ { $match : { date: { $gte: new Date("2012-07-11") } } }, { $group: { _id: "$bookId", avgScore: { $avg: "$score" } } } ]); [ { _id: 345335, avgScore: 3.5 }, { _id: 974147, avgScore: 3 } ] > SELECT book_id, AVG(score) FROM review WHERE review.date > "2012-07-11" GROUP BY review.book_id ; +---------+------------+ | book_id | AVG(score) | +---------+------------+ | 345335 | 3.5 | | 974147 | 3 | +---------+------------+ > db.reviews.aggregate([ { $group: { _id: "$bookId", avgScore: { $avg: "$score" } } }, { $match : { avgScore: { $gt: 3 } } } ]); [ { _id: 345335, avgScore: 3.5 } ] > SELECT book_id, AVG(score) AS avg_score FROM review GROUP BY review.book_id HAVING avg_score > 3; +---------+------------+ | book_id | AVG(score) | +---------+------------+ | 345335 | 3.5 | +---------+------------+ |
开发嵌入式数组
如果集合中的文件包含数组,那么你就可以使用操作符将这些数组分散到几个特定的文档。
// sample data > db.articles.find(); [ { _id: 12351254, title: "Space Is Getting Closer", tags: ["science", "space", "iss"] }, { _id: 22956492, title: "Computer Solves Rubiks Cube", tags: ["computing", "science"] } ] # sample data > SELECT * FROM article; +------------+---------------------------+ | id | title | +----------+-----------------------------+ | 12351254 | Space Is Getting Closer | | 22956492 | Computer Solves Rubiks Cube | +------------+---------------------------+ > SELECT * FROM tag; +-----+------------+-----------+ | id | article_id | name | +-----+------------+-----------+ | 534 | 12351254 | science | | 535 | 12351254 | space | | 536 | 12351254 | iss | | 816 | 22956492 | computing | | 817 | 22956492 | science | +-----+------------+-----------+ > db.articles.aggregate([ { $unwind: "$tags" } ]); [ { _id: 12351254, title: "Space Is Getting Closer", tags: "science" }, { _id: 12351254, title: "Space Is Getting Closer", tags: "space" }, { _id: 22956492, title: "Computer Solves Rubiks Cube", tags: "computing" }, { _id: 22956492, title: "Computer Solves Rubiks Cube", tags: "science" } ] > SELECT article.id, article.title, tag.name FROM article LEFT JOIN tag ON article.id = tag.article_id; +------------+-----------------------------+-----------+ | article.id | article.title | tag.name | +------------+-----------------------------+-----------+ | 12351254 | Space Is Getting Closer | science | | 12351254 | Space Is Getting Closer | space | | 22956492 | Computer Solves Rubiks Cube | computing | | 22956492 | Computer Solves Rubiks Cube | science | +------------+-----------------------------+-----------+ |
聚合开发数组
聚合框架真正的威力是在你将$unwind传送给$group时才得以体现的。这与在SQL中使用LEFT JOIN…GROUP BY是类似的。
> db.articles.aggregate([ { $unwind: "$tags" }, { $group: { _id: "$tags", nbArticles: { $sum: 1 } } } ]); [ { _id: "science", nbArticles: 2 }, { _id: "space", nbArticles: 1 }, { _id: "computing", nbArticles: 1 }, ] > SELECT tag.name, COUNT(article.id) AS nb_articles FROM article LEFT JOIN tag ON article.id = tag.article_id GROUP BY tag.name; +-----------+-------------+ | tqg.name | nb_articles | +-----------+-------------+ | science | 2 | | space | 1 | | computing | 1 | +-------------+-----------+ > db.articles.aggregate([ { $unwind: "$tags" }, { $group: { _id: "$tags", articles: { $addToSet: "$_id" } } } ]); [ { _id: "science", articles: [12351254, 22956492] }, { _id: "space", articles: [12351254] }, { _id: "computing", articles: [22956492] }, ] > SELECT tag.name, GROUP_CONCAT(article.id) AS articles FROM article LEFT JOIN tag ON article.id = tag.article_id GROUP BY tag.name; +-----------+-------------------+ | tqg.name | articles | +-----------+-------------------+ | science | 12351254,22956492 | | space | 12351254 | | computing | 22956492 | +-------------+-----------------+ |
结论
想象下可以用这个功能来做些什么呢?一个接着一个的传输操作符可以进行归并,排序,限定等操作。在MongoDB自带文档中有个很具代表性的例子,它是用两个连续的$group操作符来组成一个管道。而在SQL数据库中只能用子查询才能做到这一点。
如果你所用的MapReduce功能足够简单,则可将你的MongoDB代码重构为聚合框架,执行起来会更快。