PyMongo去除重复数据

转载自: 李冬琳的博客 URL: http://ldllidonglin.github.io/blog/2015/12/14/2015-12-14-mongodb%E5%8E%BB%E9%99%A4%E9%87%8D%E5%A4%8D%E6%95%B0%E6%8D%AE/

1. 唯一索引

db.things.ensureIndex({'key' : 1}, {unique : true, dropDups : true})

　　但是dropDups is not supported by MongoDB 2.7.5 or newer所以这个方法只能在2.7.5版本以下才行

2. 用aggreate找出重复的数据，然后再一个一个删除(效率比较低)，python代码

#先找到重复的数据
deleteData=collection.aggregate([
{'$group': { 
    '_id': { 'firstField': "$area", 'secondField': "$time_point" }, 
    'uniqueIds': { '$addToSet': "$_id" },
    'count': { '$sum': 1 } 
  }}, 
  { '$match': { 
    'count': { '$gt': 1 } 
  }}
]);
first=True
for d in deleteData:
    first=True
    for did in d['uniqueIds']:
        if !first:    #第一个不删除
            collection.delete_one({'_id':did});
        first=False

　　参考1
　　参考2

3. 第二种方法当数据量很大的时候，需要把数据写入表中。aggregate的pipeline中要加上out项，同时由于aggregate只接受两个参数，self是默认的，所以要用allowDiskUse=True这种形式添加参数

# 找出重复的放入result表中
def findDuplicate():
    deleteData=collection.aggregate([
        {'$group': {
            '_id': { 'firstField': "$mid", 'secondField': "$created_at" },
            'uniqueIds': { '$addToSet': "$_id" },
            'count': { '$sum': 1 }
            }
        },
        { '$match': {
            'count': { '$gt': 1 }
            }
        },{'$out':'result'}
    ],allowDiskUse=True); 

def deleteDup():
    deleteData=db.result.find()
    first=True
    for d in deleteData:
        first=True
        for did in d['uniqueIds']:
            if first==False:
                collection.delete_one({'_id':did});
            first=False

相关阅读:
事后诸葛亮
团队作业6--展示博客（Alpha版本)
团队作业5——测试与发布（Alpha版本）
团队作业2：需求分析&原型设计
团队编程作业1-团队展示与选题
结对编程1
TeamViewer app案例分析
第一次作业--四则运算
【Alpha】Daily Scrum Meeting 集合贴
【Alpha】Daily Scrum Meeting——Day3

原文地址：https://www.cnblogs.com/yuanyongqiang/p/13324762.html