• 计算tfidf,关键词抽取---python


    1、读入文本内容

    corpos = pandas.DataFrame(columns=['filePath','content'])
    for root ,dirs,files in os.walk(r'H:19113117 - 副本'):    
        for name in files:
            filePath=root+'\'+name
            f = codecs.open(filePath,'r','utf-8')
            content=f.read()
            f.close()
            corpos.loc[len(corpos)+1]=[filePath,content.strip()]

    2、将手动分完词的文本进行词频统计

    filePaths=[]
    segments=[]
    for filePath,content in corpos.itertuples(index=False):
        for item in content.split('/'):
            segments.append(item)
            filePaths.append(filePath)
    segmentDF=pandas.DataFrame({'filePath':filePaths,'segments':segments})
                 
    segStat = segmentDF.groupby(
                by=["filePath","segments"]
            )["segments"].agg({
                "计数":numpy.size
            }).reset_index();

    3、计算tf值

    textVector=segStat.pivot_table(
               index='segments',
               values='计数',
               columns='filePath',
               fill_value=0)           
    tF=(1+numpy.log(textVector)).as_matrix()

    4、计算IDF

    def handle(x):
        idf=1+numpy.log(len(corpos)/(numpy.sum(x>0)+1))
        return idf
    zhuan=textVector.T
    iDF=zhuan.apply(handle).as_matrix()
    iDF=iDF.reshape(8889,1)

    5、计算tfidf

    TFIDF=tF*iDF
    tFIDF_DF=pandas.DataFrame(TFIDF)

    6、将每个文本中tfidf值排名前100的词和相应的tfidf值输出

    file=[]
    for root ,dirs,files in os.walk(r'H:19113117 - 副本'):    
        for name in files:
            name=name[0:-4]
            file.append(name)
    for i in range(len(corpos)):
        sort=pandas.DataFrame(tFIDF_DF.loc[:,i].order(ascending=False)[:100]).reset_index()
        names = sort.columns.tolist()
        names[names.index(i)] = 'value'
        sort.columns = names
        tagis = textVector.index[sort.index]
        print(file[i])
        for t in range(len(tagis)):
            print(tagis[t],sort.loc[t].value)
  • 相关阅读:
    linux常用命令
    Nfs的简单了解
    关于快速排序的学习
    The goal you specified requires a project to execute but there is no POM in this directory
    TeamCity编译执行selenium上传窗口脚本缺陷
    插入排序
    选择排序
    冒泡排序
    使用.bat文件运行ant的build.xml
    性能测试工具Loadrunner使用经验小结(原创更新版)
  • 原文地址:https://www.cnblogs.com/chenyaling/p/5559906.html
Copyright © 2020-2023  润新知