• Sims,Mosi, Mosei


    sims:中文多模态情感识别数据集

    label

    **sentimental state **

    emotion label
    negative -1
    neutral 0
    positive 1

    **regression task: average the five labeled results. **
    {-1.0, -0.8, -0.6, -0.4, -0.2, 0.0, 0.2, 0.4, 0.6, 0.8, 1.0}.

    divide these values into 5 classifications

    emotion label
    negative {-1.0, -0.8}
    weakly negative {-0.6, -0.4, -0.2}
    neutral {0.0}
    weakly positive {0.2, 0.4, 0.6}
    positive {0.8, 1.0}

    Feature

    Text

    BERT-base word embeddings (768-dimensional word vector)

    Audio

    LibROSA speech toolkit with default parameters to extract acoustic features at 22050Hz.
    Totally, 33dimensional frame-level acoustic features are extracted, including 1-dimensional logarithmic fundamental frequency (log F0), 20-dimensional Melfrequency cepstral coefficients (MFCCs) and 12dimensional Constant-Q chromatogram (CQT).

    Vision

    Frames are extracted from the video segments at 30Hz.
    MTCNN face detection algorithm to extract aligned faces.
    MultiComp OpenFace2.0 toolkit to extract the set of 68 facial landmarks, 17 facial action units, head pose, head orientation, and eye gaze. Lastly, 709-dimensional frame-level visual features are extracted in total.

    数据集结构

    import pickle
    import numpy as np
    
    with open('data/SIMS/unaligned_39.pkl', 'rb') as f:
        data = pickle.load(f)
    
    
    print(data.keys())
    output:
    dict_keys(['train', 'valid', 'test'])
    
    print(data['train'].keys())
    output:
    dict_keys(['raw_text', 'text_bert', 'audio_lengths', 'vision_lengths', 'classification_labels', 'regression_labels', 'classification_labels_T', 'regression_labels_T', 'classification_labels_A', 'regression_labels_A', 'classification_labels_V', 'regression_labels_V', 'text', 'audio', 'vision', 'id'])
    
    print(data['train']['raw_text'][0])
    output:
    闭嘴,不是来抓你的。
    
    保存数据
    for mode in ['train','valid','test']:
        # str --> float32
        if use_bert:
            self.text = data[self.mode]['text_bert'].astype(np.float32)
        else:
            self.text = data[self.mode]['text'].astype(np.float32)
            
        vision = data[mode]['vision'].astype(np.float32)
        audio = data[mode]['audio'].astype(np.float32)
        rawText = data[mode]['raw_text']
        ids = data[mode]['id']
    

    Statistics

    print(len(data['train']['id']))
    print(len(data['valid']['id']))
    print(len(data['test']['id']))
    
    output:
    1368
    456
    457
    

    MOSI:英文多模态情感识别数据集

    label

    emotion label
    strongly positive +3
    positive +2
    weakly positive +1
    neutral 0
    weakly negative -1
    negative -2
    strongly negative -3

    Feature

    Audio and visual features have been automatically extracted from MPEG files with framerates of 1000 for audio and 30 for video

    Visual

    16 Facial Action Units, 68 Facial Landmarks, Head Pose and Orientation, 6 Basic Emotions6 and Eye Gaze

    Audio

    COVAREP: pitch, energy, NAQ (Normalized Amplitude Quotient), MFCCs (Mel-frequency Cepstral Coefficients), Peak Slope, Energy Slope

    数据集结构

    import pickle
    import numpy as np
    
    with open('data/MOSI/aligned_50.pkl', 'rb') as f:
        data = pickle.load(f)
    
    
    print(data.keys())
    output:
    dict_keys(['train', 'valid', 'test'])
    
    print(data['train'].keys())
    output:
    dict_keys(['raw_text', 'audio', 'vision', 'id', 'text', 'text_bert', 'annotations', 'classification_labels', 'regression_labels'])
    
    print(data['train']['raw_text'][0])
    output:
    A LOT OF SAD PARTS
    
    保存数据
    for mode in ['train','valid','test']:
        if use_bert:
            self.text = data[mode]['text_bert'].astype(np.float32)
        else:
            self.text = data[mode]['text'].astype(np.float32)
            
        vision = data[mode]['vision'].astype(np.float32)
        audio = data[mode]['audio'].astype(np.float32)
        rawText = data[mode]['raw_text']
        ids = data[mode]['id']
    

    Statistics

    print(len(data['train']['id']))
    print(len(data['valid']['id']))
    print(len(data['test']['id']))
    
    output:
    1284
    229
    686
    

    MOSEI

    label

    emotion label
    strongly positive +3
    positive +2
    weakly positive +1
    neutral 0
    weakly negative -1
    negative -2
    strongly negative -3

    Feature Extraction

    Text

    All videos have manual transcription. Glove word embeddings

    Visual:

    Frames are extracted from the full videos at 30Hz.

    The bounding box of the face is extracted using the MTCNN face detection algorithm .

    facial action units through Facial Action Coding System (FACS) .

    a set of six basic emotions purely from static faces using Emotient FACET .

    MultiComp OpenFace is used to extract the set of 68 facial landmarks, 20 facial shape parameters, facial HoG features, head pose, head orientation and eye gaze.

    face embeddings from commonly used facial recognition models such as DeepFace , FaceNet and SphereFace .

    Acoustic

    COVAREP software: extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch, voiced/unvoiced segmenting features, glottal source parameters , peak slope parameters and maxima dispersion quotients.

    数据集结构

    import pickle
    import numpy as np
    
    with open('data/MOSEI/aligned_50.pkl', 'rb') as f:
        data = pickle.load(f)
    
    
    print(data.keys())
    output:
    dict_keys(['train', 'valid', 'test'])
    
    print(data['train'].keys())
    output:
    dict_keys(['raw_text', 'audio', 'vision', 'id', 'text', 'text_bert', 'annotations', 'classification_labels', 'regression_labels'])
    
    print(data['train']['raw_text'][0])
    output:
    Key is part of the people that we use to solve those issues, whether it's stretch or outdoor resistance or abrasions or different technical aspects that we really need to solve to get into new markets, they've been able to bring solutions.
    
    保存数据
    for mode in ['train','valid','test']:
        if use_bert:
            self.text = data[mode]['text_bert'].astype(np.float32)
        else:
            self.text = data[mode]['text'].astype(np.float32)
            
        vision = data[mode]['vision'].astype(np.float32)
        audio = data[mode]['audio'].astype(np.float32)
        rawText = data[mode]['raw_text']
        ids = data[mode]['id']
    

    Statistics

    print(len(data['train']['id']))
    print(len(data['valid']['id']))
    print(len(data['test']['id']))
    
    output:
    16326
    1871
    4659
    
  • 相关阅读:
    IE8下部分方法失效的解决方法
    C#获取本机IP地址(ipv4)
    WPF中控制窗口显示位置的三种方式
    JS判断IP的正则表达式
    WPF ListView 简单的拖拽实现(转)
    C# 中取绝对值的函数
    移动端rem单位适配使用
    vue中遇到的坑!!!!!
    vux安装中遇到的坑
    移动端常用的代码
  • 原文地址:https://www.cnblogs.com/ArdenWang/p/16020455.html
Copyright © 2020-2023  润新知