• Deal with relational data using libFM with blocks


    原文:https://thierrysilbermann.wordpress.com/2015/09/17/deal-with-relational-data-using-libfm-with-blocks/

    An answer for this question: [Example] Files for Block Structure

    There is a quick explanation in the README doc here: libFM1.42 Manual


    Quick explanation is case you don’t want to read this whole blog post.

    I’ll take back the toy dataset I used in this previous blog post. Look at it to get the features meaning.

    train.libfm

    5 0:1 2:1 6:1 9:12.5
    5 0:1 3:1 6:1 9:20
    4 0:1 4:1 6:1 9:78
    1 1:1 2:1 8:1 9:12.5
    1 1:1 3:1 8:1 9:20

    and test.libfm

    0 1:1 4:1 8:1 9:78
    0 0:1 5:1 6:1

    And I’ll merge them, so it will be easier for the whole process

    dataset.libfm

    5 0:1 2:1 6:1 9:12.5
    5 0:1 3:1 6:1 9:20
    4 0:1 4:1 6:1 9:78
    1 1:1 2:1 8:1 9:12.5
    1 1:1 3:1 8:1 9:20
    0 1:1 4:1 8:1 9:78
    0 0:1 5:1 6:1

    So if we wanted to use block structure.

    We will have those 5 files first:

    • rel_user.libfm (features 0,1 and 6-8 are users features)

    0 0:1 6:1
    0 1:1 8:1

    but in fact you can avoid to have feature_id_number broken like that (0-1, 6-8), we can recompress it, so (0-1 -> 0-1 and 6-8 -> 2-4)

    0 0:1 2:1
    0 1:1 4:1

    • rel_product.libfm (features 2-5 and 9 are products features) Same thing we can compress from:

    0 2:1 9:12.5
    0 3:1 9:20 
    0 4:1 9:78 
    0 5:1

    to

    0 0:1 4:12.5
    0 1:1 4:20
    0 2:1 4:78
    0 3:1

    • rel_user.train (which is now the mapping, the first 3 lines correspond to the first line of rel_user.libfm | /! we are using a 0 indexing)

    0
    0
    0
    1
    1

    • rel_product.train (which is now the mapping)

    0
    1
    2
    0
    1

    • file y.train which contains the ratings only

    5
    5
    4
    1
    1

    Almost done…

    Now you need to create the .x and .xt files for the user block and the product block. For this you need the script available with libFM in /bin/ after you compile them.

    ./bin/convert –ifile rel_user.libfm –ofilex rel_user.x –ofiley rel_user.y

    you are forced to used the flag –ofiley even if rel_user.y will never be used. You can delete it every time.

    and then

    ./bin/transpose –ifile rel_user.x –ofile rel_user.xt

    Now you can do the same thing for the test set, for test because we merge the train and test dataset at the beginning, we only need to generate rel_user.test, rel_product.test and y.test

    At this point, you will have a lot of files: (rel_user.train, rel_user.test, rel_user.x, rel_user.xt, rel_product.train, rel_product.test, rel_product.x, rel_produt.xt, y.train, y.test)

    And run the whole thing:

    ./bin/libFM -task r -train y.train -test y.test –relation rel_user,rel_product -out output

    It’s a bit overkill for this problem but I hope you get the point.


    Now a real example

    For this example, I’ll use the ml-1m.zip MovieLens dataset that you can get from here (1 million ratings)

    ratings.dat (sample) / Format: UserID::MovieID::Rating::Timestamp

    1::1193::5::978300760
    1::661::3::978302109
    1::914::3::978301968
    1::3408::4::978300275
    1::2355::5::978824291

    movies.dat (sample) / Format: MovieID::Title::Genres

    1::Toy Story (1995)::Animation|Children’s|Comedy
    2::Jumanji (1995)::Adventure|Children’s|Fantasy
    3::Grumpier Old Men (1995)::Comedy|Romance
    4::Waiting to Exhale (1995)::Comedy|Drama

    users.dat (sample) / Format: UserID::Gender::Age::Occupation::Zip-code

    1::F::1::10::48067
    2::M::56::16::70072
    3::M::25::15::55117
    4::M::45::7::02460
    5::M::25::20::55455

    I’ll create 3 different models.

    1. Easiest libFM files to train without block. I’ll use those features: UserID, MovieID
    2. Regular libFM files to train without block. I’ll use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)
    3. libFM files to train with block. I’ll also use those features: UserID, MovieID, Gender, Age, Occupation, Genre (of movie)

    Model 1 and 2 can be created using the following code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    # -*- coding: utf-8 -*-
    __author__ = 'Silbermann Thierry'
    __license__ = 'WTFPL'
     
    import pandas as pd
    import numpy as np
     
    def create_libfm(w_filename, model_lvl=1):
     
        # Load the data
        file_ratings = 'ratings.dat'
        data_ratings = pd.read_csv(file_ratings, delimiter='::', engine='python',
                    names=['UserID', 'MovieID', 'Ratings', 'Timestamp'])
     
        file_movies = 'movies.dat'
        data_movies = pd.read_csv(file_movies, delimiter='::', engine='python',
                    names=['MovieID', 'Name', 'Genre_list'])
     
        file_users = 'users.dat'
        data_users = pd.read_csv(file_users, delimiter='::', engine='python',
                    names=['UserID', 'Genre', 'Age', 'Occupation', 'ZipCode'])
     
        # Transform data
        ratings = data_ratings['Ratings']
        data_ratings = data_ratings.drop(['Ratings', 'Timestamp'], axis=1)
         
        data_movies = data_movies.drop(['Name'], axis=1)
        list_genres = [genres.split('|') for genres in data_movies['Genre_list']]
        set_genre = [item for sublist in list_genres for item in sublist]
         
        data_users = data_users.drop(['ZipCode'], axis=1)
         
        print 'Data loaded'
     
        # Map the data
        offset_array = [0]
        dict_array = []
         
        feat = [('UserID', data_ratings), ('MovieID', data_ratings)]
        if model_lvl > 1:
            feat.extend[('Genre', data_users), ('Age', data_users),
                ('Occupation', data_users), ('Genre_list', data_movies)]
     
        for (feature_name, dataset) in feat:
            uniq = np.unique(dataset[feature_name])
            offset_array.append(len(uniq) + offset_array[-1])
            dict_array.append({key: value + offset_array[-2]
                for value, key in enumerate(uniq)})
     
        print 'Mapping done'
     
        # Create libFM file
         
        w = open(w_filename, 'w')
        for i in range(data_ratings.shape[0]):
            s = "{0}".format(ratings[i])
            for index_feat, (feature_name, dataset) in enumerate(feat):
                if dataset[feature_name][i] in dict_array[index_feat]:
                    s += " {0}:1".format(
                            dict_array[index_feat][dataset[feature_name][i]]
                                + offset_array[index_feat]
                                              )
            s += ' '
            w.write(s)
        w.close()
     
    if __name__ == '__main__':
        create_libfm('model1.libfm', 1)
        create_libfm('model2.libfm', 2)

    So you end up with a file model1.libfm and model2.libfm. Just need to split each of those files in two to create a training et test set file that I’ll call train_m1.libfm, test_m1.libfm (same thing for model2, train_m2.libfm, test_m2.libfm)

    Then you just run libFM like this:

    ./libFM -train train_m1.libfm -test test_m1.libfm -task r -iter 20 -method mcmc -dim ‘1,1,8’ -output output_m1

    But I guess you already know how to do those.


    Now the interesting part, using blocks.

    [TODO]

  • 相关阅读:
    《P3953 [NOIP2017 提高组] 逛公园》
    《P4180 [BJWC2010]严格次小生成树》
    《济南icpc补题》
    《levil的因子和》
    《洛谷P2704 [NOI2001]炮兵阵地》
    《Codeforces Round #689 (Div. 2, based on Zed Code Competition)》
    《2174: Leapin' Lizards》
    《3820: Revenge of Fibonacci 》
    马拉车求最长回文子串
    二分训练
  • 原文地址:https://www.cnblogs.com/zhizhan/p/5099751.html
Copyright © 2020-2023  润新知