文本情感分类 实验笔记
本实验为台大李宏毅老师机器学习2020年的HW4【实验说明】【官方实现代码参考】【实现代码】
数据介绍
本次实验数据为twitter上的推文,每个推文会被标注为正面或负面。其中 0 --> 负面,1 --> 正面
实验数据共包括,
Labeled training data: 20w条数据
1 1 +++$+++ are wtf ... awww thanks ! 2 1 +++$+++ leavingg to wait for kaysie to arrive myspacin itt for now ilmmthek .! 3 0 +++$+++ i wish i could go and see duffy when she comes to mamaia romania . 4 1 +++$+++ i know eep ! i can ' t wait for one more day .... 5 0 +++$+++ so scared and feeling sick . fuck ! hope someone at hr help ... wish it would be wendita or karen . 6 0 +++$+++ my b day was thurs . i wanted 2 do 5 this weekend for my b day but i guess close enough next weekend . going alone 7 1 +++$+++ e3 is in the trending topics only just noticed ive been tweeting on my iphone until now 8 1 +++$+++ where did you get him from i know someone who would love that ! 9 0 +++$+++ dam just got buzzed by another huge fly ! this time it landed on my head ... not impressed 10 1 +++$+++ tomorrowwwwwwwww !!! you ' ll love tomorrow ' s news ! 11 0 +++$+++ gonna try 2 sleep . damn garageband next to me won ' t let me tho 12 0 +++$+++ wish weekend .. but not really also .. cuz next monday is exam and i haven ' t studied at all yet hate exam .. grr 13 1 +++$+++ check this vid out .... you ' ll piss yourself laughin 14 0 +++$+++ damn you gavin !!!!!! i want my computer back !!!! 15 1 +++$+++ it ' s great that you feel better , fresh air is nice im sure it will help too 16 0 +++$+++ got a bloody wheel clamp yesterday ï ¿ ½150 for 15 mins parking 17 0 +++$+++ homework and summer school . we ' ll go soon though ! 18 0 +++$+++ no it ' s not right ..... it is so wrong ... i would never have expected it 19 1 +++$+++ says sa mga mag gf bf na nag aaway make piece not war 20 1 +++$+++ only has under 200 words left to write on her assignment 21 0 +++$+++ son graduated 5th grade today hes so grown !
Unlabeled training data: 120w条数据 用于半监督学习
1 mkhang mlbo . dami niang followers ee . di q rin naman sia masisisi . desperate n kng desperate , pero dpt tlga replyn nia q = d 2 don ' t you hate it when you hang on to a seemingly interesting movie to see the ending only to find out that the ending sucks ? 3 ok so never went to the movies because friend wasn ' t feeling well but next weekend . back to work today , wasn ' t too bad . 4 can ' t wait to see diversity ' s performance ! 5 i love britney spears haha joey this is what u do go party with eric or do things haha 6 wish i could call in but i can ' t do blogtalk from work 7 1 more day ! 8 nursing celeste with a tummy ache . 9 hates being this burnt !! ouch 10 just couldn ' t sleep last night . working 7a 3p , than dinner with megan . happy bday jl ! 11 i love slaves ! by david raccah , linkedin , rotfl 12 is being super organised and making up orders to post first thing tomorrow ! 13 laying in the bed . it feels soooooo good . what a long day 14 finally , at the airport . currently chilling out at the citibank lounge . maaaan , the wi fi here doesn ' t work ! lameeee ! 15 back and still feeling shattered . still no cockney ... i ' m ashamed to say . 16 so do i 17 don ' t ask me difficult questions , i know how to spell , but not ponder the bigger picture ! 18 hey guys ! i am a big fan too just like my twin lol .. have a good day ! and wishin ya the best of luck ! xd 19 ay dios mio ! 2 weeks left of college !!! ah can ' t wait !! 20 oh , we must be related ! i ' ve heard that line before ! 21 i know , i don ' t know if kayley knows . he ' ll probably be resting again tomorrow , i hope not he ' ll be better . 22 good luck 23 the app never works for me 24 ew , im not that clever , im just lucky what bother you at the class ? the lessons ? 25 whoah crap , that was a mistake ... do not put the three letters together in a tweet im el im .. just got overwhelmed with follow bots . 26 problem with feedburner again . showing no . of feed readers less than actual ones . 27 im having problems don ' t worry 28 listening too mgmt time to pretend
Testing data: 20w条数据(10w public, 10w private)
1 id,text 2 0,my dog ate our dinner . no , seriously ... he ate it . 3 1,omg last day sooon n of primary noooooo x im gona be swimming out of school wif the amount of tears am gona cry 4 2,stupid boys .. they ' re so .. stupid ! 5 3,hi ! do u know if the nurburgring is open for tourists today ? we want to go , but there is an event today 6 4,having lunch in the office , and thinking of how to resolve this discount form issue 7 5,shopping was fun 8 6,wondering where all the nice weather has gone . 9 7,morning ! yeeessssssss new mimi in aug 10 8,umm ... maybe that ' s how the british spell it ? 11 9,yes it ' s 3 : 50 am . yes i ' m still awake . yes i can ' t sleep . yes i ' ll regret it tomorrow . haha i love you mr saturday 12 10,cute heart shaped portal cube . my baby is playing games , im reading fan fictions ! 13 11,had a song on mtv movie awards !!!!! 14 12,thanks nite 15 13,did not start her religion isu i will fail 16 14,that sounds wonderful !! i shall have to try it one day soon ! 17 15,i love ya mariah , i love listening to your songs , your such an inspiration for alot of people out there !!! 18 16,there is sooo much love on here that i could faint ! lol . go celtics !! i miss my b ball team . i ' m proud of you donnie ! 19 17,just found out i ' m gonna be let out early tomorrow , cos we ' re getting the results . omg if i fail science ... 20 18,that was a good thing to wake up to your right we will , and thats why god made us friends !!! ily 21 19,and old cam ' pic of tene and i . goodtimes . heehe . i want my cake now , mum 22 20,ooh my god ! i know the feeling i cannot stand getting into london from harold wood 23 21,nothing ! just kept us there for 20 minutes until they realized a walkie talkie is just a little toy and not a spy tool 24 22,6flags today teexxxt i need to shower but i ' m being lazy . i really don ' t feel that good 25 23,apparently , these are from filming , not the aftermath of the skanky hoebag fans . celebrity sites twisted the truth 26 24,hey fairuz ili ! nice to see some friends here 27 25,also cancelled my nikon 50mm lens order needed to buy some struts and tires for my car ... 28 26,headed to dallas tomorrow ... need some sleep but i am not tired yet !! 29 27,i just found out that i won a shirt from pretty effin sweet , eh ? i wonder what i ' ll get 30 28,sad i didnt get tickets 2 nin ja in albuquerque and it sold out 31 29,has had the most enjoyable day she ' s had for a lonng time 32 30,this should do the trick 33 31,o i have 21 tests i do 10 subjects lucky ... n o right ... kl is it hard ?? 34 32,lol ! i thought so ! have fun in vegas . 35 33,sarah vowell ? if your dad likes humor with his history 36 34,i like corpus
实现步骤
一、 数据预处理
1.1) 读取数据,包括 train_label_data、train_no_label_data、test_data。放到word2vec模型中(gensim),训练得到w2v_all.model
1.2) 读取训练数据train_label --> input,对 input 中的sentence处理成word embedding的形式 --> train_x:
- 根据 input,制作embedding matrix 及 word和idx的对应字典。记得加上<PAD>和<UNK>
- 将 input 中每个句子中的每个词都转为idx,并以一句为单位存到train_x中作为输入。如果当前词未出现过,归为<UNK>
- 根据超参sen_len,对每个sentence进行裁剪及补全(补全则每位补<PAD>),使每个sentence一样长度
1.3) 将 label 从 str 转为 int 型 --> y
二、 准备数据
2.1) 将train_x, y 分为训练集和验证集: X_train, X_val, y_train, y_val
2.2) 制作 train 和 val 的Dataset 和 DataLoader,方便模型操作shuffle、喂batch等
三、 准备RNN模型
3.1) 新建一个LSTM_Net模型
1 # model.py 2 class LSTM_Net(nn.Module): 3 # 此处的embedding是embedding matrix 4 def __init__(self, embedding, embedding_dim, hidden_dim, num_layers, dropout=0.5, fix_embedding=True): 5 super(LSTM_Net, self).__init__() 6 # embedding layer 7 self.embedding = torch.nn.Embedding(embedding.size(0), embedding.size(1)) 8 self.embedding.weight = torch.nn.Parameter(embedding) 9 10 self.embedding.weight.requires_grad = False if fix_embedding else True 11 self.embedding_dim = embedding.size(1) 12 13 self.hidden_dim = hidden_dim 14 self.num_layers = num_layers 15 self.dropout = dropout 16 self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True) 17 self.classifier = nn.Sequential(nn.Dropout(dropout), 18 nn.Linear(hidden_dim, 1), 19 nn.Sigmoid()) 20 21 def forward(self, inputs): 22 inputs = self.embedding(inputs) 23 x, _ = self.lstm(inputs, None) 24 # x 的 dimension (batch, seq_len, hidden_size) 25 # 取 LSTM 最后一层的hidden state 26 x = x[:, -1, :] 27 x = self.classifier(x) 28 return x
四、进行模型训练
4.1)model.train() 模式下训练,model.eval()模式下验证。与之前图像CNN的过程类似。
4.2)epoch都训练完后,保存最后一个epoch中 best_acc 的 model
五、对 test 数据进行预测
5.1) 读取test数据,并记得做embedding处理
5.2) 把处理后的test数据喂给模型,得到预测结果,保存至csv中。
补充:半监督学习
利用未标注数据。这边采用一个比较好实现的方法 self-Training
Self-Training:把训练好的模型对未标注数据做预测。并将这些预测后的值转成未标注数据的标注,并加入这些新数据做训练。可调整不同的threshold,或多次取样得到比较有信心的data。