最近学习spark需要大量文本文件,在之前项目的基础上进行了一定改进,在此总结一下。
首先定义MarkovChain,此处用的应该是bigram的(如果是要一阶的markov chain,由于马科尔夫链的转移矩阵可以由:P_ij的当前次数/总转移次数 来计算)
import random class MarkovChain: def __init__(self): self.memory = {} def _learn_key(self, key, value): if key not in self.memory: self.memory[key] = [] self.memory[key].append(value) def learn(self, text): tokens = text.split(" ") bigrams = [(tokens[i], tokens[i + 1]) for i in range(0, len(tokens) - 1)] for bigram in bigrams: self._learn_key(bigram[0], bigram[1]) def _next(self, current_state): next_possible = self.memory.get(current_state) if not next_possible: next_possible = self.memory.keys() return random.sample(next_possible, 1)[0] def babble(self, amount, state=''): if not amount: return state next_word = self._next(state) return state + ' ' + self.babble(amount - 1, next_word)
之后导入需要学习的txt文件进行学习(需要大量txt文本学习的话可以到此github页面下载)
with open ("../books/3001.txt", "r") as myfile: data=myfile.readlines()
m = MarkovChain() for i in data: m.learn(i)
最后就可以生成指定长度的txt文本了(生成到data/test.txt下),如果需要大量只要加个循环就好了
length = 1000 text_file = open("../data/test.txt", "w") text_file.write(m.babble(length)) text_file.close()
来不及写一些代码来源了,日后补上!
欢迎转载:注明转载出处就好:):嘻哈小屋 » 基于马科尔夫链Python自动生成大量txt文本