您现在的位置是：主页 > news > 网站建设价格女/中国站长之家

网站建设价格女/中国站长之家

admin2025/5/25 18:07:16【news】

简介网站建设价格女,中国站长之家,公司网站开发费用计入哪个科目,上海2023年建设市场放假时间单词纠错在我们平时使用Word或者其他文字编辑软件的时候，常常会遇到单词纠错的功能。比如在Word中：单词拼写错误单词纠错算法首先，我们需要一个语料库，基本上所有的NLP任务都会有语料库。单词纠错的语料库为bit.txt，里…

网站建设价格女,中国站长之家,公司网站开发费用计入哪个科目,上海2023年建设市场放假时间单词纠错在我们平时使用Word或者其他文字编辑软件的时候，常常会遇到单词纠错的功能。比如在Word中：单词拼写错误单词纠错算法首先，我们需要一个语料库，基本上所有的NLP任务都会有语料库。单词纠错的语料库为bit.txt，里…

单词纠错

在我们平时使用Word或者其他文字编辑软件的时候，常常会遇到单词纠错的功能。比如在Word中：

单词拼写错误

单词纠错算法

首先，我们需要一个语料库，基本上所有的NLP任务都会有语料库。单词纠错的语料库为bit.txt，里面包含的内容如下：

Gutenberg语料库数据;

维基词典;

英国国家语料库中的最常用单词列表。

下载的网址为：https://github.com/percent4/-word- 。

Python实现

实现单词纠错的完整Python代码(spelling_correcter.py)如下：

# -*- coding: utf-8 -*-

import re, collections

def tokens(text):

"""

Get all words from the corpus

"""

return re.findall('[a-z]+', text.lower())

with open('E://big.txt', 'r') as f:

WORDS = tokens(f.read())

WORD_COUNTS = collections.Counter(WORDS)

def known(words):

"""

Return the subset of words that are actually

in our WORD_COUNTS dictionary.

"""

return {w for w in words if w in WORD_COUNTS}

def edits0(word):

"""

Return all strings that are zero edits away

from the input word (i.e., the word itself).

"""

return {word}

def edits1(word):

"""

Return all strings that are one edit away

from the input word.

"""

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def splits(word):

"""

Return a list of all possible (first, rest) pairs

that the input word is made of.

"""

return [(word[:i], word[i:]) for i in range(len(word) + 1)]

pairs = splits(word)

deletes = [a + b[1:] for (a, b) in pairs if b]

transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1]

replaces = [a + c + b[1:] for (a, b) in pairs for c in alphabet if b]

inserts = [a + c + b for (a, b) in pairs for c in alphabet]

return set(deletes + transposes + replaces + inserts)

def edits2(word):

"""

Return all strings that are two edits away

from the input word.

"""

return {e2 for e1 in edits1(word) for e2 in edits1(e1)}

def correct(word):

"""

Get the best correct spelling for the input word

"""

# Priority is for edit distance 0, then 1, then 2

# else defaults to the input word itself.

candidates = (known(edits0(word)) or

known(edits1(word)) or

known(edits2(word)) or

[word])

return max(candidates, key=WORD_COUNTS.get)

def correct_match(match):

"""

Spell-correct word in match,

and preserve proper upper/lower/title case.

"""

word = match.group()

def case_of(text):

"""

Return the case-function appropriate

for text: upper, lower, title, or just str.:

"""

return (str.upper if text.isupper() else

str.lower if text.islower() else

str.title if text.istitle() else

str)

return case_of(word)(correct(word.lower()))

def correct_text_generic(text):

"""

Correct all the words within a text,

returning the corrected text.

"""

return re.sub('[a-zA-Z]+', correct_match, text)

测试

有了上述的单词纠错程序，接下来我们对一些单词或句子做测试。如下：

original_word_list = ['fianlly', 'castel', 'case', 'monutaiyn', 'foresta', \

'helloa', 'forteen', 'persreve', 'kisss', 'forteen helloa', \

'phons forteen Doora. This is from Chinab.']

for original_word in original_word_list:

correct_word = correct_text_generic(original_word)

print('Orginial word: %s\nCorrect word: %s'%(original_word, correct_word))

输出结果如下：

Orginial word: fianlly

接着，我们对如下的Word文档(Spelling Error.docx)进行测试(下载地址为：https://github.com/percent4/-word-)，

有单词错误的Word文档

对该文档进行单词纠错的Python代码如下：

from docx import Document

from nltk import sent_tokenize, word_tokenize

from spelling_correcter import correct_text_generic

from docx.shared import RGBColor

# 文档中修改的单词个数

COUNT_CORRECT = 0

#获取文档对象

file = Document("E://Spelling Error.docx")

#print("段落数:"+str(len(file.paragraphs)))

punkt_list = r",.?\"'!()/\\-<>:@#$%^&*~"

document = Document() # word文档句柄

def write_correct_paragraph(i):

global COUNT_CORRECT

# 每一段的内容

paragraph = file.paragraphs[i].text.strip()

# 进行句子划分

sentences = sent_tokenize(text=paragraph)

# 词语划分

words_list = [word_tokenize(sentence) for sentence in sentences]

p = document.add_paragraph(' '*7) # 段落句柄

for word_list in words_list:

for word in word_list:

# 每一句话第一个单词的第一个字母大写，并空两格

if word_list.index(word) == 0 and words_list.index(word_list) == 0:

if word not in punkt_list:

p.add_run(' ')

# 修改单词，如果单词正确，则返回原单词

correct_word = correct_text_generic(word)

# 如果该单词有修改，则颜色为红色

if correct_word != word:

colored_word = p.add_run(correct_word[0].upper()+correct_word[1:])

font = colored_word.font

font.color.rgb = RGBColor(0x00, 0x00, 0xFF)

COUNT_CORRECT += 1

else:

p.add_run(correct_word[0].upper() + correct_word[1:])

else:

p.add_run(word)

else:

p.add_run(' ')

# 修改单词，如果单词正确，则返回原单词

correct_word = correct_text_generic(word)

if word not in punkt_list:

# 如果该单词有修改，则颜色为红色

if correct_word != word:

colored_word = p.add_run(correct_word)

font = colored_word.font

font.color.rgb = RGBColor(0xFF, 0x00, 0x00)

COUNT_CORRECT += 1

else:

p.add_run(correct_word)

else:

p.add_run(word)

for i in range(len(file.paragraphs)):

write_correct_paragraph(i)

document.save('E://correct_document.docx')

print('修改并保存文件完毕！')

print('一共修改了%d处。'%COUNT_CORRECT)

输出的结果如下：

修改并保存文件完毕！

修改后的Word文档如下：

单词纠错后的Word文档

其中的红色字体部分为原先的单词有拼写错误，进行拼写纠错后的单词，一共修改了19处。

总结

单词纠错实现起来并没有想象中的那么难，但也不是那么容易~https://github.com/percent4/-word- 。

您现在的位置是：主页 > news > 网站建设价格女/中国站长之家

网站建设价格女/中国站长之家

相关文章

最新文章