Sunday, 18 August 2013

Spell Check Wordlist Cleaner Sanitiser Script dictionary .dic

Spell checkers have been polluted with too much crap. They don't seem to be maintained very well if at all.

Do a file search for your browsers dictionary file. For instance en-US.dic will contain 45593, awe-inspiring, BIOS's, hep, hosp, Jugiong, mer's, ODs, Sc, Supt., kw.... and the list goes on and on.

Whether this is a solution or not needs to be debated, but the results are obvious. Out of about 43000 words, 13000 were found to be unsatisfactory.

The perl script below, runs the dictionary file through a bunch of regexes, pulling out unwanted expressions. It is default setup to be very harsh, but users can comment out the 'elsif' to '}' lines for expressions they don't need.
The script outputs two files, one cleaned and one with the bad.

By the way, it has been tested on Linux, perl 4 version 14.  Good luck getting it to run on windows, the line terminators may get you.  A good test is to probably re-run the script on the cleaned file and see if it yields any more results.


Be sure to copy out your dictionary file to local directory before running the script. You may have to rerun the script if the dictionary updates itself.

USAGE spellclean.pl FILE.dic

Link to script:
http://pastebin.com/PAa9m0Ud

References:
http://www.chromium.org/developers/how-tos/editing-the-spell-checking-dictionaries
http://www.openoffice.org/lingucomponent/dictionary.html
http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html
http://txt2re.com/

#!/usr/bin/perl

#spellclean.pl
#cleans the spell checking word list.
#USAGE spellclean.pl FILE.dic


my $filename=shift;  #shift command line argument to variable filename
my $filenamemod = $filename.'.cleaned';  #cleaned filename and extension
my $filenamebad = $filename.'.bad';  #bad lines filename and extension
my $linecount = 0;  #whole file line counter
my $removeline = 0;  #delete line flag
my @badlines = ();  #delete line array
my $modlinecount = 0;   #line counter for the destination file.

# open file for phrasing
open SPELLFILE, $filename or die "error opening file $filename\n";


print "Here is the preview of the words removed, press [Enter] to continue: ";
<STDIN>;  #wait for input


while ($line=<SPELLFILE>) {
 $linecount++;  #increment linecount

 if ($line=~m/^\d.*/) { #remove any words beginning with numbers
  $removeline = 1;
 }
 elsif ($line=~m/(^[A-Z]{2}?)/) { #remove any CAPital words, like acronyms
  $removeline = 1;
 }
 elsif ($line=~m/.-/gm) { #remove any words with hyphens
  $removeline = 1;
 }
 elsif ($line=~ /.\./gm) { #remove any words with dots
  $removeline = 1;
 }
 elsif ($line=~ /^.{1,3}$/gm) { #remove any one and two letter words
  $removeline = 1;
 }
 elsif ($line=~ /^.{4,4}$/gm) { #remove any three letter words
  $removeline = 1;
 }
 elsif ($line=~ /^.{5,5}$/gm) { #remove any four letter words
  $removeline = 1;
 }
 elsif ($line=~ /^[A-Z].*$/gm) { #remove any Name style words
  $removeline = 1;
 }
 elsif ($line=~ /'.*$/gm) { #remove any words with apostrophy
  $removeline = 1;
 }
 
 # push lines number to the array, reset the removeline flag to 0 before looping back
 if ($removeline == 1) {
  push (@badlines, $linecount);
  print "Bad line $linecount -> $line";
  $removeline = 0;
 }

}
close (SPELLFILE); # close the file


print "Preview complete, press [Enter] to remove words from list or ^C to abort! ";
<STDIN>;   #wait for input


#open the original file and a destination file.
open SPELLFILE, $filename or die "error opening file $filename\n";
open SPELLFILEMOD, ">$filenamemod" or die "error creating file $filenamemod\n"; 
open SPELLFILEBAD, ">$filenamebad" or die "error creating file $filenamebad\n";

$linecount = 0;  #reset line counter to zero.
$removeline = 0; #reuse this variable.
 

#checks if the line number of the original file matches the first element of the array [0], if it does then it is a badline
#and it wont be printed to the destination file.  The removeline variable is incremented for the next array element.

while ($line=<SPELLFILE>) {
 $linecount++;
 if ($badlines[$removeline]==$linecount) {
  print "Removing line $linecount -> $line";
  print SPELLFILEBAD "$line";
  $removeline++;
 }
 else {
   print SPELLFILEMOD "$line";
   $modlinecount++;
  }
}
#close the files
close SPELLFILE;
close SPELLFILEMOD;
close SPELLFILEBAD;

# summary
print "\n\nCompleted!\n";
print "Total lines in the original file is: $linecount\n";
print "Total lines removed is: ".scalar(@badlines)."\n";
print "Cleaned file has: $modlinecount lines\n";
print "Cleaned file is saved as: $filenamemod\n";
print "Bad lines file is saved as: $filenamebad\n";




WindyCityTech Blogger
WindyWindyCityTech Wordpress

No comments:

Post a Comment