Word frequency list

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

Post Reply
grantb
Posts: 5
Joined: Wed Sep 13, 2017 3:26 am

Word frequency list

Post by grantb » Wed Sep 13, 2017 3:38 am

Today, I wanted to create word frequency list of the words used in job descriptions in my company. I didn't find one ready made filter for exactly this in TextPipe Standard, but it was easy to do in two steps, using the "text to word list" filter to create a single text file of all the words in my text-format job description archive, and using that text file as input for the "count duplicate lines" filter.

The "text to word list" read all my job descriptions and put each word on a single line; the "count duplicate lines" filter then counted all the words and produced a second text file with the words and a word count.

Just what I needed. I'm sharing this here in case others search for something similar.

User avatar
DataMystic Support
Site Admin
Posts: 2220
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Word frequency list

Post by DataMystic Support » Wed Sep 13, 2017 10:54 pm

Thanks Grant - you can upload filters too provided they are zipped.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments

DFH
Posts: 716
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Word frequency list

Post by DFH » Tue Sep 19, 2017 3:35 am

Help for the Text to word list filter states:
This filter takes all the incoming words and outputs them one per line, with a DOS line feed between them. This can be used to generate word lists for Indexes, encryption programs etc. Hyphenated words are recognised as single words, provided that they aren't broken across lines. To get around this limitation, use a Search and Replace filter to replace hyphens followed by line feeds with just a hyphen.
Only the hyphen/minus is counted as a special case.

One obvious limitation is how the filter should deal with English possessives (or other abbreviations) ending with ’s.

Currently, such words would be stripped of the ’s, which may not be what the user requires.

To workaround this, use a Search and Replace filter to replace by an unused letter such as the small letter thorn þ.
Then restore the afterwards by means of another Search and Replace filter.
Last edited by DFH on Tue Sep 19, 2017 3:38 am, edited 1 time in total.

DFH
Posts: 716
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Word frequency list

Post by DFH » Tue Sep 19, 2017 3:37 am

To what extent is the Text to word list filter UTF-8 aware?

The help page does not even indicate whether it's limited to ANSI or ASCII letters.

David

Post Reply

Who is online

Users browsing this forum: Baidu [Spider] and 6 guests