Count Duplicate Lines filter

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

DFH
Posts: 654
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Count Duplicate Lines filter

Postby DFH » Tue May 29, 2012 4:01 am

This new topic is copied from one of my comments in the thread headed: Please add Unicode support to the Text to Word List filter.
It's reposted here to focus attention on the Count Duplicate Lines filter.


My existing Text to Word List filter didn't cope properly with soft hyphens,
presumably because U+00AD is beyond ASCII, being part of Windows-1252 (aka ANSI).

There's no clue that characters U+00A0 to U+00FF are unsupported by the Count Duplicate Lines filter,
which follows the Text to Word List subfilter in my two stage filter.

The contrast with the Sort filter is brought to your attention:
Sort Type

The sort type controls the method by which items are sorted. The available options are:
· ANSI sort (case insensitive)
· ANSI sort (case sensitive) - faster than case insensitive as no case-mapping is performed
· ASCII sort (case insensitive)
· ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed
...

So before extending the Text to Word List filter to cope with Unicode in general,
please could you first extend the Count Duplicate Lines filter to support ANSI.

Meanwhile, I'll tweak my two stage filter to investigate further.

David

DFH
Posts: 654
Joined: Sun Dec 09, 2007 2:49 am
Location: UK

Re: Count Duplicate Lines filter

Postby DFH » Fri Jun 01, 2012 12:41 am

The help for the Count Duplicate Lines filter includes,
The file need NOT be sorted prior to this filter.

Yet it's clear that the filter does actually sort the results.
The sort seems to be
ASCII sort (case sensitive) - faster than case insensitive as no case-mapping is performed.

Here's an example of the first few output lines from my KJV NT Word List.

Code: Select all

001911   a
000004   Aaron
000001   Aaron's
000001   Abaddon
000004   abased
000001   abasing
000003   Abba
000004   Abel
000001   Abhor
000001   abhorrest
000003   Abia
000001   Abiathar
000032   abide
000020   abideth
000004   abiding
000001   Abilene
000003   ability
000002   Abiud
000061   able
000001   aboard

User avatar
DataMystic Support
Site Admin
Posts: 2164
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Count Duplicate Lines filter

Postby DataMystic Support » Wed Feb 24, 2016 8:58 am

Just found this - the ANSI change was made some time ago.
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: No registered users and 9 guests