ID unique/duplicate lines

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

alnico
Posts: 74
Joined: Fri Oct 12, 2007 11:57 pm

ID unique/duplicate lines

Postby alnico » Fri Apr 15, 2011 2:05 am

I am ripping data to create a relational database.

When I rip a column/field from my source file…it needs to be de-duplicated and then an ID assigned (add line number)…pretty simple.
I also, must create a replace list that will find the string in the source file and replace it with the ID. (running a total of two fll in succession).
Now this gets a little complex when fields overlap and share many ID’s.

Anyway, I am looking for a way to simplify this whole process. Here is what I am thinking…

A new filter ('ID lines') based on the ‘Count duplicate lines’ filter.
The filter would simply find all unique/duplicate lines and assign a number (ID)…it would only increment the ID number for each unique match. In the Output format field, replace %d with ‘The ID’)
Although this ID is just a number, it would be easy to wrap it for identification for later processing, i.e. FName=%d\t%s\t/Fname

Here is an example…the first three ‘ID lines’ process single fields…the fourth processes two fields:

Input (tab delimited for clarification):

Code: Select all

Joe   Jones   Germany
Joe   Jones   Australia
Joe   Johnson   USA
Tim   Johnson   USA
Tim   Jones   Germany
Fred   Smith


Output after ‘ID lines’ first field:

Code: Select all

FName=1   Joe   /Fname   Jones   Germany
FName=1   Joe   /Fname   Jones   Australia
FName=1   Joe   /Fname   Johnson   USA
FName=2   Tim   /Fname   Johnson   USA
FName=2   Tim   /Fname   Jones   Germany
FName=3   Fred   /Fname   Smith


Output after ‘ID lines’ second field:

Code: Select all

FName=1   Joe   /Fname   LName=1   Jones   /Lname   Germany
FName=1   Joe   /Fname   LName=1   Jones   /Lname   Australia
FName=1   Joe   /Fname   LName=2   Johnson   /Lname   USA
FName=2   Tim   /Fname   LName=2   Johnson   /Lname   USA
FName=2   Tim   /Fname   LName=1   Jones   /Lname   Germany
FName=3   Fred   /Fname   LName=3   Smith   /Lname


Output after ‘ID lines’ third field:

Code: Select all

FName=1   Joe   /Fname   LName=1   Jones   /Lname   Country=1   Germany   /Country
FName=1   Joe   /Fname   LName=1   Jones   /Lname   Country=2   Australia   /Country
FName=1   Joe   /Fname   LName=2   Johnson   /Lname   Country=3   USA   /Country
FName=2   Tim   /Fname   LName=2   Johnson   /Lname   Country=3   USA   /Country
FName=2   Tim   /Fname   LName=1   Jones   /Lname   Country=1   Germany   /Country
FName=3   Fred   /Fname   LName=3   Smith   /Lname


Output after ‘ID lines’ first and second field:

Code: Select all

FLName=1   FName=1   Joe   /Fname   LName=1   Jones   /Lname   /FLName   Country=1   Germany   /Country
FLName=1   FName=1   Joe   /Fname   LName=1   Jones   /Lname   /FLName   Country=2   Australia   /Country
FLName=2   FName=1   Joe   /Fname   LName=2   Johnson   /Lname   /FLName   Country=3   USA   /Country
FLName=3   FName=2   Tim   /Fname   LName=2   Johnson   /Lname   /FLName   Country=3   USA   /Country
FLName=4   FName=2   Tim   /Fname   LName=1   Jones   /Lname   /FLName   Country=1   Germany   /Country
FLName=5   FName=3   Fred   /Fname   LName=3   Smith   /Lname   /FLName



I would then rip all fields and create individual output for database import which would look like this:

TableName=FirstNames

Code: Select all

ID   FName
1   Joe
2   Tim
3   Fred


TableName=LastNames

Code: Select all

ID   LName
1   Jones
2   Johnson
3   Smith


TableName=Country

Code: Select all

ID   Country
1   Germany
2   Australia
3   USA


TableName=FirstLastName (shows relationships to tables: FirstNames and LastNames)

Code: Select all

ID   FNameID   LNameID
1   1   1
2   1   2
3   2   2
4   2   1
5   3   3


This may look complicated...but I don't think it is...padding the fields is the biggest issue.


Regards,
Brent

On another related note; it would be sweet if the ‘Count/Remove duplicate lines’ and similar filters could sit as a subfilter of the ‘Restrict to delimited fields’…then we would not have to worry about padding, etc. (maybe this could happen with current filters; Logic…‘IF Count/Remove duplicate lines and similar filters' IS a subfilter of ‘Restrict to delimited fields’ gray out/ignore Start column and Length). Just an idea, I have no idea of other implications ;-)

Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: No registered users and 1 guest