Remove duplicate content inside unique data blocks

Get help with installation and running here.

Moderators: DataMystic Support, Moderators

alnico
Posts: 74
Joined: Fri Oct 12, 2007 11:57 pm

Remove duplicate content inside unique data blocks

Postby alnico » Sat Sep 19, 2009 1:22 am

Hi,

How to remove duplicate tables where the structure of the table and content of each cell is the same, while ignoring any attributes, id numbers, etc within tags.

Here are five tables [input], four have the same 'content' (not in order)...I want to remove all but one, but..one has a <div> tag that makes the structure non-identical (so only three are identical in content AND structure).
I would like to retain the table order after the duplicates are removed (keep the first duplicate) [output].
Note: I need to retain one duplicate table with all its tag attributes...(otherwise I could remove these attributes and then put everything on a single line, sort and remove...um maybe there is a way to capture the attributes and re-insert at end???)

Any ideas on how to accomplish this?

Thanks,
Brent

Input:

Code: Select all

<table>
   <tr id="1">
      <td id="1">
         <content>XXX</content>
      </td>
   </tr>
   <tr id="2">
      <td id="1">
         <content>XXX</content>
      </td>
   </tr>
</table>

<table>
   <tr id="3">
      <td id="2">
         <content>XXX</content>
      </td>
   </tr>
   <tr id="4">
      <td id="2">
         <content>XXX</content>
      </td>
   </tr>
</table>

<table>
   <tr id="5">
      <td id="3">
         <content>X</content>
      </td>
   </tr>
   <tr id="6">
      <td id="3">
         <content>X</content>
      </td>
   </tr>
</table>

<table>
   <tr id="7">
      <td id="4">
         <content>XXX</content>
      </td>
   </tr>
   <tr id="8">
      <td id="4">
         <content>XXX</content>
      </td>
   </tr>
</table>

<table>
   <div>
      <tr id="9">
         <td id="5">
            <content>XXX</content>
         </td>
      </tr>
   </div>
   <tr id="10">
      <td id="5">
         <content>XXX</content>
      </td>
   </tr>
</table>


Output:

Code: Select all

<table>
   <tr id="1">
      <td id="1">
         <content>XXX</content>
      </td>
   </tr>
   <tr id="2">
      <td id="1">
         <content>XXX</content>
      </td>
   </tr>
</table>

<table>
   <tr id="5">
      <td id="3">
         <content>X</content>
      </td>
   </tr>
   <tr id="6">
      <td id="3">
         <content>X</content>
      </td>
   </tr>
</table>

<table>
   <div>
      <tr id="9">
         <td id="5">
            <content>XXX</content>
         </td>
      </tr>
   </div>
   <tr id="10">
      <td id="5">
         <content>XXX</content>
      </td>
   </tr>
</table>

alnico
Posts: 74
Joined: Fri Oct 12, 2007 11:57 pm

Re: Remove duplicate content inside unique data blocks

Postby alnico » Thu Sep 24, 2009 1:12 am

I have figured out a way to do this...

Put tables on single line
Add line number for sorting and ID
Duplicated each table and tag one of them
Remove non-comparable text from one table
Sort and remove duplicates
Find and extract matches, keeping the original table

Filter attached for anybody to use.

Brent
Attachments
Unique tables-remove content duplicates.zip
(988 Bytes) Downloaded 225 times

User avatar
DataMystic Support
Site Admin
Posts: 2138
Joined: Mon Jun 30, 2003 12:32 pm
Location: Melbourne, Australia
Contact:

Re: Remove duplicate content inside unique data blocks

Postby DataMystic Support » Thu Sep 24, 2009 6:21 am

Thanks Brent - scary what you can achieve!
Regards,

Simon Carter, http://DataMystic.com/forums/index.php
http://PredictBGL.com - Insulin dose calculator for Type 1 diabetes
http://DownloadPipe.com - 250,000 free software downloads
http://DetachPipe.com - send huge email attachments


Return to “TextPipe Tips and Tricks, Questions and Support”

Who is online

Users browsing this forum: No registered users and 1 guest