The deduplication process is broken into three processes:
Import, Find Matches and Output. The process that is running is shown in the
Step Type in the Statistics for this Step (lower left) section of the dedupeIT
window. The Import and Output processes should run at a fairly even rate - the
progress information in this section of the window shows how far each step has
still to run. The Find Matches process can run at an uneven rate, depending on which step is being processed.
If there is no sign of activity (e.g. hard disk light flashing)
during the Import or Output process, dedupeIT has probably hung and you will
need to End Task (via Ctrl+Alt+Del) and try again. If necessary, reboot the PC and try again.
If there is no sign of activity during the Find Matches process,
it may simply be that your file has a lot of incomplete data e.g. empty names,
which can cause dedupeIT to slow down significantly. Try leaving dedupeIT
running for another 5 minutes or so and if there is still no evidence of
progress, End Task as described above.
dedupeIT is limited to files of 50,000 records or fewer, because
larger files tend to require more flexibility for matching options than
dedupeIT is designed to allow. dedupeIT has done some approximate calculations
and decided to display this message. However, the import of the file will
continue, so that dedupeIT can check whether the actual number of records is in
fact greater than 50,000. If it is greater than 50,000 records, the job will
stop and you cannot process this file in dedupeIT. You can buy a more
functional product from the helpIT systems' range, matchIT, to dedupe this
file. For more information, visit www.helpit.com
Refer to the answer to the previous question.
When dedupeIT is processing your data, you will see some bar
graphs appear on the screen, which show the progress of the job and various
statistics about your file and the data within it. These are explained fully in
the on-line Help.
If you are working on large files, or the files have a really
high level of duplication in them, then you may see that there are a "number of
groups greater than 500". The reason for this is as follows:
To avoid grinding to a halt (or appearing to "hang"), dedupeIT
will not compare every single record for a given match key with every other
record for that match key if the group of records with that match key value
(e.g. a specific postcode) exceeds 500 records. Generally, this will not affect
the quality of your results, because any duplicates missed on one pass of
matching will be found on another. If you find that the number of groups
reported here is very high, or you are concerned about it, then you can do
another dedupe run on the output file from dedupeIT. This will tell you if
dedupeIT missed any dupes the first time around and should find them for you,
unless the data has an excessive number of records for e.g. the same postcode