APPENDIX C - MD5 Hash Generation
During import all original files are given an MD5 Hash Value which is used when identifying duplicates within Discovery Manager. The following table describes the data used to generate the MD5 Hash per Document Type. In addition to the email metadata properties listed in the table below, the following normalization process is used when creating an email MD5 Hash:
Milliseconds are removed from all time values.
Recipients are sorted by email address alphanumerically.
Display Names are not used.
Attachments are sorted by filename alphanumerically.
All whitespaces, hard line returns, and non-alphanumeric characters are removed from the email body leaving only letters and numbers.
Whitespaces, hard line returns, and non-alphanumeric characters are not removed from the email subject.
Document Type | Values Used To Generate MD5 Hash |
---|---|
Efiles (Including Efile Attachments) | Generated on the bit stream of the file |
Outlook Items1 | Date Sent, Sender Email Address, Recipient Email Addresses, Subject, Body, Attachment Names, Attachment Size |
Lotus Notes Items | |
Memo, Reply, Notice | From, DateSent, SendTo, CopyTo, BlindCopyTo, Attachment Name ($FILE), Subject, Body |
Appointment | Subject, Chair, STARTDATETIME, Location, EndDateTime, RequiredAttendees, RepeatDates, OptionalAttendees, FYIAttendees, Attachment Name($FILE), Body |
Task | Subject, DateSent, STARTDATETIME, DueDateTime, Principal, AssignedTo, OptionalAssignedTo, FYIAssignedTo, Body, AttachmentName ($FILE) |
Non Delivery Report | Subject, IntendedRecipient, FailureReason, From, DateSent, SendTo, CopyTo, BlindCopyTo, Attachment Name ($FILE), OriginalSubject, Body |
Delivery Report, Return Receipt | DateSent, Subject, IntendedRecipient, From, AttachmentName($FILE), OriginalSubject, Body, SendTo, CopyTo, BlindCopyTo |
Unrecognized Forms | All properties except UNID |
Note
1 The above fields can be adjusted within Project Settings shown below. You can remove fields, which will identify more duplicates, however it will create more false positives.