APPENDIX F - dtSearch Syntax Guide
Noise Words | |||||||
a | being | furthermore | in | must | she | they | when |
about | between | get | indeed | my | should | this | where |
after | both | got | into | never | since | those | which |
all | but | Had | is | not | some | through | while |
also | By | has | it | now | still | thus | who |
an | came | have | Its | of | such | to | will |
and | can | he | just | on | take | too | with |
another | come | her | like | only | than | under | would |
any | could | here | made | or | that | up | you |
are | did | hi | many | other | the | very | your |
as | Do | him | Me | our | their | was | |
at | each | himself | might | out | them | way | |
be | even | how | more | over | then | we | |
because | for | however | moreover | said | there | well | |
been | from | I | most | same | therefore | were | |
before | further | If | much | see | these | what |
Usage:
If a phrase contains a noise word, dtSearch will skip over the noise word when searching for it.
Example:
"statue of liberty"
This example would retrieve any file containing the word statue, any intervening noise word, and the word liberty.
For more accurate searching, the noise word ‘of’ could be removed from the Stop Word List.
Important
When building or rebuilding an Index, the old Indexes must be first deleted.
Phrases and Words
Quotation Marks
Usage:
Quotation Marks should be used around a phrase to ensure that connector words are interpreted as part of the phrase.
Example:
"clear and present danger"
Without the quotation marks, clear and present danger would be interpreted as a Boolean search for "clear" and "present danger".
Punctuation
Usage:
Punctuation inside of a search word is treated as a space.
Examples:
can't
dtSearch would interpret this as a phrase consisting of two words: can and t
1843(c)(8)(ii)
dtSearch would interpret this as four words: 1843 c 8 ii.
To customize the way dtSearch handles punctuation in text, edit the Advanced Options – Alphabet File in the Project Settings.
Special Characters
Character | Use |
---|---|
? | matches any character |
* | matches any number of characters |
# | phonic search |
~ | |
% | |
~~ | |
: | |
## |
Wildcard Searches
= Wildcard Search
Usage:
The ‘=’ wildcard matches any single number digit.
Example:
NUM===
This would retrieve any files that had NUM and any combination of 3 numbers after the term, i.e. NUM123, NUM321
"330 == ===="
This would look for any Social Security number that starts with "330"
? Wildcard Search
Usage:
The ‘?’ wildcard matches any single character
Example:
appl?
This search would retrieve any files that match apple, apply, but not apples or application.
* Wildcard Search
Usage:
The ‘*’ wildcard matches any number of characters
Examples:
appl*
This search would retrieve any files that matched apple, apply, apples, application, applications etc.
*cipl*
This search would retrieve any files that match principal, principals, etc.
ap*ed
This search would retrieve any files that match applied, approved, etc.
Stemming Searches
~ Stemming Search
Usage:
Stemming extends a search to cover grammatical variations on a word.
Example:
priv~
This would retrieve any files that had privileged, privilege, but not Privileged.doc, etc.
Fuzzy Searches
% Fuzzy Search
Usage:
Fuzzy searching will find a word even if it is misspelled. Fuzzy searching can be useful when you are searching text that may contain typographical errors, or for text that has been scanned using optical character recognition (OCR). The number of % characters you add determines the number of differences dtSearch will ignore when searching for a word. The position of the % characters determines how many letters at the start of the word have to match exactly
Examples:
ba%nana
This search would retrieve any files where a word within the file begins with ba and has at most one difference between it and banana.
ba%%nana
This search would retrieve any files where a word within the file begins with ba and has at most two differences between it and banana.
Numeric Range Searching
~~ Numeric Range Search
Usage:
To search for a numeric range within files.
Example:
apple and 12~~17
This search would retrieve any files that had the word apple and a number between 12 and 17.
Numeric Range Notes:
A numeric range search includes the upper and lower bounds (so 12 and 17 would be retrieved in the above example). Numeric range searches work only with positive integers. For purposes of numeric range searching, decimal points and commas are treated as spaces, and minus signs are ignored. For example, 123,456.78 would be interpreted as: 123 456 78 (three numbers).
Making Special Characters Searchable
Usage:
If one of the dtSearch special characters is part of the search terms for a project, the dtSearch Index needs to be manipulated to make these characters searchable.
Steps To Making the Special Characters Searchable
To make the special characters searchable, do the following when creating a project. If the project has been created, click the Project Settings button, choose Indexing Settings, choose Advanced Options – Alphabet File, follow the steps below and reindex all imports within the project.
If the character is a dtSearch Special Character it needs to be replaced with another character. In the screen shot above the '^' has been used in place of the '&' character. If the character that needs indexing is not a dtSearch Special Character, skip this step.
The below screen shots show how to make the '&' character searchable.
Add the character under the letter 'Z' in the [Letters] portion of the project Alphabet File. Only characters found under the heading [Letters] will be searchable. When adding the character under the letter 'Z', the following should be done in the exact order described:
Make a new line under the letter 'Z' in the [Letters] portion of the Project Alphabet File.
The character must be written in 4 times and have a leading space and a space in between each character. If the leading space is not added in front of the character this will not work. Note that this must be done twice, one for each sets of letter ‘Z’.
The below screenshots display how to make the '&' character searchable.
Hyphens
The below are the three categories for hyphen handling.
3 - Treat hyphens as spaces (index “first-class” as “first and ”class"), this is the default.
2 - Treat hyphens as searchable (index “first-class” as “first-class”).
1 - Ignore hyphens (index “first-class” as “firstclass”).
Steps To Changing the Hyphen Handling
To change the hyphen handling, change the HyphenValue = 3 to HyphenValue = 1 or HyphenValue = 2 depending on how you would like the hyphens handled within the project.
This should be done when creating a project. If the project has been created, click the Project Settings button, choose Indexing Settings, choose Advanced Options – Alphabet File, change the HyphenValue to the desired setting, and reindex all imports within the project.
Spaces
A character that causes a word break. For example, if you classify the period (".") as a space character, then dtSearch would process U.S.A. as three separate words: U, S and A.
Ignore
A character that is disregarded in processing text. For example, if you classify the period as ignore instead of space then dtSearch would process U.S.A. as one word: USA.
CJK Word Breaking
Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word.
To make this type of text searchable, you can enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters. With this option enabled, each character will be treated as single word.
The CJKRanges setting specifies which Unicode character ranges should have this treatment.
AdditionalLetters
The AdditonalLetters section is for adding characters that don't exist within the ASCII code range. This is done by listing the Unicode for each character you want searchable. The below example makes all of the Unicode currency characters such as the Euro, Pound, and Lira searchable characters.
Example:
AdditionalLetters = 00a2 00a3 00a4 00a5 20a0 20a1 20a2 20a3 20a4 20a5 20a6 20a7 20a8 20a9 20aa 20ab 20ac
Boolean Search Requests
A Boolean search request consists of a group of words, phrases, or macros linked by connectors such as ‘AND’, ‘OR’, ‘NOT’ that indicate the relationship between them. Boolean connectors are not case sensitive so they can be written as ‘AND’ or ‘and’.
Search Request | Explanation |
---|---|
apple and pear | Both words must be present. |
apple or pear | Either word can be present. |
apple w/5 pear | Apple must occur within 5 words of pear. |
apple pre/5 pear | Apple must occur within 5 or fewer words before pear. |
apple not w/5 pear | Apple must not occur within 5 words of pear. |
apple and not pear | Only apple must be present. |
apple or not pear | Apple must be present or pear must not be present. |
name contains smith | The field name must contain smith. |
apple w/5 xfirstword | Apple must occur in the first five words of the file. |
apple w/5 xlastword | Apple must occur in the last five words of the file. |
AND Connector
Usage:
Use the ‘AND’ connector in a search request to connect two expressions, both of which must be found in any file retrieved.
Examples:
apple pie and poached pear
This search would return any file that contained both phrases.
(apple or banana) and (pear w/5 grape)
This search would retrieve any file that (1) contained either apple ‘OR’ banana, ‘AND’ (2) contained pear within 5 words of grape.
OR Connector
Usage:
Use the ‘OR’ connector in a search request to connect two expressions, at least one of which must be found in any file retrieved.
Examples:
"apple pie" or "poached pear"
This search would retrieve any file that contained apple pie, poached pear, or both.
NOT Connector
Usage:
NOT standing alone can be the start of a search request only. If the NOT connector is not the first connector in a request, you need to use either AND NOT or OR NOT.
Example:
not pear
This search would retrieve all files that did not contain pear.
AND NOT Connector
Usage:
Use ‘AND NOT’ in front of any search expression to reverse its meaning. This allows you to exclude files from a search.
Examples:
"apple sauce" and not pear
This search would retrieve all files that contained apple sauce but did not contain pear.
W/N Connector
Usage:
Use the W/N connector in a search request to specify that one word or phrase must occur within N words of the other.
Examples:
apple w/5 pear would retrieve any file that contained apple within 5 words of pear. The following are examples of search requests using W/N:
(apple or pear) w/5 banana
(apple w/5 banana) w/10 pear
(apple and banana) w/10 pear
W/N Connector Syntax Notes:
Incorrect W/N Syntax:
Some types of complex expressions using the W/N connector will produce ambiguous results and should not be used. The following are examples of ambiguous search requests:
Incorrect Syntax Examples:
(apple and banana) w/10 (pear and grape)
(apple w/10 banana) w/10 (pear and grape)
Correct W/N Syntax:
In general, at least one of the two expressions connected by W/N must be a single word or phrase or a group of words and phrases connected by OR. Below are the corrected examples of the search requests:
Corrected Syntax Examples:
(apple and banana) w/10 (pear or grape)
(apple and banana) w/10 "orange tree"
NOT W/N Connector
Usage:
The NOT W/ ("not within") operator allows you to search for a word or phrase not in association with another word or phrase.
Example:
apple not w/20 pear
This would search for files that have the word apple and excludes cases where apple is within 20 words of pear.
pear not w/20 apple
This would search for files that have the word pear and excludes cases where pear is within 20 words of apple.
Special W/N Connector xfirstword/xlastword
Usage:
dtSearch uses two built in search words to mark the beginning and end of a file.
Examples:
apple w/10 xfirstword
This search would retrieve any files where apple was within 10 words of the beginning of the file.
apple w/10 xlastword
This search would retrieve any files where apple was within 10 words of the end of the file.
Fielded Searching
If the Project Settings Index Project for FullText Searching and Index Senders/Recipients in Fields are both enabled, the 6 fields listed below will automatically be added and searchable in the dtSearch Index.
Usage:
The (field contains(term)) allows a user to search a filed in the file. Without specifying the field, searches will run across all available fields. The following fields are available for fielded searching in the dtSearch Index:
Index Project for FullText Searching
FullText
Index Sender/Recipients in Fields
Sender
Recipients
Includes To, CC, and BCC
To
CC
BCC
Examples:
(//text contains (water*))
(Sender contains (*enron.com))
This search would retrieve any files that have the value *enron.com in the Sender field for the parent email.
(Sender contains (jdoe*)) and (Recipients contains (bdoe*))
This search would retrieve any files that have the value jdoe* in the Sender field and the value bdoe* in the Recipients field for the parent email.
Note
By default, Reveal Discovery Platform indexes and searches the fields FULLTEXT, SENDER, RECIPIENTS, TO, FROM, CC, BCC. The sender and recipient email address fields contain both the display name and the fully qualified email address. Due to this, it is possible that a Keyword Search Term will hit on one of the email address fields, and the fully qualified email address is not visible in the extracted text (FULLTEXT). To only search the extracted text, use the syntax //text contains (<Term>). This is the only fielded search that requires the // syntax in the fielded search. Alternatively, within the Project Settings, the sender and recipient fields can be excluded from the dtSearch Index leaving only the FULLTEXT. Alternatively, within the Project Settings, the sender and recipient fields can be excluded from the dtSearch Index leaving only the FULLTEXT.
Recognize Dates/Email Addresses/Credit Cards
One of the dtSearch Settings that is not selected by default, is the setting Recognize Date/Email Addresses/Credit Cards. If this setting is selected within the project, searches for various formats of dates, email addresses or credit card numbers can be executed. Please note that activating the feature will dramatically impact indexing and searching performance.
Recognize Dates
Usage:
Date recognition looks for anything that appears to be a date, using English language months (including common abbreviations) and numerical formats. To search for a date, put "date()" around the date expression or range.
Examples:
date(January 10, 2010)
date(10 Jan 10)
date(2010/01/10)
date(1/10/10)
date(1-10-10)
date(The tenth of January, two thousand ten)
Email Addresses
Usage:
Email address recognition looks for text that follows the syntax for a valid email address (example: sales@dtsearch.com). This makes it possible to search for a specific email address regardless of the alphabet settings for the @ and . characters, as well as any other punctuation that may be present in an email address. Also, this makes it possible to use the word listing functions in dtSearch to enumerate all email addresses in a file collection. To search for an email address, put "mail()" around the address. The * and ? wildcard expressions are supported inside the () marks.
Examples:
mail(
*@mindseyesolutions.com
)
Credit Card Numbers
Usage:
Credit card number recognition looks for any sequence of numbers that appears to satisfy the criteria for a valid credit card number issued by one of the major credit card issuers. Credit card numbers are recognized regardless of the pattern of spaces or punctuation embedded in the number. Numerical tests used by credit card issuers for card validity are used to exclude sequences of numbers that are not credit card numbers. However, these tests are not perfect and so the credit card number recognition feature may pick up some numbers that are not really credit card numbers. To search for a credit card number, put "creditcard()" around the number.
Examples:
creditcard(654654654231323)
creditcard(5405 2465 7894 8798)