Updated: April
2022
Email: canary@partners.org
Web: http://canary.bwh.harvard.edu/
Thank you for using Canary. This file includes the complete documentation for using the software. For further assistance or bug reports, please contact us via the email address above.
Table of Contents
K)
Output Mapping functionality
O) Running from the Command
Line
4) What
operating systems does Canary run on?.
5) Can
I run Canary on a Mac or Linux?
6) Can
I run Canary on a cluster?
7) Can
I run Canary from the command line?
8) How
much hard drive space does Canary need?
10) Does
Canary require a UMLS or any other licenses?
12) Does
Canary use machine learning techniques?
13) What
file formats can Canary extract information from?
14) Does
Canary connect directly to any electronic medical records systems?
15) Can
Canary be used with languages other than English?
17) How
much does Canary cost?
18) What is
the difference between a “Flag” and an “Output Criteria”?
19) How can
negated structures be identified?
20) Where
should output files generally be stored?
21) How can
I create a duplicate copy of an existing project?
The adoption of Electronic Health Records (EHR) in the last two decades has greatly increased the amount of digital data available to researchers. A large fraction of information in electronic health records is contained in narrative documents (e.g. provider notes, radiology reports, etc.). This has led to the development of computational methods to process and mine these documents for information of interest.
Canary is a platform designed to help users create natural language processing tools based on user-defined rules. It was designed for users without software development or engineering experience, allowing the vocabulary and grammar rules to be defined via a Graphical User Interface (GUI).
Canary takes plain text files as input. These text files may contain many individual documents which are delimited using a "document separator" sequence of characters along with a unique identifier. More details about this can be found in the “Run” section of this document.
Canary can be installed (i.e. unzipped) into any directory on the computer. The directory can be moved anywhere else on the computer after Canary is unzipped. Canary can also work from a flash drive or other portable media.
After installation, instructions for starting the software can be found in a file called “HOW TO RUN” in the installation folder. This file also contains other relevant instructions, such as how to create additional shortcuts.
Canary is compatible with Windows 10, Windows 11 and Windows Server 2012 / 2016 / 2019 / 2022. It is compatible with both 32-bit and 64-bit architectures.
The first step in using Canary is creating a project. This
is done on the Projects tab. On this
tab the user can open one of the projects previously created by this user on
this computer (Load Selected Project),
create a completely new project (New
Project), or load project not on the list (Load Project from Folder). To save a modified project without
changing the original, use Save Project
to Folder button and choose / create a new folder for the modified project.
Note:
You will not be able to open any other tabs until a project
is open. Only one project at a time can be open.
Canary can take as input any number of files, each of which can contain an unlimited number of individual documents. The files have to be in plain text format. Individual documents in each file have to be separated by the following sequence:
a) New line (i.e. each document has to start on a new line)
b) A unique “separator” (a unique sequence of characters that is unlikely to be found inside of a document). The separator is defined on the Run tab.
c) Document ID (a number or a character sequence that uniquely identifies the document). The format of the Document ID is also defined on the Run tab.
As a first step, each text file is read and split into individual documents. Next, each individual document is further processed.
In this step each document is scanned to identify abbreviations. Abbreviations need to be identified so that the periods after abbreviations are not interpreted by Canary as the end of a sentence.
Abbreviations are recognized using a user-defined list. This list is
contained in the >ABBREVIATION
word class and can contain as many or as few entries as the user requires. When
a new project is started, >ABBREVIATION
word class is auto-populated with abbreviations commonly used in medical
domain. This word class can then be edited according to the user’s needs. As
all other elements of word classes, the entered abbreviations are not case
sensitive.
The user can choose a different name for a word class containing
abbreviations. The name of the word class containing abbreviations (whether
named >ABBREVIATION
or
otherwise) has to be indicated on the Run
tab.
Note:
· A word class containing abbreviations is required – Canary will not execute a project without one. However, it may be an empty class.
Canary will next attempt to split each document into individual sentences.
It is important for Canary to identify abbreviations in order to achieve good results here. This is because periods in abbreviations such as "Mr." or "Dr." may look like sentence-ending periods.
After Canary identifies sentences, it performs customized text replacement. Text replacements are defined on the Replace Text tab. They are defined using regular expressions; more information on regular expressions can be found in the regular expression help file. These replacements only take place inside Canary; the original input file is not modified. Text replacements can be helpful under several circumstances:
a) Clitics
Clitics are morphemes that have a syntactic function. In English they can represent contractions of phrases such as I'm (I am) or couldn't (could not). They can pose a challenge in NLP as the same information may appear in different forms. It is desirable to map the variations to a standard form to make identifying them easier.
In this process, Canary attempts to replace such clitics with a normalized form of the word. For example, can't can be mapped to cannot. The Canary distribution comes with a standard set of clitics that are pre-populated on the Replace Text tab whenever a new project is created.
b) Noun phrases
Text replacement function can also be used to normalize (bring to the same form) noun phrases, verb phrases, etc. For example, text “d.m.”, “diabetes” and “diabetes mellitus” can all be replaced with a single form “diabetesmellitus” that can be used in subsequent text analysis by Canary.
Text replacement can also be used to normalize noun phrases whose components are separated by other words. For example, a regular expression s/\bmyocardial.*infarction/myocardialinfarction/gi will replace the phrase “myocardial ischemia, eventually leading to an infarction” with “myocardialinfarction”. However, this approach will lead to loss of information that is contained between the noun phrase components (it will not be available for further analysis) and possibly loss of accuracy, and therefore should be used with caution.
The Replace Text feature can also be used to perform any other text replacement before the documents are further processed.
Punctuation Removal
In order to facilitate easier text matching, Canary removes any punctuation remaining after text replacement has been performed. If you wish to preserve any punctuation tokens for rule matching or inclusion in your output, you must create a text replacement rule to substitute these with non-punctuation tokens.
Note:
· Canary output will include text after replacement has taken place
· Canary distribution includes companion software UMLS Concept Exporter that can be used to export selected UMLS concepts in the format to create Canary projects. This includes generating Canary clitic files that includes replacement rules that will merge multi-word UMLS concepts into a single word to facilitate subsequent processing by Canary.
The vocabulary (a.k.a. lexicon or ontology), which is the organization of words into user-specified categories, is a core part of Canary. Each sentence in the data will be processed and each word (or 'token') in the sentence will be assigned to a 'class'. Each class is essentially a category that you create. You will also choose the words that belong to each class. These classes can be thought of as a grouping of words from the same semantic category. Each word class name has to start with a “>” character and the class name has to be capitalized.
For example, if we were working on a project to identify references to body parts in documents, we would define two classes referring to body parts and anatomical adjectives, which could be defined as:
>ANATOMICAL -> !(bi)?-?lat(eral)?, anterior, caudal, upper, lower, left, right, [...]
>BODYPART -> (gastro)?intestinal, (gastro)?o?esophageal, (musculo)?skeletal, abdom.?, abdomen, [...]
Users may define as many classes as needed and the words belonging to each class can be matched using a regular expression, as seen in the above example. This allows users to create customizable vocabularies to meet their needs.
Any
words that you have not included in your vocabulary will be assigned to a
special word class called >UNKNOWN
.
To start a new word class, enter its name in the Word class name search / entry box above the Defined word classes list and press Enter. After a word class has been started, members can be added to it by entering them above the Defined members list and pressing Enter.
As you enter the name of the word class in the Word class name search / entry box, Canary will first search for an existing word class with that name. If no word class with that name is found, then it will be possible to create a new class (by pressing Enter).
Word classes can be exported (all together) into a file or imported from a file. This functionality allows word classes to be shared, in part or in full, with another Canary project.
Note:
· Members of word classes have to be single words – no spaces inside word class members are allowed. To encode phrases that consist of several words use Phrase Structures or Replace Text functions.
· Any word class without members will automatically be deleted when the project is saved. Canary distribution includes companion software UMLS Concept Exporter that can be used to export selected UMLS concepts in the format that can be used
· Canary distribution includes companion software UMLS Concept Exporter that can be used to export selected UMLS concepts in the format to create Canary projects. This includes generating Canary word class files that represent UMLS concepts.
When adding
entries to different classes in the project vocabulary, it is possible that a
word can be added to multiple classes. This is because a single lexical entry
may belong to several semantic or syntactic categories. For example, when
defining different classes for groups of drugs, certain medications may fall
into more than one class. This is also possible with regular expressions, as we
describe below.
The
multiple resolver mechanism help manage these conflicts by allowing the user to
define an order of precedence for assigning words to classes. It allows you to
list a set of matching classes and to define which one will be selected in case
of a conflict.
For
example, we may have two word classes: >A with a regular expression member
defined as “\d+” and >B
with a member “\d{3,4}”. The
token “1234” may
match both of these classes. A multiple
resolver rule will be required to help the software decide to which class the
word must be assigned. Of course it would be preferable to avoid such ambiguous
word class definitions, but this is not always possible.
Using
multiple resolvers we can create a rule that says: if a token matches multiple
classes >A, >B, …
>N, assign it to the class that we specify for such
conflicting cases, e.g. >C.
If such a
conflict occurs and no multiple resolver rule is defined, the token will be
assigned to the >UNKNOWN class.
Once the vocabulary (word classes) has been created, the next major step involves creating grammatical rules that define how these word classes can be combined to form phrase structures. A phrase structure can be a single word or a combination of words, as allowed by the grammar. For example
<BODYPARTPHRASE -> >BODYPART
<BODYPARTPHRASE -> >ANATOMICAL, >BODYPART
The above rules state that a body part phrase can be a single body part, or an anatomical adjective followed by a body part. These rules are then processed by a parser to match all fragments of a text that match any of the supplied rules. Some examples matched by this simple grammar include knee, left hand and lateral ankle.
We can also extend the grammar to match nested body parts by simply adding a recursive rule:
<BODYPARTPHRASE -> <BODYPARTPHRASE, <BODYPARTPHRASE
A recursive rule expresses that a phrase may include a sub-phrase of the same type as its own constituent, allowing us to capture the recursive property of natural language. This is important because, for example, English nouns and sentences can be infinitely recursive. In the above example, this extension allows the capture of one or more adjacent body part phrases.
In order to create phrase structures that are composed, in part or in whole, from other phrase structures (including recursive phrase structures, as in the above example), multiple tiers of phrase structures have to be created. Phrase structures in tier 2 can use as components phrase structures from tier 1; phrase structures from tier 3 can use phrase structures from tiers 1 and 2, etc. There is no limit on the number of tiers that a project can have.
At least one phrase structure tier has to be created in a Canary project. Creation of the first tier has to be the first step in creating phrase structures. Click Add new tier to create the first (or subsequent) tiers. To add phrase structures to a tier, select the tier number and click Add new structure.
Every single word in a phrase structure has to be specified – it is not possible to create a phrase structure that encodes the beginning and the end of a phrase, without specifying in some way what is in between. However, it is possible to specify that a particular word in a phrase structure can be any word other than those specified elsewhere in word classes. The special word class >UNKNOWN can be used for this purpose. >UNKNOWN parts of a phrase structure can be “strung” together so one could create several phrase structures with sequences of one, two, three, etc. >UNKNOWN components between the specified components.
A phrase structure with the same name can have multiple definitions, on the same or different tiers, reflecting multiple language models for the concept represented by the phrase structure.
Phrase structure names have to start with a “<” and have to be capitalized.
Note:
· Phrase structures are required for the project to be executed by Canary
· >UNKNOWN will only match words that are not members of any of the other word classes
· Within the same tier, a particular text will only match one phrase structure. If several phrase structures within the tier match the text, the text will be “assigned” to the first phrase structure by alphabetical order.
· If two phrase structures in two different tiers match a particular part of the text, the text will be “assigned” to both phrase structures.
· Canary distribution includes companion software UMLS Concept Exporter that can be used to export selected UMLS concepts in the format to create Canary projects. This includes generating Canary phrase structure files that represent UMLS concepts.
Additional
Details about Phrase Structure Rule Processing:
Canary implements a shift-reduce parser, with tiers being processed in the specified tier order.
Within each tier:
· The parser processes all rules within each tier continuously until no further matches are found.
o This means that a tier can contain structures that are derived from other structures in the same tier.
· Tiers can also reference structures defined in previous tiers, but not subsequent ones.
· It is not possible to control the rule precedence (order) within a tier.
· All tiers can access word classes, so long as the word has not been matched and merged into a structure by another rule on the same or previous tier.
Tiers can be used for organization of rules (e.g. managing large rule sets by splitting them into related subsets). More importantly, tiers can also help control the order of rule matching, which can help resolve shift-reduce conflicts.
The placement of phrases of different lengths between tiers can also affect matching:
· Within the same tier, a sentence will preferentially match a longer phrase structure over a shorter one. For example, if trying to exclude a negated concept, the phrase structure that includes both the negation and the concept must be in the same tier as the phrase structure that has just the concept alone. If the phrase structure that has both the negation and the concept is in an upper tier, then both phrase structures will match.
· If the longer phrase structure is on the lower tier, and the shorter phrase structure is on a higher tier, only the longer phrase structure will match.
· If the shorter phrase structure is on the lower tier, and the longer phrase structure is on a higher tier, both will match.
Flags are user-defined criteria that cause entire documents to be "flagged" as containing information of interest. This is an important step in extracting these documents or text fragments from within documents.
To create a new flag, click on Add new flag button and type a name for the flag in the Flag name field in the Edit flag dialog box that opens. The next step is to specify which phrase structure(s) must be found within a document or a sentence for it to be flagged. For example, in a simple scenario we may be interested in all documents that mention a specific illness or medication.
After the search term is selected, the user can optionally select output terms. Output terms are phrase structures that describe what Canary needs to produce as output. Note that it is not possible for the Search Term and Output Term to be identical. If the entire search term needs to be included in the output, a “shell” phrase structure that only includes the Output Term phrase structure needs to be created and used as the Search Term. For example, if text “heart attack” matching the phrase structure <MYOCARDIALINFARCTION should be included as output, a “shell” phrase structure <MYOCARDIALINFARCTIONSH can be created that is set to <MYOCARDIALINFARCTION. Then the parent (“shell”) phrase structure <MYOCARDIALINFARCTIONSH is set as the Search term and the child phrase structure <MYOCARDIALINFARCTION is set as the Output term to allow Canary to print “heart attack” in the output file (to generate output, Canary needs to be additionally configured on the Output Criteria tab – see the next section).
In a more complicated case, Canary may be configured to detect a phrase documenting blood pressure, such as “BP is 130/80”. However, the output the user is interested in is not the entire phrase, but only the blood pressure value. Then the phrase structure that corresponds to the entire phrase “BP is 130/80” is selected as a search term, while its “subcomponent” phrase structure that only represents the actual blood pressure value (“130/80”) is selected as the output term. The main phrase structure (representing “BP is 130/80” in this example) has to actually be composed of several phrase structures that include the “subcomponent” structure. The subcomponent structure then has to be on a lower tier than the main structure. For example, the main phrase structure could be <ENTIREBPSENTENCE and be defined as <BPTERM, <BPVALUE where <BPTERM and <BPVALUE are its component phrase structures.
A single flag can have multiple outputs. For example, if it is desirable to have separate output fields for systolic and diastolic blood pressure, then the blood pressure example above could include the main phrase structure <ENTIREBPSENTENCE that is defined as <BPTERM, <SYSTOLICBP, <FORWARDSLASH, <DIASTOLICBP. The <ENTIREBPSENTENCE would be selected as the search term, while <SYSTOLICBP and <DIASTOLICBP would be selected as output terms.
Users may define as many flags as they require.
Note:
· A flag can have only one search term, but multiple outputs
· A flag may contain an output term that occurs multiple times in a single match. In this case the output terms will be grouped together and output in comma-separated format.
· A flag cannot have the phrase structure itself as an output – only its subcomponent(s), even if it’s equal to the phrase structure.
· Digits are not allowed in flag names.
Canary can
utilize output criteria that include flags located at a distance from each
other in a single sentence or even in different sentences. However, when the
flags are far away from each other in the text, the topic of the narrative can
change between them. Topic boundaries functionality is designed to
identify these changes.
Click on Add
New to create a new topic boundary. Then select flags that will serve as
topic (“anchor”) 1 and topic (“anchor”) 2 and the flag that will serve as the
topic boundary. For example, in a natural language processing tool that
identifies records of aortic valve surgery, Topic 1 could be a phrase
structure representing aortic valve (e.g. <AORTICVALVE) and Topic 2
could be a phrase structure representing surgery (e.g. <SURGERY), while Topic
Boundary could be a phrase structure representing coronary bypass (e.g.
<CORONARYBYPASS). This topic boundary would be triggered in a sentence Catheterization
showed that her aortic valve was fine but found extensive coronary stenosis
requiring coronary bypass surgery, indicating that between aortic and
surgery the topic of the narrative changed, and surgery did not
refer to aortic. On the other hand, this topic boundary would not
be triggered in a sentence Aortic valve was severely stenosed and surgical
repair was carried out within a month and an instance of documentation of
aortic valve surgery would be detected.
"Output Criteria" are used to specify which flagged sentences should be included in the output. In the simplest case, a sentence with a specific flag may be included in the output. However, more advanced rules can also be created to identify sequences of flags. For example, we may wish to identify a sentence specifying an illness, followed by another sentence mentioning a medication or drug class. By creating flags for each of these concepts, we can specify an output criteria that matches this chain of concepts or topics within a document.
Output criteria may include multiple sentences, each of which may be required to match one or more flags. There are no limits on either how many flags a sentence may be required to match, or how many sentences a particular output criterion can include. In order for output to be generated, every flag in every sentence included in an output criterion has to match. However, the order in which flags have to match in a particular sentence cannot be specified – the flags can match in any order.
To create a new output criterion, click on the Add New button. In the resulting Edit output criterion dialogue, every row in the Criterion parts box represents a separate sentence that is a component of that criterion. In every row, different flags that a sentence will be required to match are separated by commas. For example, in the screenshot below, the criterion includes two sentences. The first sentence has to match flags DIABETES and FAMILYHISTORY (in no particular order), and the second sentence has to match flags DIET and COUNSELING (again in no particular order). Both flags in both sentences have to match in order for output to be generated from this criterion. To change the order in which the sentences have to match, use up and down arrows to the left of the Criterion parts box.
To add a new sentence to the criterion, click on Add Sentence. You will be prompted to select the first flag that this sentence has to match (as each sentence has to match at least one flag). Your selection will automatically populate the list of Flags in selected sentence on the right. To require the sentence to match additional flags, select the flag from the Add a flag to the current dropdown and click on Add flag.
Note:
· A sentence in a criterion can be required to match the same flag only once
· Flags can match the sentence in any particular order
It is often necessary to process large amounts of data, but substantial portions of this data may not include relevant information. Filtering the input data can help optimize the information extraction process by discarding or filtering data that does not contain topics of interest. This feature can significantly reduce processing time for large datasets.
Canary can filter the data at both the document and sentence level to speed up processing. This functionality is based on word classes, giving users the ability to require that certain word classes be found in documents and sentences. For example, if your project targets diabetes or related concepts, you may wish to discard any documents that do not mention "diabetes" anywhere. This is an optional feature and all data will be processed if no filters are defined.
The document-level filter will discard entire documents in the early processing stage if they do not contain any of the listed classes. This can provide a substantial reduction of processing time.
The sentence-level filter is a more fine-grained one that can be used to filter out irrelevant sentences in a document. For example, if you are interested in certain concepts related to insulin, you may wish to eliminate any sentences that do not explicitly mention insulin.
Note:
Filtering uses text after it has been modified by the rules on the Replace Text tab.
Canary can map output it
generates to a set of phrase structures. This functionality can be used, for
example, to map Canary output to standard vocabularies, such as UMLS. In this
process, the output that Canary generated (rather than the entire original text)
is submitted for analysis using word classes defined on the Word Classes
Mapper tab and phrase structures defined on the Structures Mapper
tab. The resulting output that includes the names of the phrase structures
(e.g. UMLS Concept Unique Identifiers) is found in the MappedOutput
file.
The first step in this
process is setting up output mapping by Canary is in the World Classes
Mapper tab. This tab describes word classes used to form phrase structures
that will be mapped. For example, to map mitral stenosis to its
corresponding UMLS CUI C0026269, one would create word classes mitral
and stenosis. The overall functionality of this tab is analogous to the Word
Classes tab.
The second step in the
process of setting up output mapping by Canary is in the Structures Mapper
tab. This tab describes phrase structures that will be mapped. For example, to
map mitral stenosis to its corresponding UMLS CUI C0026269, one
would create a phrase structure named C0026269 that would set to the combination
of previously created word classes mitral and stenosis. The
overall functionality of this tab is analogous to the Structures tab.
The Run tab allows the user to set several options needed for execution of the project:
· Abbreviations word class
This is a word class (as discussed in more detail in the Abbreviation Preprocessing section above) that lists common abbreviations whose period are removed by Canary in order not to be confused with sentence boundaries. The periods are only removed from the text in Canary’s memory; the original text is not affected.
· Maximum number of threads
Canary supports parallel processing of documents for faster execution of a project. Many modern computers have multiple processor “cores”, each of which can host a separate Canary “thread” that analyzes its own subset of the overall batch of documents (all threads subsequently pool their output together transparently to the user). It is best, however, not to use up all CPU cores for Canary execution, so that the computer can still be used by other programs while Canary is running. We therefore recommend that the Maximum number of threads be set to at least one fewer than the number of processor (CPU) cores. By default, Canary sets the Maximum number of threads to the number of processor cores -1 when a new project is created.
To determine the number of processor cores available, right-click on the Windows Taskbar and select Windows Task Manager. Click on the Performance tab. The number of processor (CPU) cores will be reflected as the number of columns in the CPU Usage History display.
· Documents per batch
The number of documents each Canary thread will process. We recommend between 1,000 and 10,000, depending on the overall number of documents in the batch. If the number is too small, the overhead of starting a thread will decrease performance. If the number is too large, each thread will take up too much memory.
· Document separator
A sequence of characters that is unlikely to be found in the text that separates individual documents from each other within each file. Canary recognizes the beginning of an individual document in a file containing multiple documents when it encounters all of the following:
1) New line (each new document has to start with a new line)
2) Document ID (a number or a sequence of characters that uniquely identifies the document; this document ID will be included in the output to identify the document the output came from)
3) Document separator
Here is an example of a text file containing several “documents”:
001*|#*|#*|#1*|#*|#*|#
This is the first document.
002*|#*|#*|#1*|#*|#*|#
This is the second document (or record) in this file.
Each document can be of varying length.
003*|#*|#*|#1*|#*|#*|#
This is the third and final item in this file.
The document ID is numeric in this example and the separator is “*|#*|#*|#1*|#*|#*|#”. More examples of input files can be found in the various sample projects that are distributed with the software.
· Document ID format
The format of the document ID (a number or a sequence of characters that uniquely identifies the document; this document ID will be included in the output to identify the document the output came from) in the form of a regular expression. For example, \d+ (the default value) indicates that document ID will be a number (sequence of digits) unique to the document.
· Document Directory
The directory where the documents to be processed by the project are located. All files with the extension *.txt in that directory will be processed when the project is executed.
· Output Directory
The directory where Canary will place the output files (see Output Files section below for more details). Any existing output files in that directory will be overwritten without prompting.
· Log File Directory
Canary can optionally create a log file where it will record every instance of Replace Text, Word Class matching or Phrase Structure matching. This can help debug a language model that does not perform according to expectations or to generally understand how Canary analyzes text. On the other hand, log files that are created by Canary are likely to be very large relative to the size of the text files being analyzed. It is therefore recommended to only use Log functionality on small amounts of text (e.g. < 10 kB).
To use Log File functionality, check off the check box to the left of Log file directory and then indicate the directory / folder where the log file (it will be named log.txt) will be created.
· Save Project
Click on this button to save the project at any time. Project is saved automatically if the Run Project button is clicked.
· Run Project
Click on this button to execute the project on the documents in the Document directory. Large document sets and / or projects with a large number of rules may take a significant amount of time to execute.
Canary
also supports processing files directly using the command line. This can be useful
for batch processing, automatically processing data with Canary, or enabling
integration of Canary models with other systems.
A working
distribution of Perl 5 is required for running the Canary command line. The
Windows version of Canary includes this as part of the bundle.
The Perl
code itself resides in the "perl_code"
directory of the Canary package.
Canary
projects can be run via the "Wrapper.pl" file, which takes 3 arguments:
(a)
The location of a directory with your
documents
(b)
The location of your Canary project's
configuration file (usually named "ConfigFile.txt")
(c)
The location of the directory where
results should be saved
The
following steps walk you through the process of running a Canary project.
1.
Open your command line terminal and
navigate to the "perl_code" directory,
e.g.:
cd "C:\Users\John\Desktop\canary\perl_code\"
2.
Execute the project, passing the 3
arguments described above to Wrapper.pl
as below:
perl Wrapper.pl "C:\data\notes"
%C:\Users\John\Desktop\MyCanaryProject\ConfigFile.txt
+C:\output\
· The above
command must be run from within the perl_code directory
· Please
note that the order of the arguments is important.
· The path
to the configuration file should be prefixed with "%"
· The path
to the output directory should be prefixed with "+"
· We also
recommend that you use complete and absolute paths instead of relative paths.
· When
running on Windows, there must not be a backslash at the end of a folder name
in the above command as it will be interpreted as an escape character.
3.
Monitor the command line for output from
the Perl script. A message will confirm that processing has been completed, at
which point the specified output directory should contain your results. These
are identical to the files generated when the project is run via the Canary
GUI.
When a project is executed, Canary will generate 4 file categories:
· MappedOutput_N.txt
This file contains Canary output values mapped to a vocabulary (e.g. SNOMED). It contains the following information for each sentence that contains text that was mapped to the vocabulary:
o DocumentFile name of the file containing the document
o DocumentID unique ID for the document
o OutputCriteria the ID(s) of the output criteria that triggered the output
o StartInDocument number of characters from the start of the document to the start of the text that matched the output criterion
o EndInDocument number of characters from the start of the document to the end of the text that matched the output criterion
o StartInFile number of characters from the start of the file to the start of the text that matched the output criterion
o EndInFile number of characters from the start of the file to the end of the text that matched the output criterion
o Sentence the entire sentence where the output criterion was matched
o <XXXX vocabulary code that corresponds to the text that matched the specific output criterion <XXXX
· DocumentInfoOutput_N.txt
This file contains the following information for each document analyzed by Canary:
o DocumentFile name of the file containing the document
o DocumentID unique ID for the document
o Start number of characters from the start of the file to the start of the document
o End number of characters from the start of the file to the end of the document
o Output “1” if output was produced for this file, and “0” if it was not
· DocumentLevelOutput_N.txt
Contains the list of document IDs for documents for which output was generated.
· SentenceLevelOutput_N.txt
This file includes sentence-level output. It has the following columns:
o DocumentFile the name of the file that contained the documents being analyzed
o DocumentID the unique ID of the document
o OutputCriteria the ID(s) of the output criteria that triggered the output
o StartInDocument number of characters from the start of the document to the start of the text that matched the output criterion
o EndInDocument number of characters from the start of the document to the end of the text that matched the output criterion
o StartInFile number of characters from the start of the file to the start of the text that matched the output criterion
o EndInFile number of characters from the start of the file to the end of the text that matched the output criterion
o Sentence the entire sentence where the output criterion was matched
o <XXXX text that matched the specific output criterion <XXXX
· Log.txt
This file includes every instance of Replace Text, Word Class matching or Phrase Structure matching. It has the following columns:
o DocumentFile the name of the file that contained the documents being analyzed
o DocumentID the unique ID of the document
o Event The Canary text analysis event (e.g. matched word class or a phrase structure) that is being recorded.
o Sentence The sentence where the Canary text analysis event took place.
o Position in Sentence The position of the first character of the word/ phrase in the sentence (left to right; the first character is “1”) that the Canary text analysis event applies to.
Canary is a free, open-source software program for extraction
of information from narrative text. Canary was designed for users without
software engineering or computer science background. Canary uses a graphic user
interface to guide users through creation of a language model (a set of rules)
that will allow them to extract a concept of interest from electronic text
(e.g. whether or not the patient had a mass reported on their brain MRI; what
their blood pressure was; whether the patient suffered an adverse reaction to a
medication; etc.). Canary was initially designed for use with medical texts,
but can be used outside of medicine as well.
If you wish to cite the Canary software, please
reference the following publication:
Malmasi S, Sandor NL, Hosomura N, Goldberg M, Skentzos S, Turchin A. Canary: An NLP Platform for
Clinicians and Researchers. Applied Clinical Informatics. 2017;8(2):447-53. doi: 10.4338/ACI-2017-01-IE-0018.
You can find information about all Canary related
publications on the website:
http://canary.bwh.harvard.edu/publications/
Canary can be installed in any folder or even on a
flash drive. Double-click on the installation file and choose the folder where
Canary should be installed.
4)
What operating systems does
Canary run on?
Canary has been tested on Windows 7, Windows 8,
Windows 10 as well as Windows Server 2008 and 2012.
5)
Can I run Canary on a Mac or
Linux?
Canary consists of a Perl-based command-line
component that could potentially run on multiple operating systems, and a
.NET-based graphic user interface component, that runs only on Windows systems.
You may be able to run the Perl component on non-Windows operating systems,
including Mac and Linux, but this has not been tested.
6)
Can I run Canary on a cluster?
Canary cannot currently make use of distributed
computing systems.
7)
Can I run Canary from the
command line?
Canary consists of a Perl-based command-line
component that could potentially run on multiple operating systems, and a
.NET-based graphic user interface component, that runs only on Windows systems.
The graphic user interface component creates several configuration files that
describe the language model and are subsequently used by the command-line
component. You could create the language model / configuration files using the
graphic user interface component, and then execute it using just the
command-line Perl component.
8)
How much hard drive space does
Canary need?
Currently the software requires around 300mb of disk
space.
This depends on the complexity of the language model
(higher complexity = slower processing) and the number of threads you are using
in parallel to process the text (more threads = faster processing). For
example, Canary processes 2.8 GB of text using a language model consisting of
284 phrase structures (rules) and 12 threads in 4.5 hours.
10) Does Canary
require a UMLS or any other licenses?
Using the Mapper function of Canary with UMLS
requires a UMLS license. Otherwise using Canary does not require any licenses.
Contact us at canary@partners.org
12)
Does Canary use machine
learning techniques?
No, Canary is a rule-based information extraction
tool.
13)
What file formats can Canary
extract information from?
Only plain text files are supported at this point.
If your data is in another format, such as XML, CSV or HTML, you will need to
use other tools to convert it to text format.
Canary expects the plain text files to be in a
particular format, with every individual document starting with a unique
document ID followed by a sequence of characters (“document separator”). If the
files you are working with have a different format, some of them can be
converted to the Canary-expected format by a companion tool, Canary Data
Converter (see Canary Data Converter manual for more information). Currently
Canary Data Converter supports Epic and Research Patient Data Registry (RPDR)
formats, among others.
14)
Does Canary connect directly to
any electronic medical records systems?
Not at this point. You will need to extract EMR data
from your system manually and process it with Canary.
15)
Can Canary be used with
languages other than English?
Canary supports Unicode input and has been tested
with non-English languages. A sample project using Korean data is included.
16)
Does Canary map the information
it extracts to standard dictionaries (e.g. UMLS; ICD-10; SNOMED; etc.)?
This is possible using the UMLS Exporter companion
software that comes bundled with Canary and the Mapper tabs of Canary, but it
is not currently implemented in the user interface.
17)
How much does Canary cost?
Canary is free software provided at no cost to the
user.
18) What is the difference between a “Flag” and an “Output Criteria”?
A flag represents a single target structure that should be searched for in every sentence; it also lets you define sub-parts of this structure to output separately in structured format. Output criteria are definitions of requirements that must be met in order for Canary to generate output. These criteria may be as simple as a sentence that contains a specific flag. This can be extended to outputting sentences with multiple flags or creating even more sophisticated criteria that match flags across multiple consecutive sentences.
19) How can negated structures be identified?
Specially crafted phrase structure rules can be used to identify negated phrases. New projects are automatically assigned a >NEGATION word class that include some words to assist with this. A working example of how this technique can be used to identify negated mentions of masses in radiology reports, e.g. “no recurrent mass”, can be found in the Radiology sample project.
20) Where should output files generally be stored?
Just like the input documents, Canary output can be placed in any accessible directory. It does not need to be stored in a subdirectory within the project directory, although this can be convenient. When a new project is created, by default Canary creates a subdirectory for your output, for example: "C:\Canary\projects\yourproject\OutputFolder". You can change this output location to something else.
Placing the Canary output, or any other files, directly in a Canary project directory is highly discouraged. This directory should generally only contain the project files. Attempting to manually manage files in this folder may lead to the accidental modification or deletion of project files, which may stop your project from working correctly.
21) How can I create a duplicate copy of an existing project?
Duplicating an existing project is a common requirement when iteratively developing and testing a model. It allows for new rules and features to be added and the results can then be compared against a previous version of the model. It is also a good way to back up your work.
This can be performed via the GUI, from the projects tab. Once a project has been loaded you can use the “Save Project to Folder” button to create an additional copy of the project. This will copy the project files, but not any documents that may be stored with the project.
Optionally, a second method for copying projects is to simply duplicate the directory they are stored in.