Category Data In/Out¶

Category Latino¶

Widget: Get Plain Texts¶

Automatically generated widget from function GetDocStrings in package latino. The original function signature: GetDocStrings.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Parameter: Token Annotation (System.String)
- Default value: TextBlock
Parameter: Feature Condition (Condition which tokens to include based on their features. Format examples: -Feature1 (don’t include tokens with Feature1 set ta any value) -Feature1=Value1 (don’t include tokens with Feature1 set to the value Value1) -Feature1 +Feature2 (don’t include tokens with Feature1 set unless it has also Feature2 set) -Feature1=Value1 +Feature2 (don’t include tokens with Feature1 set to Value1 unless it has also Feature2 set to any value)...)
Parameter: Delimiter for token concatenation (System.String)
Parameter: Include Document Identifier (System.Boolean)
Output: Texts

Widget: Load Document Corpus From File¶

This widges processes raw text file and loads the texts into ADC (Annotated Document Corpus) structure. The input file contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

Input: Raw Text File (Input Text File: Contains one document per line - the whole line represents text from the body of a document.)
Parameter: Text before the first tabulator [/t] represents the title of a document (System.Boolean)
- Default value: false
Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (System.Boolean)
- Default value: false
Output: Annotated Document Corpus

Widget: Load Document Corpus From String¶

This widges processes raw text file and loads the texts into ADC (Annotated Document Corpus) structure. The input file contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

Input: String (Input Text String: Contains one document per line - the whole line represents text from the body of a document.)
Parameter: Text before the first tabulator [/t] represents the title of a document (System.Boolean)
- Default value: false
Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (System.Boolean)
- Default value: false
Output: Annotated Document Corpus

Widget: Get Plain Texts¶

Widget transforms Annotated Document Corpus to string.

Input: Annotated Document Corpus (Annotated Document Corpus.)
Parameter: Feature Annotation (Select a feature annotation.)
- Default value: Stem
Parameter: Token Annotation (Select token annotation.)
- Default value: Token
Parameter: Delimiter for token concatenation (Delimiter for token concatenation.)
- Default value:
Parameter: Include Document Identifier (Include Document Identifier.)
Output: Texts (String with all documents in Annotated Document Corpus.)
Example usage: LBD workflows for outlier detection

Widget: Load Document Corpus¶

This widget processes input text and loads it into ADC (Annotated Document Corpus) structure. The input text contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

Input: Input (Input can be a string (str) or a file (fil).)
Parameter: Text before the first tabulator [/t] represents the title of a document (Text before the first tabulator [/t] represents the title of a document.)
- Default value: false
Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (First words in a line (after optional title) with preceding exclamation (!) present labels.)
- Default value: false
Output: Annotated Document Corpus (Annotated Document Corpus.)
Example usage: Evaluation of POS 3-gram sequences in gender classification task

Widget: Load Document Corpus From String¶

This widget processes input text and loads it into ADC (Annotated Document Corpus) structure. The input text contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

Input: String (Input Text String: Contains one document per line - the whole line represents text from the body of a document.)
Parameter: Text before the first tabulator [/t] represents the title of a document (Text before the first tabulator [/t] represents the title of a document.)
- Default value: false
Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (First words in a line (after optional title) with preceding exclamation (!) present labels.)
- Default value: false
Output: Annotated Document Corpus (Annotated Document Corpus.)

Widget: Load PTB Corpus¶

Loads corpus in Penn Treebank format with part of speech or lemma annotations. Corpus should be a directory with ptb or .txt files, or it could be just one file with one nested tupple per line. Bellow is an example of how the input format could look:

(ROOT

(S

(VP (VBG Making): (NP (NNPS Skittles))))

(NP (NN vodka)) (VP (VBZ is)

(NP (DT a) (JJ fun) (NN way)

(S

(VP (TO to)

(VP (VB add)

(NP

(NP (DT a) (NN splash)) (PP (IN of)

(NP (JJ fruity) (NN flavor)

(CC and) (NN color))))

(PP (TO to)

(NP (JJ regular) (NN vodka))))))))

(. .)))

The widget returns a list of tokenized sentences with part of speech or lemma tags.

Input: Input (input should be a zipped directory of files or a file)
Output: PTB document corpus
Example usage: POS tagger intrinsic evaluation - experiment 2

Widget: PTB To ADC Converter¶

Convert PTB corpus to pseudo ADC corpus. Can be used after ‘Load PTB corpus’ widget.

Input: PTB Document Corpus (Corpus in penn treebank format)
Parameter: Annotation name (Give the name to the annotations from the Penn Treebank Format, for example, ‘POS Tag’ or ‘Lemma’. This annotation will be tagged in the ADC corpus under this name.)
- Default value: POS Tag
Output: Annotated Document Corpus
Example usage: POS tagger intrinsic evaluation - experiment 2

Widget: Crawl URL links¶

This widget takes either a list of url links or a string, where each url is in a separate line. For every inputted url, the system crawls the page and extracts content using the boilerpipe library.

Input: Input (Input can be a string (str) or a file (fil).)
Parameter: Extractor Name (Extractor which is used for content extraction)
- Possible values:
  - Article Extractor
  - Article Sentences Extractor
  - Canola Extractor
  - Default Extractor
  - Default Extractor
  - Keep Everything Extractor
  - Largest Content Extractor
- Default value: DefaultExtractor
Parameter: Label (label which is set to all documents)
Output: Annotated Document Corpus (Annotated Document Corpus.)

Widget: Load Document Corpus From File¶

This widget processes raw text file and loads the texts into ADC (Annotated Document Corpus) structure. The input file contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

Input: Raw Text File (Input Text File: Contains one document per line - the whole line represents text from the body of a document.)
Parameter: Text before the first tabulator [/t] represents the title of a document (Text before the first tabulator [/t] represents the title of a document.)
- Default value: false
Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (First words in a line (after optional title) with preceding exclamation (!) present labels.)
- Default value: false
Output: Annotated Document Corpus (Annotated Document Corpus.)

Category Triplet Extraction¶

Category Document Corpus¶

Category Latino¶

Widget: Add Feature¶

Automatically generated widget from function AddDocumentsFeatures in package latino. The original function signature: AddDocumentsFeatures.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Input: Feature Values (Array of Labels) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Parameter: New Feature Name (System.String)
- Default value: feature
Parameter: New Feature Value Prefix (System.String)
Output: Annotated Document Corpus

Widget: Add Computed Feature¶

Automatically generated widget from function AddComputedFeatures in package latino. The original function signature: AddComputedFeatures.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Parameter: New Feature Name (System.String)
- Default value: feature
Parameter: New Feature Computataion (System.String)
- Default value: {feature2:name}{feature3}, {feature1:value}
Parameter: Old Features Specification (Comma separated list of names of old features used in the ‘New Feature Computataion’.)
- Default value: feature1, feature2
Output: Annotated Document Corpus

Widget: Add Set Feature¶

Automatically generated widget from function MarkDocumentsWithSetFeature in package latino. The original function signature: MarkDocumentsWithSetFeature.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Parameter: Feature Name (System.String)
- Default value: set
Parameter: Feature Value Prefix (System.String)
Parameter: Num of Sets (System.Int32)
- Default value: 10
Parameter: Assign Sets Randomly (System.Boolean)
- Default value: true
Parameter: Use Seed for Random (System.Boolean)
- Default value: false
Parameter: Random Seed (System.Int32)
- Default value: 0
Output: Annotated Document Corpus

Widget: Extract Documents¶

Automatically generated widget from function ExtractDocuments in package latino. The original function signature: ExtractDocuments.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Input: List of Document Indexes to be Extracted (System.Collections.Generic.List`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Parameter: Discard The Rest (The Filtered Out) (System.Boolean)
- Default value: false
Output: Annotated Document Corpus of Extracted Documents
Output: Annotated Document Corpus of the Rest of Documents

Widget: Add Feature¶

Add a feature to Annotated Document Corpus.

Input: Annotated Document Corpus
Input: Feature Values (List of feature values)
Parameter: New Feature Name
- Default value: feature
Parameter: New Feature Value Prefix
Output: Annotated Document Corpus

Widget: Display Document Corpus¶

Display Document Corpus widget displays ADC (Annotated Document Corpus) structure. It shows a detail view for selected document with annotations.

Input: Annotated Document Corpus (Annotated Document Corpus.)
Outputs: Popup window which shows widget’s results
Example usage: POS tagging classification evaluation

Widget: Extract Documents¶

Extract documents, given document indices, from Annotated Document Corpus.

Input: List of Document Indexes to be Extracted
Input: Annotated (Annotated Document Corpus.)
Parameter: Discard The Rest (The Filtered Out)
- Default value: false
Output: Annotated Document Corpus of Extracted Documents
Output: Annotated Document Corpus of Extracted Documents
Example usage: LBD workflows for outlier detection

Widget: Extract Feature¶

Extract documents features.

Input: Annotated Document Corpus (Annotated Document Corpus.)
Parameter: Extracted Feature Name
Output: List of Extracted Features

Widget: Merge Corpora¶

Merge multiple Annotated Document Corpuses into one.

Input: Annotated
Output: Merged Annotated Document Corpus

Widget: NLTK Document Corpus¶

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats. These functions can be used to read both the corpus files that are distributed in the NLTK corpus package, and corpus files that are part of external corpora.

Please see http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml for a complete list. Install corpora using nltk.download().

Corpus has the following available functions: words(): list of str sents(): list of (list of str) paras(): list of (list of (list of str)) tagged_words(): list of (str,str) tuple tagged_sents(): list of (list of (str,str)) tagged_paras(): list of (list of (list of (str,str))) chunked_sents(): list of (Tree w/ (str,str) leaves) parsed_sents(): list of (Tree with str leaves) parsed_paras(): list of (list of (Tree with str leaves)) xml(): A single xml ElementTree raw(): unprocessed corpus contents

Parameter: NLTK Document Corpus Name (NTLK Document Corpus Name)
- Possible values:
  - Brown
  - Cess Esp (spanish)
  - Floresta
  - NPS chat
  - Treebank
- Default value: brown
Parameter: Corpus Chunk (Define the chunk of the corpus you want. You can define the chunk as percentage(e.g. ‘80%’) of the corpus you would like or you can define the number of sentences from the beggining of the corpus. For example, value ‘1000’ will return first 1000 sentences in the corpus.

You can also define the chunk you want to discard. For example ‘^80%’ will discard first 80% of the corpus and return last 20% of the corpus. ‘^1000’ will discard first 1000 sentences of the corpus and return the rest of the corpus.)
- Default value: 100%
Output: NTLK document corpus (NLTK document corpus name)
Example usage: POS tagger intrinsic evaluation - experiment 2

Widget: Split¶

Split Annotated Document Corpus by conditions with features and values.

Input: Annotated Document Corpus (Annotated )
Parameter: Feature Condition
Parameter: Discard The Rest (The Filtered Out)
Output: Filtered Annotated Document Corpus
Output: The Rest of Annotated Document Corpus

Widget: Statistics¶

Statistics of Annotated Document Corpus.

Input: Annotated Document Corpus
Output: Number of Documents (Number of Documents.)
Output: Number of Features (Number of Features.)
Output: Statistics (Statistics.)

Widget: Add Computed Token Features¶

For every annotation of the selected type generate an additional feature. Between { } a feature name can be entered and it will be replaced with its value.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Parameter: New Feature Name (System.String)
- Default value: feature
Parameter: Annotation Name (Add features to tokens of this type.)
- Default value: Token
Parameter: New Feature Computataion (Values for the new features. Between { } a feature name can be entered and it will be replaced with its value.)
- Default value: {Stem}_{POS Tag}
Output: Annotated Document Corpus
Example usage: COMTRADE demo

Category Tokenization¶

Category Latino¶

Category Advanced¶

Widget: Split Sentences Hub¶

Automatically generated widget from function TokenizeSentences in package latino. The original function signature: TokenizeSentences.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Input: Tokenizer (Latino.TextMining.ITokenizer)
Parameter: Annotation to be tokenized (Which annotated part of document to be splited)
- Default value: TextBlock
Parameter: Annotation to be produced (How to annotate found sentences)
- Default value: Sentence
Output: Annotated Document Corpus

Widget: Unicode Tokenizer¶

Automatically generated widget from function ConstructUnicodeTokenizer in package latino. The original function signature: ConstructUnicodeTokenizer.

Parameter: Filter (Latino.TextMining.TokenizerFilter)
- Possible values:
  - AlphaLoose: accept tokens that contain at least one alphabetic character
  - AlphanumLoose: accept tokens that contain at least one alphanumeric character
  - AlphanumStrict: accept tokens that contain alphanumeric characters only
  - AlphaStrict: accept tokens that contain alphabetic characters only
  - None: accept all tokens
- Default value: None
Parameter: Minimal Token Length (System.Int32)
- Default value: 1
Output: Tokenizer

Widget: Regex Tokenizer¶

Automatically generated widget from function ConstructRegexTokenizer in package latino. The original function signature: ConstructRegexTokenizer.

Parameter: Regular Expression (System.String)
- Default value: p{L}+(-p{L}+)*
Parameter: Ignore Unknown Tokens (System.Boolean)
Parameter: Ignore Case (System.Boolean)
Parameter: Multiline (System.Boolean)
Parameter: Explicit Capture (System.Boolean)
Parameter: Compiled (System.Boolean)
Parameter: Singleline (System.Boolean)
Parameter: Ignore Pattern Whitespace (System.Boolean)
Parameter: Right To Left (System.Boolean)
Parameter: ECMA Script (System.Boolean)
Parameter: Culture Invariant (System.Boolean)
Output: Tokenizer

Widget: Tokenizer Hub¶

Automatically generated widget from function TokenizeWords in package latino. The original function signature: TokenizeWords.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Input: Tokenizer (Latino.TextMining.ITokenizer)
Parameter: Annotation to be tokenized (Which annotated part of document to be splited)
- Default value: TextBlock
Parameter: Annotation to be produced (How to annotate found sentences)
- Default value: Token
Output: Annotated Document Corpus

Category Nltk¶

Widget: Line Tokenizer¶

Tokenize a string into its lines, optionally discarding blank lines.

Parameter: Blank Lines (blanklines: Indicates how blank lines should be handled. Options are:
- discard: strip blank lines out of the token list before returning it.
  
  A line is considered blank if it contains only whitespace characters.
- keep: leave all blank lines in the token list.
- discard-eof: if the string ends with a newline, then do not generate
  
  a corresponding token '' after that newline.)
- Possible values:
  - discard
  - discard-eof
  - keep
- Default value: discard
Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Regex Tokenizer¶

The Regex Tokenizer splits a string into substrings using a regular expression.

Parameter: Regular Expression (The pattern used to build this tokenizer.

(This pattern may safely contain capturing parentheses.))
- Default value: p{L}+(-p{L}+)*
Parameter: Gaps (True if this tokenizer’s pattern should be used

to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.)
Parameter: Discard empty (True if any empty tokens ‘’

generated by the tokenizer should be discarded. Empty tokens can only be generated if Gaps is set.)
Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: S-Expression Tokenizer¶

S-Expression Tokenizer is used to find parenthesized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens

Parameter: Parentheses ( A two-element sequence specifying the open and close parentheses

that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.)
- Default value: ()
Parameter: Strict (If true, then raise an exception when tokenizing an ill-formed sexpr.)
- Default value: true
Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Simple Tokenizer¶

These tokenizers divide strings into substrings using the string split() method.

Space Tokenizer - Tokenize a string using the space character as a delimiter, which is the same as s.split(‘ ‘). Tab Tokenizer - Tokenize a string use the tab character as a delimiter, the same as s.split(‘t’). Char Tokenizer - Tokenize a string into individual characters. Whitespace Tokenizer - Tokenize a string on whitespace (space, tab, newline). Blankline Tokenizer - Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters. Word Punct Tokenizer - Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

Parameter: Type (Select a tokenizer.

Space Tokenizer - Tokenize a string using the space character as a delimiter, which is the same as s.split(‘ ‘).

Tab Tokenizer - Tokenize a string use the tab character as a delimiter, the same as s.split(‘t’).

Char Tokenizer - Tokenize a string into individual characters.

Whitespace Tokenizer - Tokenize a string on whitespace (space, tab, newline).

Blankline Tokenizer - Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters.

Word Punct Tokenizer - Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.)
- Possible values:
  - Blankline Tokenizer
  - Char Tokenizer
  - Space Tokenizer
  - Tab Tokenizer
  - Whitespace Tokenizer
  - WordPunct Tokenizer
- Default value: wordpunct_tokenizer
Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Text Tiling Tokenizer¶

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

Parameter: Pseudosentence size (Pseudosentence size.)
- Default value: 20
Parameter: Size (Size (in sentences) of the block used in the block comparison method. )
- Default value: 10
Parameter: Stopwords ( A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus). Example: the, a)
- Default value: None
Parameter: Smoothing width (The width of the window used by the smoothing method.)
- Default value: 2
Parameter: Smoothing rounds (The number of smoothing passes.)
- Default value: 1
Parameter: Similarity method (The method used for determining similarity scores: Block comparison (default) or Vocabulary introduction.)
- Possible values:
  - Block comparison
  - Vocabulary introduction
- Default value: BLOCK_COMPARISON
Parameter: Cutoff policy (The policy used to determine the number of boundaries: HC (default) or LC.)
- Possible values:
  - HC
  - LC
- Default value: HC
Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Treebank Word Tokenizer¶

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.

This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

split standard contractions, e.g. don't -> do n't and they'll -> they 'll
treat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace

separate periods that appear at the end of line

>>> from nltk.tokenize import TreebankWordTokenizer
>>> s = '''Good muffins cost $3.88\\nin New York.  Please buy me\\ntwo of them.\\n\\nThanks.'''
>>> TreebankWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "They'll save and invest more."
>>> TreebankWordTokenizer().tokenize(s)
['They', "'ll", 'save', 'and', 'invest', 'more', '.']

NB. this tokenizer assumes that the text is presented as one sentence per line, where each line is delimited with a newline character. The only periods to be treated as separate tokens are those appearing at the end of a line.

Output: Tokenizer

Widget: Tokenizer Hub¶

Apply the tokenizer object on the Annotated Document Corpus (adc):

first select only annotations of type input_annotation,
apply the tokenizer
create new annotations output_annotation with the outputs of the tokenizer.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Tokenizer (Python dictionary containing the Tokenizer object and its arguments.)
Parameter: Annotation to be tokenized (Which annotated part of document to be splitted.)
- Default value: TextBlock
Parameter: Annotation to be produced (How to annotate the newly discovered tokens.)
- Default value: Token
Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Example usage: LBD workflows for outlier detection

Category POS Tagging¶

Category Latino¶

Category Advanced¶

Category Nltk¶

Widget: NLTK Corpus to ADC Format¶

extract tagged sentences in PTB format from NLTK corpus and convert it to ADC format.

Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
Parameter: Annotation name (Give the name to the annotations from the Penn Treebank Format, for example, ‘POS Tag’ or ‘Lemma’. This annotation will be tagged in the ADC corpus under this name.)
- Default value: POS Tag
Output: Annotated Document Corpus

Widget: Perceptron POS tagger¶

Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal with a fix that makes it not crash on 0 length tokens during training. The implementation is identical to the one implemented in NLTK 3.1.0 and later if you do not consider the fix.

Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
Output: POS Tagger (A python dictionary containing the POS tagger)
Example usage: POS tagger extrinsic evaluation in gender classification task

Widget: POS Affix Tagger¶

A tagger that chooses a token’s tag based on a leading or trailing substring of its word string. (It is important to note that these substrings are not necessarily “true” morphological affixes). In particular, a fixed-length substring of the word is looked up in a table, and the corresponding tag is returned. Affix taggers are typically constructed by training them on a tagged corpus.

Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
Input: Backoff Tagger (A backoff tagger, to be used by the new tagger if it encounters an unknown context.)
Parameter: Affix Length (The length of the affixes that should be considered during training and tagging. Use negative numbers for suffixes.)
- Default value: -3
Parameter: Cutoff (If the most likely tag for a context occurs fewer than cutoff times, then exclude it from the context-to-tag table for the new tagger.)
- Default value: 0
Parameter: Minimum Stem Length (Any words whose length is less than min_stem_length+abs(affix_length) will be assigned a tag of None by this tagger.)
- Default value: 2
Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
Example usage: POS tagging classification evaluation (copy)

Widget: POS Brill’s rule-based Tagger¶

“”“Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as tag.DefaultTagger) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the BrillRule interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using either BrillTaggerTrainer or FastBrillTaggerTrainer.

Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
Input: Initial Tagger (The initial tagger. Brill taggers use an initial tagger (such as DefaultTagger) to assign an initial tag sequence to a text.)
Parameter: Max Rules (The maximum number of transformations to be created)
- Default value: 200
Parameter: Min Score (The minimum acceptable net error reduction that each transformation must produce in the corpus.)
- Default value: 2
Parameter: Templates (Templates to be used in training TODO: meaning?!

Options: - nltkdemo18:

Return 18 templates, from the original nltk demo, in multi-feature syntax
- nltkdemo18plus:
  
  Return 18 templates, from the original nltk demo, and additionally a few multi-feature ones (the motivation is easy comparison with nltkdemo18)
- brill24:
  
  Return 24 templates of the seminal TBL paper, Brill (1995)
- fntbl37:
  
  Return 37 templates taken from the postagging task of the fntbl distribution http://www.cs.jhu.edu/~rflorian/fntbl/ (37 is after excluding a handful which do not condition on Pos[0]; fntbl can do that but the current nltk implementation cannot.))
- Possible values:
  - brill24
  - fntbl37
  - nltkdemo18
  - nltkdemo18plus
- Default value: brill24
Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
Example usage: POS tagging classification evaluation (copy)

Widget: POS Classifier-based Tagger¶

A sequential tagger that uses a classifier to choose the tag for each token in a sentence. The featureset input for the classifier is generated by a feature detector function:

feature_detector(tokens, index, history) -> featureset

Where tokens is the list of unlabeled tokens in the sentence; index is the index of the token for which feature detection should be performed; and history is list of the tags for all tokens before index.

Construct a new classifier-based sequential tagger.

Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
Input: Backoff Tagger (A backoff tagger, to be used by the new tagger if it encounters an unknown context.)
Input: Classifier (The classifier that should be used by the tagger. This is useful if you want to use a manually constructed classifier for POS tagging.)
Parameter: Cutoff Probability (If specified, then this tagger will fall back on its backoff tagger if the probability of the most likely tag is less than cutoff_prob.)
Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
Example usage: POS tagging classification evaluation (copy)

Widget: POS Default Tagger¶

A tagger that assigns the same tag to every token.

>>> from nltk.tag.sequential import DefaultTagger
>>> default_tagger = DefaultTagger('NN')
>>> default_tagger.tag('This is a test'.split())
[('This', 'NN'), ('is', 'NN'), ('a', 'NN'), ('test', 'NN')]

This tagger is recommended as a backoff tagger, in cases where a more powerful tagger is unable to assign a tag to the word (e.g. because the word was not seen during training).

Parameter: Default tag (The default tag “-None-”. Set this to a different tag, such as “NN”, to change the default tag.)
- Default value: -None-
Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
Example usage: POS tagging classification evaluation (copy)

Widget: POS N-gram Tagger¶

A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags. In particular, a tuple (tags[i-n:i-1], words[i]) is looked up in a table, and the corresponding tag is returned. N-gram taggers are typically trained on a tagged corpus.

Train a new NgramTagger using the given training data or the supplied model. In particular, construct a new tagger whose table maps from each context (tag[i-n:i-1], word[i]) to the most frequent tag for that context. But exclude any contexts that are already tagged perfectly by the backoff tagger.

Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
Input: Backoff Tagger (A backoff tagger, to be used by the new tagger if it encounters an unknown context.)
Parameter: N-gram (N-gram is a contiguous sequence of n items from a given sequence of text or speech.)
- Default value: 1
Parameter: Cutoff (If the most likely tag for a context occurs fewer than cutoff times, then exclude it from the context-to-tag table for the new tagger.)
- Default value: 0
Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
Example usage: POS tagging classification evaluation (copy)

Widget: Display Annotation Statistics¶

Display statistics for specific annotation or annotation sequence in ADC corpus. Widget shows annotations ranked by frequency, PMI and chi square and the scores they achieved.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Parameter: Statistic type (Choose what kind of statistics you would like to show)
- Possible values:
  - Chi square test
  - frequency
  - PMI of bigrams
  - PMI of trigrams
- Default value: frequency
Parameter: Annotation name (Choose annotation)
- Default value: Token/POS Tag
Parameter: N-gram (Choose what kind of n-gram features you would like to score.)
- Possible values:
  - 1
  - 2
  - 3
  - 4
  - 5
  - 6
- Default value: 1
Outputs: Popup window which shows widget’s results
Example usage: Evaluation of POS 3-gram sequences in gender classification task

Widget: POS Tagger Hub¶

TODO

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: POS Tagger (OpenNLP.Tools.PosTagger.EnglishMaximumEntropyPosTagger)
Parameter: Sentence’s Annotation (System.String)
- Default value: Sentence
Parameter: Element’s Annotation (System.String)
- Default value: Token
Parameter: Output Feature Name (System.String)
- Default value: POS Tag
Parameter: Take first k letters from POS tag
- Possible values:
  - 1
  - 2
  - 3
  - all
- Default value: -1
Output: Annotated Document Corpus
Example usage: download_adc_annotations_as_csv

Category Bag of Words¶

Category Latino¶

Category Advanced¶

Widget: Construct BOW Model (Text)¶

Automatically generated widget from function ConstructBowSpace in package latino. The original function signature: ConstructBowSpace.

Input: Textual Documents (Array of strings) (System.Object)
Input: Tokenizer (Latino.TextMining.ITokenizer)
Input: Stemmer or Lemmatizer (Tagger) (Latino.TextMining.IStemmer)
Input: Stopwords (Array of Stopwords) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Parameter: Maximum N-Gram Length (System.Int32)
- Default value: 2
Parameter: Minimum Word Freqency (System.Int32)
- Default value: 5
Parameter: Word Weighting Type (Latino.TextMining.WordWeightType)
- Possible values:
  - Log Df Tf Idf
  - Term Freq
  - Tf Idf
  - Tf Idf Safe
- Default value: TfIdf
Parameter: Cut Low Weights Percentage (System.Double)
- Default value: 0.2
Parameter: Normalize Vectors (System.Boolean)
- Default value: true
Output: Bag of Words Model
Output: Dataset

Widget: Construct BOW Model and Dataset¶

Automatically generated widget from function ConstructBowSpace in package latino. The original function signature: ConstructBowSpace.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Parameter: Token Annotation (System.String)
- Default value: Token
Parameter: Stem Feature Name (System.String)
- Default value: stem
Parameter: Stopword Feature Name (System.String)
- Default value: stopword
Parameter: Label Document Feature Name (System.String)
- Default value: label
Parameter: Maximum N-Gram Length (System.Int32)
- Default value: 2
Parameter: Minimum Word Freqency (System.Int32)
- Default value: 5
Parameter: Word Weighting Type (Latino.TextMining.WordWeightType)
- Possible values:
  - Log Df Tf Idf
  - Term Freq
  - Tf Idf
  - Tf Idf Safe
- Default value: TfIdf
Parameter: Cut Low Weights Percentage (System.Double)
- Default value: 0.2
Parameter: Normalize Vectors (System.Boolean)
- Default value: true
Output: Bag of Words Model
Output: Dataset

Widget: Construct BOW Model¶

Automatically generated widget from function ConstructBowModel in package latino. The original function signature: ConstructBowModel.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Parameter: Token Annotation (System.String)
- Default value: Token
Parameter: Stem Feature Name (System.String)
- Default value: stem
Parameter: Stopword Feature Name (System.String)
- Default value: stopword
Parameter: Label Document Feature Name (System.String)
- Default value: label
Parameter: Maximum N-Gram Length (System.Int32)
- Default value: 2
Parameter: Minimum Word Freqency (System.Int32)
- Default value: 5
Parameter: Word Weighting Type (Latino.TextMining.WordWeightType)
- Possible values:
  - Log Df Tf Idf
  - Term Freq
  - Tf Idf
  - Tf Idf Safe
- Default value: TfIdf
Parameter: Cut Low Weights Percentage (System.Double)
- Default value: 0.2
Parameter: Normalize Vectors (System.Boolean)
- Default value: true
Output: Bag of Words Model

Widget: Construct BoW Dataset and BoW Model Constructor¶

The Construct BoW Dataset and BoW Model Constructor widget takes as an input an ADC data object and generates a sparse BoW model dataset (which can be then handed to i.e. a classifier). The widget takes as an input also several user defined parameters, such as weighting type, minimum word frequency, ngram length ...

Besides the sparse BoW model dataset this widget also outputs a BowModelConstructor instance. This additional object contains settings which allow repetition of the feature construction steps on another document corpus. These settings include the inputted parameters, as well as the learned term weights and vocabulary.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Controlled Vocabulary (List of terms which will be used for building the vocabulary. Parameter ‘Maximum N-gram Length’ from in this widget is also applied to the vocabulary. The final vocabulary is the intersection of the controlled vocabulary and the dataset vocabulary.)
Parameter: Token Annotation (This is the type of Annotation instances, which mark parts of the document (e.g., words, sentences or a terms), which will be used for generating the vocabulary and the dataset.)
- Default value: Token
Parameter: Feature Name (If present, the model will be constructed out of annotations’ feature values instead of document text. For example, this is useful when we wish build the BoW model using stems instead of the original word forms.)
- Default value: Stem
Parameter: Stopword Feature Name (This is an annotation feature name which was used to tag tokens as stop words. These tokens will be excluded from the BoW representational model. If blank, no stop words will be used.)
- Default value: StopWord
Parameter: Label Document Feature Name (This is the name of the document feature which will be used for class labeling examples in the dataset. If blank, the generated sparse dataset will be unlabeled.)
- Default value: Labels
Parameter: Maximum N-Gram Length (The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that 1 <= n <= max_ngram will be used.)
- Default value: 2
Parameter: Minimum Word Freqency (When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature.)
- Default value: 5
Parameter: Word Weighting Type (The user can select among various weighting models for assigning weights to features)
- Possible values:
  - Log Df TF-IDF
  - Term Frequency
  - TF-IDF
  - TF-IDF Safe
- Default value: tf_idf
Parameter: Cut Low Weights Percentage (System.Double)
- Default value: 0.2
Parameter: Normalize Vectors (The weighting methods can be further modified by vector normalization. If the user opts to use it in TextFlows the L2 regularization is performed.)
- Default value: true
Output: Bag of Words Model Constructor (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)
Output: BOW Model Dataset (Sparse BOW feature vectors.)
Example usage: Outlier document detection

Widget: Construct BoW Model Constructor¶

The Construct BoW Dataset and BoW Model Constructor widget takes as an input an ADC data object and generates a BowModelConstructor instance. This object contains settings which allow repetition of the feature construction steps on another document corpus. These settings include the inputted parameters, as well as the learned term weights and vocabulary. The widget takes as an input also several user defined parameters, such as weighting type, minimum word frequency, ngram length ...

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Controlled Vocabulary (List of terms which will be used for building the vocabulary. Parameter ‘Maximum N-gram Length’ from in this widget is also applied to the vocabulary. The final vocabulary is the intersection of the controlled vocabulary and the dataset vocabulary.)
Parameter: Token Annotation (This is the type of Annotation instances, which mark parts of the document (e.g., words, sentences or a terms), which will be used for generating the vocabulary and the dataset.)
- Default value: Token
Parameter: Feature Name (If present, the model will be constructed out of annotations’ feature values instead of document text. For example, this is useful when we wish build the BoW model using stems instead of the original word forms.)
- Default value: Stem
Parameter: Stopword Feature Name (This is an annotation feature name which was used to tag tokens as stop words. These tokens will be excluded from the BoW representational model. If blank, no stop words will be used.)
- Default value: StopWord
Parameter: Label Document Feature Name (This is the name of the document feature which will be used for class labeling examples in the dataset. If blank, the generated sparse dataset will be unlabeled.)
- Default value: Labels
Parameter: Maximum N-Gram Length (The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that 1 <= n <= max_ngram will be used.)
- Default value: 2
Parameter: Minimum Word Freqency (When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature.)
- Default value: 5
Parameter: Word Weighting Type (The user can select among various weighting models for assigning weights to features)
- Possible values:
  - Log Df TF-IDF
  - Term Frequency
  - TF-IDF
  - TF-IDF Safe
- Default value: tf_idf
Parameter: Cut Low Weights Percentage (System.Double)
- Default value: 0.2
Parameter: Normalize Vectors (The weighting methods can be further modified by vector normalization. If the user opts to use it in TextFlows the L2 regularization is performed.)
- Default value: true
Output: Bag of Words Model Constructor (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)

Category Chunking¶

Category Nltk¶

Widget: N-gram Chunker¶

Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
Input: Backoff Chunker (A backoff chunker, to be used by the new chunker if it encounters an unknown context.)
Parameter: N-gram (N-gram is a contiguous sequence of n items from a given sequence of text or speech.)
- Default value: 1
Output: Chunker (A python dictionary containing the Chunker object and its arguments.)

Widget: Regex Chunker¶

A grammar based chunk parser. chunk.RegexpParser uses a set of regular expression patterns to specify the behavior of the parser. The chunking of the text is encoded using a ChunkString, and each rule acts by modifying the chunking in the ChunkString. The rules are all implemented using regular expression matching and substitution.

A grammar contains one or more clauses in the following form:

NP:
  {<DT|JJ>}          # chunk determiners and adjectives
  }<[\.VI].*>+{      # chink any tag beginning with V, I, or .
  <.*>}{<DT>         # split a chunk at a determiner
  <DT|JJ>{}<NN.*>    # merge chunk ending with det/adj
                     # with one starting with a noun

The patterns of a clause are executed in order. An earlier pattern may introduce a chunk boundary that prevents a later pattern from executing. Sometimes an individual pattern will match on multiple, overlapping extents of the input. As with regular expression substitution more generally, the chunker will identify the first match possible, then continue looking for matches after this one has ended.

The clauses of a grammar are also executed in order. A cascaded chunk parser is one having more than one clause. The maximum depth of a parse tree created by this chunk parser is the same as the number of clauses in the grammar.

Parameter: Grammar (Grammar: a set of regular expression patterns to specify the behavior of the parser)
- Default value: NP: {<DT>? <JJ>* <NN>*} # NP

P: {<IN>} # Preposition V: {<V.*>} # Verb PP: {<P> <NP>} # PP -> P NP VP: {<V> <NP|PP>*} # VP -> V (NP|PP)* * Output: Chunker (A python dictionary containing the Chunker object and its arguments.)

Widget: Chunking Hub¶

TODO

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Chunker (Chunker which will be used to parse the text into chunks.)
Parameter: Sentence’s Annotation (System.String)
- Default value: Sentence
Parameter: Element’s Annotation (Tokens which feature’s will be used for tagging.)
- Default value: Token
Parameter: POS Feature Name (Element Annotations’ POS Tag Feature Names )
- Default value: POS Tag
Parameter: Output Feature Name (System.String)
- Default value: IOB Tag
Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))

Widget: Extract Annotations from IOB tags¶

TODO

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Parameter: Sentence’s Annotation (Tokens which will be used to group element annotations.)
- Default value: Sentence
Parameter: Element’s Annotation (Tokens which feature’s will be used in extraction.)
- Default value: Token
Parameter: IOB Feature Name (Element Annotations’ IOB Tag Feature Names )
- Default value: IOB Tag
Parameter: POS Feature Name (Element Annotations’ POS Tag Feature Names )
- Default value: POS Tag
Parameter: Grammar Labels to be extracted (Grammar labels which will be extracted from the text as new annotations (NP,PP,VP), separated by a comma. NP - noun phrases, VP - verb phrases.)
- Default value: NP,VP
Parameter: Annotation to be produced (The prefix for annotation of newly discovered tokens. Annotations names will be constructed as a combinations of this prefix and the label type e.x. “Chunk_NP”)
- Default value: Chunk
Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))

Category Stemming¶

Category Latino¶

Category Advanced¶

Widget: Stemming Tagger Hub¶

Taggs the given annotated document corpus with the given tagger.

Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
Input: Token Tagger (Token Annotation of the token to be tagged. If also the feature name is used than the feature value of selected token will be tagged. Usage: 1. TokenName 2. TokenName/FeatureName If multiple taggers are used then one line per tagger must be specified.)
Parameter: Token Annotation (System.String)
- Default value: Token
Parameter: Output Feature Name (System.String)
- Default value: stem
Output: Annotated Document Corpus

Category Nltk¶

Widget: ISRI Stemmer¶

ISRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA. A few minor modifications have been made to ISRI basic algorithm.

See the source code of this module for more information. isri.stem(token) returns Arabic root for the given token. The ISRI Stemmer requires that all tokens have Unicode string types. If you use Python IDLE on Arabic Windows you have to decode text first using Arabic ‘1256’ coding.

Output: Stemmer (Tagger)
Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Snowball Stemmer¶

The following languages are supported:

Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

The algorithm for English is documented here: Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

The algorithms have been developed by Martin Porter. These stemmers are called Snowball, because Porter created a programming language with this name for creating new stemming algorithms. There is more information available at http://snowball.tartarus.org/

Parameter: Language (The following languages are supported: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.)
- Possible values:
  - Danish
  - Dutch
  - English
  - Finnish
  - French
  - German
  - Hungarian
  - Italian
  - Norwegian
  - Portuguese
  - Romanian
  - Russian
  - Spanish
  - Swedish
- Default value: danish
Parameter: Ignore stopwords (If set to True, stopwords are

not stemmed and returned unchanged. Set to False by default.)
Output: Stemmer (Tagger)
Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Lancaster Stemmer¶

A word stemmer based on the Lancaster stemming algorithm.

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('maximum')     # Remove "-um" when word is intact
'maxim'
>>> st.stem('presumably')  # Don't remove "-um" when word is not intact
'presum'
>>> st.stem('multiply')    # No action taken if word ends with "-ply"
'multiply'
>>> st.stem('provision')   # Replace "-sion" with "-j" to trigger "j" set of rules
'provid'
>>> st.stem('owed')        # Word starting with vowel must contain at least 2 letters
'ow'
>>> st.stem('ear')         # ditto
'ear'
>>> st.stem('saying')      # Words starting with consonant must contain at least 3
'say'
>>> st.stem('crying')      #     letters and one of those letters must be a vowel
'cry'
>>> st.stem('string')      # ditto
'string'
>>> st.stem('meant')       # ditto
'meant'
>>> st.stem('cement')      # ditto
'cem'

Output: Stemmer (Tagger)
Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Porter Stemmer¶

This is the Porter stemming algorithm, ported to Python from the version coded up in ANSI C by the author. It follows the algorithm presented in

Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

only differing from it at the points marked –DEPARTURE– and –NEW– below.

For a more faithful version of the Porter algorithm, see http://www.tartarus.org/~martin/PorterStemmer/

Output: Stemmer (Tagger)
Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Stem/Lemma Tagger Hub¶

Taggs the given annotated document corpus with the given tagger.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Token Tagger (Token Annotation of the token to be tagged. If also the feature name is used than the feature value of selected token will be tagged. Usage: 1. TokenName 2. TokenName/FeatureName If multiple taggers are used then one line per tagger must be specified.)
Parameter: Token Annotation (System.String)
- Default value: Token
Parameter: POS Annotation (Name of Part of Speech annotation in ADC corpus if ADC corpus contains part of speech tags. Used by wordnet lemmatizer which uses POS tags for lemma prediction.)
- Default value: POS Tag
Parameter: Output Feature Name (System.String)
- Default value: Stem
Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Example usage: LBD workflows for outlier detection

Category Chunking¶

Widget: Chunking Hub¶

TODO

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Chunker (TODO)
Parameter: Input Feature Name (System.String)
- Default value: POS Tag
Parameter: Output Feature Name (System.String)
- Default value: Chunk
Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))

Category Dataset¶

Category Latino¶

Widget: Add Labels¶

Automatically generated widget from function AddLabelsToDocumentVectors in package latino. The original function signature: AddLabelsToDocumentVectors.

Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Input: Labeles (Array of Strings) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Output: Dataset

Widget: Extract Labels¶

Automatically generated widget from function ExtractDatasetLabels in package latino. The original function signature: ExtractDatasetLabels.

Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Output: Labels (Array of Strings)

Widget: Remove Labels¶

Automatically generated widget from function RemoveDocumentVectorsLabels in package latino. The original function signature: RemoveDocumentVectorsLabels.

Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Output: Dataset

Widget: Split¶

Automatically generated widget from function DatasetSplitSimple in package latino. The original function signature: DatasetSplitSimple.

Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Parameter: Percentage (System.Double)
- Default value: 10
Parameter: Random Seed (-1 for random (time dependet) random seed)
- Default value: -1
Output: Dataset with Extracted Set
Output: Dataset of Remaining Sets

Widget: Split to Predefined Sets¶

Automatically generated widget from function DatasetSplitPredefined in package latino. The original function signature: DatasetSplitPredefined.

Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Input: Sets (List with predefined set numbers) (System.Int32[])
Input: SetId (System.Int32)
Output: Dataset with Extracted Set
Output: Dataset of Remaining Sets

Widget: Dataset to Object¶

Automatically generated widget from function DatasetToObject in package latino. The original function signature: DatasetToObject.

Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Output: Standard Object Representataion of Dataset (List<Tuple<int,string,Dictionary<int,double>>> explained as: (List of Examples)<(Example Tuple)<(Id) int,(Label) string,(BOW Dictionary)<(Word Id) int,(Word Weight) double>>>)

Widget: Add Labels¶

Automatically generated widget from function AddLabelsToDocumentVectors in package latino. The original function signature: AddLabelsToDocumentVectors.

Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Input: Labeles (Array of Strings) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Output: Dataset

Category Stop Words¶

Category Latino¶

Category Advanced¶

Category Nltk¶

Widget: Stop Word Tagger¶

Constructs a python stop word tagger object.

Input: Stop Words (A list or string (stop words separated by new lines) of stop words.)
Parameter: Ignore Case (If true than words are marked as stop word regardless of their casing.)
- Default value: true
Output: Stop Word Tagger (A python dictionary containing the StopWordTagger object and its arguments.)
Example usage: Simple Document Preprocessing

Widget: Stop Word Tagger Hub¶

Apply the stop_word_tagger object on the Annotated Document Corpus (adc):

first select only annotations of type Token Annotation element_annotation,
apply the stop_word tagger
create new annotations output_feature with the outputs of the stop word tagger.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Stop Word Tagger (A python dictionary containing the stop word tagger object and its arguments.)
Parameter: Token Annotation (Which annotated part of document to be searched for stopwords.)
- Default value: Token
Parameter: Output Feature Name (How to annotate the newly discovered stop word features.)
- Default value: StopWord
Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Example usage: LBD workflows for outlier detection

Category Similarity Matrix¶

Category Latino¶

Widget: Calculate Similarity Matrix¶

Automatically generated widget from function CalculateSimilarityMatrix in package latino. The original function signature: CalculateSimilarityMatrix.

Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Parameter: Similarity Threshold (System.Double)
- Default value: 0
Parameter: Full Matrix (not only Lower Triangular) (System.Boolean)
- Default value: true
Output: Similarity Matrix

Widget: Calculate Similarity Matrix¶

Automatically generated widget from function CalculateSimilarityMatrix in package latino. The original function signature: CalculateSimilarityMatrix.

Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Parameter: Similarity Threshold (System.Double)
- Default value: 0
Parameter: Full Matrix (not only Lower Triangular) (System.Boolean)
- Default value: true
Output: Similarity Matrix

Category Clustering¶

Category Latino¶

Widget: KMeans Clusterer¶

Automatically generated widget from function ConstructKMeansClusterer in package latino. The original function signature: ConstructKMeansClusterer.

Parameter: K (Number of Clusteres) (System.Int32)
- Default value: 10
Parameter: Centroid Type (Latino.Model.CentroidType)
- Possible values:
  - Avg
  - Nrm L2
  - Sum
- Default value: NrmL2
Parameter: Similarity Measure (LatinoInterfaces.SimilarityModel)
- Possible values:
  - Cosine
  - Dot Product
- Default value: Cosine
Parameter: Random Seed (-1: Use Always Different) (System.Int32)
- Default value: -1
Parameter: Eps (System.Double)
- Default value: 0.0005
Parameter: Trials (Num of Initializations) (System.Int32)
- Default value: 1
Output: Clusterer

Category Scikit¶

Widget: k-Means¶

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia <inertia> or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

Parameter: Number of clusters (The number of clusters to form as well as the number of centroids to generate.)
- Default value: 8
Parameter: Max iterations (Maximum number of iterations of the k-means algorithm for a single run.)
- Default value: 300
Parameter: Tolerance (Relative tolerance with regards to inertia to declare convergence.)
- Default value: 1e-4
Output: Clustering

Category Classification¶

Category Latino¶

Widget: SVM Binary Classifier¶

Automatically generated widget from function ConstructSvmBinaryClassifier in package latino. The original function signature: ConstructSvmBinaryClassifier.

Parameter: C (zero implies default value ([avg. x*x]^-1))
- Default value: 0
Parameter: Biased Hyperplane (System.Boolean)
- Default value: true
Parameter: Kernel Type (Latino.Model.SvmLightKernelType)
- Possible values:
  - Linear
  - Polynomial
  - Radial Basis Function
  - Sigmoid
- Default value: Linear
Parameter: Kernel Parameter Gamma (System.Double)
- Default value: 1
Parameter: Kernel Parameter D (System.Double)
- Default value: 1
Parameter: Kernel Parameter S (System.Double)
- Default value: 1
Parameter: Kernel Parameter C (System.Double)
- Default value: 0
Parameter: Eps (System.Double)
- Default value: 0.001
Parameter: Max Iterations (System.Int32)
- Default value: 100000
Parameter: Custom Parameter String (System.String)
Output: Classifier

Widget: Maximum Entropy Classifier¶

Automatically generated widget from function ConstructMaximumEntropyClassifier in package latino. The original function signature: ConstructMaximumEntropyClassifier.

Parameter: Move Data (System.Boolean)
- Default value: false
Parameter: Num of Iterations (System.Int32)
- Default value: 100
Parameter: CutOff (System.Int32)
- Default value: 0
Parameter: Num of Threads (System.Int32)
- Default value: 1
Parameter: Normalize (System.Boolean)
- Default value: false
Output: Classifier
Example usage: Classifier evaluation

Widget: Maximum Entropy Fast Classifier¶

Automatically generated widget from function ConstructMaximumEntropyClassifierFast in package latino. The original function signature: ConstructMaximumEntropyClassifierFast.

Parameter: Move Data (System.Boolean)
- Default value: false
Parameter: Num of Iterations (System.Int32)
- Default value: 100
Parameter: CutOff (System.Int32)
- Default value: 0
Parameter: Num of Threads (System.Int32)
- Default value: 1
Parameter: Normalize (System.Boolean)
- Default value: false
Output: Classifier

Widget: Accuracy Claculation¶

Automatically generated widget from function AccuracyClaculation in package latino. The original function signature: AccuracyClaculation.

Input: True Labels (System.Collections.IList)
Input: Predicted Labels (System.Collections.IList)
Output: Accuracy
Output: Statistics (Statistics:confusionMatrix: first level of confusion matrix dictionary present true labels (first input) while the second, inner layer, presents predicted labels (second output). Stataistics:additinalScores: dictionary’s id presents the label that was considered positive for calculation and dictionary’s value are actual additioanl scores.)

Widget: Cross Validation¶

Automatically generated widget from function CrossValidation in package latino. The original function signature: CrossValidation.

Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Parameter: Num of Sets (System.Int32)
- Default value: 10
Parameter: Assign Sets Randomly (System.Boolean)
- Default value: true
Parameter: Use Seed for Random (System.Boolean)
- Default value: false
Parameter: Random Seed (System.Int32)
- Default value: 0
Output: Data Object with results

Widget: Cross Validation (Predefined Splits)¶

Automatically generated widget from function CrossValidationPredefSplits in package latino. The original function signature: CrossValidationPredefSplits.

Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Input: Sets (List with predefined set numbers) (System.Collections.Generic.List`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Output: Data Object with results

Widget: Multiple Splits Validation¶

Automatically generated widget from function CrossValidationPredefMultiSplits in package latino. The original function signature: CrossValidationPredefMultiSplits.

Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Input: Multiple Set Indexes (Dictionary with multiple predefined split element indexes. {“train0”:[1,2,3],”test0”:[4,5],”train1”:[2,3,4],”test1”:[5,6]})
Output: Data Object with results

Widget: Predict Classification¶

Automatically generated widget from function PredictClassification in package latino. The original function signature: PredictClassification.

Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Output: Prediction(s)
Output: Labeled dataset

Category Nltk¶

Widget: Naive Bayes Classifier¶

A classifier based on the Naive Bayes algorithm. In order to find the probability for a label, this algorithm first uses the Bayes rule to express P(label|features) in terms of P(label) and P(features|label):

P(label) * P(features|label)

P(label|features) = ——————————

P(features)

The algorithm then makes the ‘naive’ assumption that all features are independent, given the label:

P(label) * P(f1|label) * ... * P(fn|label)

P(label|features) = ——————————————–

P(features)

Rather than computing P(featues) explicitly, the algorithm just calculates the denominator for each label, and normalizes them so they sum to one:

P(label) * P(f1|label) * ... * P(fn|label)

P(label|features) = ——————————————–

SUM[l]( P(l) * P(f1|l) * ... * P(fn|l) )

Parameter: Normalize (System.Boolean)
- Default value: false
Parameter: Log Sum Exp Trick (System.Boolean)
- Default value: true
Output: Classifier

Category Scikit¶

Widget: Decision Tree Classifier¶

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

Parameter: Max features (The number of features to consider when looking for the best split: If int, then consider max_features features at each split. If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.)
- Default value: auto
Parameter: Max depth (The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. )
- Default value: 100
Output: Classifier
Example usage: LBD workflows for outlier detection

Widget: k-Nearest Neighbours Classifier¶

Classifier implementing the k-nearest neighbors vote.

Parameter: Number of neighbors (Number of neighbors to use by default for k_neighbors queries.)
- Default value: 5
Parameter: Algorithm (Algorithm used to compute the nearest neighbors: ‘ball_tree’ will use BallTree ‘kd_tree’ will use KDTree ‘brute’ will use a brute-force search. ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method. Note: fitting on sparse input will override the setting of this parameter, using brute force.)
- Possible values:
  - ball tree
  - brute
  - kd tree
  - most appropriate (automatically)
- Default value: auto
Parameter: Weights (weight function used in prediction. Possible values: ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. Uniform weights are used by default.)
- Possible values:
  - distance
  - uniform
- Default value: uniform
Output: Classifier
Example usage: Classifier evaluation

Widget: Logistic regression Classifier¶

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

Parameter: Penalty (Used to specify the norm used in the penalization.)
- Possible values:
  - l1
  - l2
- Default value: l1
Parameter: C (Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.)
- Default value: 1.0
Output: Classifier

Widget: Multinomial Naive Bayes Classifier¶

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

Parameter: Alpha (Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). )
- Default value: 1.0
Parameter: Fit prior (Whether to learn class prior probabilities or not.

If false, a uniform prior will be used.)
Output: Classifier
Example usage: Outlier document detection

Widget: SVM Classifier¶

Implementation of Support Vector Machine classifier using libsvm: the kernel can be non-linear but its SMO algorithm does not scale to large number of samples as LinearSVC does. Furthermore SVC multi-class mode is implemented using one vs one scheme while LinearSVC uses one vs the rest.

Parameter: C (Penalty parameter C of the error term.)
- Default value: 1.0
Parameter: Degree (Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.)
- Default value: 3
Parameter: Kernel (Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.)
- Possible values:
  - linear
  - poly
  - precomputed
  - rbf
  - sigmoid
- Default value: rbf
Output: Classifier
Example usage: POS tagger intrinsic evaluation - experiment 1

Widget: SVM Linear Classifier¶

Similar to Support Vector Classification with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better (to large numbers of samples).

Parameter: C (Penalty parameter C of the error term.)
- Default value: 1.0
Parameter: Loss (Specifies the loss function. ‘l1’ is the hinge loss (standard SVM) while ‘l2’ is the squared hinge loss.)
- Possible values:
  - l1
  - l2
- Default value: l2
Parameter: Penalty (Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.)
- Possible values:
  - l1
  - l2
- Default value: l2
Parameter: Multi class (Determines the multi-class strategy if y contains more than two classes. ovr trains n_classes one-vs-rest classifiers, while crammer_singer optimizes a joint objective over all classes. While crammer_singer is interesting from an theoretical perspective as it is consistent it is seldom used in practice and rarely leads to better accuracy and is more expensive to compute. If crammer_singer is choosen, the options loss, penalty and dual will be ignored.)
- Possible values:
  - crammer singer
  - ovr
- Default value: ovr
Output: Classifier
Example usage: Classifier evaluation

Widget: Apply Classifier Hub¶

TODO

Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Parameter: Calculate class probabilities (Calculate classification class probabilities. May slow down algorithm prediction.)
- Default value: true
Output: Prediction(s)
Output: Labeled dataset

Widget: Train Classifier Hub¶

Automatically generated widget from function TrainClassifier in package latino. The original function signature: TrainClassifier.

Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
Output: Classifier

Category Lexicology¶

Category Controlled Vocabularies¶

Category Literature Based Discovery¶

Category Heuristic Calculation¶

Widget: Calculate Term Heuristics Scores¶

Calculate all input heuristics.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Bag of Words Model (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)
Input: Heuristic or Heuristic list (List of heuristic names which scores will be calculated.)
Output: Heuristic Scores (Calculated B-Term Heuristic Scores)
Example usage: Literature Based Discovery (overview with vocab)

Category Heuristic Specification¶

Category Term ranking and Exploration¶

Widget: Explore in CrossBee¶

Explore heuristic scores and terms in CrossBee.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Input: Bag of Words Model Constructor (Bag of Words Model Constructor )
Input: BOW Model Dataset (Sparse BOW feature vectors)
Input: B-terms (List of bridging terms)
Input: Heuristic Scores (Calculated B-term)
Parameter: CrossBee API URL (URL to the CrossBee API for exploring external data. Data to be displayed in CrossBee will be available at TextFlows’ URL. This URL will be send to CrossBee API via replacing “{dataurl.json}” string in the supplied Crossbe API URL.)
- Default value: http://crossbee.ijs.si/Home/ImportFromJson
Parameter: Primary Heuristic Index (Index of the primary heuristics to be analized as ensamble)
- Default value: 0
Output: Serialized Annotated Document Corpus (Serialized Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Output: Vocabulary
Output: Heuristic Scores (Calculated B-Term Heuristic Scores)
Output: B-terms (List of bridging terms)
Output: Serialized BOW Model Dataset (Serialized sparse BOW feature vectors)
Output: Primary Heuristic Index (Index of the primary heuristics to be analized as ensamble)
Example usage: LBD workflows for outlier detection

Category Helpers¶

Category Tagging¶

Widget: Condition Tagger¶

Automatically generated widget from function ConstructConditionTagger in package latino. The original function signature: ConstructConditionTagger.

Parameter: Feature Condition (Condition which tokens to include based on their features. Format examples: -Feature1 (don’t include tokens with Feature1 set ta any value) -Feature1=Value1 (don’t include tokens with Feature1 set to the value Value1) -Feature1 +Feature2 (don’t include tokens with Feature1 set unless it has also Feature2 set) -Feature1=Value1 +Feature2 (don’t include tokens with Feature1 set to Value1 unless it has also Feature2 set to any value)...)
Parameter: output Feature Value (System.String)
- Default value: true
Parameter: Put token/feature text as the output feature value (If set to true than token or token’s feature text is asigned as output feature value)
Output: Tagger

Widget: Random Cross Validation Sets¶

Automatically generated widget from function RandomCrossValidationSets in package latino. The original function signature: RandomCrossValidationSets.

Input: Example List (Not required, but if set, then it overrides parameter ‘numOfExamples’ and len(examples) is used for ‘numOfExamples’. This should be a type implementing Count, Count() or Length.)
Parameter: Num of Examples (This determines the length of the set id list. If input ‘examples’ is set then len(examples) is used for ‘numOfExamples’ and this setting is overriden.)
- Default value: 100
Parameter: Num of Sets (System.Int32)
- Default value: 10
Parameter: Assign Sets Randomly (System.Boolean)
- Default value: true
Parameter: Use Seed for Random (System.Boolean)
- Default value: false
Parameter: Random Seed (System.Int32)
- Default value: 0
Output: Example SetIds List

Widget: Random Sequential Validation Sets¶

Automatically generated widget from function RandomSequentialValidationSets in package latino. The original function signature: RandomSequentialValidationSets.

Input: Example List (Not required, but if set, then it overrides parameter ‘numOfExamples’ and len(examples) is used for ‘numOfExamples’. This should be a type implementing Count, Count() or Length.)
Parameter: Num of Examples (This determines the length of the set id list. If input ‘examples’ is set then len(examples) is used for ‘numOfExamples’ and this setting is overriden.)
- Default value: 100
Parameter: Num of Sets (System.Int32)
- Default value: 10
Parameter: Assign Sets Randomly (If not set then sets are exactly evenly distributet across the whole dataset.)
- Default value: true
Parameter: Use Seed for Random (System.Boolean)
- Default value: false
Parameter: Random Seed (System.Int32)
- Default value: 0
Parameter: Size of Train Set (May be specified as absolute number or number foloweed by ‘%’ to denote the percentage of the whole dataset.)
- Default value: 40%
Parameter: Size of Test Set (May be specified as absolute number or number foloweed by ‘%’ to denote the percentage of the whole dataset.)
- Default value: 10%
Parameter: Size of Space Between Train and Test Set (May be specified as absolute number or number foloweed by ‘%’ to denote the percentage of the whole dataset.)
- Default value: 1%
Output: Multiple Set Indexes

Widget: C#.NET Snippet¶

Runs c#.NET snippet. You can use variable which is provided on the input by the name “in1” .. “inN”. Whatever you want to otput needs to be asigned to the variable “out1” before the code is terminated

Input: Snippet Input Parameter(s) (input can be accesed as variable “in1” .. “inN” inside the code)
Parameter: C# Snippet Code (Input can be accesed as variable “in1” .. “inN” inside the code and output can be accesed/assigned as variable “out1” inside the code.)
- Default value: // This is the C#.NET Code Snippet where you can modify the data.

// Varaible “in1” .. “inN” contains whatever you connected to the input port // Input variables are correctly typed. // Whatever is assigned to the variable “out1” will be transfered to the output port. out1 = in1; * Parameter: Namespace Section (using directives) (System.String)

Default value: using System;

using System.Collections.Generic; using System.Linq; using Latino; using Latino.TextMining; using LatinoInterfaces; * Parameter: Additional References (imports) (System.String)

Default value: System.dll

System.Xml.dll System.Core.dll workflowstextflows_dot_netbinLatino.dll workflowstextflows_dot_netbinLatinoWorkflows.dll workflowstextflows_dot_netbinLatinoInterfaces.dll * Output: out (output can be accesed/assigned as variable “out1” inside the code) * Output: Console Output * Output: Possible compile/runtime errors * Output: Generated Code

Widget: Python Snippet¶

Runs python snippet. You can use variable which is provided on the input by the name “in1” .. “inN”. Whatever you want to otput needs to be asigned to the variable “out1” before the code is terminated

Input: in (input can be accesed as variable “in1” .. “inN” inside the code)
Parameter: Python Snippet Code (Input can be accesed as variable “in1” .. “inN” inside the code and output can be accesed/assigned as variable “out1” inside the code.)
- Default value: # This is the Python Code Snippet where you can modify the data however is needed.

# Varaible “in1” .. “inN” contains whatever you connected to the input port # Whatever is assigned to the variable “out1” will be transfered to the output port.

out1 = in1 * Output: out (output can be accesed/assigned as variable “out1” inside the code)

Category Noise Handling¶

Category Noise Filters¶

Category Performance Evaluation¶

Widget: Classification statistics¶

Calculates various classification statistics from true and predicted labels. Labels can be provided in two ways:

[y_true, y_predicted]

or for folds:

[[y_true_1, y_predicted_1], [y_true_2, y_predicted_2], ...]

Input: True and predicted labels (List of true and predicted labels (see help for details))
Output: Classification accuracy
Output: Precision
Output: Recall
Output: F1 (F1 measure)
Output: AUC
Output: Confusion matrix
Example usage: COMTRADE demo

Widget: Evaluation Results to 2d Table¶

Table that can be used in workflows with nested loops. You can define names on x and y axis. You can also choose the evaluation metrics that you want to show from a dropdown menu.

Input: Evaluation Results
Parameter: Evaluation metric (Choose the evaluation measurement you would like to show in the table.)
- Possible values:
  - accuracy
  - auc
  - fscore
  - precision
  - recall
- Default value: accuracy
Outputs: Popup window which shows widget’s results
Example usage: POS tagger intrinsic evaluation - experiment 5

Widget: Extract Actual and Predicted features¶

Takes as an input an ADC object with predicted features and an ADC object with actual features(golden standard). Output is a list containing a list of predicted features and a list contained actual features.

Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
Parameter: Predicted annotation (System.String)
- Default value: POS tag
Parameter: Actual annotation (System.String)
- Default value: POS tag
Parameter: Lowercase (Convert features to lowercase)
- Default value: False
Output: Actual and Predicted Values (List of Actual and Predicted Values)

Category Data In/Out¶

Category Latino¶

Widget: Convert Corpus to XML String¶

Widget: Convert XML String to Corpus¶

Widget: Get Plain Texts¶

Widget: Load Document Corpus From File¶

Widget: Load Document Corpus From String¶

Widget: Get Plain Texts¶

Widget: Load Document Corpus¶

Widget: Load Document Corpus From String¶

Widget: Load PTB Corpus¶

Widget: PTB To ADC Converter¶

Widget: Search with Bing¶

Widget: Crawl URL links¶

Widget: Search with Faroo¶

Widget: Load Document Corpus from MySQL¶

Widget: Load Document Corpus From File¶

Category Triplet Extraction¶

Widget: Triplet Extraction Hub¶

Category Document Corpus¶

Category Latino¶

Widget: Display Document Corpus¶

Widget: Statistics¶

Widget: Extract Feature¶

Widget: Add Feature¶

Widget: Add Computed Feature¶

Widget: Add Set Feature¶

Widget: Split¶

Widget: Extract Documents¶

Widget: Merge Corpora¶

Widget: Add Feature¶

Widget: Display Document Corpus¶

Widget: Extract Documents¶

Widget: Extract Feature¶

Widget: Merge Corpora¶

Widget: NLTK Document Corpus¶

Widget: Split¶

Widget: Statistics¶

Widget: Add Computed Document Features¶

Widget: Add Computed Token Features¶

Widget: Extract ADC Name¶

Widget: Extract NLTK Corpus Name¶

Category Tokenization¶

Category Latino¶

Category Advanced¶

Widget: Split Sentences Hub (Text)¶

Widget: Tokenizer Hub (Text)¶

Widget: Max Entropy Sentence Splitter¶

Widget: Split Sentences Hub¶

Widget: Max Entorpy Tokenizer¶

Widget: Unicode Tokenizer¶

Widget: Regex Tokenizer¶

Widget: Simple Tokenizer¶

Widget: Tokenizer Hub¶

Category Nltk¶

Widget: Line Tokenizer¶

Widget: Regex Tokenizer¶

Widget: S-Expression Tokenizer¶

Widget: Simple Tokenizer¶

Widget: Stanford Tokenizer¶

Widget: Text Tiling Tokenizer¶

Widget: Punkt Sentence Tokenizer¶

Widget: Treebank Word Tokenizer¶

Widget: Tokenizer Hub¶

Category POS Tagging¶

Category Latino¶

Category Advanced¶

Widget: POS Tagger Hub (Text)¶

Widget: Max Entropy POS Tagger¶

Category Nltk¶

Widget: CRF POS tagger¶

Widget: NLP4J POS tagger¶

Widget: NLTK Corpus to ADC Format¶

Widget: NLTK maxent treebank tagger¶

Widget: NLTK perceptron tagger¶

Widget: Perceptron POS tagger¶

Widget: Stanford POS tagger¶

Widget: TNT POS tagger¶

Widget: Tree Tagger¶

Widget: POS Affix Tagger¶