Category Data In/Out

Category Latino

Widget: Convert Corpus to XML String

_images/question-mark.png

Automatically generated widget from function SaveADCToXml in package latino. The original function signature: SaveADCToXml.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Output: XML String

Widget: Convert XML String to Corpus

_images/question-mark.png

Automatically generated widget from function LoadADCFromXml in package latino. The original function signature: LoadADCFromXml.

  • Input: XML String (System.String)
  • Output: Annotated Document Corpus

Widget: Get Plain Texts

_images/question-mark.png

Automatically generated widget from function GetDocStrings in package latino. The original function signature: GetDocStrings.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: Token Annotation (System.String)
    • Default value: TextBlock
  • Parameter: Feature Condition (Condition which tokens to include based on their features. Format examples: -Feature1 (don’t include tokens with Feature1 set ta any value) -Feature1=Value1 (don’t include tokens with Feature1 set to the value Value1) -Feature1 +Feature2 (don’t include tokens with Feature1 set unless it has also Feature2 set) -Feature1=Value1 +Feature2 (don’t include tokens with Feature1 set to Value1 unless it has also Feature2 set to any value)...)
  • Parameter: Delimiter for token concatenation (System.String)
  • Parameter: Include Document Identifier (System.Boolean)
  • Output: Texts

Widget: Load Document Corpus From File

_images/question-mark.png

This widges processes raw text file and loads the texts into ADC (Annotated Document Corpus) structure. The input file contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

  • Input: Raw Text File (Input Text File: Contains one document per line - the whole line represents text from the body of a document.)
  • Parameter: Text before the first tabulator [/t] represents the title of a document (System.Boolean)
    • Default value: false
  • Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (System.Boolean)
    • Default value: false
  • Output: Annotated Document Corpus

Widget: Load Document Corpus From String

_images/question-mark.png

This widges processes raw text file and loads the texts into ADC (Annotated Document Corpus) structure. The input file contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

  • Input: String (Input Text String: Contains one document per line - the whole line represents text from the body of a document.)
  • Parameter: Text before the first tabulator [/t] represents the title of a document (System.Boolean)
    • Default value: false
  • Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (System.Boolean)
    • Default value: false
  • Output: Annotated Document Corpus

Widget: Get Plain Texts

_images/question-mark.png

Widget transforms Annotated Document Corpus to string.

  • Input: Annotated Document Corpus (Annotated Document Corpus.)
  • Parameter: Feature Annotation (Select a feature annotation.)
    • Default value: Stem
  • Parameter: Token Annotation (Select token annotation.)
    • Default value: Token
  • Parameter: Delimiter for token concatenation (Delimiter for token concatenation.)
    • Default value:
  • Parameter: Include Document Identifier (Include Document Identifier.)
  • Output: Texts (String with all documents in Annotated Document Corpus.)
  • Example usage: LBD workflows for outlier detection

Widget: Load Document Corpus

_images/question-mark.png

This widget processes input text and loads it into ADC (Annotated Document Corpus) structure. The input text contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

  • Input: Input (Input can be a string (str) or a file (fil).)
  • Parameter: Text before the first tabulator [/t] represents the title of a document (Text before the first tabulator [/t] represents the title of a document.)
    • Default value: false
  • Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (First words in a line (after optional title) with preceding exclamation (!) present labels.)
    • Default value: false
  • Output: Annotated Document Corpus (Annotated Document Corpus.)
  • Example usage: Evaluation of POS 3-gram sequences in gender classification task

Widget: Load Document Corpus From String

_images/question-mark.png

This widget processes input text and loads it into ADC (Annotated Document Corpus) structure. The input text contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

  • Input: String (Input Text String: Contains one document per line - the whole line represents text from the body of a document.)
  • Parameter: Text before the first tabulator [/t] represents the title of a document (Text before the first tabulator [/t] represents the title of a document.)
    • Default value: false
  • Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (First words in a line (after optional title) with preceding exclamation (!) present labels.)
    • Default value: false
  • Output: Annotated Document Corpus (Annotated Document Corpus.)

Widget: Load PTB Corpus

_images/question-mark.png

Loads corpus in Penn Treebank format with part of speech or lemma annotations. Corpus should be a directory with ptb or .txt files, or it could be just one file with one nested tupple per line. Bellow is an example of how the input format could look:

(ROOT
(S
(S
(VP (VBG Making)
(NP (NNPS Skittles))))

(NP (NN vodka)) (VP (VBZ is)

(NP (DT a) (JJ fun) (NN way)
(S
(VP (TO to)
(VP (VB add)
(NP

(NP (DT a) (NN splash)) (PP (IN of)

(NP (JJ fruity) (NN flavor)
(CC and) (NN color))))
(PP (TO to)
(NP (JJ regular) (NN vodka))))))))

(. .)))

The widget returns a list of tokenized sentences with part of speech or lemma tags.

Widget: PTB To ADC Converter

_images/question-mark.png

Convert PTB corpus to pseudo ADC corpus. Can be used after ‘Load PTB corpus’ widget.

  • Input: PTB Document Corpus (Corpus in penn treebank format)
  • Parameter: Annotation name (Give the name to the annotations from the Penn Treebank Format, for example, ‘POS Tag’ or ‘Lemma’. This annotation will be tagged in the ADC corpus under this name.)
    • Default value: POS Tag
  • Output: Annotated Document Corpus
  • Example usage: POS tagger intrinsic evaluation - experiment 2

Widget: Search with Bing

_images/question-mark.png

This widget makes a web search query on the Bing search engine and returns a list of top k URLs.

  • Parameter: Search Query (Search Query.)
  • Parameter: Limit (Limit no of results)
    • Default value: 50
  • Output: List of URLs

Widget: Search with Faroo

_images/question-mark.png

This widget makes a web search query on the Faroo search engine and returns a list of top k URLs.

  • Parameter: Search Query (Search Query.)
  • Parameter: Limit (Limit no of results)
    • Default value: 50
  • Output: List of URLs

Widget: Load Document Corpus from MySQL

_images/question-mark.png

This widget processes input text from the inputed MySQL database and loads it into ADC (Annotated Document Corpus) structure.

  • Parameter: DB Username
  • Parameter: DB Password
  • Parameter: Hostname
  • Parameter: Database Name
  • Parameter: Table Name (Tabel Name)
  • Parameter: Title Column Name
  • Parameter: Text Column Name
  • Parameter: Label Column Name
  • Output: Annotated Document Corpus (Annotated Document Corpus.)

Widget: Load Document Corpus From File

_images/question-mark.png

This widget processes raw text file and loads the texts into ADC (Annotated Document Corpus) structure. The input file contains one document per line - the whole line represents text from the body of a document. In case lines contain more document properties (i.e.: ids, titles, labels,...) than other widgets should be used to load ADC structure.

  • Input: Raw Text File (Input Text File: Contains one document per line - the whole line represents text from the body of a document.)
  • Parameter: Text before the first tabulator [/t] represents the title of a document (Text before the first tabulator [/t] represents the title of a document.)
    • Default value: false
  • Parameter: First words in a line (after optional title) with preceding exclamation (!) present labels (First words in a line (after optional title) with preceding exclamation (!) present labels.)
    • Default value: false
  • Output: Annotated Document Corpus (Annotated Document Corpus.)

Category Triplet Extraction

Widget: Triplet Extraction Hub

_images/question-mark.png

TODO

  • Input: Annotated Document Corpus
  • Parameter: Annotation to be tokenized (Which annotated part of document to be splitted.)
    • Default value: Sentence
  • Parameter: Annotation to be produced (How to annotate the newly discovered tokens.)
    • Default value: Triplet
  • Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Output: Triplets (List of triplets from all documents.)

Category Document Corpus

Category Latino

Widget: Display Document Corpus

_images/question-mark.png

Automatically generated widget from function DisplayDocumentCorpus_PYTHON in package latino. The original function signature: DisplayDocumentCorpus_PYTHON.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Outputs: Popup window which shows widget’s results

Widget: Statistics

_images/question-mark.png

Automatically generated widget from function CorpusStatistics in package latino. The original function signature: CorpusStatistics.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Output: Number of Documents
  • Output: Number of Features
  • Output: Statistics

Widget: Extract Feature

_images/question-mark.png

Automatically generated widget from function ExtractDocumentsFeatures in package latino. The original function signature: ExtractDocumentsFeatures.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: Extracted Feature Name (System.String)
  • Output: List of Extracted Features

Widget: Add Feature

_images/question-mark.png

Automatically generated widget from function AddDocumentsFeatures in package latino. The original function signature: AddDocumentsFeatures.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Feature Values (Array of Labels) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Parameter: New Feature Name (System.String)
    • Default value: feature
  • Parameter: New Feature Value Prefix (System.String)
  • Output: Annotated Document Corpus

Widget: Add Computed Feature

_images/question-mark.png

Automatically generated widget from function AddComputedFeatures in package latino. The original function signature: AddComputedFeatures.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: New Feature Name (System.String)
    • Default value: feature
  • Parameter: New Feature Computataion (System.String)
    • Default value: {feature2:name}{feature3}, {feature1:value}
  • Parameter: Old Features Specification (Comma separated list of names of old features used in the ‘New Feature Computataion’.)
    • Default value: feature1, feature2
  • Output: Annotated Document Corpus

Widget: Add Set Feature

_images/question-mark.png

Automatically generated widget from function MarkDocumentsWithSetFeature in package latino. The original function signature: MarkDocumentsWithSetFeature.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: Feature Name (System.String)
    • Default value: set
  • Parameter: Feature Value Prefix (System.String)
  • Parameter: Num of Sets (System.Int32)
    • Default value: 10
  • Parameter: Assign Sets Randomly (System.Boolean)
    • Default value: true
  • Parameter: Use Seed for Random (System.Boolean)
    • Default value: false
  • Parameter: Random Seed (System.Int32)
    • Default value: 0
  • Output: Annotated Document Corpus

Widget: Split

_images/question-mark.png

Automatically generated widget from function SplitDocumentsByFeatureValue in package latino. The original function signature: SplitDocumentsByFeatureValue.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: Feature Condition (System.String)
  • Parameter: Discard The Rest (The Filtered Out) (System.Boolean)
    • Default value: false
  • Output: Filtered Annotated Document Corpus
  • Output: The Rest of Annotated Document Corpus

Widget: Extract Documents

_images/question-mark.png

Automatically generated widget from function ExtractDocuments in package latino. The original function signature: ExtractDocuments.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: List of Document Indexes to be Extracted (System.Collections.Generic.List`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Parameter: Discard The Rest (The Filtered Out) (System.Boolean)
    • Default value: false
  • Output: Annotated Document Corpus of Extracted Documents
  • Output: Annotated Document Corpus of the Rest of Documents

Widget: Merge Corpora

_images/question-mark.png

Automatically generated widget from function JoinDocumentsCorpora in package latino. The original function signature: JoinDocumentsCorpora.

  • Input: Annotated Document Corpus (System.Collections.Generic.List`1[[LatinoInterfaces.DocumentCorpus, LatinoInterfaces, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Merged Annotated Document Corpus

Widget: Add Feature

_images/question-mark.png

Add a feature to Annotated Document Corpus.

  • Input: Annotated Document Corpus
  • Input: Feature Values (List of feature values)
  • Parameter: New Feature Name
    • Default value: feature
  • Parameter: New Feature Value Prefix
  • Output: Annotated Document Corpus

Widget: Display Document Corpus

_images/question-mark.png

Display Document Corpus widget displays ADC (Annotated Document Corpus) structure. It shows a detail view for selected document with annotations.

Widget: Extract Documents

_images/question-mark.png

Extract documents, given document indices, from Annotated Document Corpus.

  • Input: List of Document Indexes to be Extracted
  • Input: Annotated (Annotated Document Corpus.)
  • Parameter: Discard The Rest (The Filtered Out)
    • Default value: false
  • Output: Annotated Document Corpus of Extracted Documents
  • Output: Annotated Document Corpus of Extracted Documents
  • Example usage: LBD workflows for outlier detection

Widget: Extract Feature

_images/question-mark.png

Extract documents features.

  • Input: Annotated Document Corpus (Annotated Document Corpus.)
  • Parameter: Extracted Feature Name
  • Output: List of Extracted Features

Widget: Merge Corpora

_images/question-mark.png

Merge multiple Annotated Document Corpuses into one.

  • Input: Annotated
  • Output: Merged Annotated Document Corpus

Widget: NLTK Document Corpus

_images/question-mark.png

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats. These functions can be used to read both the corpus files that are distributed in the NLTK corpus package, and corpus files that are part of external corpora.

Please see http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml for a complete list. Install corpora using nltk.download().

Corpus has the following available functions: words(): list of str sents(): list of (list of str) paras(): list of (list of (list of str)) tagged_words(): list of (str,str) tuple tagged_sents(): list of (list of (str,str)) tagged_paras(): list of (list of (list of (str,str))) chunked_sents(): list of (Tree w/ (str,str) leaves) parsed_sents(): list of (Tree with str leaves) parsed_paras(): list of (list of (Tree with str leaves)) xml(): A single xml ElementTree raw(): unprocessed corpus contents

  • Parameter: NLTK Document Corpus Name (NTLK Document Corpus Name)

    • Possible values:
      • Brown
      • Cess Esp (spanish)
      • Floresta
      • NPS chat
      • Treebank
    • Default value: brown
  • Parameter: Corpus Chunk (Define the chunk of the corpus you want. You can define the chunk as percentage(e.g. ‘80%’) of the corpus you would like or you can define the number of sentences from the beggining of the corpus. For example, value ‘1000’ will return first 1000 sentences in the corpus.

    You can also define the chunk you want to discard. For example ‘^80%’ will discard first 80% of the corpus and return last 20% of the corpus. ‘^1000’ will discard first 1000 sentences of the corpus and return the rest of the corpus.)

    • Default value: 100%
  • Output: NTLK document corpus (NLTK document corpus name)

  • Example usage: POS tagger intrinsic evaluation - experiment 2

Widget: Split

_images/question-mark.png

Split Annotated Document Corpus by conditions with features and values.

  • Input: Annotated Document Corpus (Annotated )
  • Parameter: Feature Condition
  • Parameter: Discard The Rest (The Filtered Out)
  • Output: Filtered Annotated Document Corpus
  • Output: The Rest of Annotated Document Corpus

Widget: Statistics

_images/question-mark.png

Statistics of Annotated Document Corpus.

  • Input: Annotated Document Corpus
  • Output: Number of Documents (Number of Documents.)
  • Output: Number of Features (Number of Features.)
  • Output: Statistics (Statistics.)

Widget: Add Computed Document Features

_images/question-mark.png

TODO

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: New Feature Name (System.String)
    • Default value: feature
  • Parameter: New Feature Computataion (System.String)
    • Default value: {feature2:name}{feature3}, {feature1:value}
  • Output: Annotated Document Corpus

Widget: Add Computed Token Features

_images/question-mark.png

For every annotation of the selected type generate an additional feature. Between { } a feature name can be entered and it will be replaced with its value.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: New Feature Name (System.String)
    • Default value: feature
  • Parameter: Annotation Name (Add features to tokens of this type.)
    • Default value: Token
  • Parameter: New Feature Computataion (Values for the new features. Between { } a feature name can be entered and it will be replaced with its value.)
    • Default value: {Stem}_{POS Tag}
  • Output: Annotated Document Corpus
  • Example usage: COMTRADE demo

Widget: Extract ADC Name

_images/question-mark.png

Returns a name od the ADC corpus.

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Output: ADC Name

Widget: Extract NLTK Corpus Name

_images/question-mark.png

Returns a name od the NLTK corpus.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Output: NLTK Name

Category Tokenization

Category Latino

Category Advanced

Widget: Split Sentences Hub (Text)

_images/question-mark.png

Automatically generated widget from function TokenizeStringString in package latino. The original function signature: TokenizeStringString.

  • Input: Text (System.Object)
  • Input: Tokenizer (Latino.TextMining.ITokenizer)
  • Output: Text

Widget: Tokenizer Hub (Text)

_images/question-mark.png

Automatically generated widget from function TokenizeStringWords in package latino. The original function signature: TokenizeStringWords.

  • Input: Text (System.Object)
  • Input: Tokenizer (Latino.TextMining.ITokenizer)
  • Output: String

Widget: Max Entropy Sentence Splitter

_images/question-mark.png

Automatically generated widget from function ConstructEnglishMaximumEntropySentenceDetector in package latino. The original function signature: ConstructEnglishMaximumEntropySentenceDetector.

Widget: Split Sentences Hub

_images/question-mark.png

Automatically generated widget from function TokenizeSentences in package latino. The original function signature: TokenizeSentences.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Tokenizer (Latino.TextMining.ITokenizer)
  • Parameter: Annotation to be tokenized (Which annotated part of document to be splited)
    • Default value: TextBlock
  • Parameter: Annotation to be produced (How to annotate found sentences)
    • Default value: Sentence
  • Output: Annotated Document Corpus

Widget: Max Entorpy Tokenizer

_images/question-mark.png

Automatically generated widget from function ConstructEnglishMaximumEntropyTokenizer in package latino. The original function signature: ConstructEnglishMaximumEntropyTokenizer.

Widget: Unicode Tokenizer

_images/question-mark.png

Automatically generated widget from function ConstructUnicodeTokenizer in package latino. The original function signature: ConstructUnicodeTokenizer.

  • Parameter: Filter (Latino.TextMining.TokenizerFilter)
    • Possible values:
      • AlphaLoose: accept tokens that contain at least one alphabetic character
      • AlphanumLoose: accept tokens that contain at least one alphanumeric character
      • AlphanumStrict: accept tokens that contain alphanumeric characters only
      • AlphaStrict: accept tokens that contain alphabetic characters only
      • None: accept all tokens
    • Default value: None
  • Parameter: Minimal Token Length (System.Int32)
    • Default value: 1
  • Output: Tokenizer

Widget: Regex Tokenizer

_images/question-mark.png

Automatically generated widget from function ConstructRegexTokenizer in package latino. The original function signature: ConstructRegexTokenizer.

  • Parameter: Regular Expression (System.String)
    • Default value: p{L}+(-p{L}+)*
  • Parameter: Ignore Unknown Tokens (System.Boolean)
  • Parameter: Ignore Case (System.Boolean)
  • Parameter: Multiline (System.Boolean)
  • Parameter: Explicit Capture (System.Boolean)
  • Parameter: Compiled (System.Boolean)
  • Parameter: Singleline (System.Boolean)
  • Parameter: Ignore Pattern Whitespace (System.Boolean)
  • Parameter: Right To Left (System.Boolean)
  • Parameter: ECMA Script (System.Boolean)
  • Parameter: Culture Invariant (System.Boolean)
  • Output: Tokenizer

Widget: Simple Tokenizer

_images/question-mark.png

Automatically generated widget from function ConstructSimpleTokenizer in package latino. The original function signature: ConstructSimpleTokenizer.

  • Parameter: Type (Latino.TextMining.TokenizerType)
    • Possible values:
      • AllChars: equivalent to [^s]+
      • AlphanumOnly: equivalent to [p{L}d]+
      • AlphaOnly: equivalent to p{L}+
    • Default value: AllChars
  • Parameter: Minimal Token Length (System.Int32)
    • Default value: 1
  • Output: Tokenizer

Widget: Tokenizer Hub

_images/question-mark.png

Automatically generated widget from function TokenizeWords in package latino. The original function signature: TokenizeWords.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Tokenizer (Latino.TextMining.ITokenizer)
  • Parameter: Annotation to be tokenized (Which annotated part of document to be splited)
    • Default value: TextBlock
  • Parameter: Annotation to be produced (How to annotate found sentences)
    • Default value: Token
  • Output: Annotated Document Corpus

Category Nltk

Widget: Line Tokenizer

_images/question-mark.png

Tokenize a string into its lines, optionally discarding blank lines.

  • Parameter: Blank Lines (blanklines: Indicates how blank lines should be handled. Options are:
    • discard: strip blank lines out of the token list before returning it.
      A line is considered blank if it contains only whitespace characters.
    • keep: leave all blank lines in the token list.
    • discard-eof: if the string ends with a newline, then do not generate
      a corresponding token '' after that newline.)
    • Possible values:
      • discard
      • discard-eof
      • keep
    • Default value: discard
  • Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Regex Tokenizer

_images/question-mark.png

The Regex Tokenizer splits a string into substrings using a regular expression.

  • Parameter: Regular Expression (The pattern used to build this tokenizer.

    (This pattern may safely contain capturing parentheses.))

    • Default value: p{L}+(-p{L}+)*
  • Parameter: Gaps (True if this tokenizer’s pattern should be used

    to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.)

  • Parameter: Discard empty (True if any empty tokens ‘’

    generated by the tokenizer should be discarded. Empty tokens can only be generated if Gaps is set.)

  • Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: S-Expression Tokenizer

_images/question-mark.png

S-Expression Tokenizer is used to find parenthesized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens

  • Parameter: Parentheses ( A two-element sequence specifying the open and close parentheses

    that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.)

    • Default value: ()
  • Parameter: Strict (If true, then raise an exception when tokenizing an ill-formed sexpr.)

    • Default value: true
  • Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Simple Tokenizer

_images/question-mark.png

These tokenizers divide strings into substrings using the string split() method.

Space Tokenizer - Tokenize a string using the space character as a delimiter, which is the same as s.split(‘ ‘). Tab Tokenizer - Tokenize a string use the tab character as a delimiter, the same as s.split(‘t’). Char Tokenizer - Tokenize a string into individual characters. Whitespace Tokenizer - Tokenize a string on whitespace (space, tab, newline). Blankline Tokenizer - Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters. Word Punct Tokenizer - Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

  • Parameter: Type (Select a tokenizer.

    Space Tokenizer - Tokenize a string using the space character as a delimiter, which is the same as s.split(‘ ‘).

    Tab Tokenizer - Tokenize a string use the tab character as a delimiter, the same as s.split(‘t’).

    Char Tokenizer - Tokenize a string into individual characters.

    Whitespace Tokenizer - Tokenize a string on whitespace (space, tab, newline).

    Blankline Tokenizer - Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters.

    Word Punct Tokenizer - Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.)

    • Possible values:
      • Blankline Tokenizer
      • Char Tokenizer
      • Space Tokenizer
      • Tab Tokenizer
      • Whitespace Tokenizer
      • WordPunct Tokenizer
    • Default value: wordpunct_tokenizer
  • Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Stanford Tokenizer

_images/question-mark.png

A tokenizer divides text into a sequence of tokens, which roughly correspond to “words”.

  • Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Text Tiling Tokenizer

_images/question-mark.png

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

  • Parameter: Pseudosentence size (Pseudosentence size.)
    • Default value: 20
  • Parameter: Size (Size (in sentences) of the block used in the block comparison method. )
    • Default value: 10
  • Parameter: Stopwords ( A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus). Example: the, a)
    • Default value: None
  • Parameter: Smoothing width (The width of the window used by the smoothing method.)
    • Default value: 2
  • Parameter: Smoothing rounds (The number of smoothing passes.)
    • Default value: 1
  • Parameter: Similarity method (The method used for determining similarity scores: Block comparison (default) or Vocabulary introduction.)
    • Possible values:
      • Block comparison
      • Vocabulary introduction
    • Default value: BLOCK_COMPARISON
  • Parameter: Cutoff policy (The policy used to determine the number of boundaries: HC (default) or LC.)
    • Possible values:
      • HC
      • LC
    • Default value: HC
  • Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Punkt Sentence Tokenizer

_images/question-mark.png

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

  • Output: Tokenizer (A python dictionary containing the Tokenizer object and its arguments.)

Widget: Treebank Word Tokenizer

_images/question-mark.png
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.

This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

  • split standard contractions, e.g. don't -> do n't and they'll -> they 'll

  • treat most punctuation characters as separate tokens

  • split off commas and single quotes, when followed by whitespace

  • separate periods that appear at the end of line

    >>> from nltk.tokenize import TreebankWordTokenizer
    >>> s = '''Good muffins cost $3.88\\nin New York.  Please buy me\\ntwo of them.\\n\\nThanks.'''
    >>> TreebankWordTokenizer().tokenize(s)
    ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.',
    'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
    >>> s = "They'll save and invest more."
    >>> TreebankWordTokenizer().tokenize(s)
    ['They', "'ll", 'save', 'and', 'invest', 'more', '.']
    

NB. this tokenizer assumes that the text is presented as one sentence per line, where each line is delimited with a newline character. The only periods to be treated as separate tokens are those appearing at the end of a line.

  • Output: Tokenizer

Widget: Tokenizer Hub

_images/question-mark.png

Apply the tokenizer object on the Annotated Document Corpus (adc):

  1. first select only annotations of type input_annotation,
  2. apply the tokenizer
  3. create new annotations output_annotation with the outputs of the tokenizer.
  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Tokenizer (Python dictionary containing the Tokenizer object and its arguments.)
  • Parameter: Annotation to be tokenized (Which annotated part of document to be splitted.)
    • Default value: TextBlock
  • Parameter: Annotation to be produced (How to annotate the newly discovered tokens.)
    • Default value: Token
  • Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Example usage: LBD workflows for outlier detection

Category POS Tagging

Category Latino

Category Advanced

Widget: POS Tagger Hub (Text)

_images/question-mark.png

Automatically generated widget from function PosTagString in package latino. The original function signature: PosTagString.

  • Input: Text (System.Object)
  • Input: POS Tagger (OpenNLP.Tools.PosTagger.EnglishMaximumEntropyPosTagger)
  • Parameter: Output Feature Name (System.String)
    • Default value: posTag
  • Output: String

Widget: Max Entropy POS Tagger

_images/question-mark.png

Automatically generated widget from function ConstructEnglishMaximumEntropyPosTagger in package latino. The original function signature: ConstructEnglishMaximumEntropyPosTagger.

Category Nltk

Widget: CRF POS tagger

_images/question-mark.png

CRF part of speech tagger from CRFsuite

Widget: NLP4J POS tagger

_images/question-mark.png

POS tagger from NLP4J

Widget: NLTK Corpus to ADC Format

_images/question-mark.png

extract tagged sentences in PTB format from NLTK corpus and convert it to ADC format.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Parameter: Annotation name (Give the name to the annotations from the Penn Treebank Format, for example, ‘POS Tag’ or ‘Lemma’. This annotation will be tagged in the ADC corpus under this name.)
    • Default value: POS Tag
  • Output: Annotated Document Corpus

Widget: NLTK maxent treebank tagger

_images/question-mark.png

Maxent treebank tagger from NLTK. It’s the tagger NLTK 2 uses when ‘nltk.pos_tag()’ function is called.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Output: POS Tagger (A python dictionary containing the POS tagger )
  • Example usage: POS tagger intrinsic evaluation - experiment 5

Widget: NLTK perceptron tagger

_images/question-mark.png

Perceptron tagger from NLTK. It’s the tagger NLTK 3.0 uses when ‘nltk.pos_tag()’ function is called.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Output: POS Tagger (A python dictionary containing the POS tagger )
  • Example usage: POS tagger intrinsic evaluation - experiment 1

Widget: Perceptron POS tagger

_images/question-mark.png

Greedy Averaged Perceptron tagger, as implemented by Matthew Honnibal with a fix that makes it not crash on 0 length tokens during training. The implementation is identical to the one implemented in NLTK 3.1.0 and later if you do not consider the fix.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Output: POS Tagger (A python dictionary containing the POS tagger)
  • Example usage: POS tagger extrinsic evaluation in gender classification task

Widget: Stanford POS tagger

_images/question-mark.png

Stanford POS tagger from coreNLP

Widget: TNT POS tagger

_images/question-mark.png

TNT part of speech tagger as implemented in NLTK

Widget: Tree Tagger

_images/question-mark.png

Tree Tagger by Helmut Schmid

Widget: POS Affix Tagger

_images/question-mark.png

A tagger that chooses a token’s tag based on a leading or trailing substring of its word string. (It is important to note that these substrings are not necessarily “true” morphological affixes). In particular, a fixed-length substring of the word is looked up in a table, and the corresponding tag is returned. Affix taggers are typically constructed by training them on a tagged corpus.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Input: Backoff Tagger (A backoff tagger, to be used by the new tagger if it encounters an unknown context.)
  • Parameter: Affix Length (The length of the affixes that should be considered during training and tagging. Use negative numbers for suffixes.)
    • Default value: -3
  • Parameter: Cutoff (If the most likely tag for a context occurs fewer than cutoff times, then exclude it from the context-to-tag table for the new tagger.)
    • Default value: 0
  • Parameter: Minimum Stem Length (Any words whose length is less than min_stem_length+abs(affix_length) will be assigned a tag of None by this tagger.)
    • Default value: 2
  • Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
  • Example usage: POS tagging classification evaluation (copy)

Widget: POS Brill’s rule-based Tagger

_images/question-mark.png

“”“Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as tag.DefaultTagger) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the BrillRule interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using either BrillTaggerTrainer or FastBrillTaggerTrainer.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)

  • Input: Initial Tagger (The initial tagger. Brill taggers use an initial tagger (such as DefaultTagger) to assign an initial tag sequence to a text.)

  • Parameter: Max Rules (The maximum number of transformations to be created)

    • Default value: 200
  • Parameter: Min Score (The minimum acceptable net error reduction that each transformation must produce in the corpus.)

    • Default value: 2
  • Parameter: Templates (Templates to be used in training TODO: meaning?!

    Options: - nltkdemo18:

    Return 18 templates, from the original nltk demo, in multi-feature syntax

    • nltkdemo18plus:
      Return 18 templates, from the original nltk demo, and additionally a few multi-feature ones (the motivation is easy comparison with nltkdemo18)
    • brill24:
      Return 24 templates of the seminal TBL paper, Brill (1995)
    • fntbl37:
      Return 37 templates taken from the postagging task of the fntbl distribution http://www.cs.jhu.edu/~rflorian/fntbl/ (37 is after excluding a handful which do not condition on Pos[0]; fntbl can do that but the current nltk implementation cannot.))
    • Possible values:
      • brill24
      • fntbl37
      • nltkdemo18
      • nltkdemo18plus
    • Default value: brill24
  • Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)

  • Example usage: POS tagging classification evaluation (copy)

Widget: POS Classifier-based Tagger

_images/question-mark.png

A sequential tagger that uses a classifier to choose the tag for each token in a sentence. The featureset input for the classifier is generated by a feature detector function:

feature_detector(tokens, index, history) -> featureset

Where tokens is the list of unlabeled tokens in the sentence; index is the index of the token for which feature detection should be performed; and history is list of the tags for all tokens before index.

Construct a new classifier-based sequential tagger.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Input: Backoff Tagger (A backoff tagger, to be used by the new tagger if it encounters an unknown context.)
  • Input: Classifier (The classifier that should be used by the tagger. This is useful if you want to use a manually constructed classifier for POS tagging.)
  • Parameter: Cutoff Probability (If specified, then this tagger will fall back on its backoff tagger if the probability of the most likely tag is less than cutoff_prob.)
  • Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
  • Example usage: POS tagging classification evaluation (copy)

Widget: POS Default Tagger

_images/question-mark.png

A tagger that assigns the same tag to every token.

>>> from nltk.tag.sequential import DefaultTagger
>>> default_tagger = DefaultTagger('NN')
>>> default_tagger.tag('This is a test'.split())
[('This', 'NN'), ('is', 'NN'), ('a', 'NN'), ('test', 'NN')]

This tagger is recommended as a backoff tagger, in cases where a more powerful tagger is unable to assign a tag to the word (e.g. because the word was not seen during training).

  • Parameter: Default tag (The default tag “-None-”. Set this to a different tag, such as “NN”, to change the default tag.)
    • Default value: -None-
  • Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
  • Example usage: POS tagging classification evaluation (copy)

Widget: POS N-gram Tagger

_images/question-mark.png

A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags. In particular, a tuple (tags[i-n:i-1], words[i]) is looked up in a table, and the corresponding tag is returned. N-gram taggers are typically trained on a tagged corpus.

Train a new NgramTagger using the given training data or the supplied model. In particular, construct a new tagger whose table maps from each context (tag[i-n:i-1], word[i]) to the most frequent tag for that context. But exclude any contexts that are already tagged perfectly by the backoff tagger.

  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Input: Backoff Tagger (A backoff tagger, to be used by the new tagger if it encounters an unknown context.)
  • Parameter: N-gram (N-gram is a contiguous sequence of n items from a given sequence of text or speech.)
    • Default value: 1
  • Parameter: Cutoff (If the most likely tag for a context occurs fewer than cutoff times, then exclude it from the context-to-tag table for the new tagger.)
    • Default value: 0
  • Output: POS Tagger (A python dictionary containing the POS tagger object and its arguments.)
  • Example usage: POS tagging classification evaluation (copy)

Widget: Display Annotation Statistics

_images/question-mark.png

Display statistics for specific annotation or annotation sequence in ADC corpus. Widget shows annotations ranked by frequency, PMI and chi square and the scores they achieved.

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Parameter: Statistic type (Choose what kind of statistics you would like to show)
    • Possible values:
      • Chi square test
      • frequency
      • PMI of bigrams
      • PMI of trigrams
    • Default value: frequency
  • Parameter: Annotation name (Choose annotation)
    • Default value: Token/POS Tag
  • Parameter: N-gram (Choose what kind of n-gram features you would like to score.)
    • Possible values:
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
    • Default value: 1
  • Outputs: Popup window which shows widget’s results
  • Example usage: Evaluation of POS 3-gram sequences in gender classification task

Widget: POS Tagger Evaluator

_images/question-mark.png

This widgets can be used to evaluate NLTK POS taggers. Inputs are POS tagger and golden standard corpus

  • Input: POS Tagger (OpenNLP.Tools.PosTagger.EnglishMaximumEntropyPosTagger)
  • Input: PTB Document Corpus (Corpus in penn treebank format)
  • Output: Actual and predicted labels (List of actual and predicted labels (see help for details))

Widget: POS Tagger Hub

_images/question-mark.png

TODO

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: POS Tagger (OpenNLP.Tools.PosTagger.EnglishMaximumEntropyPosTagger)
  • Parameter: Sentence’s Annotation (System.String)
    • Default value: Sentence
  • Parameter: Element’s Annotation (System.String)
    • Default value: Token
  • Parameter: Output Feature Name (System.String)
    • Default value: POS Tag
  • Parameter: Take first k letters from POS tag
    • Possible values:
      • 1
      • 2
      • 3
      • all
    • Default value: -1
  • Output: Annotated Document Corpus
  • Example usage: download_adc_annotations_as_csv

Widget: Extract POS Tagger Name

_images/question-mark.png

Returns a string with pretty POS tagger name.

  • Input: POS Tagger
  • Output: POS Tagger Name

Category Bag of Words

Category Latino

Category Advanced

Widget: Construct BOW Model (Text)

_images/question-mark.png

Automatically generated widget from function ConstructBowSpace in package latino. The original function signature: ConstructBowSpace.

  • Input: Textual Documents (Array of strings) (System.Object)
  • Input: Tokenizer (Latino.TextMining.ITokenizer)
  • Input: Stemmer or Lemmatizer (Tagger) (Latino.TextMining.IStemmer)
  • Input: Stopwords (Array of Stopwords) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Parameter: Maximum N-Gram Length (System.Int32)
    • Default value: 2
  • Parameter: Minimum Word Freqency (System.Int32)
    • Default value: 5
  • Parameter: Word Weighting Type (Latino.TextMining.WordWeightType)
    • Possible values:
      • Log Df Tf Idf
      • Term Freq
      • Tf Idf
      • Tf Idf Safe
    • Default value: TfIdf
  • Parameter: Cut Low Weights Percentage (System.Double)
    • Default value: 0.2
  • Parameter: Normalize Vectors (System.Boolean)
    • Default value: true
  • Output: Bag of Words Model
  • Output: Dataset

Widget: Get Terms

_images/question-mark.png

Automatically generated widget from function GetVocabulary in package latino. The original function signature: GetVocabulary.

  • Input: BOW Model (Latino.TextMining.BowSpace)
  • Parameter: Index of First Retrieved Word (System.Int32)
    • Default value: 1
  • Parameter: Maximum Words Retrieved (Use 0 for no limit.)
    • Default value: 0
  • Output: Terms

Widget: Process New Documents (Text)

_images/question-mark.png

Automatically generated widget from function ProcessNewDocumentsFromString in package latino. The original function signature: ProcessNewDocumentsFromString.

  • Input: Documents = (Nested) List of Strings (System.Object)
  • Input: Bag of Words Model (Latino.TextMining.BowSpace)
  • Output: Dataset

Widget: Create Term Dataset

_images/question-mark.png

Automatically generated widget from function CreateTermDatasetFromAdc in package latino. The original function signature: CreateTermDatasetFromAdc.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Bag of Words Model (Latino.TextMining.BowSpace)
  • Output: Term Dataset

Widget: Construct BOW Model and Dataset

_images/question-mark.png

Automatically generated widget from function ConstructBowSpace in package latino. The original function signature: ConstructBowSpace.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: Token Annotation (System.String)
    • Default value: Token
  • Parameter: Stem Feature Name (System.String)
    • Default value: stem
  • Parameter: Stopword Feature Name (System.String)
    • Default value: stopword
  • Parameter: Label Document Feature Name (System.String)
    • Default value: label
  • Parameter: Maximum N-Gram Length (System.Int32)
    • Default value: 2
  • Parameter: Minimum Word Freqency (System.Int32)
    • Default value: 5
  • Parameter: Word Weighting Type (Latino.TextMining.WordWeightType)
    • Possible values:
      • Log Df Tf Idf
      • Term Freq
      • Tf Idf
      • Tf Idf Safe
    • Default value: TfIdf
  • Parameter: Cut Low Weights Percentage (System.Double)
    • Default value: 0.2
  • Parameter: Normalize Vectors (System.Boolean)
    • Default value: true
  • Output: Bag of Words Model
  • Output: Dataset

Widget: Parse Document Corpus

_images/question-mark.png

Automatically generated widget from function ParseDocuments in package latino. The original function signature: ParseDocuments.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Bag of Words Model (Latino.TextMining.BowSpace)
  • Output: Parsed Document Corpus

Widget: Get Vocabulary Table

_images/question-mark.png

Automatically generated widget from function GetVocabularyTable in package latino. The original function signature: GetVocabularyTable.

  • Input: Bag of Words Model (Latino.TextMining.BowSpace)
  • Parameter: Index of First Retrieved Word (System.Int32)
    • Default value: 1
  • Parameter: Maximum Words Retrieved (System.Int32)
    • Default value: 500
  • Output: Vocabulary Table

Widget: Create Dataset

_images/question-mark.png

Automatically generated widget from function ProcessNewDocumentsFromADC in package latino. The original function signature: ProcessNewDocumentsFromADC.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Bag of Words Model (Latino.TextMining.BowSpace)
  • Output: Dataset

Widget: Construct BOW Model

_images/question-mark.png

Automatically generated widget from function ConstructBowModel in package latino. The original function signature: ConstructBowModel.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Parameter: Token Annotation (System.String)
    • Default value: Token
  • Parameter: Stem Feature Name (System.String)
    • Default value: stem
  • Parameter: Stopword Feature Name (System.String)
    • Default value: stopword
  • Parameter: Label Document Feature Name (System.String)
    • Default value: label
  • Parameter: Maximum N-Gram Length (System.Int32)
    • Default value: 2
  • Parameter: Minimum Word Freqency (System.Int32)
    • Default value: 5
  • Parameter: Word Weighting Type (Latino.TextMining.WordWeightType)
    • Possible values:
      • Log Df Tf Idf
      • Term Freq
      • Tf Idf
      • Tf Idf Safe
    • Default value: TfIdf
  • Parameter: Cut Low Weights Percentage (System.Double)
    • Default value: 0.2
  • Parameter: Normalize Vectors (System.Boolean)
    • Default value: true
  • Output: Bag of Words Model

Widget: Construct BoW Dataset and BoW Model Constructor

_images/question-mark.png

The Construct BoW Dataset and BoW Model Constructor widget takes as an input an ADC data object and generates a sparse BoW model dataset (which can be then handed to i.e. a classifier). The widget takes as an input also several user defined parameters, such as weighting type, minimum word frequency, ngram length ...

Besides the sparse BoW model dataset this widget also outputs a BowModelConstructor instance. This additional object contains settings which allow repetition of the feature construction steps on another document corpus. These settings include the inputted parameters, as well as the learned term weights and vocabulary.

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Controlled Vocabulary (List of terms which will be used for building the vocabulary. Parameter ‘Maximum N-gram Length’ from in this widget is also applied to the vocabulary. The final vocabulary is the intersection of the controlled vocabulary and the dataset vocabulary.)
  • Parameter: Token Annotation (This is the type of Annotation instances, which mark parts of the document (e.g., words, sentences or a terms), which will be used for generating the vocabulary and the dataset.)
    • Default value: Token
  • Parameter: Feature Name (If present, the model will be constructed out of annotations’ feature values instead of document text. For example, this is useful when we wish build the BoW model using stems instead of the original word forms.)
    • Default value: Stem
  • Parameter: Stopword Feature Name (This is an annotation feature name which was used to tag tokens as stop words. These tokens will be excluded from the BoW representational model. If blank, no stop words will be used.)
    • Default value: StopWord
  • Parameter: Label Document Feature Name (This is the name of the document feature which will be used for class labeling examples in the dataset. If blank, the generated sparse dataset will be unlabeled.)
    • Default value: Labels
  • Parameter: Maximum N-Gram Length (The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that 1 <= n <= max_ngram will be used.)
    • Default value: 2
  • Parameter: Minimum Word Freqency (When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature.)
    • Default value: 5
  • Parameter: Word Weighting Type (The user can select among various weighting models for assigning weights to features)
    • Possible values:
      • Log Df TF-IDF
      • Term Frequency
      • TF-IDF
      • TF-IDF Safe
    • Default value: tf_idf
  • Parameter: Cut Low Weights Percentage (System.Double)
    • Default value: 0.2
  • Parameter: Normalize Vectors (The weighting methods can be further modified by vector normalization. If the user opts to use it in TextFlows the L2 regularization is performed.)
    • Default value: true
  • Output: Bag of Words Model Constructor (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)
  • Output: BOW Model Dataset (Sparse BOW feature vectors.)
  • Example usage: Outlier document detection

Widget: Construct BoW Model Constructor

_images/question-mark.png

The Construct BoW Dataset and BoW Model Constructor widget takes as an input an ADC data object and generates a BowModelConstructor instance. This object contains settings which allow repetition of the feature construction steps on another document corpus. These settings include the inputted parameters, as well as the learned term weights and vocabulary. The widget takes as an input also several user defined parameters, such as weighting type, minimum word frequency, ngram length ...

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Controlled Vocabulary (List of terms which will be used for building the vocabulary. Parameter ‘Maximum N-gram Length’ from in this widget is also applied to the vocabulary. The final vocabulary is the intersection of the controlled vocabulary and the dataset vocabulary.)
  • Parameter: Token Annotation (This is the type of Annotation instances, which mark parts of the document (e.g., words, sentences or a terms), which will be used for generating the vocabulary and the dataset.)
    • Default value: Token
  • Parameter: Feature Name (If present, the model will be constructed out of annotations’ feature values instead of document text. For example, this is useful when we wish build the BoW model using stems instead of the original word forms.)
    • Default value: Stem
  • Parameter: Stopword Feature Name (This is an annotation feature name which was used to tag tokens as stop words. These tokens will be excluded from the BoW representational model. If blank, no stop words will be used.)
    • Default value: StopWord
  • Parameter: Label Document Feature Name (This is the name of the document feature which will be used for class labeling examples in the dataset. If blank, the generated sparse dataset will be unlabeled.)
    • Default value: Labels
  • Parameter: Maximum N-Gram Length (The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that 1 <= n <= max_ngram will be used.)
    • Default value: 2
  • Parameter: Minimum Word Freqency (When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature.)
    • Default value: 5
  • Parameter: Word Weighting Type (The user can select among various weighting models for assigning weights to features)
    • Possible values:
      • Log Df TF-IDF
      • Term Frequency
      • TF-IDF
      • TF-IDF Safe
    • Default value: tf_idf
  • Parameter: Cut Low Weights Percentage (System.Double)
    • Default value: 0.2
  • Parameter: Normalize Vectors (The weighting methods can be further modified by vector normalization. If the user opts to use it in TextFlows the L2 regularization is performed.)
    • Default value: true
  • Output: Bag of Words Model Constructor (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)

Widget: Create BoW Dataset using the BoW Model Constructor

_images/question-mark.png

TODO:

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Bag of Words Model Constructor (Latino.TextMining.BowSpace)
  • Output: BOW Model Dataset (Sparse BOW feature vectors.)

Category Chunking

Category Nltk

Widget: N-gram Chunker

_images/question-mark.png
  • Input: Training Corpus (A tagged corpus included with NLTK, such as treebank, brown, cess_esp, floresta, or an Annotated Document Corpus in the standard TextFlows’ adc format)
  • Input: Backoff Chunker (A backoff chunker, to be used by the new chunker if it encounters an unknown context.)
  • Parameter: N-gram (N-gram is a contiguous sequence of n items from a given sequence of text or speech.)
    • Default value: 1
  • Output: Chunker (A python dictionary containing the Chunker object and its arguments.)

Widget: Regex Chunker

_images/question-mark.png

A grammar based chunk parser. chunk.RegexpParser uses a set of regular expression patterns to specify the behavior of the parser. The chunking of the text is encoded using a ChunkString, and each rule acts by modifying the chunking in the ChunkString. The rules are all implemented using regular expression matching and substitution.

A grammar contains one or more clauses in the following form:

NP:
  {<DT|JJ>}          # chunk determiners and adjectives
  }<[\.VI].*>+{      # chink any tag beginning with V, I, or .
  <.*>}{<DT>         # split a chunk at a determiner
  <DT|JJ>{}<NN.*>    # merge chunk ending with det/adj
                     # with one starting with a noun

The patterns of a clause are executed in order. An earlier pattern may introduce a chunk boundary that prevents a later pattern from executing. Sometimes an individual pattern will match on multiple, overlapping extents of the input. As with regular expression substitution more generally, the chunker will identify the first match possible, then continue looking for matches after this one has ended.

The clauses of a grammar are also executed in order. A cascaded chunk parser is one having more than one clause. The maximum depth of a parse tree created by this chunk parser is the same as the number of clauses in the grammar.

  • Parameter: Grammar (Grammar: a set of regular expression patterns to specify the behavior of the parser)
    • Default value: NP: {<DT>? <JJ>* <NN>*} # NP

P: {<IN>} # Preposition V: {<V.*>} # Verb PP: {<P> <NP>} # PP -> P NP VP: {<V> <NP|PP>*} # VP -> V (NP|PP)* * Output: Chunker (A python dictionary containing the Chunker object and its arguments.)

Widget: Chunking Hub

_images/question-mark.png

TODO

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Chunker (Chunker which will be used to parse the text into chunks.)
  • Parameter: Sentence’s Annotation (System.String)
    • Default value: Sentence
  • Parameter: Element’s Annotation (Tokens which feature’s will be used for tagging.)
    • Default value: Token
  • Parameter: POS Feature Name (Element Annotations’ POS Tag Feature Names )
    • Default value: POS Tag
  • Parameter: Output Feature Name (System.String)
    • Default value: IOB Tag
  • Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))

Widget: Extract Annotations from IOB tags

_images/question-mark.png

TODO

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Parameter: Sentence’s Annotation (Tokens which will be used to group element annotations.)
    • Default value: Sentence
  • Parameter: Element’s Annotation (Tokens which feature’s will be used in extraction.)
    • Default value: Token
  • Parameter: IOB Feature Name (Element Annotations’ IOB Tag Feature Names )
    • Default value: IOB Tag
  • Parameter: POS Feature Name (Element Annotations’ POS Tag Feature Names )
    • Default value: POS Tag
  • Parameter: Grammar Labels to be extracted (Grammar labels which will be extracted from the text as new annotations (NP,PP,VP), separated by a comma. NP - noun phrases, VP - verb phrases.)
    • Default value: NP,VP
  • Parameter: Annotation to be produced (The prefix for annotation of newly discovered tokens. Annotations names will be constructed as a combinations of this prefix and the label type e.x. “Chunk_NP”)
    • Default value: Chunk
  • Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))

Category Stemming

Category Latino

Category Advanced

Widget: Stemming Tagger Hub (Text)

_images/question-mark.png

Automatically generated widget from function TagStringStemLemma in package latino. The original function signature: TagStringStemLemma.

  • Input: Text (System.Object)
  • Input: Token Tagger (System.Object)
  • Parameter: Output Feature Name (System.String)
    • Default value: stem
  • Output: String (string or array of strings (based on the input))

Widget: Lemma Tagger LemmaGen

_images/question-mark.png

Automatically generated widget from function ConstructLemmaSharpLemmatizer in package latino. The original function signature: ConstructLemmaSharpLemmatizer.

  • Parameter: Language (Latino.TextMining.Language)
    • Possible values:
      • Bulgarian
      • Czech
      • English
      • Estonian
      • French
      • German
      • Hungarian
      • Italian
      • Romanian
      • Serbian
      • Slovene
      • Spanish
    • Default value: English
  • Output: Lemmatizer (Tagger)
  • Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Stem Tagger Snowball

_images/question-mark.png

Automatically generated widget from function ConstructSnowballStemmer in package latino. The original function signature: ConstructSnowballStemmer.

  • Parameter: Language (Latino.TextMining.Language)
    • Possible values:
      • Danish
      • Dutch
      • English
      • Finnish
      • French
      • German
      • Italian
      • Norwegian
      • Portuguese
      • Russian
      • Spanish
      • Swedish
    • Default value: English
  • Output: Stemmer (Tagger)
  • Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Stemming Tagger Hub

_images/question-mark.png

Taggs the given annotated document corpus with the given tagger.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Token Tagger (Token Annotation of the token to be tagged. If also the feature name is used than the feature value of selected token will be tagged. Usage: 1. TokenName 2. TokenName/FeatureName If multiple taggers are used then one line per tagger must be specified.)
  • Parameter: Token Annotation (System.String)
    • Default value: Token
  • Parameter: Output Feature Name (System.String)
    • Default value: stem
  • Output: Annotated Document Corpus

Category Nltk

Widget: ISRI Stemmer

_images/question-mark.png

ISRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA. A few minor modifications have been made to ISRI basic algorithm.

See the source code of this module for more information. isri.stem(token) returns Arabic root for the given token. The ISRI Stemmer requires that all tokens have Unicode string types. If you use Python IDLE on Arabic Windows you have to decode text first using Arabic ‘1256’ coding.

Widget: Regex Stemmer

_images/question-mark.png

A stemmer that uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.

  • Parameter: Pattern (The regular expression that should be used to
    identify morphological affixes.)
  • Parameter: Minimum length of string (The minimum length of string to stem.)
    • Default value: 0
  • Output: Stemmer (Tagger)

Widget: RSLP Stemmer

_images/question-mark.png

A stemmer for Portuguese.

Widget: Snowball Stemmer

_images/question-mark.png
The following languages are supported:

Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

The algorithm for English is documented here: Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

The algorithms have been developed by Martin Porter. These stemmers are called Snowball, because Porter created a programming language with this name for creating new stemming algorithms. There is more information available at http://snowball.tartarus.org/

  • Parameter: Language (The following languages are supported: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.)
    • Possible values:
      • Danish
      • Dutch
      • English
      • Finnish
      • French
      • German
      • Hungarian
      • Italian
      • Norwegian
      • Portuguese
      • Romanian
      • Russian
      • Spanish
      • Swedish
    • Default value: danish
  • Parameter: Ignore stopwords (If set to True, stopwords are
    not stemmed and returned unchanged. Set to False by default.)
  • Output: Stemmer (Tagger)
  • Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Default Lemmatizer

_images/question-mark.png

Default Lemmatizer

Lemmatizer that can be used as a baseline. Does not do anything, returns word unchanged.

Widget: Lemmagen Lemmatizer

_images/question-mark.png

Lemmagen lemmatizer as implemented in Python

Widget: Pattern Lemmatizer

_images/question-mark.png

Pattern Lemmatizer

Lemmatize using Pattern’s library built-in stem function.

Widget: Pattern Porter Stemmer

_images/question-mark.png

Porter stemmer from Pattern library.

Widget: WordNet Lemmatizer

_images/question-mark.png

WordNet Lemmatizer

Lemmatize using WordNet’s built-in morphy function. Returns the input word unchanged if it cannot be found in WordNet.

  • Parameter: POS Annotation (Define the name of the part of speech annotations form ADC corpus that wordnet lemmatizer will use when trying to lemmatize words.)
    • Default value: POS Tag
  • Output: Stemmer (Tagger)
  • Example usage: Stemmer and Lemmatizer classification evaluation

Widget: Lancaster Stemmer

_images/question-mark.png

A word stemmer based on the Lancaster stemming algorithm.

>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('maximum')     # Remove "-um" when word is intact
'maxim'
>>> st.stem('presumably')  # Don't remove "-um" when word is not intact
'presum'
>>> st.stem('multiply')    # No action taken if word ends with "-ply"
'multiply'
>>> st.stem('provision')   # Replace "-sion" with "-j" to trigger "j" set of rules
'provid'
>>> st.stem('owed')        # Word starting with vowel must contain at least 2 letters
'ow'
>>> st.stem('ear')         # ditto
'ear'
>>> st.stem('saying')      # Words starting with consonant must contain at least 3
'say'
>>> st.stem('crying')      #     letters and one of those letters must be a vowel
'cry'
>>> st.stem('string')      # ditto
'string'
>>> st.stem('meant')       # ditto
'meant'
>>> st.stem('cement')      # ditto
'cem'

Widget: Porter Stemmer

_images/question-mark.png

This is the Porter stemming algorithm, ported to Python from the version coded up in ANSI C by the author. It follows the algorithm presented in

Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.

only differing from it at the points marked –DEPARTURE– and –NEW– below.

For a more faithful version of the Porter algorithm, see http://www.tartarus.org/~martin/PorterStemmer/

Widget: Lemmatizer Evaluator

_images/question-mark.png

This widgets can be used to evaluate lemmatizers. Inputs are lemmatizer and a corpus on which you wish to evaluate lemmatizer

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Lemmatizer (Lemmatizer to be avaluated)
  • Output: Actual and predicted labels (List of actual and predicted labels (see help for details))

Widget: Stem/Lemma Tagger Hub

_images/question-mark.png

Taggs the given annotated document corpus with the given tagger.

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Token Tagger (Token Annotation of the token to be tagged. If also the feature name is used than the feature value of selected token will be tagged. Usage: 1. TokenName 2. TokenName/FeatureName If multiple taggers are used then one line per tagger must be specified.)
  • Parameter: Token Annotation (System.String)
    • Default value: Token
  • Parameter: POS Annotation (Name of Part of Speech annotation in ADC corpus if ADC corpus contains part of speech tags. Used by wordnet lemmatizer which uses POS tags for lemma prediction.)
    • Default value: POS Tag
  • Parameter: Output Feature Name (System.String)
    • Default value: Stem
  • Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Example usage: LBD workflows for outlier detection

Category Chunking

Widget: Chunking Hub

_images/question-mark.png

TODO

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Chunker (TODO)
  • Parameter: Input Feature Name (System.String)
    • Default value: POS Tag
  • Parameter: Output Feature Name (System.String)
    • Default value: Chunk
  • Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))

Widget: Classifier based parser

_images/question-mark.png

TODO

  • Output: classifier based chunker

Widget: Regex parser

_images/question-mark.png

TODO

  • Parameter: Grammar (System.String)
    • Default value: “NP: {<DT>?<JJ>*<NN>}”
  • Output: regex chunker

Category Dataset

Category Latino

Widget: Add Labels

_images/question-mark.png

Automatically generated widget from function AddLabelsToDocumentVectors in package latino. The original function signature: AddLabelsToDocumentVectors.

  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Input: Labeles (Array of Strings) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Output: Dataset

Widget: Extract Labels

_images/question-mark.png

Automatically generated widget from function ExtractDatasetLabels in package latino. The original function signature: ExtractDatasetLabels.

  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Labels (Array of Strings)

Widget: Remove Labels

_images/question-mark.png

Automatically generated widget from function RemoveDocumentVectorsLabels in package latino. The original function signature: RemoveDocumentVectorsLabels.

  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Dataset

Widget: Split

_images/question-mark.png

Automatically generated widget from function DatasetSplitSimple in package latino. The original function signature: DatasetSplitSimple.

  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Parameter: Percentage (System.Double)
    • Default value: 10
  • Parameter: Random Seed (-1 for random (time dependet) random seed)
    • Default value: -1
  • Output: Dataset with Extracted Set
  • Output: Dataset of Remaining Sets

Widget: Split to Predefined Sets

_images/question-mark.png

Automatically generated widget from function DatasetSplitPredefined in package latino. The original function signature: DatasetSplitPredefined.

  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Input: Sets (List with predefined set numbers) (System.Int32[])
  • Input: SetId (System.Int32)
  • Output: Dataset with Extracted Set
  • Output: Dataset of Remaining Sets

Widget: Dataset to Object

_images/question-mark.png

Automatically generated widget from function DatasetToObject in package latino. The original function signature: DatasetToObject.

  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Standard Object Representataion of Dataset (List<Tuple<int,string,Dictionary<int,double>>> explained as: (List of Examples)<(Example Tuple)<(Id) int,(Label) string,(BOW Dictionary)<(Word Id) int,(Word Weight) double>>>)

Widget: Object to Dataset

_images/question-mark.png

Automatically generated widget from function ObjectToDataset in package latino. The original function signature: ObjectToDataset.

  • Input: Standard Object Representataion of Dataset (List<Tuple<int,string,Dictionary<int,double>>> explained as: (List of Examples)<(Example Tuple)<(Id) int,(Label) string,(BOW Dictionary)<(Word Id) int,(Word Weight) double>>>)
  • Output: Dataset

Widget: Add Labels

_images/question-mark.png

Automatically generated widget from function AddLabelsToDocumentVectors in package latino. The original function signature: AddLabelsToDocumentVectors.

  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Input: Labeles (Array of Strings) (System.Collections.Generic.List`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Output: Dataset

Category Stop Words

Category Latino

Category Advanced

Widget: Stop Word Tagger Hub (Text)

_images/question-mark.png

Automatically generated widget from function TagStringStopwords in package latino. The original function signature: TagStringStopwords.

  • Input: Text (System.Object)
  • Input: Token Tagger (string or array of strings)
  • Parameter: Output Feature Name (System.String)
    • Default value: stopword
  • Output: String (string or array of strings (based on the input))

Widget: Stop Word Sets

_images/question-mark.png

Automatically generated widget from function GetStopWords in package latino. The original function signature: GetStopWords.

  • Parameter: Language (Latino.TextMining.Language)
    • Possible values:
      • Bulgarian
      • Czech
      • Danish
      • Dutch
      • English
      • Finnish
      • French
      • German
      • Hungarian
      • Italian
      • Norwegian
      • Portuguese
      • Romanian
      • Russian
      • Serbian
      • Slovene
      • Spanish
      • Swedish
    • Default value: English
  • Output: StopWords
  • Example usage: Simple Document Preprocessing

Widget: Stop Word Tagger

_images/question-mark.png

Automatically generated widget from function ConstructStopWordsTagger in package latino. The original function signature: ConstructStopWordsTagger.

  • Input: Stopwords (List of stopwords)
  • Parameter: Ignore Case (If true than words are marked stopword regardless of their casing.)
    • Default value: true
  • Output: Stop Word Tagger

Widget: Stop Word Tagger Hub

_images/question-mark.png

Automatically generated widget from function TagADCStopwords in package latino. The original function signature: TagADCStopwords.

  • Input: Annotated Document Corpus (LatinoInterfaces.DocumentCorpus)
  • Input: Token Tagger (System.Object)
  • Parameter: Token Annotation (System.String)
    • Default value: Token
  • Parameter: Output Feature Name (System.String)
    • Default value: stopword
  • Output: Annotated Document Corpus

Category Nltk

Widget: Stop Word Tagger

_images/question-mark.png

Constructs a python stop word tagger object.

  • Input: Stop Words (A list or string (stop words separated by new lines) of stop words.)
  • Parameter: Ignore Case (If true than words are marked as stop word regardless of their casing.)
    • Default value: true
  • Output: Stop Word Tagger (A python dictionary containing the StopWordTagger object and its arguments.)
  • Example usage: Simple Document Preprocessing

Widget: Stop Word Tagger Hub

_images/question-mark.png

Apply the stop_word_tagger object on the Annotated Document Corpus (adc):

  1. first select only annotations of type Token Annotation element_annotation,
  2. apply the stop_word tagger
  3. create new annotations output_feature with the outputs of the stop word tagger.
  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Stop Word Tagger (A python dictionary containing the stop word tagger object and its arguments.)
  • Parameter: Token Annotation (Which annotated part of document to be searched for stopwords.)
    • Default value: Token
  • Parameter: Output Feature Name (How to annotate the newly discovered stop word features.)
    • Default value: StopWord
  • Output: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Example usage: LBD workflows for outlier detection

Category Similarity Matrix

Category Latino

Widget: Calculate Similarity Matrix

_images/question-mark.png

Automatically generated widget from function CalculateSimilarityMatrix in package latino. The original function signature: CalculateSimilarityMatrix.

  • Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Parameter: Similarity Threshold (System.Double)
    • Default value: 0
  • Parameter: Full Matrix (not only Lower Triangular) (System.Boolean)
    • Default value: true
  • Output: Similarity Matrix

Widget: Convert Matrix to Table

_images/question-mark.png

Automatically generated widget from function SparseMatrixToTable in package latino. The original function signature: SparseMatrixToTable.

  • Input: Sparse Matrix (Latino.SparseMatrix`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Output: Matrix Table

Widget: Calculate Similarity Matrix

_images/question-mark.png

Automatically generated widget from function CalculateSimilarityMatrix in package latino. The original function signature: CalculateSimilarityMatrix.

  • Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Parameter: Similarity Threshold (System.Double)
    • Default value: 0
  • Parameter: Full Matrix (not only Lower Triangular) (System.Boolean)
    • Default value: true
  • Output: Similarity Matrix

Category Clustering

Category Latino

Widget: KMeans Clusterer

_images/question-mark.png

Automatically generated widget from function ConstructKMeansClusterer in package latino. The original function signature: ConstructKMeansClusterer.

  • Parameter: K (Number of Clusteres) (System.Int32)
    • Default value: 10
  • Parameter: Centroid Type (Latino.Model.CentroidType)
    • Possible values:
      • Avg
      • Nrm L2
      • Sum
    • Default value: NrmL2
  • Parameter: Similarity Measure (LatinoInterfaces.SimilarityModel)
    • Possible values:
      • Cosine
      • Dot Product
    • Default value: Cosine
  • Parameter: Random Seed (-1: Use Always Different) (System.Int32)
    • Default value: -1
  • Parameter: Eps (System.Double)
    • Default value: 0.0005
  • Parameter: Trials (Num of Initializations) (System.Int32)
    • Default value: 1
  • Output: Clusterer

Widget: KMeans Fast Clusterer

_images/question-mark.png

Automatically generated widget from function ConstructKMeansFastClusterer in package latino. The original function signature: ConstructKMeansFastClusterer.

  • Parameter: K (Number of Clusteres) (System.Int32)
    • Default value: 10
  • Parameter: Random Seed (-1: Use Always Different) (System.Int32)
    • Default value: -1
  • Parameter: Eps (System.Double)
    • Default value: 0.0005
  • Parameter: Trials (Num of Initializations) (System.Int32)
    • Default value: 1
  • Output: Clusterer

Widget: Hierarchical Bisecting Clusterer

_images/question-mark.png

Automatically generated widget from function ConstructHierarchicalBisectingClusterer in package latino. The original function signature: ConstructHierarchicalBisectingClusterer.

  • Parameter: Min Quality (System.Double)
    • Default value: 0.2
  • Output: Clusterer

Widget: Clustering Results Info

_images/question-mark.png

Automatically generated widget from function ClusteringResultsInfo in package latino. The original function signature: ClusteringResultsInfo.

  • Input: Clustering Results (Latino.Model.ClusteringResult)
  • Output: Document Labels (Array of Clusteres Ids)
  • Output: Clusters Tree

Widget: View Clusters

_images/question-mark.png

Automatically generated widget from function ViewClusters_PYTHON in package latino. The original function signature: ViewClusters_PYTHON.

  • Input: Clustering Results (System.Object)
  • Outputs: Popup window which shows widget’s results

Category Scikit

Widget: k-Means

_images/question-mark.png

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia <inertia> or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

  • Parameter: Number of clusters (The number of clusters to form as well as the number of centroids to generate.)
    • Default value: 8
  • Parameter: Max iterations (Maximum number of iterations of the k-means algorithm for a single run.)
    • Default value: 300
  • Parameter: Tolerance (Relative tolerance with regards to inertia to declare convergence.)
    • Default value: 1e-4
  • Output: Clustering

Widget: Clustering Hub

_images/question-mark.png

Automatically generated widget from function ClusterDocumentVectors in package latino. The original function signature: ClusterDocumentVectors.

  • Input: Clusterer (LatinoClowdFlows.IClusterer)
  • Input: Dataset (Latino.Model.IUnlabeledExampleCollection`1[[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Clustering Results

Category Classification

Category Latino

Widget: Nearest Centroid Classifier

_images/question-mark.png

Automatically generated widget from function ConstructCentroidClassifier in package latino. The original function signature: ConstructCentroidClassifier.

  • Parameter: Similarity Model (LatinoInterfaces.SimilarityModel)
    • Possible values:
      • Cosine
      • Dot Product
    • Default value: Cosine
  • Parameter: Normalize Centorids (System.Boolean)
    • Default value: false
  • Output: Centroid Classifier
  • Example usage: Classifier evaluation

Widget: Naive Bayes Classifier

_images/question-mark.png

Automatically generated widget from function ConstructNaiveBayesClassifier in package latino. The original function signature: ConstructNaiveBayesClassifier.

  • Parameter: Normalize (System.Boolean)
    • Default value: false
  • Parameter: Log Sum Exp Trick (System.Boolean)
    • Default value: true
  • Output: Classifier
  • Example usage: Classifier evaluation

Widget: SVM Binary Classifier

_images/question-mark.png

Automatically generated widget from function ConstructSvmBinaryClassifier in package latino. The original function signature: ConstructSvmBinaryClassifier.

  • Parameter: C (zero implies default value ([avg. x*x]^-1))
    • Default value: 0
  • Parameter: Biased Hyperplane (System.Boolean)
    • Default value: true
  • Parameter: Kernel Type (Latino.Model.SvmLightKernelType)
    • Possible values:
      • Linear
      • Polynomial
      • Radial Basis Function
      • Sigmoid
    • Default value: Linear
  • Parameter: Kernel Parameter Gamma (System.Double)
    • Default value: 1
  • Parameter: Kernel Parameter D (System.Double)
    • Default value: 1
  • Parameter: Kernel Parameter S (System.Double)
    • Default value: 1
  • Parameter: Kernel Parameter C (System.Double)
    • Default value: 0
  • Parameter: Eps (System.Double)
    • Default value: 0.001
  • Parameter: Max Iterations (System.Int32)
    • Default value: 100000
  • Parameter: Custom Parameter String (System.String)
  • Output: Classifier

Widget: SVM Multiclass Fast Classifier

_images/question-mark.png

Automatically generated widget from function ConstructSvmMulticlassFast in package latino. The original function signature: ConstructSvmMulticlassFast.

  • Parameter: C (System.Double)
    • Default value: 5000
  • Parameter: Eps (System.Double)
    • Default value: 0.1
  • Output: Classifier

Widget: Majority Classifier

_images/question-mark.png

Automatically generated widget from function ConstructMajorityClassifier in package latino. The original function signature: ConstructMajorityClassifier.

  • Output: Classifier

Widget: Maximum Entropy Classifier

_images/question-mark.png

Automatically generated widget from function ConstructMaximumEntropyClassifier in package latino. The original function signature: ConstructMaximumEntropyClassifier.

  • Parameter: Move Data (System.Boolean)
    • Default value: false
  • Parameter: Num of Iterations (System.Int32)
    • Default value: 100
  • Parameter: CutOff (System.Int32)
    • Default value: 0
  • Parameter: Num of Threads (System.Int32)
    • Default value: 1
  • Parameter: Normalize (System.Boolean)
    • Default value: false
  • Output: Classifier
  • Example usage: Classifier evaluation

Widget: Maximum Entropy Fast Classifier

_images/question-mark.png

Automatically generated widget from function ConstructMaximumEntropyClassifierFast in package latino. The original function signature: ConstructMaximumEntropyClassifierFast.

  • Parameter: Move Data (System.Boolean)
    • Default value: false
  • Parameter: Num of Iterations (System.Int32)
    • Default value: 100
  • Parameter: CutOff (System.Int32)
    • Default value: 0
  • Parameter: Num of Threads (System.Int32)
    • Default value: 1
  • Parameter: Normalize (System.Boolean)
    • Default value: false
  • Output: Classifier

Widget: Knn Classifier

_images/question-mark.png

Automatically generated widget from function ConstructKnnClassifier in package latino. The original function signature: ConstructKnnClassifier.

  • Parameter: Similarity Model (LatinoInterfaces.SimilarityModel)
    • Possible values:
      • Cosine
      • Dot Product
    • Default value: Cosine
  • Parameter: K (Neighbourhood) (System.Int32)
    • Default value: 10
  • Parameter: Soft Voting (System.Boolean)
    • Default value: true
  • Output: Classifier

Widget: Knn Fast Classifier

_images/question-mark.png

Automatically generated widget from function ConstructKnnClassifierFast in package latino. The original function signature: ConstructKnnClassifierFast.

  • Parameter: K (Neighbourhood) (System.Int32)
    • Default value: 10
  • Parameter: Soft Voting (System.Boolean)
    • Default value: true
  • Output: Classifier
  • Example usage: Classifier evaluation

Widget: Accuracy Claculation

_images/question-mark.png

Automatically generated widget from function AccuracyClaculation in package latino. The original function signature: AccuracyClaculation.

  • Input: True Labels (System.Collections.IList)
  • Input: Predicted Labels (System.Collections.IList)
  • Output: Accuracy
  • Output: Statistics (Statistics:confusionMatrix: first level of confusion matrix dictionary present true labels (first input) while the second, inner layer, presents predicted labels (second output). Stataistics:additinalScores: dictionary’s id presents the label that was considered positive for calculation and dictionary’s value are actual additioanl scores.)

Widget: Cross Validation

_images/question-mark.png

Automatically generated widget from function CrossValidation in package latino. The original function signature: CrossValidation.

  • Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Parameter: Num of Sets (System.Int32)
    • Default value: 10
  • Parameter: Assign Sets Randomly (System.Boolean)
    • Default value: true
  • Parameter: Use Seed for Random (System.Boolean)
    • Default value: false
  • Parameter: Random Seed (System.Int32)
    • Default value: 0
  • Output: Data Object with results

Widget: Cross Validation (Predefined Splits)

_images/question-mark.png

Automatically generated widget from function CrossValidationPredefSplits in package latino. The original function signature: CrossValidationPredefSplits.

  • Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Input: Sets (List with predefined set numbers) (System.Collections.Generic.List`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Output: Data Object with results

Widget: Multiple Splits Validation

_images/question-mark.png

Automatically generated widget from function CrossValidationPredefMultiSplits in package latino. The original function signature: CrossValidationPredefMultiSplits.

  • Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Input: Multiple Set Indexes (Dictionary with multiple predefined split element indexes. {“train0”:[1,2,3],”test0”:[4,5],”train1”:[2,3,4],”test1”:[5,6]})
  • Output: Data Object with results

Widget: Predict Classification

_images/question-mark.png

Automatically generated widget from function PredictClassification in package latino. The original function signature: PredictClassification.

  • Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Prediction(s)
  • Output: Labeled dataset

Widget: Prediction Info

_images/question-mark.png

Automatically generated widget from function PredictionInfo in package latino. The original function signature: PredictionInfo.

  • Input: Prediction(s) (System.Collections.Generic.List`1[[Latino.Model.Prediction`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Lable(s) (Array of Strings)
  • Output: Prediction Info(s)

Widget: View Classifications

_images/question-mark.png

Automatically generated widget from function ViewClasssifications_PYTHON in package latino. The original function signature: ViewClasssifications_PYTHON.

  • Input: Prediction(s) (System.Object)
  • Outputs: Popup window which shows widget’s results

Category Nltk

Widget: Naive Bayes Classifier

_images/question-mark.png

A classifier based on the Naive Bayes algorithm. In order to find the probability for a label, this algorithm first uses the Bayes rule to express P(label|features) in terms of P(label) and P(features|label):

P(label) * P(features|label)
P(label|features) = ——————————
P(features)

The algorithm then makes the ‘naive’ assumption that all features are independent, given the label:

P(label) * P(f1|label) * ... * P(fn|label)
P(label|features) = ——————————————–
P(features)

Rather than computing P(featues) explicitly, the algorithm just calculates the denominator for each label, and normalizes them so they sum to one:

P(label) * P(f1|label) * ... * P(fn|label)
P(label|features) = ——————————————–
SUM[l]( P(l) * P(f1|l) * ... * P(fn|l) )
  • Parameter: Normalize (System.Boolean)
    • Default value: false
  • Parameter: Log Sum Exp Trick (System.Boolean)
    • Default value: true
  • Output: Classifier

Category Scikit

Widget: Decision Tree Classifier

_images/scikit_Tree-icon.png

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

  • Parameter: Max features (The number of features to consider when looking for the best split: If int, then consider max_features features at each split. If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.)
    • Default value: auto
  • Parameter: Max depth (The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. )
    • Default value: 100
  • Output: Classifier
  • Example usage: LBD workflows for outlier detection

Widget: Gaussian Naive Bayes Classifier

_images/classifier_naive_bayes_image.png

Gaussian Naive Bayes. When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution.

Widget: k-Nearest Neighbours Classifier

_images/classifier_knn_image.png

Classifier implementing the k-nearest neighbors vote.

  • Parameter: Number of neighbors (Number of neighbors to use by default for k_neighbors queries.)
    • Default value: 5
  • Parameter: Algorithm (Algorithm used to compute the nearest neighbors: ‘ball_tree’ will use BallTree ‘kd_tree’ will use KDTree ‘brute’ will use a brute-force search. ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method. Note: fitting on sparse input will override the setting of this parameter, using brute force.)
    • Possible values:
      • ball tree
      • brute
      • kd tree
      • most appropriate (automatically)
    • Default value: auto
  • Parameter: Weights (weight function used in prediction. Possible values: ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally. ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. Uniform weights are used by default.)
    • Possible values:
      • distance
      • uniform
    • Default value: uniform
  • Output: Classifier
  • Example usage: Classifier evaluation

Widget: Logistic regression Classifier

_images/scikit_LogisticRegression.png

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

  • Parameter: Penalty (Used to specify the norm used in the penalization.)
    • Possible values:
      • l1
      • l2
    • Default value: l1
  • Parameter: C (Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.)
    • Default value: 1.0
  • Output: Classifier

Widget: Multinomial Naive Bayes Classifier

_images/classifier_naive_bayes_image.png

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

  • Parameter: Alpha (Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). )
    • Default value: 1.0
  • Parameter: Fit prior (Whether to learn class prior probabilities or not.
    If false, a uniform prior will be used.)
  • Output: Classifier
  • Example usage: Outlier document detection

Widget: SVM Classifier

_images/classifier_svm_image.png

Implementation of Support Vector Machine classifier using libsvm: the kernel can be non-linear but its SMO algorithm does not scale to large number of samples as LinearSVC does. Furthermore SVC multi-class mode is implemented using one vs one scheme while LinearSVC uses one vs the rest.

  • Parameter: C (Penalty parameter C of the error term.)
    • Default value: 1.0
  • Parameter: Degree (Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.)
    • Default value: 3
  • Parameter: Kernel (Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.)
    • Possible values:
      • linear
      • poly
      • precomputed
      • rbf
      • sigmoid
    • Default value: rbf
  • Output: Classifier
  • Example usage: POS tagger intrinsic evaluation - experiment 1

Widget: SVM Linear Classifier

_images/classifier_svm_image.png

Similar to Support Vector Classification with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better (to large numbers of samples).

  • Parameter: C (Penalty parameter C of the error term.)
    • Default value: 1.0
  • Parameter: Loss (Specifies the loss function. ‘l1’ is the hinge loss (standard SVM) while ‘l2’ is the squared hinge loss.)
    • Possible values:
      • l1
      • l2
    • Default value: l2
  • Parameter: Penalty (Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.)
    • Possible values:
      • l1
      • l2
    • Default value: l2
  • Parameter: Multi class (Determines the multi-class strategy if y contains more than two classes. ovr trains n_classes one-vs-rest classifiers, while crammer_singer optimizes a joint objective over all classes. While crammer_singer is interesting from an theoretical perspective as it is consistent it is seldom used in practice and rarely leads to better accuracy and is more expensive to compute. If crammer_singer is choosen, the options loss, penalty and dual will be ignored.)
    • Possible values:
      • crammer singer
      • ovr
    • Default value: ovr
  • Output: Classifier
  • Example usage: Classifier evaluation

Widget: Apply Classifier Hub

_images/question-mark.png

TODO

  • Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Parameter: Calculate class probabilities (Calculate classification class probabilities. May slow down algorithm prediction.)
    • Default value: true
  • Output: Prediction(s)
  • Output: Labeled dataset

Widget: Train Classifier Hub

_images/question-mark.png

Automatically generated widget from function TrainClassifier in package latino. The original function signature: TrainClassifier.

  • Input: Classifier (Latino.Model.IModel`1[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Input: Dataset (Latino.Model.LabeledDataset`2[[System.String, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089],[Latino.SparseVector`1[[System.Double, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]], Latino, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]])
  • Output: Classifier

Widget: Extract Classifier Name

_images/question-mark.png

Returns a string with pretty classifier name.

  • Input: Classifier
  • Output: Classifier Name

Widget: Extract Actual and Predicted Values

_images/question-mark.png

Takes as an input a ADC object with already defined actual and predicted features that can be compared. Outputs a combined list of actual and predicted values which can be used e.g. by the Classification Statistics widget.

  • Input: Predictions (Classification Predictions)
  • Input: Dataset (BoW Dataset)
  • Output: Actual and Predicted Values (List of Actual and Predicted Values)

Category Lexicology

Category Controlled Vocabularies

Widget: MeSH vocabulary builder

_images/question-mark.png

Constructs vocabulary from selected top categories in MeSH hierarchy.

  • Parameter: N-grams (Construct n-grams subsets of words from a MeSH term)
  • Output: List of MeSH terms (List of MeSH terms.)

Category Literature Based Discovery

Category Heuristic Calculation

Widget: Exclude Terms that Appear in One Domain Only

_images/question-mark.png
  • Input: Bag of Words Model Constructor (Bag of Words Model Constructor )
  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Output: Bag of Words Model Constructor with Filtered Vocabulary (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)
  • Output: BOW Model Dataset (Sparse BOW feature vectors.)

Widget: Calculate Term Heuristics Scores

_images/question-mark.png

Calculate all input heuristics.

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Bag of Words Model (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)
  • Input: Heuristic or Heuristic list (List of heuristic names which scores will be calculated.)
  • Output: Heuristic Scores (Calculated B-Term Heuristic Scores)
  • Example usage: Literature Based Discovery (overview with vocab)

Widget: Actual and Predicted Values

_images/question-mark.png

Prepare actual and predicted values for B-term Heuristics.

  • Input: Bag of Words Model Constructor (Bag of Words Model Constructor (BowModelConstructor) gathers utilities to build feature vectors from annotated document corpus.)
  • Input: B-terms (List of bridging terms)
  • Input: Heuristic Scores (Calculated B-Term Heuristic Scores)
  • Output: Actual and Predicted Values (List of actual and predicted values for every B-term Discovery Heuristic)

Category Heuristic Specification

Widget: Frequency-based heuristics

_images/question-mark.png

Interactive widget which allows specification of of frquency-based bridging term discovery heuristics.

Widget: TF-IDF-based heuristics

_images/question-mark.png

Interactive widget which allows specification of TF-IDF based bridging term discovery heuristics.

Widget: Similarity-based heuristics

_images/question-mark.png

Interactive widget which allows specification of similarity-based bridging term discovery heuristics.

  • Output: List of Selected Heuristics for Bringing Term Discovery

Widget: Outlier-based heuristics

_images/question-mark.png

Interactive widget which allows specification of outlier-based bridging term discovery heuristics.

Widget: Banded matrix-based heuristics

_images/question-mark.png

Interactive widget which allows specification of bridging term discovery heuristics based on banded matrices.

  • Output: List of Selected Heuristics for Bringing Term Discovery

Widget: Outlier-based heuristic

_images/question-mark.png

Interactive widget which allows specification of a custom outlier-based bridging term discovery heuristics by using the classifiers from the input.

  • Input: Classifier
  • Output: List of Selected Heuristics for Bringing Term Discovery

Widget: Heuristic Maximum

_images/question-mark.png

Defines a calculated heuristic that is the maximum (for every term) of the input heuristics.

  • Input: Heuristic or Heuristic list
  • Output: Heuristic Max Specification (Heuristic Maximum Specification)

Widget: Heuristic Minimum

_images/question-mark.png

Defines a calculated heuristic that is the minimum (for every term) of the input heuristics.

  • Input: Heuristic or Heuristic list
  • Output: Heuristic Min Specification (Heuristic Minimum Specification)

Widget: Heuristic Normalization

_images/question-mark.png

Defines calculated heuristics where scores are scaled to [0,1] values using the minimum and maximum scores.

  • Input: Heuristic or Heuristic list
  • Output: Normalized Heuristic or Heuristic Specifications list (Normalized Heuristic Specification or Heuristic Specifications list)
  • Example usage: LBD workflows for outlier detection

Widget: Heuristic Sum

_images/question-mark.png

Defines a calculated heuristic that is the summation of the input heuristics.

Widget: Ensemble Average Position

_images/question-mark.png

The Ensemble Average Position score is calculated as an average of position scores of individual base heuristics.

Widget: Ensemble Heuristic Vote

_images/question-mark.png

Every term get an integer score, which represents how many of input heuristics voted for the term. Each input heuristic gives one vote to each term which is in the first third in its ranked list of terms.

Category Term ranking and Exploration

Widget: Explore in CrossBee

_images/question-mark.png

Explore heuristic scores and terms in CrossBee.

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Input: Bag of Words Model Constructor (Bag of Words Model Constructor )
  • Input: BOW Model Dataset (Sparse BOW feature vectors)
  • Input: B-terms (List of bridging terms)
  • Input: Heuristic Scores (Calculated B-term)
  • Parameter: CrossBee API URL (URL to the CrossBee API for exploring external data. Data to be displayed in CrossBee will be available at TextFlows’ URL. This URL will be send to CrossBee API via replacing “{dataurl.json}” string in the supplied Crossbe API URL.)
  • Parameter: Primary Heuristic Index (Index of the primary heuristics to be analized as ensamble)
    • Default value: 0
  • Output: Serialized Annotated Document Corpus (Serialized Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Output: Vocabulary
  • Output: Heuristic Scores (Calculated B-Term Heuristic Scores)
  • Output: B-terms (List of bridging terms)
  • Output: Serialized BOW Model Dataset (Serialized sparse BOW feature vectors)
  • Output: Primary Heuristic Index (Index of the primary heuristics to be analized as ensamble)
  • Example usage: LBD workflows for outlier detection

Category Helpers

Category Tagging

Widget: Condition Tagger

_images/question-mark.png

Automatically generated widget from function ConstructConditionTagger in package latino. The original function signature: ConstructConditionTagger.

  • Parameter: Feature Condition (Condition which tokens to include based on their features. Format examples: -Feature1 (don’t include tokens with Feature1 set ta any value) -Feature1=Value1 (don’t include tokens with Feature1 set to the value Value1) -Feature1 +Feature2 (don’t include tokens with Feature1 set unless it has also Feature2 set) -Feature1=Value1 +Feature2 (don’t include tokens with Feature1 set to Value1 unless it has also Feature2 set to any value)...)
  • Parameter: output Feature Value (System.String)
    • Default value: true
  • Parameter: Put token/feature text as the output feature value (If set to true than token or token’s feature text is asigned as output feature value)
  • Output: Tagger

Widget: Advanced Object Viewer

_images/question-mark.png

Displays any input.

  • Input: Object (Any type of object.)
  • Parameter: Attribute (The depth of the object display)
  • Parameter: Maximum Output Length (System.Int32)
    • Default value: 5000
  • Outputs: Popup window which shows widget’s results

Widget: Random Cross Validation Sets

_images/question-mark.png

Automatically generated widget from function RandomCrossValidationSets in package latino. The original function signature: RandomCrossValidationSets.

  • Input: Example List (Not required, but if set, then it overrides parameter ‘numOfExamples’ and len(examples) is used for ‘numOfExamples’. This should be a type implementing Count, Count() or Length.)
  • Parameter: Num of Examples (This determines the length of the set id list. If input ‘examples’ is set then len(examples) is used for ‘numOfExamples’ and this setting is overriden.)
    • Default value: 100
  • Parameter: Num of Sets (System.Int32)
    • Default value: 10
  • Parameter: Assign Sets Randomly (System.Boolean)
    • Default value: true
  • Parameter: Use Seed for Random (System.Boolean)
    • Default value: false
  • Parameter: Random Seed (System.Int32)
    • Default value: 0
  • Output: Example SetIds List

Widget: Random Sequential Validation Sets

_images/question-mark.png

Automatically generated widget from function RandomSequentialValidationSets in package latino. The original function signature: RandomSequentialValidationSets.

  • Input: Example List (Not required, but if set, then it overrides parameter ‘numOfExamples’ and len(examples) is used for ‘numOfExamples’. This should be a type implementing Count, Count() or Length.)
  • Parameter: Num of Examples (This determines the length of the set id list. If input ‘examples’ is set then len(examples) is used for ‘numOfExamples’ and this setting is overriden.)
    • Default value: 100
  • Parameter: Num of Sets (System.Int32)
    • Default value: 10
  • Parameter: Assign Sets Randomly (If not set then sets are exactly evenly distributet across the whole dataset.)
    • Default value: true
  • Parameter: Use Seed for Random (System.Boolean)
    • Default value: false
  • Parameter: Random Seed (System.Int32)
    • Default value: 0
  • Parameter: Size of Train Set (May be specified as absolute number or number foloweed by ‘%’ to denote the percentage of the whole dataset.)
    • Default value: 40%
  • Parameter: Size of Test Set (May be specified as absolute number or number foloweed by ‘%’ to denote the percentage of the whole dataset.)
    • Default value: 10%
  • Parameter: Size of Space Between Train and Test Set (May be specified as absolute number or number foloweed by ‘%’ to denote the percentage of the whole dataset.)
    • Default value: 1%
  • Output: Multiple Set Indexes

Widget: Advanced Object to String Converter

_images/question-mark.png

Displays any input.

  • Input: Object (Any type of object.)
  • Parameter: Attribute (The attribute of the object to display)
  • Parameter: Maximum Output Length (System.Int32)
    • Default value: 500000
  • Output: Object String Representation

Widget: C#.NET Snippet

_images/question-mark.png

Runs c#.NET snippet. You can use variable which is provided on the input by the name “in1” .. “inN”. Whatever you want to otput needs to be asigned to the variable “out1” before the code is terminated

  • Input: Snippet Input Parameter(s) (input can be accesed as variable “in1” .. “inN” inside the code)
  • Parameter: C# Snippet Code (Input can be accesed as variable “in1” .. “inN” inside the code and output can be accesed/assigned as variable “out1” inside the code.)
    • Default value: // This is the C#.NET Code Snippet where you can modify the data.

// Varaible “in1” .. “inN” contains whatever you connected to the input port // Input variables are correctly typed. // Whatever is assigned to the variable “out1” will be transfered to the output port. out1 = in1; * Parameter: Namespace Section (using directives) (System.String)

  • Default value: using System;

using System.Collections.Generic; using System.Linq; using Latino; using Latino.TextMining; using LatinoInterfaces; * Parameter: Additional References (imports) (System.String)

  • Default value: System.dll

System.Xml.dll System.Core.dll workflowstextflows_dot_netbinLatino.dll workflowstextflows_dot_netbinLatinoWorkflows.dll workflowstextflows_dot_netbinLatinoInterfaces.dll * Output: out (output can be accesed/assigned as variable “out1” inside the code) * Output: Console Output * Output: Possible compile/runtime errors * Output: Generated Code

Widget: Display Table

_images/question-mark.png

Automatically generated widget from function ShowTable_PYTHON in package latino. The original function signature: ShowTable_PYTHON.

  • Input: Table (System.Object)
  • Outputs: Popup window which shows widget’s results

Widget: Get Multi Set Indexes

_images/question-mark.png

Generates multiple set indexes from a list of predefined set numbers. See widgets “Cross Validation (Predefined Splits)” and “Multiple Splits Validation”

  • Input: Sets (List with predefined set numbers) (System.Collections.Generic.List`1[[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]])
  • Output: Multiple Set Indexes

Widget: Flatten String Hierarchy

_images/question-mark.png

Automatically generated widget from function FlattenObjectToStringArray in package latino. The original function signature: FlattenObjectToStringArray.

  • Input: data (System.Object)
  • Output: flatData

Widget: Generate Integer Range

_images/question-mark.png

Automatically generated widget from function GenerateIntegerRange in package latino. The original function signature: GenerateIntegerRange.

  • Parameter: Start (System.Int32)
    • Default value: 0
  • Parameter: Stop (System.Int32)
    • Default value: 10
  • Parameter: Step (System.Int32)
    • Default value: 1
  • Output: Range

Widget: Python Snippet

_images/question-mark.png

Runs python snippet. You can use variable which is provided on the input by the name “in1” .. “inN”. Whatever you want to otput needs to be asigned to the variable “out1” before the code is terminated

  • Input: in (input can be accesed as variable “in1” .. “inN” inside the code)
  • Parameter: Python Snippet Code (Input can be accesed as variable “in1” .. “inN” inside the code and output can be accesed/assigned as variable “out1” inside the code.)
    • Default value: # This is the Python Code Snippet where you can modify the data however is needed.

# Varaible “in1” .. “inN” contains whatever you connected to the input port # Whatever is assigned to the variable “out1” will be transfered to the output port.

out1 = in1 * Output: out (output can be accesed/assigned as variable “out1” inside the code)

Widget: Split Object

_images/question-mark.png

Automatically generated widget from function SplitObject_PYTHON in package latino. The original function signature: SplitObject_PYTHON.

  • Input: object (System.Object)
  • Parameter: Object Modifier (if one wants to extract object’s attributes, leading dot should be used.)
  • Output: object

Category Noise Handling

Category Noise Filters

Widget: Classification Filter

_images/CF-filter-black.png

A widget which uses a classifier as a tool for detecting noisy instances in data.

  • Input: Learner
  • Input: Dataset
  • Parameter: Timeout
    • Default value: 300
  • Parameter: Number of Folds for Cross-Validation
    • Possible values:
      • 10
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
    • Default value: 10
  • Output: Noise instances
  • Example usage: Outlier document detection

Widget: Matrix Factorization Filter

_images/CF-filter-black.png
  • Input: Dataset
  • Parameter: Threshold
    • Default value: 10
  • Output: Noise instances

Widget: Saturation Filter

_images/SF-filter_1.png

Widget implementing a saturation filter used to eliminate noisy training examples from labeled data. Reference: http://www.researchgate.net/publication/228898399

  • Input: Dataset
  • Parameter: Type of Saturation Filtering
    • Possible values:
      • Normal
      • Pre-pruned
    • Default value: normal
  • Output: Noise instances

Widget: HARF

_images/HARF_60-48-RF.png

High Agreement Random Forest

  • Parameter: Agreement Level
    • Possible values:
      • 60
      • 70
      • 80
      • 90
    • Default value: 70
  • Output: HARF Classifier

Widget: NoiseRank

_images/NoiseRank3.png

Widget implementing an ensemble-based noise ranking methodology for explicit noise and outlier identification. Reference: http://dx.doi.org/10.1007/s10618-012-0299-1

  • Input: Dataset
  • Input: Noisy Instances
  • Output: All Noise
  • Output: Selected Instances
  • Output: Selected Indices
  • Example usage: Outlier document detection

Category Performance Evaluation

Widget: Aggregate Detection Results

_images/question-mark.png

Aggregates results of the detection of noisy instances in data

  • Input: Positive Indices
  • Input: Detected Instances
  • Output: Aggregated Detection Results

Widget: Classification statistics

_images/question-mark.png

Calculates various classification statistics from true and predicted labels. Labels can be provided in two ways:

  1. [y_true, y_predicted]

or for folds:

  1. [[y_true_1, y_predicted_1], [y_true_2, y_predicted_2], ...]
  • Input: True and predicted labels (List of true and predicted labels (see help for details))
  • Output: Classification accuracy
  • Output: Precision
  • Output: Recall
  • Output: F1 (F1 measure)
  • Output: AUC
  • Output: Confusion matrix
  • Example usage: COMTRADE demo

Widget: Evaluate Detection Algorithms

_images/question-mark.png
  • Input: Noisy Instances
  • Input: Detected Noise
  • Parameter: Beta parameter for F-mesure
    • Default value: 1
  • Output: Noise Detection Performance

Widget: Evaluate Repeated Detection

_images/question-mark.png
  • Input: Algorithm Performances
  • Parameter: F-measure Beta-parameter
    • Default value: 1
  • Output: Performance Results

Widget: Evaluation Results to 2d Table

_images/question-mark.png

Table that can be used in workflows with nested loops. You can define names on x and y axis. You can also choose the evaluation metrics that you want to show from a dropdown menu.

  • Input: Evaluation Results
  • Parameter: Evaluation metric (Choose the evaluation measurement you would like to show in the table.)
    • Possible values:
      • accuracy
      • auc
      • fscore
      • precision
      • recall
    • Default value: accuracy
  • Outputs: Popup window which shows widget’s results
  • Example usage: POS tagger intrinsic evaluation - experiment 5

Widget: Evaluation Results to Table

_images/question-mark.png

Widget: Performance Chart

_images/question-mark.png

Widget: VIPER: Visual Performance Evaluation

_images/question-mark.png

VIPER: performance evaluation in the Precision-Recall (PR) space. An interactive widget showing the PR plot , which can also be saved as an image or printed.

  • Input: Algorithm Performance
  • Parameter: eps-proximity evaluation parameter [%]
    • Possible values:
      • 1
      • 10
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • Do not use eps-proximity evaluation
    • Default value: 0.05
  • Outputs: Popup window which shows widget’s results
  • Example usage: POS tagging classification evaluation (copy)

Widget: Extract Actual and Predicted features

_images/question-mark.png

Takes as an input an ADC object with predicted features and an ADC object with actual features(golden standard). Output is a list containing a list of predicted features and a list contained actual features.

  • Input: Annotated Document Corpus (Annotated Document Corpus (workflows.textflows.DocumentCorpus))
  • Parameter: Predicted annotation (System.String)
    • Default value: POS tag
  • Parameter: Actual annotation (System.String)
    • Default value: POS tag
  • Parameter: Lowercase (Convert features to lowercase)
    • Default value: False
  • Output: Actual and Predicted Values (List of Actual and Predicted Values)

Category Visual performance evaluation (ViperCharts)

Category Column charts

Widget: Column chart

_images/question-mark.png

Standard graphical presentation of algorithm performance. Also referred to as a bar chart. Visualizes the values of one or more performance measures of the evaluated algorithms.

  • Input: Performance results
  • Outputs: Popup window which shows widget’s results

Category Curve charts

Widget: Lift curves

_images/question-mark.png

The Lift curve widget plots the true positive rate (also found in ROC and PR curves) against the predicted positive rate (the fraction of examples, classified as positive). Each point represents the classifier performance for a given threshold or ranking cut-off point. http://viper.ijs.si/types/curve/

  • Input: Performance results
  • Parameter: Chart title
  • Outputs: Popup window which shows widget’s results

Widget: ROC curves

_images/question-mark.png

A widget which illustrates the trade off between the true positive rate and the true negative rate of a classifier. Each point represents the classifier performance for a given threshold or ranking cut-off point. http://viper.ijs.si/types/curve/

  • Input: Performance results
  • Parameter: Chart title
  • Outputs: Popup window which shows widget’s results

Widget: ROC hull curves

_images/question-mark.png

The ROC Hull chart widget plots the upper convex hull of the ROC chart. Each point represents the classifier performance for a given threshold or ranking cut-off point. Points on the ROC Hull represent an optimal performance of the classifier for certain misclassification costs. http://viper.ijs.si/types/curve/

  • Input: Performance results
  • Parameter: Chart title
  • Outputs: Popup window which shows widget’s results

Widget: PR curves

_images/question-mark.png

A widget which provides the PR (precision recall) curve. It presents the trade off between the precision (the fraction of examples classified as positive that are truly positive) and the recall or true positive rate. Each point represents the classifier performance for a given threshold or ranking cut-off point. http://viper.ijs.si/types/curve/

  • Input: Performance results
  • Parameter: Chart title
  • Outputs: Popup window which shows widget’s results

Widget: Cost curves

_images/question-mark.png

The Cost curve widget plots the normalized expected cost of the classifier as a function of the skew (fraction of positive examples multiplied by the cost of misclassifying a positive example) of the data on which it is deployed. Lines and points on the cost curve correspond to points and lines on the ROC curve of the classifier. http://viper.ijs.si/types/curve/

  • Input: Performance results
  • Parameter: Chart title
  • Outputs: Popup window which shows widget’s results

Widget: Kendall curves

_images/question-mark.png

The Kendall chart widget presents the difference between the normalized expected cost of the classifier and the normalized expected cost of an ideal classifier. Costs for both classifiers are calculated using the rate-driven threshold choice method. http://viper.ijs.si/types/curve/

  • Input: Performance results
  • Parameter: Chart title
  • Outputs: Popup window which shows widget’s results

Widget: Rate driven curves

_images/question-mark.png

The Rate-Driven chart widget plots the expected loss for the classifier as a function of the skew (fraction of positive examples multiplied by the cost of misclassifying a positive example) of the data on which it is deployed. The cost is calculated using the rate-driven threshold choice method. http://viper.ijs.si/types/curve/

  • Input: Performance results
  • Parameter: Chart title
  • Outputs: Popup window which shows widget’s results

Category Scatter charts

Widget: ROC space

_images/question-mark.png

Scatter chart - ROC space. Provides an easy and intuitive visual performance evaluation in terms of Recall, Precision and F-measure. By introducing the F-isolines into the precision-recall space the 2-dimensional graphic representation reveals information about an additional, third evaluation measure. http://viper.ijs.si/types/scatter/

  • Input: Performance results
  • Outputs: Popup window which shows widget’s results

Widget: PR space

_images/question-mark.png

Scatter chart - Precision-Recall space. Provides an easy and intuitive visual performance evaluation in terms of Recall, Precision and F-measure. By introducing the F-isolines into the precision-recall space the 2-dimensional graphic representation reveals information about an additional, third evaluation measure. http://viper.ijs.si/types/scatter/

  • Input: Performance results
  • Outputs: Popup window which shows widget’s results

Category Utilities

Widget: Prepare performance curve data

_images/question-mark.png
  • Input: Actual and Predicted values
  • Parameter: Prediction Type (Prediction scores or ranks)
    • Possible values:
      • Ranks
      • Scores
    • Default value: -score
  • Output: Performance curve data