1. Introduction
The [ASIMOV] Platform is a polyglot development platform for trustworthy, neurosymbolic AI.
This specification defines the algorithm for resolving data source URIs to ASIMOV modules using pattern matching. The resolution process enables the platform to automatically discover and select modules that are capable of extracting and transforming specific data sources into knowledge graph datasets.
1.1. Overview
The ASIMOV Module Resolution Specification defines a standardized algorithm for matching URIs against module capability declarations to determine which modules can handle specific resources. The resolution process enables:
-
Automatic Module Selection: Given a URI, the platform can automatically select appropriate modules for processing
-
Pattern-Based Matching: Supports exact matches, prefixes, and parameterized patterns for flexible resource handling
-
Conflict Resolution: Provides deterministic rules for selecting modules when multiple candidates are available
-
Extensible Architecture: Allows modules to declare new resource types and patterns
1.2. Scope
This specification covers:
-
The URI tokenization and normalization process
-
Pattern matching algorithms for different handler types
-
Module selection and conflict resolution rules
-
The data structures and state machines used in resolution
This specification does not cover:
-
The format of module manifests (see [ASIMOV-MMS])
-
Runtime execution of selected modules
-
Inter-module communication protocols
1.3. Conformance
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
A conforming resolver is one that implements the resolution algorithm defined in this specification and produces correct results for all valid inputs.
2. Resolution Algorithm
2.1. URI Tokenization
The resolution process begins with tokenizing the input URI into a sequence of sections that can be matched against module patterns.
2.1.1. Tokenization Process
Given a URI, the tokenizer MUST:
-
Extract the scheme: The protocol portion before the first colon (e.g.,
https
,file
,near
) -
Parse the authority: For hierarchical URIs, extract and reverse the domain components
-
Extract path segments: Split the path on forward slashes, ignoring empty segments
-
Extract query parameters: Parse query string into name-value pairs
2.1.2. Section Types
The tokenizer produces the following section types:
Protocol
- The URI scheme (e.g.,
https
,file
,near
)
Domain
- A single domain component in reverse order (e.g.,
com
,example
fromexample.com
)
Path
- A single path segment (e.g.,
search
,users
from/search/users
)
QueryParamName
- The name of a query parameter (e.g.,
q
from?q=value
)
QueryParamValue
- The value of a query parameter (e.g.,
value
from?q=value
)
2.1.3. Normalization Rules
During tokenization, the following normalization rules MUST be applied:
-
www Removal: For HTTP/HTTPS URIs, remove leading
www.
from the domain -
Domain Reversal: Domain components are stored in reverse order (TLD first)
-
Empty Segment Filtering: Empty path segments are ignored
-
Query Parameter Ordering: Query parameters are processed in the order they appear
2.1.4. Tokenization Examples
# Input: https://example.com/search?q=test # Output: [Protocol("https"), Domain("com"), Domain("example"), # Path("search"), QueryParamName("q"), QueryParamValue("test")] # Input: near://account/alice.near # Output: [Protocol("near"), Path("account"), Path("alice.near")] # Input: file:///path/to/file.txt # Output: [Protocol("file"), Path("path"), Path("to"), Path("file.txt")]
2.2. Pattern Types
Modules can declare different types of patterns for matching URIs:
2.2.1. Protocol Patterns
Protocol patterns match URIs based on their scheme. A protocol pattern matches any URI that begins with the specified protocol, effectively acting as a prefix match.
Example:
handles : url_protocols : - near - ipfs
2.2.2. Prefix Patterns
Prefix patterns match URIs that begin with a specific prefix. The matching is exact up to the end of the declared prefix, and any additional path segments or query parameters are ignored.
Example:
handles : url_prefixes : - https://api.github.com/ - https://example.com/api/v1/
2.2.3. Parameterized Patterns
Parameterized patterns match URIs with variable components, allowing extraction of parameters from the URI structure.
2.2.3.1. Pattern Syntax
Parameterized patterns use the following syntax:
-
*
in domain position: Matches zero or more subdomains -
:name
in path position: Matches any single path segment -
:name
in query value position: Matches any query parameter value
Example:
handles : url_patterns : - https://*.example.com/users/:id - https://search.example.com/?q=:query
2.2.3.2. Wildcard Matching
Wildcard domain patterns (*
) match zero or more subdomain components. This enables matching of URIs with varying numbers of subdomains.
Wildcard path patterns (:name
) match exactly one path segment with any value.
Wildcard query patterns (:name
) match any value for a specific query parameter name.
2.2.4. File Extension Patterns
File extension patterns match URIs with file://
scheme based on the file extension. The extension is extracted from the last path segment.
Example:
handles : file_extensions : - csv - json - tar.gz
2.3. Resolution State Machine
The resolution algorithm uses a finite state machine to track possible matches as it processes the tokenized URI.
2.3.1. State Representation
Each state in the resolution process is represented by a node that contains:
-
Transitions: A mapping from section types to destination nodes
-
Modules: A set of modules that can handle URIs reaching this state
-
Free Moves: Special transitions that match any input without consuming it
2.3.2. State Transitions
The state machine processes input sections sequentially, following these rules:
-
Start with root states: Initialize with all root nodes whose patterns match the first input section
-
Process remaining input: For each subsequent input section, find all reachable states
-
Follow free moves: After each transition, follow any available free move transitions
-
Collect results: Gather all modules from states reached after processing all input
2.3.3. Free Move Semantics
Free moves are special transitions that enable:
-
Prefix matching: Allowing additional path segments beyond the declared prefix
-
Protocol matching: Treating protocols as prefixes that match any URI with that scheme
-
Wildcard domain repetition: Enabling
*
patterns to match multiple subdomain levels
2.4. Resolution Process
2.4.1. Input Processing
The resolution process follows these steps:
-
Tokenize URI: Convert the input URI into a sequence of sections
-
Handle file extensions: For
file://
URIs, check file extension patterns first -
Initialize state set: Find all root states that match the first input section
-
Process input sequence: For each remaining input section, advance the state machine
-
Collect results: Gather all modules from final states
2.4.2. Matching Rules
Section matching follows these precedence rules:
-
Exact matches: Literal sections match exactly
-
Wildcard matches: Wildcard sections match corresponding input types
-
Free moves: Always match without consuming input
The matching function for sections is defined as:
matches(pattern_section, input_section) := pattern_section == input_section OR (pattern_section == WildcardDomain AND input_section is Domain) OR (pattern_section == WildcardPath AND input_section is Path) OR (pattern_section == WildcardQueryParamValue AND input_section is QueryParamValue) OR pattern_section == FreeMove
2.4.3. Conflict Resolution
When multiple modules match a URI, the resolver returns all matching modules. The selection of which module to use for processing is left to higher-level platform components.
However, for informational purposes, the following precedence rules are RECOMMENDED:
-
Specificity: More specific patterns take precedence over less specific ones
-
Pattern type precedence: Parameterized patterns > Prefix patterns > Protocol patterns
-
Path length: Longer paths take precedence over shorter ones
3. Examples
3.1. Basic Resolution Examples
3.1.1. Protocol Resolution
# Module declares: handles : url_protocols : - near# Resolves: near://account/alice.near -> [near-module] near://tx/ABC123 -> [near-module] near -> [near-module]
3.1.2. Prefix Resolution
# Module declares: handles : url_prefixes : - https://api.github.com/# Resolves: https://api.github.com/ -> [github-module] https://api.github.com/users -> [github-module] https://api.github.com/repos/owner/name -> [github-module]
3.1.3. Pattern Resolution
# Module declares: handles : url_patterns : - https://youtube.com/watch?v=:video_id# Resolves: https://youtube.com/watch?v=ABC123 -> [youtube-module]
3.2. Advanced Resolution Examples
3.2.1. Wildcard Domains
# Module declares: handles : url_patterns : - https://*.example.com/api/:endpoint# Resolves: https://example.com/api/users -> [api-module] https://api.example.com/api/users -> [api-module] https://v1.api.example.com/api/users -> [api-module]
3.2.2. Multiple Handlers
# Module A declares: handles : url_protocols : - https# Module B declares: handles : url_prefixes : - https://example.com/# Module C declares: handles : url_patterns : - https://example.com/api/:endpoint# Resolution: https://example.com/api/users -> [Module A, Module B, Module C] https://example.com/page -> [Module A, Module B] https://other.com/page -> [Module A]
3.2.3. File Extensions
# Module declares: handles : file_extensions : - csv - tar.gz# Resolves: file:///path/to/data.csv -> [csv-module] file:///archive.tar.gz -> [csv-module]
3.3. Complex Resolution Scenario
Consider a comprehensive example with multiple module types:
# Search module name : search-aggregatorhandles : url_patterns : - https://google.com/search?q=:query - https://bing.com/search?q=:query# Social media module name : social-scraperhandles : url_prefixes : - https://twitter.com/ - https://x.com/url_patterns : - https://youtube.com/watch?v=:video_id# NEAR module name : near-integrationhandles : url_protocols : - nearurl_patterns : - https://explorer.near.org/accounts/:account# File processor module name : data-processorhandles : file_extensions : - csv - json
Resolution results:
-
https://google.com/search?q=ASIMOV
→[
search-aggregator ] -
https://x.com/username
→[
social-scraper ] -
https://youtube.com/watch?v=ABC123
→[
social-scraper ] -
near://account/alice.near
→[
near-integration ] -
https://explorer.near.org/accounts/alice.near
→[
near-integration ] -
file:///data/export.csv
→[
data-processor ]
4. Implementation Considerations
4.1. Data Structures
4.1.1. Resolver State
A conforming resolver implementation MUST maintain:
-
Module registry: A mapping from module names to module metadata
-
File extension index: A mapping from file extensions to lists of capable modules
-
State machine nodes: A collection of nodes representing the resolution state space
-
Root node registry: A mapping from initial sections to starting nodes
4.1.2. Node Structure
Each node in the state machine MUST contain:
-
Transition table: A mapping from section types to destination node identifiers
-
Module set: A collection of modules that can handle URIs reaching this node
-
Free move target: An optional reference to a node reachable via free move
4.1.3. Memory Management
Implementations SHOULD consider:
-
Shared module references: Avoid duplicating module metadata across nodes
-
Compact node representation: Use efficient data structures for transition tables
-
Lazy evaluation: Only compute reachable states when needed
4.2. Performance Considerations
4.2.1. Algorithmic Complexity
The resolution algorithm has the following complexity characteristics:
-
Time complexity: O(n × m) where n is the number of input sections and m is the number of active states
-
Space complexity: O(k) where k is the total number of registered patterns
-
Preprocessing: O(p) where p is the number of patterns to register
4.2.2. Optimization Strategies
Implementations MAY employ:
-
Early termination: Stop processing when no more states are reachable
-
State deduplication: Merge identical states during construction
-
Transition caching: Cache frequently used transition computations
-
Batch processing: Process multiple URIs in batches to amortize setup costs
4.3. Error Handling
4.3.1. Invalid URIs
The resolver MUST handle invalid URIs gracefully:
-
Malformed URIs: Return an error indicating the URI cannot be parsed
-
Empty URIs: Return an error indicating the URI is empty
-
Unsupported schemes: Attempt resolution but may return no results
4.3.2. Resolution Failures
When no modules can handle a URI:
-
Return empty result: The resolver SHOULD return an empty list rather than an error
-
Logging: Implementations MAY log unsuccessful resolution attempts for debugging
-
Fallback modules: Implementations MAY provide fallback modules for common cases
5. Security Considerations
5.1. Pattern Injection
Implementations MUST prevent pattern injection attacks:
-
Input validation: Validate all pattern strings before registration
-
Sanitization: Remove or escape potentially dangerous characters
-
Pattern limits: Impose reasonable limits on pattern complexity
5.2. Resource Consumption
The resolution algorithm MUST protect against resource exhaustion:
-
State explosion: Limit the number of active states during resolution
-
Pattern complexity: Impose limits on pattern depth and branching factor
-
Memory usage: Implement bounds on memory consumption for large pattern sets
5.3. URI Validation
Input URIs SHOULD be validated before processing:
-
Scheme validation: Ensure schemes conform to RFC 3986
-
Length limits: Impose reasonable limits on URI length
-
Character encoding: Handle Unicode characters appropriately
6. IANA Considerations
This specification does not require any IANA registrations.
7. Acknowledgments
The editors would like to thank the ASIMOV Platform community for their contributions and feedback during the development of this specification.
8. Changes
This section will document changes between versions of this specification.
8.1. Version 1.0
Initial version of the ASIMOV Module Resolution Specification.