AsciiDoc Parser - Component Specification

1. Introduction

This specification defines the AsciiDocParser - a lightweight component for parsing AsciiDoc documents. The parser is intentionally tailored to the requirements of this project and is not a complete Asciidoctor-compatible parser.

1.1. Purpose

The AsciiDocParser serves to:

Capture document structure: Extract sections and build hierarchical structure
Resolve includes: Process include::[] directives recursively with source mapping
Identify elements: Recognize code blocks, tables, images, admonitions, and PlantUML as addressable blocks
Manage attributes: Set document attributes and resolve them in paths/content
Capture cross-references: Collect [anchor] and xref:[] for link validation
Source file mapping: Capture line numbers and source file for each element

1.2. Scope Limitations

The parser is not a complete Asciidoctor renderer. It:

Does not render HTML/PDF
Does not parse inline formatting (bold, italic, monospace)
Does not analyze table contents in detail
Does not process complex list structures
Does not support ifeval::[] conditional evaluation

2. Technical Debt

TD-ADOC-001: ifeval Conditional Not Supported

The ifeval::[] directive is not supported in this version. ifdef::attr[] / ifndef::attr[] / endif::[] are fully supported (Issue #14).

Priority: Low (not required for MVP)

3. Supported AsciiDoc Features

3.1. Fully Supported

Feature Description Usage

Feature	Description	Usage
Sections	`= Title` to `====== Level 5`	Document structure
Document Header	Title and attributes before first content	Metadata
Document Attributes	`:attribute: value`	Configuration, metadata
Attribute References	`{attribute}` in text and paths	Dynamic values
Include Directive	`include::path[]` with attribute substitution	Document composition
Source Blocks	`[source,language]` with `----`	Element extraction
PlantUML Blocks	`[plantuml,name,format]` with `----`	Element extraction
Images	`image::path[alt]` (block) and (inline)	Element extraction
Admonitions	`NOTE:`, `TIP:`, `WARNING:`, `CAUTION:`, `IMPORTANT:`	Element extraction
Cross-References	`[anchor]` and `file.html`	Link capture

Sections

= Title to ====== Level 5

Document structure

Document Header

Title and attributes before first content

Metadata

Document Attributes

:attribute: value

Configuration, metadata

Attribute References

{attribute} in text and paths

Dynamic values

Include Directive

include::path[] with attribute substitution

Document composition

Source Blocks

[source,language] with ----

Element extraction

PlantUML Blocks

[plantuml,name,format] with ----

Element extraction

Images

image::path[alt] (block) and (inline)

Element extraction

Admonitions

NOTE:, TIP:, WARNING:, CAUTION:, IMPORTANT:

Element extraction

Cross-References

[anchor] and file.html

Link capture

3.2. Recognized (Not Parsed in Detail)

Feature Description Treatment

Feature	Description	Treatment
Tables	`\|===` block tables	Recognized as block, content not analyzed
Listing Blocks	`----` without `[source]`	Recognized as generic block
Sidebar Blocks	`**` blocks	Recognized as block
Example Blocks	`====` blocks	Recognized as block
Quote Blocks	`__` blocks	Recognized as block

Tables

|=== block tables

Recognized as block, content not analyzed

Listing Blocks

---- without [source]

Recognized as generic block

Sidebar Blocks

** blocks

Recognized as block

Example Blocks

==== blocks

Recognized as block

Quote Blocks

__ blocks

Recognized as block

3.3. Not Supported

ifeval::[] conditional evaluation - see Technical Debt
Inline formatting (bold, italic, mono)
Footnotes
Bibliography
Index entries
Complex table formatting (colspan, rowspan)
Passthrough blocks ()

4. Include Resolution

The AsciiDocParser supports recursive resolution of include::[] directives with complete source mapping.

4.1. Syntax

include::path/to/file.adoc[]
include::{includedir}/file.adoc[]
include::chapter.adoc[leveloffset=+1]
include::code.py[lines=5..10]

4.2. Attribute Substitution in Paths

Attributes are substituted before path resolution:

:includedir: chapters
:lang: en

include::{includedir}/{lang}/intro.adoc[]
// Resolves to: chapters/en/intro.adoc

4.3. Include Options

Option Description Support

Option	Description	Support
`leveloffset=+n`	Increase section level by n	✓ Full
`leveloffset=-n`	Decrease section level by n	✓ Full
`lines=n..m`	Include only lines n to m	✓ Full
`tag=name`	Include only tagged region	✗ Not supported
`indent=n`	Add indentation	✗ Not supported

leveloffset=+n

Increase section level by n

✓ Full

leveloffset=-n

Decrease section level by n

✓ Full

lines=n..m

Include only lines n to m

✓ Full

tag=name

Include only tagged region

✗ Not supported

indent=n

Add indentation

✗ Not supported

4.4. Source Mapping

For each element, the original source file and line number are captured:

@dataclass
class SourceLocation:
    """Position in source document."""
    file: Path              # Original file (not resolved include file)
    line: int               # 1-based line number in this file
    resolved_from: Path | None  # If included via include directive

4.5. Circular Includes

The parser detects and prevents circular include chains:

Scenario: Circular includes are detected
  Given File A contains "include::B.adoc[]"
  And File B contains "include::A.adoc[]"
  When the parser processes File A
  Then a CircularIncludeError is thrown
  And the include chain is specified in the error

5. Document Attributes

5.1. Syntax

// Set attribute
:author: John Doe
:revdate: 2024-01-15
:imagesdir: ./images

// Unset attribute
:!draft:

// Attribute reference
The author is {author}.

5.2. Standard Attributes

Attribute Description Default Value

Attribute	Description	Default Value
`doctype`	Document type (article, book, etc.)	`article`
`imagesdir`	Base path for images	`.` (current directory)
`includedir`	Base path for includes	`.` (current directory)
`leveloffset`	Global section level offset	`0`

doctype

Document type (article, book, etc.)

article

imagesdir

Base path for images

. (current directory)

includedir

Base path for includes

. (current directory)

leveloffset

Global section level offset

0

5.3. jbake Attributes

For integration with jbake, the following attributes are extracted as metadata:

:jbake-title: My Document
:jbake-type: page_toc
:jbake-status: published
:jbake-menu: main
:jbake-order: 5

6. Extractable Elements

6.1. Source Blocks (Code)

[source,python]
.Optional Title

def hello(): print("Hello, World!")

Table 1. Extracted Information
Attribute	Value
`type`	`code`
`language`	`python`
`title`	`Optional Title` (or empty)
`source_location`	Source file and line number
`content`	Raw content (without delimiter)

6.2. PlantUML Blocks

[plantuml, diagram-name, svg]
----
@startuml
Alice -> Bob: Hello
@enduml
----

Table 2. Extracted Information
Attribute	Value
`type`	`plantuml`
`name`	`diagram-name`
`format`	`svg`
`source_location`	Source file and line number
`content`	PlantUML source code

6.3. Tables

.Table Title
[cols="1,2,3"]
|===
| Header 1 | Header 2 | Header 3

| Cell 1   | Cell 2   | Cell 3
|===

Table 3. Extracted Information
Attribute	Value
`type`	`table`
`title`	`Table Title` (or empty)
`columns`	Number of columns (from cols attribute or header)
`rows`	Number of data rows
`source_location`	Source file and line number

Table contents are not parsed in detail. Only structural metadata is captured.

6.4. Images

// Block image
image::path/to/image.png[Alt Text, 400, 300]

// With title
.Image Title
image::diagram.svg[Architecture Diagram]

Table 4. Extracted Information
Attribute	Value
`type`	`image`
`src`	`path/to/image.png` (with resolved `/images`)
`alt`	`Alt Text`
`title`	`Image Title` (or empty)
`width`	`400` (or empty)
`height`	`300` (or empty)
`source_location`	Source file and line number

6.5. Admonitions

NOTE: This is a note.

WARNING: This is a warning.

[TIP]
====
This is a multi-line tip.
With multiple paragraphs.
====

Table 5. Extracted Information
Attribute	Value
`type`	`admonition`
`admonition_type`	`NOTE`, `TIP`, `WARNING`, `CAUTION`, or `IMPORTANT`
`source_location`	Source file and line number
`content`	Admonition content (raw text)

7. Cross-References

The parser captures all cross-references for later link validation.

7.1. Syntax

// Internal reference
<<section-anchor>>
<<section-anchor,Custom Text>>

// External reference (xref)
xref:other-file.adoc#anchor[Link Text]
xref:other-file.adoc[]

7.2. Captured Information

@dataclass
class CrossReference:
    """A captured cross-reference."""
    type: Literal["internal", "external"]
    target: str                 # Anchor or file#anchor
    text: str | None           # Optional link text
    source_location: SourceLocation

8. Data Models

8.1. AsciidocDocument

@dataclass
class AsciidocDocument:
    """Represents a parsed AsciiDoc document."""
    file_path: Path
    title: str
    attributes: dict[str, str]
    sections: list[AsciidocSection]
    elements: list[AsciidocElement]
    cross_references: list[CrossReference]
    includes: list[IncludeInfo]         # All resolved includes

8.2. AsciidocSection

@dataclass
class AsciidocSection:
    """A section in the document."""
    title: str
    level: int                          # 0-5 (0 = document title)
    anchor: str | None                  # [[anchor]] if present
    source_location: SourceLocation
    path: str                           # Hierarchical path
    children: list[AsciidocSection]

8.3. AsciidocElement

@dataclass
class AsciidocElement:
    """An extractable element."""
    type: Literal["code", "plantuml", "mermaid", "ditaa", "table", "image", "admonition", "list"]
    source_location: SourceLocation
    attributes: dict[str, Any]          # Type-specific attributes
    parent_section: str                 # Path of containing section

8.4. IncludeInfo

@dataclass
class IncludeInfo:
    """Information about a resolved include."""
    source_location: SourceLocation     # Where the include is located
    target_path: Path                   # Resolved target path
    options: dict[str, str]             # leveloffset, lines, etc.

9. Parser Behavior

9.1. Attribute Resolution

Attributes are resolved during parsing:

Set standard attributes (doctype, imagesdir, etc.)
Parse and set header attributes
For each attribute reference {name}, insert the current value
Treat unknown attributes as empty string (with warning)

9.2. Error Handling

Situation Behavior

Situation	Behavior
Include file not found	`IncludeNotFoundError` with path and source file
Circular include	`CircularIncludeError` with include chain
Invalid attribute syntax	Log warning, ignore line
Invalid UTF-8 encoding	`EncodingError` with file hint
Empty file	Return empty document (no sections)
File without sections	Entire content as implicit root section

Include file not found

IncludeNotFoundError with path and source file

Circular include

CircularIncludeError with include chain

Invalid attribute syntax

Log warning, ignore line

Invalid UTF-8 encoding

EncodingError with file hint

Empty file

Return empty document (no sections)

File without sections

Entire content as implicit root section

9.3. Performance Requirements

Parsing a single file (without includes): < 50ms
Parsing a document with 50 include files: < 2s
Memory consumption: < 10KB per parsed file (without content)
Include depth: max. 20 levels (configurable)

10. Acceptance Criteria

10.1. AC-ADOC-01: Section Extraction

Scenario: Sections are correctly extracted
  Given an AsciiDoc file with the following content:
    """
    = Main Title

    == Chapter 1

    Text...

    == Chapter 2

    === Subchapter
    """
  When the parser processes the file
  Then 4 sections are extracted
  And the hierarchy is:
    | path                        | level |
    | /main-title                 | 0     |
    | /main-title/chapter-1       | 1     |
    | /main-title/chapter-2       | 1     |
    | /main-title/chapter-2/subchapter | 2 |

10.2. AC-ADOC-02: Attribute Resolution

Scenario: Attributes are correctly resolved
  Given an AsciiDoc file with the following content:
    """
    :author: John Doe
    :project: MCP Server

    = {project} Documentation

    Author: {author}
    """
  When the parser processes the file
  Then the document title is "MCP Server Documentation"
  And attributes["author"] is "John Doe"

10.3. AC-ADOC-03: Include Resolution

Scenario: Includes are recursively resolved
  Given a main file "main.adoc":
    """
    = Main Document

    \include::chapter.adoc[leveloffset=+1]
    """
  And an include file "chapter.adoc":
    """
    = Chapter

    Chapter content.
    """
  When the parser processes "main.adoc"
  Then the document contains 2 sections
  And the section "Chapter" has level 1 (due to leveloffset)
  And the source_location of "Chapter" points to "chapter.adoc"

10.4. AC-ADOC-04: Circular Include Detection

Scenario: Circular includes are detected
  Given a file "a.adoc" with "include::b.adoc[]"
  And a file "b.adoc" with "include::a.adoc[]"
  When the parser processes "a.adoc"
  Then a CircularIncludeError is thrown
  And the error message contains "a.adoc -> b.adoc -> a.adoc"

10.5. AC-ADOC-05: Source Block Extraction

Scenario: Source blocks are extracted
  Given an AsciiDoc file with a Python source block
  When the parser processes the file
  Then elements contains an entry of type "code"
  And its language equals "python"
  And source_location points to the correct file and line

10.6. AC-ADOC-06: PlantUML Extraction

Scenario: PlantUML blocks are extracted as their own type
  Given an AsciiDoc file with:
    """
    [plantuml, my-diagram, svg]
    ----
    @startuml
    A -> B
    @enduml
    ----
    """
  When the parser processes the file
  Then elements contains an entry of type "plantuml"
  And name equals "my-diagram"
  And format equals "svg"

10.7. AC-ADOC-07: Admonition Extraction

Scenario: Admonitions are extracted
  Given an AsciiDoc file with "WARNING: Important notice"
  When the parser processes the file
  Then elements contains an entry of type "admonition"
  And admonition_type equals "WARNING"

10.8. AC-ADOC-08: Cross-Reference Capture

Scenario: Cross-references are captured
  Given an AsciiDoc file with:
    """
    See <<section-a>> and xref:other.adoc#anchor[Link].
    """
  When the parser processes the file
  Then cross_references contains 2 entries
  And the first is type="internal", target="section-a"
  And the second is type="external", target="other.adoc#anchor"

10.9. AC-ADOC-09: Attribute Substitution in Include Paths

Scenario: Attributes in include paths are resolved
  Given an AsciiDoc file with:
    """
    :chaptersdir: chapters

    \include::{chaptersdir}/intro.adoc[]
    """
  And a file "chapters/intro.adoc" exists
  When the parser processes the file
  Then "chapters/intro.adoc" is successfully included

11. Interfaces

11.1. Parser Interface

class AsciidocParser:
    """Parser for AsciiDoc documents."""

    def __init__(self, base_path: Path, max_include_depth: int = 20):
        """
        Initializes the parser.

        Args:
            base_path: Base path for relative include resolution
            max_include_depth: Maximum include depth
        """
        ...

    def parse_file(self, file_path: Path) -> AsciidocDocument:
        """
        Parses an AsciiDoc file with include resolution.

        Raises:
            FileNotFoundError: File does not exist
            CircularIncludeError: Circular include detected
            IncludeNotFoundError: Include file not found
        """
        ...

    def get_section(self, doc: AsciidocDocument, path: str) -> AsciidocSection | None:
        """Finds a section by its hierarchical path."""
        ...

    def get_elements(
        self,
        doc: AsciidocDocument,
        element_type: str | None = None
    ) -> list[AsciidocElement]:
        """Returns all elements, optionally filtered by type."""
        ...

    def validate_cross_references(
        self,
        doc: AsciidocDocument
    ) -> list[ValidationError]:
        """Checks all cross-references for validity."""
        ...

12. Implementation Notes

12.1. Regex Patterns

# Section (Level 0-5)
SECTION_PATTERN = r'^(={1,6})\s+(.+?)(?:\s+=*)?$'

# Document attribute
ATTRIBUTE_PATTERN = r'^:([a-zA-Z0-9_-]+):\s*(.*)$'

# Unset attribute
ATTRIBUTE_UNSET_PATTERN = r'^:!([a-zA-Z0-9_-]+):$'

# Attribute reference
ATTRIBUTE_REF_PATTERN = r'\{([a-zA-Z0-9_-]+)\}'

# Include directive
INCLUDE_PATTERN = r'^include::(.+?)\[(.*?)\]$'

# Source block start
SOURCE_BLOCK_PATTERN = r'^\[source(?:,\s*(\w+))?\]$'

# PlantUML block start
PLANTUML_PATTERN = r'^\[plantuml(?:,\s*([^,\]]+))?(?:,\s*(\w+))?\]$'

# Mermaid block start (Issue #122)
MERMAID_PATTERN = r'^\[mermaid(?:,\s*([^,\]]+))?(?:,\s*(\w+))?\]$'

# Ditaa block start (Issue #122)
DITAA_PATTERN = r'^\[ditaa(?:,\s*([^,\]]+))?(?:,\s*(\w+))?\]$'

# Block delimiter
BLOCK_DELIMITER = r'^(-{4,}|={4,}|\*{4,}|_{4,})$'

# Block image
BLOCK_IMAGE_PATTERN = r'^image::(.+?)\[(.*)?]$'

# Admonition (short form)
ADMONITION_SHORT_PATTERN = r'^(NOTE|TIP|WARNING|CAUTION|IMPORTANT):\s*(.+)$'

# Cross-reference (internal)
XREF_INTERNAL_PATTERN = r'<<([^,>]+)(?:,([^>]+))?>>`

# Cross-reference (external)
XREF_EXTERNAL_PATTERN = r'xref:([^#\[]+)?(?:#([^\[]+))?\[([^\]]*)\]'

# Anchor
ANCHOR_PATTERN = r'^\[\[([^\]]+)\]\]$'

# Table start/end
TABLE_DELIMITER = r'^\|===$'

12.2. State Machine for Parsing

12.3. Include Resolution Algorithm

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.