AsciiDoc Parser - Component Specification

1. Introduction

This specification defines the AsciiDocParser - a lightweight component for parsing AsciiDoc documents. The parser is intentionally tailored to the requirements of this project and is not a complete Asciidoctor-compatible parser.

1.1. Purpose

The AsciiDocParser serves to:

  1. Capture document structure: Extract sections and build hierarchical structure

  2. Resolve includes: Process include::[] directives recursively with source mapping

  3. Identify elements: Recognize code blocks, tables, images, admonitions, and PlantUML as addressable blocks

  4. Manage attributes: Set document attributes and resolve them in paths/content

  5. Capture cross-references: Collect [anchor] and xref:[] for link validation

  6. Source file mapping: Capture line numbers and source file for each element

1.2. Scope Limitations

The parser is not a complete Asciidoctor renderer. It:

  • Does not render HTML/PDF

  • Does not parse inline formatting (bold, italic, monospace)

  • Does not analyze table contents in detail

  • Does not process complex list structures

  • Does not support ifeval::[] conditional evaluation

2. Technical Debt

TD-ADOC-001: ifeval Conditional Not Supported

The ifeval::[] directive is not supported in this version. ifdef::attr[] / ifndef::attr[] / endif::[] are fully supported (Issue #14).

Priority: Low (not required for MVP)

3. Supported AsciiDoc Features

3.1. Fully Supported

Feature Description Usage

Sections

= Title to ====== Level 5

Document structure

Document Header

Title and attributes before first content

Metadata

Document Attributes

:attribute: value

Configuration, metadata

Attribute References

{attribute} in text and paths

Dynamic values

Include Directive

include::path[] with attribute substitution

Document composition

Source Blocks

[source,language] with ----

Element extraction

PlantUML Blocks

[plantuml,name,format] with ----

Element extraction

Images

image::path[alt] (block) and alt (inline)

Element extraction

Admonitions

NOTE:, TIP:, WARNING:, CAUTION:, IMPORTANT:

Element extraction

Cross-References

[anchor] and file.html

Link capture

3.2. Recognized (Not Parsed in Detail)

Feature Description Treatment

Tables

|=== block tables

Recognized as block, content not analyzed

Listing Blocks

---- without [source]

Recognized as generic block

Sidebar Blocks

** blocks

Recognized as block

Example Blocks

==== blocks

Recognized as block

Quote Blocks

__ blocks

Recognized as block

3.3. Not Supported

  • ifeval::[] conditional evaluation - see Technical Debt

  • Inline formatting (bold, italic, mono)

  • Footnotes

  • Bibliography

  • Index entries

  • Complex table formatting (colspan, rowspan)

  • Passthrough blocks ()

4. Include Resolution

The AsciiDocParser supports recursive resolution of include::[] directives with complete source mapping.

4.1. Syntax

include::path/to/file.adoc[]
include::{includedir}/file.adoc[]
include::chapter.adoc[leveloffset=+1]
include::code.py[lines=5..10]

4.2. Attribute Substitution in Paths

Attributes are substituted before path resolution:

:includedir: chapters
:lang: en

include::{includedir}/{lang}/intro.adoc[]
// Resolves to: chapters/en/intro.adoc

4.3. Include Options

Option Description Support

leveloffset=+n

Increase section level by n

✓ Full

leveloffset=-n

Decrease section level by n

✓ Full

lines=n..m

Include only lines n to m

✓ Full

tag=name

Include only tagged region

✗ Not supported

indent=n

Add indentation

✗ Not supported

4.4. Source Mapping

For each element, the original source file and line number are captured:

@dataclass
class SourceLocation:
    """Position in source document."""
    file: Path              # Original file (not resolved include file)
    line: int               # 1-based line number in this file
    resolved_from: Path | None  # If included via include directive

4.5. Circular Includes

The parser detects and prevents circular include chains:

Scenario: Circular includes are detected
  Given File A contains "include::B.adoc[]"
  And File B contains "include::A.adoc[]"
  When the parser processes File A
  Then a CircularIncludeError is thrown
  And the include chain is specified in the error

5. Document Attributes

5.1. Syntax

// Set attribute
:author: John Doe
:revdate: 2024-01-15
:imagesdir: ./images

// Unset attribute
:!draft:

// Attribute reference
The author is {author}.

5.2. Standard Attributes

Attribute Description Default Value

doctype

Document type (article, book, etc.)

article

imagesdir

Base path for images

. (current directory)

includedir

Base path for includes

. (current directory)

leveloffset

Global section level offset

0

5.3. jbake Attributes

For integration with jbake, the following attributes are extracted as metadata:

:jbake-title: My Document
:jbake-type: page_toc
:jbake-status: published
:jbake-menu: main
:jbake-order: 5

6. Extractable Elements

6.1. Source Blocks (Code)

[source,python]
.Optional Title

def hello(): print("Hello, World!")


Table 1. Extracted Information
Attribute Value

type

code

language

python

title

Optional Title (or empty)

source_location

Source file and line number

content

Raw content (without delimiter)

6.2. PlantUML Blocks

[plantuml, diagram-name, svg]
----
@startuml
Alice -> Bob: Hello
@enduml
----
Table 2. Extracted Information
Attribute Value

type

plantuml

name

diagram-name

format

svg

source_location

Source file and line number

content

PlantUML source code

6.3. Tables

.Table Title
[cols="1,2,3"]
|===
| Header 1 | Header 2 | Header 3

| Cell 1   | Cell 2   | Cell 3
|===
Table 3. Extracted Information
Attribute Value

type

table

title

Table Title (or empty)

columns

Number of columns (from cols attribute or header)

rows

Number of data rows

source_location

Source file and line number

Table contents are not parsed in detail. Only structural metadata is captured.

6.4. Images

// Block image
image::path/to/image.png[Alt Text, 400, 300]

// With title
.Image Title
image::diagram.svg[Architecture Diagram]
Table 4. Extracted Information
Attribute Value

type

image

src

path/to/image.png (with resolved /images)

alt

Alt Text

title

Image Title (or empty)

width

400 (or empty)

height

300 (or empty)

source_location

Source file and line number

6.5. Admonitions

NOTE: This is a note.

WARNING: This is a warning.

[TIP]
====
This is a multi-line tip.
With multiple paragraphs.
====
Table 5. Extracted Information
Attribute Value

type

admonition

admonition_type

NOTE, TIP, WARNING, CAUTION, or IMPORTANT

source_location

Source file and line number

content

Admonition content (raw text)

7. Cross-References

The parser captures all cross-references for later link validation.

7.1. Syntax

// Internal reference
<<section-anchor>>
<<section-anchor,Custom Text>>

// External reference (xref)
xref:other-file.adoc#anchor[Link Text]
xref:other-file.adoc[]

7.2. Captured Information

@dataclass
class CrossReference:
    """A captured cross-reference."""
    type: Literal["internal", "external"]
    target: str                 # Anchor or file#anchor
    text: str | None           # Optional link text
    source_location: SourceLocation

8. Data Models

8.1. AsciidocDocument

@dataclass
class AsciidocDocument:
    """Represents a parsed AsciiDoc document."""
    file_path: Path
    title: str
    attributes: dict[str, str]
    sections: list[AsciidocSection]
    elements: list[AsciidocElement]
    cross_references: list[CrossReference]
    includes: list[IncludeInfo]         # All resolved includes

8.2. AsciidocSection

@dataclass
class AsciidocSection:
    """A section in the document."""
    title: str
    level: int                          # 0-5 (0 = document title)
    anchor: str | None                  # [[anchor]] if present
    source_location: SourceLocation
    path: str                           # Hierarchical path
    children: list[AsciidocSection]

8.3. AsciidocElement

@dataclass
class AsciidocElement:
    """An extractable element."""
    type: Literal["code", "plantuml", "mermaid", "ditaa", "table", "image", "admonition", "list"]
    source_location: SourceLocation
    attributes: dict[str, Any]          # Type-specific attributes
    parent_section: str                 # Path of containing section

8.4. IncludeInfo

@dataclass
class IncludeInfo:
    """Information about a resolved include."""
    source_location: SourceLocation     # Where the include is located
    target_path: Path                   # Resolved target path
    options: dict[str, str]             # leveloffset, lines, etc.

9. Parser Behavior

9.1. Attribute Resolution

Attributes are resolved during parsing:

  1. Set standard attributes (doctype, imagesdir, etc.)

  2. Parse and set header attributes

  3. For each attribute reference {name}, insert the current value

  4. Treat unknown attributes as empty string (with warning)

9.2. Error Handling

Situation Behavior

Include file not found

IncludeNotFoundError with path and source file

Circular include

CircularIncludeError with include chain

Invalid attribute syntax

Log warning, ignore line

Invalid UTF-8 encoding

EncodingError with file hint

Empty file

Return empty document (no sections)

File without sections

Entire content as implicit root section

9.3. Performance Requirements

  • Parsing a single file (without includes): < 50ms

  • Parsing a document with 50 include files: < 2s

  • Memory consumption: < 10KB per parsed file (without content)

  • Include depth: max. 20 levels (configurable)

10. Acceptance Criteria

10.1. AC-ADOC-01: Section Extraction

Scenario: Sections are correctly extracted
  Given an AsciiDoc file with the following content:
    """
    = Main Title

    == Chapter 1

    Text...

    == Chapter 2

    === Subchapter
    """
  When the parser processes the file
  Then 4 sections are extracted
  And the hierarchy is:
    | path                        | level |
    | /main-title                 | 0     |
    | /main-title/chapter-1       | 1     |
    | /main-title/chapter-2       | 1     |
    | /main-title/chapter-2/subchapter | 2 |

10.2. AC-ADOC-02: Attribute Resolution

Scenario: Attributes are correctly resolved
  Given an AsciiDoc file with the following content:
    """
    :author: John Doe
    :project: MCP Server

    = {project} Documentation

    Author: {author}
    """
  When the parser processes the file
  Then the document title is "MCP Server Documentation"
  And attributes["author"] is "John Doe"

10.3. AC-ADOC-03: Include Resolution

Scenario: Includes are recursively resolved
  Given a main file "main.adoc":
    """
    = Main Document

    \include::chapter.adoc[leveloffset=+1]
    """
  And an include file "chapter.adoc":
    """
    = Chapter

    Chapter content.
    """
  When the parser processes "main.adoc"
  Then the document contains 2 sections
  And the section "Chapter" has level 1 (due to leveloffset)
  And the source_location of "Chapter" points to "chapter.adoc"

10.4. AC-ADOC-04: Circular Include Detection

Scenario: Circular includes are detected
  Given a file "a.adoc" with "include::b.adoc[]"
  And a file "b.adoc" with "include::a.adoc[]"
  When the parser processes "a.adoc"
  Then a CircularIncludeError is thrown
  And the error message contains "a.adoc -> b.adoc -> a.adoc"

10.5. AC-ADOC-05: Source Block Extraction

Scenario: Source blocks are extracted
  Given an AsciiDoc file with a Python source block
  When the parser processes the file
  Then elements contains an entry of type "code"
  And its language equals "python"
  And source_location points to the correct file and line

10.6. AC-ADOC-06: PlantUML Extraction

Scenario: PlantUML blocks are extracted as their own type
  Given an AsciiDoc file with:
    """
    [plantuml, my-diagram, svg]
    ----
    @startuml
    A -> B
    @enduml
    ----
    """
  When the parser processes the file
  Then elements contains an entry of type "plantuml"
  And name equals "my-diagram"
  And format equals "svg"

10.7. AC-ADOC-07: Admonition Extraction

Scenario: Admonitions are extracted
  Given an AsciiDoc file with "WARNING: Important notice"
  When the parser processes the file
  Then elements contains an entry of type "admonition"
  And admonition_type equals "WARNING"

10.8. AC-ADOC-08: Cross-Reference Capture

Scenario: Cross-references are captured
  Given an AsciiDoc file with:
    """
    See <<section-a>> and xref:other.adoc#anchor[Link].
    """
  When the parser processes the file
  Then cross_references contains 2 entries
  And the first is type="internal", target="section-a"
  And the second is type="external", target="other.adoc#anchor"

10.9. AC-ADOC-09: Attribute Substitution in Include Paths

Scenario: Attributes in include paths are resolved
  Given an AsciiDoc file with:
    """
    :chaptersdir: chapters

    \include::{chaptersdir}/intro.adoc[]
    """
  And a file "chapters/intro.adoc" exists
  When the parser processes the file
  Then "chapters/intro.adoc" is successfully included

11. Interfaces

11.1. Parser Interface

class AsciidocParser:
    """Parser for AsciiDoc documents."""

    def __init__(self, base_path: Path, max_include_depth: int = 20):
        """
        Initializes the parser.

        Args:
            base_path: Base path for relative include resolution
            max_include_depth: Maximum include depth
        """
        ...

    def parse_file(self, file_path: Path) -> AsciidocDocument:
        """
        Parses an AsciiDoc file with include resolution.

        Raises:
            FileNotFoundError: File does not exist
            CircularIncludeError: Circular include detected
            IncludeNotFoundError: Include file not found
        """
        ...

    def get_section(self, doc: AsciidocDocument, path: str) -> AsciidocSection | None:
        """Finds a section by its hierarchical path."""
        ...

    def get_elements(
        self,
        doc: AsciidocDocument,
        element_type: str | None = None
    ) -> list[AsciidocElement]:
        """Returns all elements, optionally filtered by type."""
        ...

    def validate_cross_references(
        self,
        doc: AsciidocDocument
    ) -> list[ValidationError]:
        """Checks all cross-references for validity."""
        ...

12. Implementation Notes

12.1. Regex Patterns

# Section (Level 0-5)
SECTION_PATTERN = r'^(={1,6})\s+(.+?)(?:\s+=*)?$'

# Document attribute
ATTRIBUTE_PATTERN = r'^:([a-zA-Z0-9_-]+):\s*(.*)$'

# Unset attribute
ATTRIBUTE_UNSET_PATTERN = r'^:!([a-zA-Z0-9_-]+):$'

# Attribute reference
ATTRIBUTE_REF_PATTERN = r'\{([a-zA-Z0-9_-]+)\}'

# Include directive
INCLUDE_PATTERN = r'^include::(.+?)\[(.*?)\]$'

# Source block start
SOURCE_BLOCK_PATTERN = r'^\[source(?:,\s*(\w+))?\]$'

# PlantUML block start
PLANTUML_PATTERN = r'^\[plantuml(?:,\s*([^,\]]+))?(?:,\s*(\w+))?\]$'

# Mermaid block start (Issue #122)
MERMAID_PATTERN = r'^\[mermaid(?:,\s*([^,\]]+))?(?:,\s*(\w+))?\]$'

# Ditaa block start (Issue #122)
DITAA_PATTERN = r'^\[ditaa(?:,\s*([^,\]]+))?(?:,\s*(\w+))?\]$'

# Block delimiter
BLOCK_DELIMITER = r'^(-{4,}|={4,}|\*{4,}|_{4,})$'

# Block image
BLOCK_IMAGE_PATTERN = r'^image::(.+?)\[(.*)?]$'

# Admonition (short form)
ADMONITION_SHORT_PATTERN = r'^(NOTE|TIP|WARNING|CAUTION|IMPORTANT):\s*(.+)$'

# Cross-reference (internal)
XREF_INTERNAL_PATTERN = r'<<([^,>]+)(?:,([^>]+))?>>`

# Cross-reference (external)
XREF_EXTERNAL_PATTERN = r'xref:([^#\[]+)?(?:#([^\[]+))?\[([^\]]*)\]'

# Anchor
ANCHOR_PATTERN = r'^\[\[([^\]]+)\]\]$'

# Table start/end
TABLE_DELIMITER = r'^\|===$'

12.2. State Machine for Parsing

adoc parser states

12.3. Include Resolution Algorithm

include resolution