Markdown Parser - Component Specification

1. Introduction

This specification defines the MarkdownParser - a lightweight component for parsing GitHub Flavored Markdown (GFM) documents. The parser is intentionally tailored to the requirements of this project and is not a complete GFM parser.

1.1. Purpose

The MarkdownParser serves to:

Capture document structure: Extract headings and build hierarchical structure
Identify elements: Recognize code blocks, tables, and images as addressable blocks
Read metadata: Parse YAML frontmatter
Source file mapping: Capture line numbers for each element

1.2. Scope Limitations

The parser is not a complete Markdown renderer. It:

Does not render HTML
Does not parse inline formatting (bold, italic, inline links)
Does not analyze table contents
Does not process nested lists in detail

2. Supported GFM Features

2.1. Fully Supported

Feature Description Usage

Feature	Description	Usage
ATX Headings	`# H1` to `H6`	Document structure
Fenced Code Blocks	`language … `	Element extraction
YAML Frontmatter	`---` block at file start	Metadata
Images	`![alt](url)` and `![alt](url "title")`	Element extraction

ATX Headings

# H1 to H6

Document structure

Fenced Code Blocks

`language … `

Element extraction

YAML Frontmatter

--- block at file start

Metadata

Images

![alt](url) and ![alt](url "title")

Element extraction

2.2. Recognized (Not Parsed in Detail)

Feature Description Treatment

Feature	Description	Treatment
Tables	GFM pipe tables	Recognized as block, content not analyzed
Task Lists	`- [ ]` and `- [x]`	Recognized as list
Blockquotes	`> quoted text`	Recognized as block

Tables

GFM pipe tables

Recognized as block, content not analyzed

Task Lists

- [ ] and - [x]

Recognized as list

Blockquotes

> quoted text

Recognized as block

2.3. Not Supported

Setext Headings (underlined with === or ---)
Inline HTML
Footnotes
Definition Lists
Math blocks (LaTeX)

3. Document Structure via Folder Hierarchy

Unlike the AsciiDoc parser, the MarkdownParser uses no include directives. Instead, the folder structure represents the document hierarchy.

3.1. Structure Rules

docs/
├── index.md              # Document introduction (optional)
├── 01_introduction/
│   ├── index.md          # Chapter introduction (optional)
│   ├── 01_overview.md
│   └── 02_goals.md
├── 02_architecture/
│   ├── 01_constraints.md
│   └── 02_decisions.md
└── 03_appendix.md        # Single file as chapter

3.2. Sorting Rules

The order of files and folders is determined alphabetically:

Numeric prefixes are sorted correctly: 01_, 02_, … 10_, 11_
index.md or README.md always appears first in a folder
Folders and files at the same level are sorted together

Table 1. Sorting Example
File System	Sorted Order
02_details.md README.md 01_intro.md 10_appendix.md	README.md (1st - special rule) 01_intro.md (2nd - numeric) 02_details.md (3rd - numeric) 10_appendix.md (4th - numeric)

3.3. Hierarchy Mapping

File System Hierarchical Path Heading Level

File System	Hierarchical Path	Heading Level
`docs/index.md` → `# Title`	`/title`	H1
`docs/01_intro/index.md` → `# Intro`	`/intro`	H1 (Chapter)
`docs/01_intro/01_overview.md` → `# Overview`	`/intro/overview`	H1 (becomes H2 in context)
`docs/01_intro/01_overview.md` → `## Details`	`/intro/overview/details`	H2 (becomes H3 in context)

docs/index.md → # Title

/title

docs/01_intro/index.md → # Intro

/intro

H1 (Chapter)

docs/01_intro/01_overview.md → # Overview

/intro/overview

H1 (becomes H2 in context)

docs/01_intro/01_overview.md → ## Details

/intro/overview/details

H2 (becomes H3 in context)

4. YAML Frontmatter

The parser supports YAML frontmatter at the file start for metadata.

4.1. Format

---
title: Document Title
author: John Doe
date: 2024-01-15
tags: [architecture, design]
custom_field: arbitrary value
---

# Heading

Content...

4.2. Supported Data Types

String: title: "My Document"
Number: order: 42
Boolean: draft: true
Date: date: 2024-01-15
List: tags: [a, b, c]
Nested Objects: author: { name: "John", email: "john@example.com" }

4.3. Reserved Fields

Field Description Default Value

Field	Description	Default Value
`title`	Document title (overrides H1)	First H1 or filename
`order`	Explicit sort order	Alphabetical by filename
`draft`	Document is draft (will be ignored)	`false`
`exclude`	Exclude document from index	`false`

title

Document title (overrides H1)

First H1 or filename

order

Explicit sort order

Alphabetical by filename

draft

Document is draft (will be ignored)

false

exclude

Exclude document from index

false

5. Extractable Elements

The parser identifies the following elements as standalone, addressable blocks:

5.1. Code Blocks

```python
def hello():
    print("Hello, World!")
```

Table 2. Extracted Information
Attribute	Value
`type`	`code`
`language`	`python`
`start_line`	Line of opening ``
`end_line`	Line of closing ``
`content`	Raw content (without fence)

5.2. Tables

| Header 1 | Header 2 |
|----------|----------|
| Cell 1   | Cell 2   |

Table 3. Extracted Information
Attribute	Value
`type`	`table`
`start_line`	First line of table
`end_line`	Last line of table
`columns`	Number of columns (from header)
`rows`	Number of data rows

Table contents are not parsed in detail. Only structural metadata is captured.

5.3. Images

![Alt Text](path/to/image.png "Optional Title")

Table 4. Extracted Information
Attribute	Value
`type`	`image`
`alt`	`Alt Text`
`src`	`path/to/image.png`
`title`	`Optional Title` (or empty)
`line`	Line number

6. Data Models

6.1. MarkdownDocument

@dataclass
class MarkdownDocument:
    """Represents a parsed Markdown document."""
    file_path: Path
    frontmatter: dict[str, Any]
    title: str
    sections: list[MarkdownSection]
    elements: list[MarkdownElement]

6.2. MarkdownSection

@dataclass
class MarkdownSection:
    """A section (heading) in the document."""
    title: str
    level: int                    # 1-6
    start_line: int
    end_line: int
    path: str                     # Hierarchical path
    children: list[MarkdownSection]

6.3. MarkdownElement

@dataclass
class MarkdownElement:
    """An extractable element (code, table, image)."""
    type: Literal["code", "table", "image"]
    start_line: int
    end_line: int
    attributes: dict[str, Any]    # Type-specific attributes
    parent_section: str           # Path of containing section

6.4. FolderDocument

@dataclass
class FolderDocument:
    """A document from multiple Markdown files."""
    root_path: Path
    documents: list[MarkdownDocument]  # Sorted
    structure: DocumentStructure       # Combined hierarchy

7. Parser Behavior

7.1. Error Handling

Situation	Behavior
Invalid frontmatter	Log warning, treat frontmatter as empty
File not readable	Throw exception, skip file
Invalid UTF-8 encoding	Throw exception with encoding hint
Empty file	Return empty document (no sections)
File without headings	Entire content as implicit root section

Situation

Behavior

Invalid frontmatter

Log warning, treat frontmatter as empty

File not readable

Throw exception, skip file

Invalid UTF-8 encoding

Throw exception with encoding hint

Empty file

Return empty document (no sections)

File without headings

Entire content as implicit root section

7.2. Performance Requirements

Parsing a single file: < 50ms
Parsing a folder with 100 files: < 2s
Memory consumption: < 10KB per parsed file (without content)

8. Acceptance Criteria

8.1. AC-MD-01: Heading Extraction

Scenario: Headings are correctly extracted
  Given a Markdown file with the following content:
    """
    # Main Title

    ## Subchapter 1

    Text...

    ## Subchapter 2

    ### Sub-subchapter
    """
  When the parser processes the file
  Then 4 sections are extracted
  And the hierarchy is:
    | path                              | level |
    | /main-title                       | 1     |
    | /main-title/subchapter-1          | 2     |
    | /main-title/subchapter-2          | 2     |
    | /main-title/subchapter-2/sub-subchapter | 3 |

8.2. AC-MD-02: Frontmatter Parsing

Scenario: YAML frontmatter is correctly parsed
  Given a Markdown file with the following content:
    """
    ---
    title: My Document
    author: John Doe
    tags: [design, architecture]
    ---

    # Content
    """
  When the parser processes the file
  Then frontmatter["title"] equals "My Document"
  And frontmatter["author"] equals "John Doe"
  And frontmatter["tags"] is a list with 2 elements

8.3. AC-MD-03: Code Block Extraction

Scenario: Fenced code blocks are extracted
  Given a Markdown file with a Python code block
  When the parser processes the file
  Then elements contains an entry of type "code"
  And its language equals "python"
  And start_line and end_line are correctly set

8.4. AC-MD-04: Table Recognition

Scenario: GFM tables are recognized as blocks
  Given a Markdown file with a 3-column table
  When the parser processes the file
  Then elements contains an entry of type "table"
  And columns equals 3

8.5. AC-MD-05: Folder Structure

Scenario: Folder hierarchy is correctly mapped
  Given a folder with the following structure:
    | Path                    |
    | index.md                |
    | 01_intro/index.md       |
    | 01_intro/01_details.md  |
    | 02_chapter.md           |
  When the parser processes the folder
  Then the document order is:
    | index.md                |
    | 01_intro/index.md       |
    | 01_intro/01_details.md  |
    | 02_chapter.md           |

8.6. AC-MD-06: Sorting with Prefixes

Scenario: Numeric prefixes are correctly sorted
  Given a folder with files: 10_z.md, 2_b.md, 1_a.md, README.md
  When the parser processes the folder
  Then the order is: README.md, 1_a.md, 2_b.md, 10_z.md

9. Interfaces

9.1. Parser Interface

class MarkdownParser:
    """Parser for GitHub Flavored Markdown documents."""

    def parse_file(self, file_path: Path) -> MarkdownDocument:
        """Parses a single Markdown file."""
        ...

    def parse_folder(self, folder_path: Path) -> FolderDocument:
        """Parses a folder with Markdown files."""
        ...

    def get_section(self, doc: MarkdownDocument, path: str) -> MarkdownSection | None:
        """Finds a section by its hierarchical path."""
        ...

    def get_elements(
        self,
        doc: MarkdownDocument,
        element_type: str | None = None
    ) -> list[MarkdownElement]:
        """Returns all elements, optionally filtered by type."""
        ...

10. Implementation Notes

10.1. Recommended Libraries

PyYAML or ruamel.yaml: For frontmatter parsing
regex (instead of re): For better Unicode support

10.2. Regex Patterns

# ATX Heading
HEADING_PATTERN = r'^(#{1,6})\s+(.+?)(?:\s+#*)?$'

# Fenced Code Block (opening)
CODE_FENCE_OPEN = r'^(`{3,}|~{3,})(\w*)?$'

# YAML Frontmatter
FRONTMATTER_PATTERN = r'^---\s*\n(.*?)\n---\s*\n'

# Image
IMAGE_PATTERN = r'!\[([^\]]*)\]\(([^)\s]+)(?:\s+"([^"]*)")?\)'

# Table Row
TABLE_ROW_PATTERN = r'^\|(.+)\|$'

10.3. State Machine for Parsing

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.