# `PtcRunner.Chunker`
[🔗](https://github.com/andreasronge/ptc_runner/blob/main/lib/ptc_runner/chunker.ex#L1)

Text chunking utilities for RLM preprocessing.

Splits text into chunks by lines, characters, or approximate tokens.
Removes chunking logic from LLM-generated code, eliminating typos
and enabling proper tokenization.

## Examples

    iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd", 2)
    ["a\nb", "c\nd"]

    iex> PtcRunner.Chunker.by_chars("hello world", 5)
    ["hello", " worl", "d"]

    iex> PtcRunner.Chunker.by_tokens("hello world test", 2)
    ["hello wo", "rld test"]

## Options

All functions accept these options:

- `:overlap` - sliding window overlap (default: 0)
- `:metadata` - return maps with `%{text, index, lines, chars, tokens}` (default: false)

`by_tokens/3` also accepts:

- `:tokenizer` - `:simple` (4 chars/token) or a custom function (default: `:simple`)

# `chars_opt`

```elixir
@type chars_opt() :: {:overlap, non_neg_integer()} | {:metadata, boolean()}
```

# `chunk`

```elixir
@type chunk() :: String.t()
```

# `chunk_with_metadata`

```elixir
@type chunk_with_metadata() :: %{
  text: String.t(),
  index: non_neg_integer(),
  lines: non_neg_integer(),
  chars: non_neg_integer(),
  tokens: non_neg_integer()
}
```

# `lines_opt`

```elixir
@type lines_opt() :: {:overlap, non_neg_integer()} | {:metadata, boolean()}
```

# `result`

```elixir
@type result() :: [chunk()] | [chunk_with_metadata()]
```

# `tokens_opt`

```elixir
@type tokens_opt() ::
  {:overlap, non_neg_integer()}
  | {:metadata, boolean()}
  | {:tokenizer, :simple | :cl100k | (String.t() -&gt; non_neg_integer())}
```

# `by_chars`

```elixir
@spec by_chars(String.t() | nil, pos_integer(), [chars_opt()]) :: result()
```

Splits text into chunks by character count.

Uses `String.graphemes/1` for unicode-safe splitting.

## Examples

    iex> PtcRunner.Chunker.by_chars("hello world", 5)
    ["hello", " worl", "d"]

    iex> PtcRunner.Chunker.by_chars("abcdef", 3, overlap: 1)
    ["abc", "cde", "ef"]

    iex> PtcRunner.Chunker.by_chars(nil, 10)
    []

    iex> PtcRunner.Chunker.by_chars("", 10)
    []

    iex> PtcRunner.Chunker.by_chars("hi", 100)
    ["hi"]

# `by_lines`

```elixir
@spec by_lines(String.t() | nil, pos_integer(), [lines_opt()]) :: result()
```

Splits text into chunks by line count.

## Examples

    iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd\ne", 2)
    ["a\nb", "c\nd", "e"]

    iex> PtcRunner.Chunker.by_lines("a\nb\nc\nd", 2, overlap: 1)
    ["a\nb", "b\nc", "c\nd"]

    iex> PtcRunner.Chunker.by_lines(nil, 10)
    []

    iex> PtcRunner.Chunker.by_lines("", 10)
    []

    iex> PtcRunner.Chunker.by_lines("short", 100)
    ["short"]

# `by_tokens`

```elixir
@spec by_tokens(String.t() | nil, pos_integer(), [tokens_opt()]) :: result()
```

Splits text into chunks by approximate token count.

Uses a tokenizer to estimate token count. The default `:simple` tokenizer
uses 4 characters per token heuristic.

## Examples

    iex> PtcRunner.Chunker.by_tokens("hello world test", 2)
    ["hello wo", "rld test"]

    iex> PtcRunner.Chunker.by_tokens("abcdefghijklmnop", 2, overlap: 1)
    ["abcdefgh", "efghijkl", "ijklmnop"]

    iex> PtcRunner.Chunker.by_tokens(nil, 10)
    []

    iex> PtcRunner.Chunker.by_tokens("", 10)
    []

    iex> PtcRunner.Chunker.by_tokens("hi", 100)
    ["hi"]

Custom tokenizer example:

    iex> tokenizer = fn text -> div(String.length(text), 2) end
    iex> PtcRunner.Chunker.by_tokens("abcdefgh", 2, tokenizer: tokenizer)
    ["abcd", "efgh"]

---

*Consult [api-reference.md](api-reference.md) for complete listing*
