Antal István

PDF to Markdown on a Mac

Feeding a PDF straight to an LLM chat app works, but it is wasteful. Most apps fall back to a visual pipeline that rasterises every page and burns tokens describing layout you do not care about. Convert the PDF to Markdown first and the same document becomes searchable, diffable, and cheap to feed into a model.

marker does this conversion with surprisingly good handling of tables, equations, and code blocks. On Apple Silicon it runs locally on the GPU through MPS, so documents never leave the machine. The latest releases have a regression on MPS, so pin to 1.8.0 until issue #960 is resolved.

Install

uv tool install --python 3.12 'marker-pdf==1.8.0' --with psutil
printf '\nexport TORCH_DEVICE=mps\nexport PYTORCH_ENABLE_MPS_FALLBACK=1\n' >> ~/.zshrc

Open a new shell so the environment variables take effect. TORCH_DEVICE=mps tells PyTorch to use the Metal backend, and the fallback flag lets operations that are not yet implemented for MPS fall back to CPU instead of crashing.

Convert a single PDF

marker_single input.pdf --output_dir ./out

The output directory will contain a Markdown file, extracted images, and a JSON metadata sidecar.

Convert a folder of PDFs

marker ./pdfs --output_dir ./out --workers 4

Tune --workers to taste. On an M-series Mac with plenty of RAM, 4 is a reasonable starting point — push it higher and you will start swapping.


Related