For guys can't wait for it any more, please have a try of my Dots.OCR.Runner https://apps.apple.com/us/app/dots-ocr-runner/id6753667495
It's implemented based on llama.cpp. With my experience of this implementation, I think the most challenging part is the vision encoder that has 42 layers, which consumes quite amount of memory. With ggml framework, we can manually control the inference flow to limit memory peek.
This article is very helpful to me. Looking forward to the upcomming parts!