First thought: Wow, that looks just like how Syndicate works.
Second: that's a terrible idea. (at least in 2025)
There's a tutorial (this one, I think https://youtu.be/i_XV78N7Zuo) on how to make a tool to compose your tiles.
If you want to make a tile-space renderer, that's harder, but having done it, I can probably talk you through it. You need to look through tile-space diagonally to make in-front/behind work correctly. The way I'd do it today would probably be to 'shoot rays' from the view direction, into the tile-space, and record the first, or however many tile fragments necessary to completely obscure the view. Then, just* render from that look-up-table. (there's a fruity view(x, y) to tile(x, y, z) transform, and you still need to render transient objects at the correct depth. Also, scrolling/panning, do you only do that by tile, or do you also do sub-tile-fragment pan?)
If you can get away with just stacking some tilemaps, do that instead, but ask if you need more.