Abstract Long-Short Term Memory (LSTM) can retain memory and learn from data sequences. It gives state-of-the-art accuracy in many applications such as speech recognition, natural language processing and video classifications. Field-Programmable Gate Arrays (FPGAs) have been used to speed up the inference of LSTMs, but FPGA-based LSTM accelerators are limited by the size of on-chip memory and the bandwidth of external memory on FPGA boards. We propose a novel hardware architecture to overcome data dependency and a new blocking-batching strategy to reuse the LSTM weights fetched from external memory to optimize the performance of systems with size-limited on-chip memory for large machine learning models. Evaluation results show that our architecture can achieve 20.8 GOPS/W, which is among the highest for the FPGA-based LSTM designs storing weights in off-chip memory. Our design achieves 1.65 times higher performance-per-watt efficiency and 2.48 times higher performance-per-DSP efficiency when compared with the current state-of-the-art designs of LSTM using weights stored in off-chip memory. Compared with CPU and GPU implementations, our FPGA implementation is 23.7 and 1.3 times faster while consuming 208 and 19.2 times lower energy respectively, which shows that our approach enables large LSTM systems to be processed efficiently on FPGAs with high performance and low power consumption.