Can foundational language models be useful in automating cybersecurity tasks? To address this open question, systematic and comprehensive evaluation of large language models (LLMs) across diverse cyber operational tasks (e.g., incident response, threat identification, forensic analysis, etc.), as well as understanding their risks and limitations, are crucial. A significant challenge lies in the absence of a standard benchmark dataset encompassing real-life cyber operational tasks that can be processed by LLMs. This paper tackles this challenge by conducting a preliminary study towards evaluation and understanding of LLMs for cyber operation automation. To that end, we first identify a list of defensive cyber operational tasks with increasing complexities and suggests the creation of new datasets to accomplish these tasks. Second, we review recent works leveraging LLMs for downstream cyber operational tasks to identify research gaps and open problems. Third, we propose a framework to understand and benchmark the cyber operational tasks to report potential solutions and research directions for the reliable evaluation of LLMs. Finally, this paper serves as an open call to the cybersecurity researchers and professionals to contribute to the development of an open-source evaluation framework paving the way for the trustworthy use of foundation models in cyber domain.