Purpose To investigate the ability of generative artificial intelligence models to answer ophthalmology board style questions Design Experimental study. Methods This study evaluated three large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science (BCSC) Self-Assessment Program (SAP). While ChatGPT is trained on information last updated in 2021, Bing Chat incorporates more recently indexed internet search to generate its answers. Performance was compared to human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or non-logical reasoning were documented. Main outcome measures Primary outcome: response accuracy. Secondary outcomes: performance in question subcategories and hallucination frequency. Results Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), while ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (OR = 3.89, 95% CI 1.19-14.73, p = 0.03) compared with diagnostic questions, but struggled with image interpretation (OR = 0.14, 95% CI 0.05-0.33, p < 0.01) when compared with single step reasoning questions. Against single step questions, Bing Chat also faced difficulties with image interpretation (OR = 0.18, 95% CI 0.08-0.44, p < 0.01) and multi-step reasoning (OR = 0.30, 95% CI 0.11-0.84, p = 0.02). ChatGPT-3.5 had the highest rate of hallucinations or non-logical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%). Conclusions LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the BCSC SAP. The frequency of hallucinations and non-logical reasoning suggest room for improvement in the performance of conversational agents in the medical domain.