搜索引擎索引
计算机科学
上游(联网)
管道运输
自然语言
数据科学
情报检索
人工智能
计算机网络
工程类
环境工程
作者
James Lee Martin,M Nur Arif Zanuri,Muthu Kumar Sockalingam,Eric Andersen
标识
DOI:10.2523/iptc-23626-ea
摘要
Abstract Large Language Models (LLMs) are attracting an enormous amount of interest at the moment in many domains. Their general nature, and ability to "understand" natural language, has already stimulated multiple areas of research at our company. Here we successfully demonstrate a Natural Language querying system, which is able to search a large repository of unstructured exploration data. The system supports follow up querying on the returned results, plus automatic summarization of content. The system is integrated into our novel end-to-end data-mining platform, which continuously mines our unstructured exploration data for new changes and indexes the results. Important in our method are the enrichment processes that occur prior to use of the LLM. Our approach avoids usual "chunking" techniques, which in our experience results in inferior results, especially in the multiple domain areas of Exploration. By integrating our novel ontology-model AI in the enrichment of the initial Index, we drastically boost the performance of search resulting from the LLM steps. In order to perform the search, key parts of our unstructured data, plus the query itself, need to be transformed into a vector form. This is performed using the embedding feature of the LLM. For this work, we had around 500,000 embeddings to calculate. To improve performance these were indexed in a leading Analytics Engine as a vector object, allowing fast search via cosine or Euclidian similarity. A custom dashboard was made to allow fresh searches of the vector datastore to be returned for further analysis. Our current search time across 500,000 embeddings is under 20 milli-seconds. Our custom dashboard returns the top matches for further interrogation and analysis. This includes follow-up Natural Language question support on the returned matches for summarization tasks and other customised querying. Since our exploration-specific, ontology model is able to tag each piece of data with over 40 exploration-specific labels, we are able to cross-examine the LLM returned results with the tags. Agreement on a range of queries - ranging from targeted, highly specific questions to general, open-ended queries - was surprisingly good. Natural Language based querying of our unstructured data is opening a whole new approach to data discovery in our company. Tailoring it to the exploration domain has required specific domain expertise and a novel ontology-model be used to ensure relevant prompts and query results. Obtaining search results quickly has also required expertise and fine-tuning. Future directions include ingesting more data, scaling the support infrastructure and further capability enhancement.
科研通智能强力驱动
Strongly Powered by AbleSci AI