As the trend to open up data and provide them freely on the Internet has intensified in volume as well as quality and value of the data made available, the linked data community has grasped the opportunity to combine, cross-reference, and analyse unprecedented volumes of high-quality data and to build innovative applications. This caused a tremendous network effect, adding value and creating new opportunities for everybody, including the original data providers. But most of the low-hanging fruit has been picked and it is time to move on to the next step, combining, cross-indexing and, in general, making the best out of all public data, regardless of their size, update rate, and schema; accepting that centrally-managed repositories (even distributed) are not able to meet the challenges ahead and that we need to develop the infrastructure for the efficient querying of loose data source federations at a large scale.
SemaGrow carried out fundamental databases research and developed methods and infrastructure, laying the foundations for the scalable, efficient, and robust data services needed by the data-intensive Science of 2020. The SemaGrow project used the large-scale and complex agricultural data service ecosystem as a testbed for its technologies. During the first reporting period, the project engaged consortium members and external stakeholders to identify user needs and relevant data sources and to draft requirements and the system architecture. From the perspective of engaging stakeholders and translating user needs to requirements, the project focused on the definition of different types of use cases based on stakeholder requirements. Documenting the outcomes of the stakeholder workshops, allowed the project to effectively pin point how to develop and then evaluate its results. Through the definition of use cases ideas and requirements expressed in these workshops were translated into the application descriptions that form the basis for the development of service demonstrators and connect the needs of the stakeholders to the technological developments of the project. Three categories of use cases are considered in the SemaGrow project, addressing a diverse group of stakeholders and wide area of application:
- Heterogeneous Data Collections and Streams, focusing on data-intensive experiments in the domain of agricultural and forestry modelling. SemaGrow technologies are used to prepare suitable input dataset for modelling experiments from the wealth of heterogeneous big data collections and streams available. The stakeholders are the modellers who prepare the experiments and the IT personnel who support them. These use cases validate the integrative semantic capabilities provided by the Semagrow Stack as well as its ability to search and retrieve large data volumes.
- Reactive Data Analysis, focusing on the AGRIS portal that serves scientific bibliography and relevant Web resources. SemaGrow technologies are used to federate the diverse endpoints used to search for and retrieve resources that are semantically relevant to a given AGRIS bibliography item. The stakeholders are the Web developers who maintain the portal and the data analysis experts who experiment to refine the relevance scoring mechanism. These use cases test validate the ease of adding members to the federation and the reactivity of the Semagrow Stack when searching through big data in order to find results that are not necessarily voluminous.
- Reactive Resource Discovery, focusing on the Linked Open Data Hub (LODH) application that serves diverse bibliographical and educational resources over a simple REST API. The stakeholders are the LOD professionals and enthusiasts who put together Web applications using Web services and the LODH end-users who are searching for resources. These uses cases validate the ease of developing Web applications over the SemaGrow Stack and the ability to federate and semantically integrate a large number of heterogeneous data sources, although the contents of any individual data source do not constitute big data by itself.
The assumption is that the efficiency of the stakeholders on these use cases will significantly improve when replacing current methods of data access with SemaGrow technologies without affecting their workflow otherwise. That is to say that, for example, agricultural and forestry modelling software, the AGRIS portal, and the clients developed for the ADS Web application will not need to be re-developed in order to be able to take advantage of SemaGrow technologies. Our rigorous testing and evaluation activities validated both the increase in stakeholder efficiency and the effort required to adopt SemaGrow as data access infrastructure.
By addressing these use cases, the project evaluated its approach to the following major technical challenges:
- Finding small results in big data: in this “needle in a haystack” situation results from different big datasets are joined in order to retrieve a result set that does not constitute big data by itself. The challenge here is to have an intelligent query execution planner that guides the query execution engine along a query execution plan that never retrieves large quantities of results that do not contribute to the end-result because they do not join with subsequent query patterns. Such a query execution planner relies heavily on the availability of accurate instance-level metadata.
- Big results from big data: in this situation the result set constitutes big data, and no amount of query plan optimization can avoid this, since this is what has been requested by the client. Besides the challenges for the query execution engine, handling this situation is also relevant to the ability of the histogram maintenance mechanism to efficiently handle query feedback that is big data.
- Integrating heterogeneous data:accepting that some schemas might be better suited to a given dataset and application and that there is no consensus about a “universal” schema or vocabulary for any given application, developed technologies that allow data providers to publish in the manner and form that best suits their processes and purposes and data consumers to query in the manner and form that best suits theirs.