Dataset Search: Google Launches New Search Engine to Help Scientists Find Datasets



[ad_1]

Google's goal has always been to organize the world's information and its first target was the commercial Web. Now he wants to do the same for the scientific community with a new search engine for datasets.

The service, called Dataset Search, is launching today and will be a companion for Google Scholar, the company's popular search engine for academic studies and reports. Institutions that publish their data online, such as universities and governments, will need to include metadata tags in their web pages that describe their data, including who created them, when they were published, and so on. This information will then be indexed by the Google search engine and combined with information from the knowledge graph. (So, if CERN has released the X dataset, some information about the institute will also be included in the search.)

Talk to The edgeNatasha Noy, a Google AI researcher who helped create Dataset Search, says the goal is to unify the tens of thousands of different repositories for online datasets. "We want this data to be discovered, but keep it where it is," says Noy.

At present, the publication of data sets is extremely fragmented. Different scientific domains have their own preferred references, as do different governments and local authorities. "Scientists say," I know where I have to go to find my data, but that's not what I always want, "says Noy. "Once they come out of their unique community, it's when it becomes difficult."

Noy cites the example of a recent climate specialist who told her that she was looking for a specific set of ocean temperature data for a future study, but could not find it nil share. She did not find her until she met a colleague at a conference that acknowledged the dataset and told her where she was staying. Only then could she continue her work. "And it's not even a particularly storekeeper," says Noy. "The dataset was well written in a pretty important place, but it was still hard to find."


An example of search for weather records in Google Dataset Search.
Image: Google

The initial version of Dataset Search will cover environmental and social sciences, government data, and press organization datasets such as ProPublica. However, if the service becomes popular, the amount of data it indexes should quickly snowball as institutions and scientists scramble to make their information accessible.

This should be facilitated by the recent development of open data initiatives around the world. "I think in recent years, the number of deposits has exploded," says Noy. It attributes the growing importance of data in the scientific literature, which means that journals require authors to publish datasets, as well as "government regulations in the United States and Europe and the general rise of the movement open data ".

Google's participation should contribute to the success of this project, says Jeni Tennison, CEO of the Open Data Institute (ODI). "Searching the datasets has always been hard to bear and I hope Google will make things easier," she says.

According to Tennison, to create a decent search engine, you need to know how to create user-friendly systems and understand what people mean by typing. Google obviously knows what it does in these two departments.

In fact, says Tennison, ideally, Google will publish its own set of data on how data retrieval is used. Although the metadata tags used by the company to make the dataset visible to its search robots are an open standard (which means that any competitor like Bing or Yandex can also use them and create a competing service), the engines search users are there to provide data on what they are doing.

"The simple act of understanding how important it is to find people … what kind of terms they use, how they express them," says Tennison. "If we want to understand how people are looking for data and making it more accessible, it would be nice if Google opens its doors own data about it. "

In other words, Google should publish a dataset on the search of datasets that would be indexed by Dataset Search. What is more appropriate?

[ad_2]
Source link