Jindřich Bär

Collecting the gold-standard data for benchmarking

This Python notebook shows the process of benchmarking the search result ranking for the Charles Explorer application. It is a part of my diploma thesis at the Faculty of Mathematics and Physics, Charles University, Prague.

Find out more about the thesis in the GitHub repository.

Made by Jindřich Bär, 2024.

In the previous post, we have picked the best queries for our benchmarking. To recap, we’ll be comparing the search results ranking between (different configurations of) Charles Explorer and the Elsevier Scopus search engine.

Our target is to explore the usability of “local” (university-wide) graph data about publications, their authors and their collaborations for improving the search result ranking in the Charles Explorer application. We’re using Elsevier Scopus as the benchmark, as it is a widely used and well-established search engine for scientific publications - and contains data about citations and other metrics that we don’t have in our local graph data.

For this purpose, we have selected a set of queries returning a representative set of results (their distribution among the faculties is close to the distribution of the whole dataset). We will run these queries in both search engines and compare the results.

To start, we can load both the dataset of the queries and the dataset of the search results from the default Charles Explorer configuration.

import pandas as pd

queries = pd.read_csv('./best_queries.csv');
charles_explorer_results = pd.read_csv('./search_results.csv');

# Search results contain a superset of the queries in the best queries file.
# We begin by filtering the search results to only include the queries in the best queries file.
charles_explorer_results[charles_explorer_results['query'].isin(queries['0'])].to_csv('./filtered_search_results.csv', index=False);

charles_explorer_results = pd.read_csv('./filtered_search_results.csv')

charles_explorer_results[charles_explorer_results['query'] == 'thrombolytic']['name']

   Thrombolytic therapy in acute pulmonary embolism
  Guidelines for Intravenous Thrombolytic Therap...
  Pulmonary embolism - thrombolytic and anticoag...
  Pulmonary embolism - thrombolytic and anticoag...
  Thrombolytic therapy of acute central retinal ...
                              ...                        
  Intravenous Thrombolysis in Unknown-Onset Stro...
  Variable penetration of primary angioplasty in...
  Prospective Single-Arm Trial of Endovascular M...
  "Stent 4 Life" Targeting PCI at all who will b...
  How does the primary coronary angioplasty effe...
Name: name, Length: 87, dtype: object

An experiment showed that for this set of publications for the query thrombolytic, only 3 of the 30 above were listed in the top 50 most relevant search results in Elsevier Scopus.

To provide more complete data for the benchmark, we sample the top 100 search results for each query in both search engines - so we get better insight into whether some publications are missing or whether they are just ranked lower.

Since we’ve first only collected the top 30 search results, we now have to rerun the result collection process again (on the seleted queries).

from utils.charles_explorer import get_dataframe_for_queries

charles_explorer_results = get_dataframe_for_queries(queries['0'].to_list())
charles_explorer_results.to_csv('./filtered_search_results.csv', index=False)

Processed 0 queries
Processed 30 queries
Processed 60 queries
Processed 90 queries
Processed 120 queries
Processed 150 queries

The dataset is now missing the ranking order for the search results. The records are however sorted by the “relevance” score (the search result ordering is the same as in the application), so we can easily add the ranking order back.

charles_explorer_results = pd.read_csv('./filtered_search_results.csv')

charles_explorer_results['ranking'] = None

for query in charles_explorer_results['query'].unique():
    charles_explorer_results.loc[charles_explorer_results['query'] == query, 'charles_explorer_position'] = range(1, len(charles_explorer_results.loc[charles_explorer_results['query'] == query]) + 1)

charles_explorer_results[charles_explorer_results['query'] == 'biology']

	id	year	name	faculty	faculty_name	query	ranking	charles_explorer_position
2404	478570	2005.0	Developmental biology for medics	11130	Second Faculty of Medicine	biology	None	1.0
2405	129678	2010.0	Basic statistics for biologists (Statistics wi...	11320	Faculty of Mathematics and Physics	biology	None	2.0
2406	439661	2001.0	Symposium 'Electromagnetic Aspects of Selforga...	11510	Faculty of Physical Education and Sport	biology	None	3.0
2407	121104	2009.0	Jesuit´s other face: Bohuslaus Balbinus as bio...	11310	Faculty of Science	biology	None	4.0
2408	37559	1998.0	Review of - Carbohydrates - Structure and Biology	11310	Faculty of Science	biology	None	5.0
...	...	...	...	...	...	...	...	...
2499	73261	2005.0	Modern X-ray imaging techniques and their use ...	-1	Unknown faculty	biology	None	96.0
2500	31897	2003.0	Teaching tasks for biology education	11310	Faculty of Science	biology	None	97.0
2501	566168	2019.0	Hands-on activities in biology: students' opinion	11310	Faculty of Science	biology	None	98.0
2502	279719	2013.0	One of the Overlooked Themes in High School Pr...	11310	Faculty of Science	biology	None	99.0
2503	551485	2018.0	Evolutionary biology : In the history, today a...	11310	Faculty of Science	biology	None	100.0

100 rows × 8 columns

Loading the Elsevier Scopus search results

We start by loading the dataset of the search results from the Elsevier Scopus search engine.

The Scopus Advanced search feature allows us to use a special query language to submit the search queries. This query language offers a set of Prolog-like functors[1], each connected to a specific attribute - or a set of attributes - of the publication record. The attributes of these functors are used in a substring search on the specified fields.

Apart from this, the query language also supports logical operators, such as AND, OR, and AND NOT.

We will use two of the available functors: TITLE-ABS-KEY and AF-ID:

TITLE-ABS-KEY searches for the specified substring in the title, abstract, and keywords of the publication record. In this regard, it is similar to the full-text search in Charles Explorer, which searches in the same fields.
AF-ID filters the search results by the affiliation ID of the author. This is useful for filtering the search results to only those publications where at least one of the authors is affiliated with Charles University. Since Elsevier Scopus contains many records not affiliated with Charles University (but Charles Explorer only contains such records), this will help us to get a more comparable sets of search results.

By calling the Scopus API, we can get the search results in JSON format. We can then parse the JSON and load the search results into a pandas DataFrame.

import json
import subprocess
import pandas as pd

def get_query_string(query):
    return f"AF-ID ( 60016605 ) AND TITLE-ABS-KEY ( \"{query}\" )"

def get_query_object(query, limit=10, offset=0):
    return {
        "documentClassificationEnum": "primary",
        "query": get_query_string(query),
        "sort": "r-f",
        "itemcount": limit,
        "offset": offset,
        "showAbstract": False
    }

def get_curl_call(query, limit=10, offset=0):
    return f"""curl\
 'https://www.scopus.com/api/documents/search'\
 --compressed\
 -X POST\
 -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0'\
 -H 'Accept: */*'\
 -H 'Accept-Language: en,cs;q=0.7,en-US;q=0.3'\
 -H 'Accept-Encoding: gzip, deflate, br, zstd'\
 -H 'content-type: application/json'\
 -H 'Origin: https://www.scopus.com'\
 -H 'Connection: keep-alive'\
 -H 'Cookie: ######################################### Cookies have been removed for security reasons. #########################################' \
 -H 'Sec-Fetch-Dest: empty'\
 -H 'Sec-Fetch-Mode: cors'\
 -H 'Sec-Fetch-Site: same-origin'\
 -H 'Priority: u=4'\
 -H 'Pragma: no-cache'\
 -H 'Cache-Control: no-cache'\
 -H 'TE: trailers' \
 --data-raw '{json.dumps(get_query_object(query, limit, offset))}'"""

def call_scopus_api(query, limit=10, offset=0):
    result = subprocess.run(get_curl_call(query, limit, offset), check=True, shell=True, stdout=subprocess.PIPE)
    return json.loads(result.stdout.decode('utf-8'))

df = pd.DataFrame()

for query in queries['0']:
    print(query)
    response = call_scopus_api(query, limit=100)
    rankings = pd.DataFrame.from_dict(response['items'])

    rankings['query'] = query
    rankings['ranking'] = range(0, len(rankings))

    df = pd.concat([
        df, 
        rankings    
    ])
    df.to_csv('./scopus_results.csv', index=False)

import ast

df = pd.read_csv('./scopus_results.csv')
df['citationCount'] = df['citations'].apply(lambda x: ast.literal_eval(x).get('count'))
df['referenceCount'] = df['references'].apply(lambda x: ast.literal_eval(x).get('count'))

df

	query	ranking	links	citations	references	totalAuthors	freetoread	eid	subjectAreas	authors	...	abstractAvailable	publicationStage	sourceRelationship	pubYear	databaseDocumentIds	titles	source	title	citationCount	referenceCount
0	physics	0	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 0, 'link': 'https://www.scopus.com/a...	{'count': 11, 'link': 'https://www.scopus.com/...	2.0	False	2-s2.0-85130750924	[{'code': 31, 'displayName': 'Physics and Astr...	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '', 'volume': '2458', 'articleNumber...	2022.0	{'SCP': '85130750924', 'PUI': '638086293', 'SC...	['Ideas of Various Groups of Experts as a Star...	{'active': True, 'publisher': 'American Instit...	Ideas of Various Groups of Experts as a Starti...	0	11
1	physics	1	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 1, 'link': 'https://www.scopus.com/a...	{'count': 3, 'link': 'https://www.scopus.com/a...	2.0	True	2-s2.0-85072187757	[{'code': 31, 'displayName': 'Physics and Astr...	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '1', 'volume': '1286', 'articleNumbe...	2019.0	{'SCP': '85072187757', 'PUI': '629310532', 'SC...	['Practical Course on School Experiments for F...	{'active': True, 'publisher': 'Institute of Ph...	Practical Course on School Experiments for Fut...	1	3
2	physics	2	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 0, 'link': 'https://www.scopus.com/a...	{'count': 2, 'link': 'https://www.scopus.com/a...	4.0	False	2-s2.0-85099716344	[{'code': 31, 'displayName': 'Physics and Astr...	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	False	final	{'issue': '', 'volume': '', 'articleNumber': '...	2020.0	{'SCP': '85099716344', 'PUI': '633981292', 'SC...	['Collection of solved physics problems and co...	{'active': False, 'publisher': 'Slovak Physica...	Collection of solved physics problems and coll...	0	2
3	physics	3	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 8, 'link': 'https://www.scopus.com/a...	{'count': 77, 'link': 'https://www.scopus.com/...	2.0	False	2-s2.0-85099862464	[{'code': 33, 'displayName': 'Social Sciences'}]	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '4', 'volume': '43', 'articleNumber'...	2021.0	{'SNGEO': '2021003302', 'SCP': '85099862464', ...	['Physics demonstrations: who are the students...	{'active': True, 'publisher': 'Routledge', 'pu...	Physics demonstrations: who are the students a...	8	77
4	physics	4	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 0, 'link': 'https://www.scopus.com/a...	{'count': 17, 'link': 'https://www.scopus.com/...	4.0	False	2-s2.0-85130747100	[{'code': 31, 'displayName': 'Physics and Astr...	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '', 'volume': '2458', 'articleNumber...	2022.0	{'SCP': '85130747100', 'PUI': '638086272', 'SC...	['Use of the Collection of Solved Problems in ...	{'active': True, 'publisher': 'American Instit...	Use of the Collection of Solved Problems in Ph...	0	17
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7810	specific	95	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 3, 'link': 'https://www.scopus.com/a...	{'count': 96, 'link': 'https://www.scopus.com/...	1.0	True	2-s2.0-85053545422	[{'code': 33, 'displayName': 'Social Sciences'}]	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '3', 'volume': '50', 'articleNumber'...	2018.0	{'SCP': '85053545422', 'PUI': '623972267', 'SC...	['Academia without contention? The legacy of C...	{'active': True, 'publisher': 'Sociologicky Us...	Academia without contention? The legacy of Cze...	3	96
7811	specific	96	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 8, 'link': 'https://www.scopus.com/a...	{'count': 24, 'link': 'https://www.scopus.com/...	2.0	False	2-s2.0-34547786691	[{'code': 23, 'displayName': 'Environmental Sc...	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '2', 'volume': '55', 'articleNumber'...	2007.0	{'SCP': '34547786691', 'PUI': '47225913', 'SNC...	['Specific pollution of surface water and sedi...	{'active': True, 'publisher': '', 'publication...	Specific pollution of surface water and sedime...	8	24
7812	specific	97	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 0, 'link': 'https://www.scopus.com/a...	{'count': 56, 'link': 'https://www.scopus.com/...	2.0	False	2-s2.0-85193778000	[{'code': 33, 'displayName': 'Social Sciences'...	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '', 'volume': '', 'articleNumber': '...	2023.0	{'SCP': '85193778000', 'SRC-OCC-ID': '95648141...	['Reflective approaches to professionalisation...	{'active': False, 'publisher': 'Springer Inter...	Reflective approaches to professionalisation t...	0	56
7813	specific	98	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 0, 'link': 'https://www.scopus.com/a...	{'count': 0, 'link': 'https://www.scopus.com/a...	1.0	False	2-s2.0-85185678534	[{'code': 33, 'displayName': 'Social Sciences'}]	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '', 'volume': '14', 'articleNumber':...	2023.0	{'SCP': '85185678534', 'SRC-OCC-ID': '95444925...	['THE CONCEPT OF DUE DILIGENCE IN THE CONTEXT ...	{'active': True, 'publisher': 'Czech Society o...	THE CONCEPT OF DUE DILIGENCE IN THE CONTEXT OF...	0	0
7814	specific	99	[{'rel': 'self', 'type': 'GET', 'href': 'https...	{'count': 1, 'link': 'https://www.scopus.com/a...	{'count': 70, 'link': 'https://www.scopus.com/...	3.0	False	2-s2.0-84937715409	[{'code': 16, 'displayName': 'Chemistry'}]	[{'links': [{'rel': 'self', 'type': 'GET', 'hr...	...	True	final	{'issue': '7', 'volume': '109', 'articleNumber...	2015.0	{'SCP': '84937715409', 'SNCHEM': '2015122725',...	['Protease-activated receptors: Activation, in...	{'active': True, 'publisher': 'Czech Society o...	Protease-activated receptors: Activation, inhi...	1	70

7815 rows × 24 columns

df.to_csv("./scopus_results.csv", index=False)

With the data loaded into the DataFrame, we can now explore e.g. the correlation between the numerical attributes of the search results and the relevance score.

df.select_dtypes(('int', 'float')).corr()

	ranking	totalAuthors	scopusId	pubYear	citationCount	referenceCount
ranking	1.000000	0.038005	0.081229	0.109848	0.062467	0.053487
totalAuthors	0.038005	1.000000	0.033948	0.040538	0.113336	0.094358
scopusId	0.081229	0.033948	1.000000	0.806411	0.015393	0.243830
pubYear	0.109848	0.040538	0.806411	1.000000	0.033019	0.283521
citationCount	0.062467	0.113336	0.015393	0.033019	1.000000	0.218415
referenceCount	0.053487	0.094358	0.243830	0.283521	0.218415	1.000000

We can see that the ranking column is only very weakly correlated with the citationCount and referenceCount columns. Moreover, the ranking column is mostly correlated with the pubYear column (correlation coefficient 0.11). This suggests that the default Scopus ranking is mostly influenced by the full-text search and does not take much reranking into account.

( Benchmarking NDCG scores )>