Jindřich Bär

<( Selecting benchmark queries ) ( Benchmarking NDCG scores )>

Collecting the gold-standard data for benchmarking

This Python notebook shows the process of benchmarking the search result ranking for the Charles Explorer application. It is a part of my diploma thesis at the Faculty of Mathematics and Physics, Charles University, Prague.

Find out more about the thesis in the GitHub repository.

Made by Jindřich Bär, 2024.

In the previous post, we have picked the best queries for our benchmarking. To recap, we’ll be comparing the search results ranking between (different configurations of) Charles Explorer and the Elsevier Scopus search engine.

Our target is to explore the usability of “local” (university-wide) graph data about publications, their authors and their collaborations for improving the search result ranking in the Charles Explorer application. We’re using Elsevier Scopus as the benchmark, as it is a widely used and well-established search engine for scientific publications - and contains data about citations and other metrics that we don’t have in our local graph data.

For this purpose, we have selected a set of queries returning a representative set of results (their distribution among the faculties is close to the distribution of the whole dataset). We will run these queries in both search engines and compare the results.

To start, we can load both the dataset of the queries and the dataset of the search results from the default Charles Explorer configuration.

import pandas as pd

queries = pd.read_csv('./best_queries.csv');
charles_explorer_results = pd.read_csv('./search_results.csv');

# Search results contain a superset of the queries in the best queries file.
# We begin by filtering the search results to only include the queries in the best queries file.
charles_explorer_results[charles_explorer_results['query'].isin(queries['0'])].to_csv('./filtered_search_results.csv', index=False);
charles_explorer_results = pd.read_csv('./filtered_search_results.csv')

charles_explorer_results[charles_explorer_results['query'] == 'thrombolytic']['name']
7603     Thrombolytic therapy in acute pulmonary embolism
7604    Guidelines for Intravenous Thrombolytic Therap...
7605    Pulmonary embolism - thrombolytic and anticoag...
7606    Pulmonary embolism - thrombolytic and anticoag...
7607    Thrombolytic therapy of acute central retinal ...
                              ...                        
7685    Intravenous Thrombolysis in Unknown-Onset Stro...
7686    Variable penetration of primary angioplasty in...
7687    Prospective Single-Arm Trial of Endovascular M...
7688    "Stent 4 Life" Targeting PCI at all who will b...
7689    How does the primary coronary angioplasty effe...
Name: name, Length: 87, dtype: object

An experiment showed that for this set of publications for the query thrombolytic, only 3 of the 30 above were listed in the top 50 most relevant search results in Elsevier Scopus.

To provide more complete data for the benchmark, we sample the top 100 search results for each query in both search engines - so we get better insight into whether some publications are missing or whether they are just ranked lower.

Since we’ve first only collected the top 30 search results, we now have to rerun the result collection process again (on the seleted queries).

from utils.charles_explorer import get_dataframe_for_queries

charles_explorer_results = get_dataframe_for_queries(queries['0'].to_list())
charles_explorer_results.to_csv('./filtered_search_results.csv', index=False)
Processed 0 queries
Processed 30 queries
Processed 60 queries
Processed 90 queries
Processed 120 queries
Processed 150 queries

The dataset is now missing the ranking order for the search results. The records are however sorted by the “relevance” score (the search result ordering is the same as in the application), so we can easily add the ranking order back.

charles_explorer_results = pd.read_csv('./filtered_search_results.csv')

charles_explorer_results['ranking'] = None

for query in charles_explorer_results['query'].unique():
    charles_explorer_results.loc[charles_explorer_results['query'] == query, 'charles_explorer_position'] = range(1, len(charles_explorer_results.loc[charles_explorer_results['query'] == query]) + 1)
charles_explorer_results[charles_explorer_results['query'] == 'biology']
id year name faculty faculty_name query ranking charles_explorer_position
2404 478570 2005.0 Developmental biology for medics 11130 Second Faculty of Medicine biology None 1.0
2405 129678 2010.0 Basic statistics for biologists (Statistics wi... 11320 Faculty of Mathematics and Physics biology None 2.0
2406 439661 2001.0 Symposium 'Electromagnetic Aspects of Selforga... 11510 Faculty of Physical Education and Sport biology None 3.0
2407 121104 2009.0 Jesuit´s other face: Bohuslaus Balbinus as bio... 11310 Faculty of Science biology None 4.0
2408 37559 1998.0 Review of - Carbohydrates - Structure and Biology 11310 Faculty of Science biology None 5.0
... ... ... ... ... ... ... ... ...
2499 73261 2005.0 Modern X-ray imaging techniques and their use ... -1 Unknown faculty biology None 96.0
2500 31897 2003.0 Teaching tasks for biology education 11310 Faculty of Science biology None 97.0
2501 566168 2019.0 Hands-on activities in biology: students' opinion 11310 Faculty of Science biology None 98.0
2502 279719 2013.0 One of the Overlooked Themes in High School Pr... 11310 Faculty of Science biology None 99.0
2503 551485 2018.0 Evolutionary biology : In the history, today a... 11310 Faculty of Science biology None 100.0

100 rows × 8 columns

Loading the Elsevier Scopus search results

We start by loading the dataset of the search results from the Elsevier Scopus search engine.

The Scopus Advanced search feature allows us to use a special query language to submit the search queries. This query language offers a set of Prolog-like functors[1], each connected to a specific attribute - or a set of attributes - of the publication record. The attributes of these functors are used in a substring search on the specified fields.

Apart from this, the query language also supports logical operators, such as AND, OR, and AND NOT.

We will use two of the available functors: TITLE-ABS-KEY and AF-ID:

By calling the Scopus API, we can get the search results in JSON format. We can then parse the JSON and load the search results into a pandas DataFrame.

import json
import subprocess
import pandas as pd

def get_query_string(query):
    return f"AF-ID ( 60016605 ) AND TITLE-ABS-KEY ( \"{query}\" )"

def get_query_object(query, limit=10, offset=0):
    return {
        "documentClassificationEnum": "primary",
        "query": get_query_string(query),
        "sort": "r-f",
        "itemcount": limit,
        "offset": offset,
        "showAbstract": False
    }

def get_curl_call(query, limit=10, offset=0):
    return f"""curl\
 'https://www.scopus.com/api/documents/search'\
 --compressed\
 -X POST\
 -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0'\
 -H 'Accept: */*'\
 -H 'Accept-Language: en,cs;q=0.7,en-US;q=0.3'\
 -H 'Accept-Encoding: gzip, deflate, br, zstd'\
 -H 'content-type: application/json'\
 -H 'Origin: https://www.scopus.com'\
 -H 'Connection: keep-alive'\
 -H 'Cookie: ######################################### Cookies have been removed for security reasons. #########################################' \
 -H 'Sec-Fetch-Dest: empty'\
 -H 'Sec-Fetch-Mode: cors'\
 -H 'Sec-Fetch-Site: same-origin'\
 -H 'Priority: u=4'\
 -H 'Pragma: no-cache'\
 -H 'Cache-Control: no-cache'\
 -H 'TE: trailers' \
 --data-raw '{json.dumps(get_query_object(query, limit, offset))}'"""

def call_scopus_api(query, limit=10, offset=0):
    result = subprocess.run(get_curl_call(query, limit, offset), check=True, shell=True, stdout=subprocess.PIPE)
    return json.loads(result.stdout.decode('utf-8'))

df = pd.DataFrame()

for query in queries['0']:
    print(query)
    response = call_scopus_api(query, limit=100)
    rankings = pd.DataFrame.from_dict(response['items'])

    rankings['query'] = query
    rankings['ranking'] = range(0, len(rankings))

    df = pd.concat([
        df, 
        rankings    
    ])
    df.to_csv('./scopus_results.csv', index=False)
import ast

df = pd.read_csv('./scopus_results.csv')
df['citationCount'] = df['citations'].apply(lambda x: ast.literal_eval(x).get('count'))
df['referenceCount'] = df['references'].apply(lambda x: ast.literal_eval(x).get('count'))

df
query ranking links citations references totalAuthors freetoread eid subjectAreas authors ... abstractAvailable publicationStage sourceRelationship pubYear databaseDocumentIds titles source title citationCount referenceCount
0 physics 0 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 0, 'link': 'https://www.scopus.com/a... {'count': 11, 'link': 'https://www.scopus.com/... 2.0 False 2-s2.0-85130750924 [{'code': 31, 'displayName': 'Physics and Astr... [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '', 'volume': '2458', 'articleNumber... 2022.0 {'SCP': '85130750924', 'PUI': '638086293', 'SC... ['Ideas of Various Groups of Experts as a Star... {'active': True, 'publisher': 'American Instit... Ideas of Various Groups of Experts as a Starti... 0 11
1 physics 1 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 1, 'link': 'https://www.scopus.com/a... {'count': 3, 'link': 'https://www.scopus.com/a... 2.0 True 2-s2.0-85072187757 [{'code': 31, 'displayName': 'Physics and Astr... [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '1', 'volume': '1286', 'articleNumbe... 2019.0 {'SCP': '85072187757', 'PUI': '629310532', 'SC... ['Practical Course on School Experiments for F... {'active': True, 'publisher': 'Institute of Ph... Practical Course on School Experiments for Fut... 1 3
2 physics 2 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 0, 'link': 'https://www.scopus.com/a... {'count': 2, 'link': 'https://www.scopus.com/a... 4.0 False 2-s2.0-85099716344 [{'code': 31, 'displayName': 'Physics and Astr... [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... False final {'issue': '', 'volume': '', 'articleNumber': '... 2020.0 {'SCP': '85099716344', 'PUI': '633981292', 'SC... ['Collection of solved physics problems and co... {'active': False, 'publisher': 'Slovak Physica... Collection of solved physics problems and coll... 0 2
3 physics 3 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 8, 'link': 'https://www.scopus.com/a... {'count': 77, 'link': 'https://www.scopus.com/... 2.0 False 2-s2.0-85099862464 [{'code': 33, 'displayName': 'Social Sciences'}] [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '4', 'volume': '43', 'articleNumber'... 2021.0 {'SNGEO': '2021003302', 'SCP': '85099862464', ... ['Physics demonstrations: who are the students... {'active': True, 'publisher': 'Routledge', 'pu... Physics demonstrations: who are the students a... 8 77
4 physics 4 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 0, 'link': 'https://www.scopus.com/a... {'count': 17, 'link': 'https://www.scopus.com/... 4.0 False 2-s2.0-85130747100 [{'code': 31, 'displayName': 'Physics and Astr... [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '', 'volume': '2458', 'articleNumber... 2022.0 {'SCP': '85130747100', 'PUI': '638086272', 'SC... ['Use of the Collection of Solved Problems in ... {'active': True, 'publisher': 'American Instit... Use of the Collection of Solved Problems in Ph... 0 17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7810 specific 95 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 3, 'link': 'https://www.scopus.com/a... {'count': 96, 'link': 'https://www.scopus.com/... 1.0 True 2-s2.0-85053545422 [{'code': 33, 'displayName': 'Social Sciences'}] [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '3', 'volume': '50', 'articleNumber'... 2018.0 {'SCP': '85053545422', 'PUI': '623972267', 'SC... ['Academia without contention? The legacy of C... {'active': True, 'publisher': 'Sociologicky Us... Academia without contention? The legacy of Cze... 3 96
7811 specific 96 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 8, 'link': 'https://www.scopus.com/a... {'count': 24, 'link': 'https://www.scopus.com/... 2.0 False 2-s2.0-34547786691 [{'code': 23, 'displayName': 'Environmental Sc... [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '2', 'volume': '55', 'articleNumber'... 2007.0 {'SCP': '34547786691', 'PUI': '47225913', 'SNC... ['Specific pollution of surface water and sedi... {'active': True, 'publisher': '', 'publication... Specific pollution of surface water and sedime... 8 24
7812 specific 97 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 0, 'link': 'https://www.scopus.com/a... {'count': 56, 'link': 'https://www.scopus.com/... 2.0 False 2-s2.0-85193778000 [{'code': 33, 'displayName': 'Social Sciences'... [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '', 'volume': '', 'articleNumber': '... 2023.0 {'SCP': '85193778000', 'SRC-OCC-ID': '95648141... ['Reflective approaches to professionalisation... {'active': False, 'publisher': 'Springer Inter... Reflective approaches to professionalisation t... 0 56
7813 specific 98 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 0, 'link': 'https://www.scopus.com/a... {'count': 0, 'link': 'https://www.scopus.com/a... 1.0 False 2-s2.0-85185678534 [{'code': 33, 'displayName': 'Social Sciences'}] [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '', 'volume': '14', 'articleNumber':... 2023.0 {'SCP': '85185678534', 'SRC-OCC-ID': '95444925... ['THE CONCEPT OF DUE DILIGENCE IN THE CONTEXT ... {'active': True, 'publisher': 'Czech Society o... THE CONCEPT OF DUE DILIGENCE IN THE CONTEXT OF... 0 0
7814 specific 99 [{'rel': 'self', 'type': 'GET', 'href': 'https... {'count': 1, 'link': 'https://www.scopus.com/a... {'count': 70, 'link': 'https://www.scopus.com/... 3.0 False 2-s2.0-84937715409 [{'code': 16, 'displayName': 'Chemistry'}] [{'links': [{'rel': 'self', 'type': 'GET', 'hr... ... True final {'issue': '7', 'volume': '109', 'articleNumber... 2015.0 {'SCP': '84937715409', 'SNCHEM': '2015122725',... ['Protease-activated receptors: Activation, in... {'active': True, 'publisher': 'Czech Society o... Protease-activated receptors: Activation, inhi... 1 70

7815 rows × 24 columns

df.to_csv("./scopus_results.csv", index=False)

With the data loaded into the DataFrame, we can now explore e.g. the correlation between the numerical attributes of the search results and the relevance score.

df.select_dtypes(('int', 'float')).corr()
ranking totalAuthors scopusId pubYear citationCount referenceCount
ranking 1.000000 0.038005 0.081229 0.109848 0.062467 0.053487
totalAuthors 0.038005 1.000000 0.033948 0.040538 0.113336 0.094358
scopusId 0.081229 0.033948 1.000000 0.806411 0.015393 0.243830
pubYear 0.109848 0.040538 0.806411 1.000000 0.033019 0.283521
citationCount 0.062467 0.113336 0.015393 0.033019 1.000000 0.218415
referenceCount 0.053487 0.094358 0.243830 0.283521 0.218415 1.000000

We can see that the ranking column is only very weakly correlated with the citationCount and referenceCount columns. Moreover, the ranking column is mostly correlated with the pubYear column (correlation coefficient 0.11). This suggests that the default Scopus ranking is mostly influenced by the full-text search and does not take much reranking into account.

( Benchmarking NDCG scores )>