<( Selecting benchmark queries ) | ( Benchmarking NDCG scores )> |
This Python notebook shows the process of benchmarking the search result ranking for the Charles Explorer application. It is a part of my diploma thesis at the Faculty of Mathematics and Physics, Charles University, Prague.
Find out more about the thesis in the GitHub repository.
Made by Jindřich Bär, 2024.
In the previous post, we have picked the best queries for our benchmarking. To recap, we’ll be comparing the search results ranking between (different configurations of) Charles Explorer and the Elsevier Scopus search engine.
Our target is to explore the usability of “local” (university-wide) graph data about publications, their authors and their collaborations for improving the search result ranking in the Charles Explorer application. We’re using Elsevier Scopus as the benchmark, as it is a widely used and well-established search engine for scientific publications - and contains data about citations and other metrics that we don’t have in our local graph data.
For this purpose, we have selected a set of queries returning a representative set of results (their distribution among the faculties is close to the distribution of the whole dataset). We will run these queries in both search engines and compare the results.
To start, we can load both the dataset of the queries and the dataset of the search results from the default Charles Explorer configuration.
import pandas as pd
queries = pd.read_csv('./best_queries.csv');
charles_explorer_results = pd.read_csv('./search_results.csv');
# Search results contain a superset of the queries in the best queries file.
# We begin by filtering the search results to only include the queries in the best queries file.
charles_explorer_results[charles_explorer_results['query'].isin(queries['0'])].to_csv('./filtered_search_results.csv', index=False);
charles_explorer_results = pd.read_csv('./filtered_search_results.csv')
charles_explorer_results[charles_explorer_results['query'] == 'thrombolytic']['name']
7603 Thrombolytic therapy in acute pulmonary embolism
7604 Guidelines for Intravenous Thrombolytic Therap...
7605 Pulmonary embolism - thrombolytic and anticoag...
7606 Pulmonary embolism - thrombolytic and anticoag...
7607 Thrombolytic therapy of acute central retinal ...
...
7685 Intravenous Thrombolysis in Unknown-Onset Stro...
7686 Variable penetration of primary angioplasty in...
7687 Prospective Single-Arm Trial of Endovascular M...
7688 "Stent 4 Life" Targeting PCI at all who will b...
7689 How does the primary coronary angioplasty effe...
Name: name, Length: 87, dtype: object
An experiment showed that for this set of publications for the query thrombolytic
, only 3 of the 30 above were listed in the top 50 most relevant search results in Elsevier Scopus.
To provide more complete data for the benchmark, we sample the top 100 search results for each query in both search engines - so we get better insight into whether some publications are missing or whether they are just ranked lower.
Since we’ve first only collected the top 30 search results, we now have to rerun the result collection process again (on the seleted queries).
from utils.charles_explorer import get_dataframe_for_queries
charles_explorer_results = get_dataframe_for_queries(queries['0'].to_list())
charles_explorer_results.to_csv('./filtered_search_results.csv', index=False)
Processed 0 queries
Processed 30 queries
Processed 60 queries
Processed 90 queries
Processed 120 queries
Processed 150 queries
The dataset is now missing the ranking order for the search results. The records are however sorted by the “relevance” score (the search result ordering is the same as in the application), so we can easily add the ranking order back.
charles_explorer_results = pd.read_csv('./filtered_search_results.csv')
charles_explorer_results['ranking'] = None
for query in charles_explorer_results['query'].unique():
charles_explorer_results.loc[charles_explorer_results['query'] == query, 'charles_explorer_position'] = range(1, len(charles_explorer_results.loc[charles_explorer_results['query'] == query]) + 1)
charles_explorer_results[charles_explorer_results['query'] == 'biology']
id | year | name | faculty | faculty_name | query | ranking | charles_explorer_position | |
---|---|---|---|---|---|---|---|---|
2404 | 478570 | 2005.0 | Developmental biology for medics | 11130 | Second Faculty of Medicine | biology | None | 1.0 |
2405 | 129678 | 2010.0 | Basic statistics for biologists (Statistics wi... | 11320 | Faculty of Mathematics and Physics | biology | None | 2.0 |
2406 | 439661 | 2001.0 | Symposium 'Electromagnetic Aspects of Selforga... | 11510 | Faculty of Physical Education and Sport | biology | None | 3.0 |
2407 | 121104 | 2009.0 | Jesuit´s other face: Bohuslaus Balbinus as bio... | 11310 | Faculty of Science | biology | None | 4.0 |
2408 | 37559 | 1998.0 | Review of - Carbohydrates - Structure and Biology | 11310 | Faculty of Science | biology | None | 5.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2499 | 73261 | 2005.0 | Modern X-ray imaging techniques and their use ... | -1 | Unknown faculty | biology | None | 96.0 |
2500 | 31897 | 2003.0 | Teaching tasks for biology education | 11310 | Faculty of Science | biology | None | 97.0 |
2501 | 566168 | 2019.0 | Hands-on activities in biology: students' opinion | 11310 | Faculty of Science | biology | None | 98.0 |
2502 | 279719 | 2013.0 | One of the Overlooked Themes in High School Pr... | 11310 | Faculty of Science | biology | None | 99.0 |
2503 | 551485 | 2018.0 | Evolutionary biology : In the history, today a... | 11310 | Faculty of Science | biology | None | 100.0 |
100 rows × 8 columns
We start by loading the dataset of the search results from the Elsevier Scopus search engine.
The Scopus Advanced search feature allows us to use a special query language to submit the search queries. This query language offers a set of Prolog-like functors[1], each connected to a specific attribute - or a set of attributes - of the publication record. The attributes of these functors are used in a substring search on the specified fields.
Apart from this, the query language also supports logical operators, such as AND
, OR
, and AND NOT
.
We will use two of the available functors: TITLE-ABS-KEY
and AF-ID
:
TITLE-ABS-KEY
searches for the specified substring in the title, abstract, and keywords of the publication record. In this regard, it is similar to the full-text search in Charles Explorer, which searches in the same fields.AF-ID
filters the search results by the affiliation ID of the author. This is useful for filtering the search results to only those publications where at least one of the authors is affiliated with Charles University.
Since Elsevier Scopus contains many records not affiliated with Charles University (but Charles Explorer only contains such records), this will help us to get a more comparable sets of search results.By calling the Scopus API, we can get the search results in JSON format. We can then parse the JSON and load the search results into a pandas DataFrame
.
import json
import subprocess
import pandas as pd
def get_query_string(query):
return f"AF-ID ( 60016605 ) AND TITLE-ABS-KEY ( \"{query}\" )"
def get_query_object(query, limit=10, offset=0):
return {
"documentClassificationEnum": "primary",
"query": get_query_string(query),
"sort": "r-f",
"itemcount": limit,
"offset": offset,
"showAbstract": False
}
def get_curl_call(query, limit=10, offset=0):
return f"""curl\
'https://www.scopus.com/api/documents/search'\
--compressed\
-X POST\
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0'\
-H 'Accept: */*'\
-H 'Accept-Language: en,cs;q=0.7,en-US;q=0.3'\
-H 'Accept-Encoding: gzip, deflate, br, zstd'\
-H 'content-type: application/json'\
-H 'Origin: https://www.scopus.com'\
-H 'Connection: keep-alive'\
-H 'Cookie: ######################################### Cookies have been removed for security reasons. #########################################' \
-H 'Sec-Fetch-Dest: empty'\
-H 'Sec-Fetch-Mode: cors'\
-H 'Sec-Fetch-Site: same-origin'\
-H 'Priority: u=4'\
-H 'Pragma: no-cache'\
-H 'Cache-Control: no-cache'\
-H 'TE: trailers' \
--data-raw '{json.dumps(get_query_object(query, limit, offset))}'"""
def call_scopus_api(query, limit=10, offset=0):
result = subprocess.run(get_curl_call(query, limit, offset), check=True, shell=True, stdout=subprocess.PIPE)
return json.loads(result.stdout.decode('utf-8'))
df = pd.DataFrame()
for query in queries['0']:
print(query)
response = call_scopus_api(query, limit=100)
rankings = pd.DataFrame.from_dict(response['items'])
rankings['query'] = query
rankings['ranking'] = range(0, len(rankings))
df = pd.concat([
df,
rankings
])
df.to_csv('./scopus_results.csv', index=False)
import ast
df = pd.read_csv('./scopus_results.csv')
df['citationCount'] = df['citations'].apply(lambda x: ast.literal_eval(x).get('count'))
df['referenceCount'] = df['references'].apply(lambda x: ast.literal_eval(x).get('count'))
df
query | ranking | links | citations | references | totalAuthors | freetoread | eid | subjectAreas | authors | ... | abstractAvailable | publicationStage | sourceRelationship | pubYear | databaseDocumentIds | titles | source | title | citationCount | referenceCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | physics | 0 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 0, 'link': 'https://www.scopus.com/a... | {'count': 11, 'link': 'https://www.scopus.com/... | 2.0 | False | 2-s2.0-85130750924 | [{'code': 31, 'displayName': 'Physics and Astr... | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '', 'volume': '2458', 'articleNumber... | 2022.0 | {'SCP': '85130750924', 'PUI': '638086293', 'SC... | ['Ideas of Various Groups of Experts as a Star... | {'active': True, 'publisher': 'American Instit... | Ideas of Various Groups of Experts as a Starti... | 0 | 11 |
1 | physics | 1 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 1, 'link': 'https://www.scopus.com/a... | {'count': 3, 'link': 'https://www.scopus.com/a... | 2.0 | True | 2-s2.0-85072187757 | [{'code': 31, 'displayName': 'Physics and Astr... | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '1', 'volume': '1286', 'articleNumbe... | 2019.0 | {'SCP': '85072187757', 'PUI': '629310532', 'SC... | ['Practical Course on School Experiments for F... | {'active': True, 'publisher': 'Institute of Ph... | Practical Course on School Experiments for Fut... | 1 | 3 |
2 | physics | 2 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 0, 'link': 'https://www.scopus.com/a... | {'count': 2, 'link': 'https://www.scopus.com/a... | 4.0 | False | 2-s2.0-85099716344 | [{'code': 31, 'displayName': 'Physics and Astr... | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | False | final | {'issue': '', 'volume': '', 'articleNumber': '... | 2020.0 | {'SCP': '85099716344', 'PUI': '633981292', 'SC... | ['Collection of solved physics problems and co... | {'active': False, 'publisher': 'Slovak Physica... | Collection of solved physics problems and coll... | 0 | 2 |
3 | physics | 3 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 8, 'link': 'https://www.scopus.com/a... | {'count': 77, 'link': 'https://www.scopus.com/... | 2.0 | False | 2-s2.0-85099862464 | [{'code': 33, 'displayName': 'Social Sciences'}] | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '4', 'volume': '43', 'articleNumber'... | 2021.0 | {'SNGEO': '2021003302', 'SCP': '85099862464', ... | ['Physics demonstrations: who are the students... | {'active': True, 'publisher': 'Routledge', 'pu... | Physics demonstrations: who are the students a... | 8 | 77 |
4 | physics | 4 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 0, 'link': 'https://www.scopus.com/a... | {'count': 17, 'link': 'https://www.scopus.com/... | 4.0 | False | 2-s2.0-85130747100 | [{'code': 31, 'displayName': 'Physics and Astr... | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '', 'volume': '2458', 'articleNumber... | 2022.0 | {'SCP': '85130747100', 'PUI': '638086272', 'SC... | ['Use of the Collection of Solved Problems in ... | {'active': True, 'publisher': 'American Instit... | Use of the Collection of Solved Problems in Ph... | 0 | 17 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7810 | specific | 95 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 3, 'link': 'https://www.scopus.com/a... | {'count': 96, 'link': 'https://www.scopus.com/... | 1.0 | True | 2-s2.0-85053545422 | [{'code': 33, 'displayName': 'Social Sciences'}] | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '3', 'volume': '50', 'articleNumber'... | 2018.0 | {'SCP': '85053545422', 'PUI': '623972267', 'SC... | ['Academia without contention? The legacy of C... | {'active': True, 'publisher': 'Sociologicky Us... | Academia without contention? The legacy of Cze... | 3 | 96 |
7811 | specific | 96 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 8, 'link': 'https://www.scopus.com/a... | {'count': 24, 'link': 'https://www.scopus.com/... | 2.0 | False | 2-s2.0-34547786691 | [{'code': 23, 'displayName': 'Environmental Sc... | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '2', 'volume': '55', 'articleNumber'... | 2007.0 | {'SCP': '34547786691', 'PUI': '47225913', 'SNC... | ['Specific pollution of surface water and sedi... | {'active': True, 'publisher': '', 'publication... | Specific pollution of surface water and sedime... | 8 | 24 |
7812 | specific | 97 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 0, 'link': 'https://www.scopus.com/a... | {'count': 56, 'link': 'https://www.scopus.com/... | 2.0 | False | 2-s2.0-85193778000 | [{'code': 33, 'displayName': 'Social Sciences'... | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '', 'volume': '', 'articleNumber': '... | 2023.0 | {'SCP': '85193778000', 'SRC-OCC-ID': '95648141... | ['Reflective approaches to professionalisation... | {'active': False, 'publisher': 'Springer Inter... | Reflective approaches to professionalisation t... | 0 | 56 |
7813 | specific | 98 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 0, 'link': 'https://www.scopus.com/a... | {'count': 0, 'link': 'https://www.scopus.com/a... | 1.0 | False | 2-s2.0-85185678534 | [{'code': 33, 'displayName': 'Social Sciences'}] | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '', 'volume': '14', 'articleNumber':... | 2023.0 | {'SCP': '85185678534', 'SRC-OCC-ID': '95444925... | ['THE CONCEPT OF DUE DILIGENCE IN THE CONTEXT ... | {'active': True, 'publisher': 'Czech Society o... | THE CONCEPT OF DUE DILIGENCE IN THE CONTEXT OF... | 0 | 0 |
7814 | specific | 99 | [{'rel': 'self', 'type': 'GET', 'href': 'https... | {'count': 1, 'link': 'https://www.scopus.com/a... | {'count': 70, 'link': 'https://www.scopus.com/... | 3.0 | False | 2-s2.0-84937715409 | [{'code': 16, 'displayName': 'Chemistry'}] | [{'links': [{'rel': 'self', 'type': 'GET', 'hr... | ... | True | final | {'issue': '7', 'volume': '109', 'articleNumber... | 2015.0 | {'SCP': '84937715409', 'SNCHEM': '2015122725',... | ['Protease-activated receptors: Activation, in... | {'active': True, 'publisher': 'Czech Society o... | Protease-activated receptors: Activation, inhi... | 1 | 70 |
7815 rows × 24 columns
df.to_csv("./scopus_results.csv", index=False)
With the data loaded into the DataFrame
, we can now explore e.g. the correlation between the numerical attributes of the search results and the relevance score.
df.select_dtypes(('int', 'float')).corr()
ranking | totalAuthors | scopusId | pubYear | citationCount | referenceCount | |
---|---|---|---|---|---|---|
ranking | 1.000000 | 0.038005 | 0.081229 | 0.109848 | 0.062467 | 0.053487 |
totalAuthors | 0.038005 | 1.000000 | 0.033948 | 0.040538 | 0.113336 | 0.094358 |
scopusId | 0.081229 | 0.033948 | 1.000000 | 0.806411 | 0.015393 | 0.243830 |
pubYear | 0.109848 | 0.040538 | 0.806411 | 1.000000 | 0.033019 | 0.283521 |
citationCount | 0.062467 | 0.113336 | 0.015393 | 0.033019 | 1.000000 | 0.218415 |
referenceCount | 0.053487 | 0.094358 | 0.243830 | 0.283521 | 0.218415 | 1.000000 |
We can see that the ranking
column is only very weakly correlated with the citationCount
and referenceCount
columns. Moreover, the ranking
column is mostly correlated with the pubYear
column (correlation coefficient 0.11
). This suggests that the default Scopus ranking is mostly influenced by the full-text search and does not take much reranking into account.