Skip to the content.

PublicationsRetriever

CI Workflows

Github Actions

Maven CI: Build Status
CodeQL: CodeQL
Github pages: pages-build-deployment

Jenkins: Build Status

Nexus Maven Repository


Description & basic information

A Java-program which retrieves the Document and Dataset Urls from the given Publication-Web-Pages and if wanted, it can also download the full-texts and/or upload them to an S3 Object Store.
Afterwards, these full-text documents are mined (by other pieces of software), in order to enrich a much more complete set of OpenAIRE publications with inference links, in the OpenAIRE Graph.

This program is used either as a stand-alone download-tool for full-texts and datasets, or as a library for the UrlsWorker’s code, of OpenAIRE’s “PDF Aggregation Service”.

The PublicationsRetriever takes as input the PubPages with their IDs -in JSON format- and gives an output -also in JSON format, which contains the IDs, the PubPages, the Document or Dataset Urls, a series of informative booleans, the MD5 “fileHash”, the “fileSize” and a “comment”.
The “booleans” are:

Note: the values to the above “booleans” are Strings: “true”, “false” or “N/A”.

The “comment” can have the following values:

Sample JSON-input:

{"id":"dedup_wf_001::83872a151fd78b045e62275ca626ec94","url":"https://zenodo.org/record/884160"}

Sample JSON-output (with downloading of the full-texts):

{"id":"dedup_wf_001::83872a151fd78b045e62275ca626ec94","sourceUrl":"https://zenodo.org/record/884160","docUrl":"https://zenodo.org/record/884160/files/Data_for_Policy_2017_paper_55.pdf","wasUrlChecked":"true","wasUrlValid":"true","wasDocumentOrDatasetAccessible":"true","wasDirectLink":"false","couldRetry":"true","fileHash":"4e38a82fe1182e62b1c752b50f5ea59b","fileSize":"263917","comment":"/home/lampros/PublicationsRetriever/target/../example/sample_output/DocFiles/dedup_wf_001::83872a151fd78b045e62275ca626ec94.pdf"}


Explanation of some keywords:
PubPage: the web page with the publication’s information.
DocUrl: the url of the fulltext-document-file.
DatasetUrl: the url of the dataset-file.
DocOrDatasetUrl: the url of the document or the dataset file.
Full-text: the document containing all the text of a publication.
DocFileFullPath: the full-storage-path of the fulltext-document-file.
ErrorCause: the cause of the failure of retrieving the docUrl or the docFile.

The program’s execution process can be found here.
This program utilizes multiple threads to speed up the process, while using politeness-delays between same-domain connections, in order to avoid overloading the data-providers.
In case no IDs are available to be used in the input, the user should provide a file containing just urls (one url per line) and specify that wishes to process a data-set with no IDs, by changing the “util.url.LoaderAndChecker.useIdUrlPairs“-variable to “false”.
If you want to run it with distributed execution on multiple VMs, you may give a different starting-number for the docFiles in each instance (see the run-instructions below).

Disclaimers:

Install & Run (using MAVEN)

To install the application, navigate to the directory of the project, where the pom.xml is located.
Then enter this command in the terminal:
mvn clean install

To run the application you should navigate to the target directory, which will be created by MAVEN and run the executable JAR file, while choosing the appropriate run-command.

Run with standard input/output:
java -jar publications_retriever-1.2-SNAPSHOT.jar arg1:'-inputFileFullPath' arg2:<inputFile> arg3:'-retrieveDataType' arg4:'<dataType: document | dataset | all>' arg5:'-downloadDocFiles' arg6:'-fileNameType' arg7:'idName' arg8:'-firstFileNum' arg9:'NUM' arg10:'-docFilesStorage' arg11:'storageDir' < stdIn:'inputJsonFile' > stdOut:'outputJsonFile'

Run tests with custom input/output:

Arguments explanation:

Note: In order to access the S3ObjectStore, you should provide the file “S3_credentials.txt”, inside the working directory, which must contain the endpoint, the accessKey, the secretKey, the region and the bucket, in that order, separated by commas.

Example

You can check the functionality of PublicationsRetriever by running an example.
Type ./runExample.sh in the terminal and hit ENTER.
Then you can see the results in the example/sample_output directory.
The above script will run the following commands:

Customizations