openaccess_epub.article package¶
Module contents¶
openaccess_epub.article defines abstract article representation
The openaccess_epub.article contains the Article class which is instantiated per article XML file. Basic article structural elements are collected and metadata is parsed according to DTD rules into a python data structure. The Article class forms a basic unit in the procedure of analyzing articles and in the conversion to EPUB.
- class openaccess_epub.article.Article(xml_file, validation=True)¶
Abstract class for journal article; parses XML to data structure.
The Article class operates on an abstract level to execute some common processing tasks for all journal articles. It first parses the journal article XML to an lxml.etree structure, then inspects the file to discover the appropriate DTD and version by which the article was published. It, optionally, validates the article according to its DTD then proceeds (if successful) to recursively parse all metadata into a tree data structure. This facilitates easy accession of nested elements using the following strategy: “Article.metadata.front.journal_meta.publisher”
Parameters: - xml_file (str) – Path to the xml file for parsing xml_file.
- validation (bool, optional) – DTD validation is used when this evaluates True, use is strongly advised validation.
- doi¶
str
The full DOI string for the article doi.
- dtd¶
lxml.etree.DTD object
The parsed DTD object used for validation and metadata parsing dtd.
- dtd_name¶
str
The name of the DTD, such as “JPTS” dtd_name.
- dtd_version¶
float
The version of the DTD, such as 3.0 dtd_version.
- metadata¶
namedtuple object
The metadata attribute is a tree structure of nested namedtuples. For JPTS the metadata holds two attributes, ‘front’ and ‘back’. Each namedtuple under metadata will possess: attributes for every allowed child element defined by DTD, a dictionary of XML attributes held in the ‘attrs’ attribute, and a ‘node’ attribute for the lxml.etree Element itself. If any would-be attribute conflicts with a python keyword, it will be prepended by ‘l’ metadata.
- publisher¶
str
A standardized, concise name for the publisher of the article, such as “PLoS” or “Frontiers” publisher.
- get_DOI()¶
This method defines how the Article tries to detect the DOI.
It attempts to determine the article DOI string by DTD-appropriate inspection of the article metadata. This method should be made as flexible as necessary to properly collect the DOI for any XML publishing specification.
Returns: doi (str or None) – The full (publisher/article) DOI string for the article, or None on failure.
- get_publisher()¶
This method defines how the Article tries to determine the publisher of the article.
This method relies on the success of the get_DOI method to fetch the appropriate full DOI for the article. It then takes the DOI prefix which corresponds to the publisher and then uses that to attempt to load the correct publisher-specific code. This may fail; if the DOI is not mapped to a code file, if the DOI is mapped but the code file could not be located, or if the mapped code file is malformed then this method will issue/log an informative error message and return None. This method will not try to infer the publisher based on any metadata other than the DOI of the article.
Returns: publisher (Publisher instance or None)
- openaccess_epub.article.dtd_tuple¶
alias of DTD_Tuple
- openaccess_epub.article.iskeyword()¶
x.__contains__(y) <==> y in x.