quantlaw.de_extract package¶
Submodules¶
quantlaw.de_extract.load_statute_names module¶
quantlaw.de_extract.statutes_abstract module¶
-
class
quantlaw.de_extract.statutes_abstract.
StatusMatch
(text: str, start: int, end: int)[source]¶ Bases:
object
Base class to report the areas of citations to German statutes and regulations (also if a trigger e.g. ‘§’ is found but it is not followed by a citation).
-
class
quantlaw.de_extract.statutes_abstract.
StatutesMatchWithMainArea
(suffix_len: int, law_len: int, law_match_type: str, *args, **kwargs)[source]¶ Bases:
quantlaw.de_extract.statutes_abstract.StatusMatch
Class to report the areas of citations to German statutes and regulations where a main area is found after the trigger “§ 123” where “123” is the main area.
-
has_main_area
()[source]¶ Returns True if the match has a main area and thus its content can be parsed by StatutesParser
-
-
class
quantlaw.de_extract.statutes_abstract.
StatutesProcessor
(laws_lookup: dict)[source]¶ Bases:
object
Abstract class to extract and parse statute references. The abstract class provides the names of laws they are cited with.
-
laws_lookup
¶ A dictionary to find of the law names to extract. Keys are names of laws that are used in the source text used to cite laws. Values are unique identifiers of laws. For optimal results is is recommended to make the list a exhaustive as possible to reduce the chance that references are false treated as internal references within a law because the name of the referenced law is not recognized. The names of the laws should be provided in a stemmed format using the stemmer provided in quantlaw.de_extract.stemming.stem_law_name.
-
quantlaw.de_extract.statutes_areas module¶
-
class
quantlaw.de_extract.statutes_areas.
StatutesExtractor
(laws_lookup: dict)[source]¶ Bases:
quantlaw.de_extract.statutes_abstract.StatutesProcessor
Class to find areas of citations to German statutes and regulations
-
find_all
(text: str, pos: int = 0)[source]¶ Like search but returns a generator of all matches found in text
-
get_dict_law_name_len
(test_str)[source]¶ Determines if the test_str starts with a law name given with self.laws_lookup.
Returns: The length matched law name or 0.
-
static
get_eu_law_name_len
(test_str) → int[source]¶ - Returns: The length of the law name of european legislation in chars or
- 0 if no law name of this type was found
-
static
get_ignore_law_name_len
(test_str)[source]¶ - Returns: Th length of a law name to ignore in chars or 0 if no law name of
- this type was found
-
static
get_no_suffix_ignore_law_name_len
(test_str) → int[source]¶ - Returns: Length of the law name in chars, if no suffix is present that connects
- the main area with the law name or 0 if no law name of this type was found
-
static
get_sgb_law_name_len
(test_str) → int[source]¶ - Returns: The length of the SGB law name in chars or 0 if no law name of this
- type was found
-
quantlaw.de_extract.statutes_areas_patterns module¶
quantlaw.de_extract.statutes_parse module¶
-
exception
quantlaw.de_extract.statutes_parse.
NoUnitMatched
[source]¶ Bases:
Exception
Exception is raised if a unit in a refren cannot be parsed.
-
class
quantlaw.de_extract.statutes_parse.
StatutesParser
(laws_lookup: dict)[source]¶ Bases:
quantlaw.de_extract.statutes_abstract.StatutesProcessor
Class to parse the content of a reference area identified by StatutesExtractor
-
static
fix_errors_in_citation
(citation)[source]¶ Fix some common inconsistencies in the references such as double spaces.
-
static
infer_units
(reference_path, prev_reference_path)[source]¶ In some cases of an enumeration a numeric value is not directed prefixed by the corresponding unit. E.g. “§ 123 Abs. 1 S. 2, 3 S. 4”. In this case “3” is not prefixed with its unit. Instead it can be inferred by looking at the whole citation that it is next higher unit of “S.”, hence “Abs.”. These inferred units are added to parsed data.
-
static
is_numb
(token: str)[source]¶ Returns: True if the token is a ‘numeric’ value of the reference.
-
static
is_pre_numb
(token: str)[source]¶ Returns: True if the token is a number that comes before the unit. E.g. ‘erster Halbsatz’
-
parse_law
(law_text: str, match_type: str, current_lawid: str = None)[source]¶ Parses the law information from a references found by StatutesMatchWithMainArea
Parameters: - main_text – E.g. “§ 123 Abs. 4 und 5 Nr. 6”
- law_text – E.g. “BGB”
- match_type – E.g. “dict”
Returns: The key of a parse law.
-
parse_main
(main_text: str) → list[source]¶ Parses a string containing a reference to a specific section within a given law. E.g. “§ 123 Abs. 4 Satz 5 und 6”. The parsed informtaion is formatted into lists nested in lists nested in lists.
The outer list is a list of references.
References are lists of path components. A path component is e.g. “Abs. 4”.
A path component is represented by a list with two elements: The first contains the unit the second the value.
The example above would be represented as [[[’§’, ‘123’], [‘Abs’, ‘4’], [‘Satz’, ‘5’]], [[’§’, ‘123’], [‘Abs’, ‘4’], [‘Satz’, ‘6’]]].
Parameters: main_text – string to parse Returns: The parsed reference.
-
static
split_citation_into_enum_parts
(citation)[source]¶ A citation can contain references to multiple parts of the law. E.g. ‘§§ 20 und 35’ or ‘Art. 3 Abs. 1 Satz 1, Abs. 3 Satz 1’. The citation is split into parts so that each referenced section of the law is separated. E.g. ‘§§ 20’ and ‘35’ resp. ‘Art. 3 Abs. 1 Satz 1’ and ‘Abs. 3 Satz 1’. However, ranges are not spit: E.g. “§§ 1 bis 10” will not be split.
-
static
split_citation_part
(string: str)[source]¶ A string a tokenizes. Tokens are identified as units or values. Pairs are built to connect the units with their respective values. If the unit cannot be indentified (and must be inferred later) None is returned.
Parameters: string – A string that is part of a reference and cites one part a statute. - Retruns: As a generator tuples are returned, each containing the unit (or None)
- and the respecive value.
-
static
split_parts_accidently_joined
(reference_paths)[source]¶ Reformats the parsed references to separate accitently joined references. E.g. the original referehence “§ 123 § 126” will not be split by split_citation_into_enum_parts because the separation is falsly not indicated by a ‘,’, ‘or’ etc. It come from the unit ‘§’ that it can be inferred that the citation contains references to two parts of statutes. This function accounts for the case that the unit ‘§’ or ‘Art’ appears twice in the same reference path and split the path into several elements.
-
static
stem_unit
(unit: str)[source]¶ Brings a unit into a standard format. E.g. removes abbreviations, grammatical differences spelling errors, etc.
Parameters: unit – A string containing a unit that should be converted into a standard format. - Returns: Unit in a standard format as string. E.g. §, Art, Nr, Halbsatz,
- Anhang, …
-
static