NextGen-Core¶
This document provides a brief overview of coala’s NextGen-Core. coala’s NextGen-Core comes with the promise of lifting many limitations of the old core and better efficiency and performance.
What is new?¶
The following new features have been added as a part of the NextGen-Core:
- Easier Interface
- Official support for virtual files
- Improved dependency system
- New base bear type:
DependencyBear
- Ability to modify bear dependencies at runtime
- Superior caching
What has changed?¶
Global Bears
are now calledProject Bears
andLocal Bears
are now known asFile Bears
. Both theProjectBear
and theFileBear
classes inherit from the new base classcoalib.core.Bear
.- There is no need to pass
HiddenResults
between different bears (made possible by the new dependency management system) to hide results to the result callback. Only the results that were explicitly requested by passing the needed bears are passed now. The dependency bears can pass arbitrary python objects, not just theResult
objects. - The former
run
function that was inherited by all the bears to run code analysis is now replaced by theanalyze
function.
Easier Interface¶
Running a coala session with the NextGen-Core can be done by accessing
only one function, core.run(bears)
. The run
method takes the
arguments bears
, result_callback
, cache
and executor
(the last two are None
by default) and initiates a coala session.
The
bears
argument contains the list of bears to be run for a coala session.The
result_callback
is a function that is called on each result as soon as it’s available. It should have the following signature:def result_callback(result): pass
The
cache
argument if provided enables caching and runs the session using the cache provided to store the bear results. The default value of this parameter isNone
which when provided, runs coala without a cache.The
executor
argument is used to provide a custom executor (which is closed after the core is closed) in which the passed bears are to be run. If this argument is not provided thenProcessPoolExecutor
is used using as many processes as cores available on the system.
Bears in the NextGen-Core are implemented differently as compared to the old bears. Following points must be kept in mind while writing NextGen bears:
Every bear has an
analyze
function to perform code analysis instead of therun
function that was there in the old bears.The new bears must be able to be constructed with
section
andfile_dict
as parameters. Default parameters are allowed but discouraged, as you have no control over them when your bear is used as a dependency.class TestBear(Bear): def analyze(self, bear, section_name, file_dict): return "Some analysis"
More details can be found at the API Docs.
Official support for virtual files¶
IDEs like IntelliJ use virtual files to represent files in a filesystem (VFS) and perform operations on them. Hence NextGen-Core provides official support for virtual files. Bears have to point to the right file data objects when run, whether they are real files or virtual ones. This makes coala easier to integrate with IDEs.
Task Objects¶
Task objects are the representation of tasks performed by bears. Structure-wise
they are a tuple containing tuples of positional arguments and dicts of
keyword arguments to the execute_task
function, which itself calls
analyze
with them caching mechanism as their hash values are stored in the
cache along with the bear results and are looked up during each coala run to
fetch the results.
To get a clear picture of what task objects for a bear might look like take a
look at the following example FileBear
:
from coalib.core.FileBear import FileBear
class SomeFileBear(FileBear):
def analyze(self, file, filename, filename_prefix: str='',
filename_suffix: str=''):
yield 'Some analysis result'
Its corresponding task object would look like the one below:
[
[(file, filename), {'filename_prefix': "", 'filename_suffix': ""}],
]
These task objects can then be offloaded by bears to be executed in a Python
pool by the generate_tasks
method.
Improved Dependency System¶
The NextGen-Core introduces a better dependency management system than the one used by the old core. It features following improvements:
- A bear specifies its bear dependencies in
BEAR_DEPS
. - A class
DependencyTracker
manages dependency management. Dependencies are added and resolved by this class and it checks for circular dependencies. - Dependency relations between two objects are tracked using a directed graph.
When two nodes are connected with a directed edge they form a dependency
relation. The NextGen-Core lifts the limitation of specifying
LocalBear
s as dependencies ofGlobalBear
s.
The initialize_dependencies
method in Core
receives the bears that
are to be run and processes bear dependencies using a consumer-based system so
that each dependency bear has only one instance per section and file-dict. It
returns a set of dependency bears along with those bears that don’t have any
dependencies or whose dependencies have been resolved (these are the ones that
are scheduled to be run). Before the bears are run we initialize the dependency
tracking in the __init__
method of the class Session
which is
responsible for running coala sessions.
The bears that have no dependencies or whose dependencies have been resolved,
only their tasks will be scheduled for execution. Before executing any task
coala looks it up in the cache. In case of a hit, the existing results that are
stored in the cache for the corresponding task arguments are called using
execute_task_with_cache
method. In case of a miss or if coala is run without
a cache the task is executed. The bears without any running tasks are cleaned up
from the state of an ongoing run by resolving its dependencies, scheduling
dependent bears and removing the bear from the running_tasks
dict.
Even though bears still have to pass Result
instances to communicate with
coala, it is now possible to pass arbitrary Python objects. Dependency bears
benefit from this because now they can pass data according to their needs
without being bound to Result
objects only.
The dependency results lie inside self.dependecy_results
and can be accessed
that way. But this is highly discouraged since it bypasses caching and
could yield unexpected results when the core is run multiple times in a row.
DependencyBear¶
Handling of bear dependencies by the old core wasn’t effective. The old core used a queuing mechanism to communicate between bear runs. The NextGen-Core improves on this.
A new bear type was introduced, DependencyBear
, makes it more convenient
for bear developers to write dependency bears, by passing the dependency results
using task objects. This technique of handling dependencies make it possible for
the DependencyBear
to support caching.
This bear serves as a base class which parallelizes tasks for each dependency
result. A bear dependent on other bears can specify its dependencies in
BEAR_DEPS
. For example, there are two bears Foo
and Bar
and bear Bar
depends on Foo
. This can be written as
class BarBear(DependencyBear):
BEAR_DEPS = {FooBear}
This solves the dependency issues of GlobalBear
s on LocalBear
s that
were there in the old core. Now that the new dependency management is in place
GlobalBear
s won’t be stalled due to the termination of a LocalBear run.
This eradicates all the synchronization problems faced by the old core.
Multiple bears can be included as a dependency of a bear in the BEAR_DEPS
field. The results of the dependency bears are saved in a dictionary
called _dependency_results
which is initialized in the __init()__
method of the class Bear
and can be accessed using the method
dependency_results()
also belonging to the same class.
Writing a DependencyBear¶
Let’s consider a bear to be dependent on a project bear Fizz
and a file bear
Buzz
then the corresponding DependencyBear let’s call it FizzBuzz
will
look like the following:
class FizzBear(ProjectBear):
def analyze(self, file, filename):
yield 'Fizz analysis'
class BuzzyBear(FileBear):
def analyze(self, file, filename):
yield 'Buzz analysis'
class FizzBuzzBear(DependencyBear):
BEAR_DEPS = {FizzBear, BuzzBear}
def analyze(self, dependency_bear, dependency_result, a_number=100):
yield '{} ({}) - {}'.format(
dependency_bear.name, a_number, dependency_result)
Ability To Modify Bear Dependencies At Runtime¶
A bear might depend on multiple bears before its execution can begin.
Bear.BEAR_DEPS
is just a set of bear classes that need to be executed
before that bear can run. Once all these dependencies have run, their
results are appended to self.dependency_results
. The results are in the form
of a dictionary with the types of the bears and their corresponding results
(in the form of a list) as key-value pairs. From the previous example if we
try to access the BEAR_DEPS of the BarBear
we will get the result
{<class 'coalib.core.Bear.FooBear'>}
.
In the __init__() method of the class Bear
the dependencies specified
in the BEAR_DEPS
are copied to every instance of a Bear run using
which makes runtime modifications possible.
Override bears¶
A NextGen bear has to have the following functions to perform analysis:
analyze
: This method contains the code that performs the actual code analysis routine that that bear is used for.generate_tasks
: This method is a part of the parentBear
class and returns tuples containing the positional arguments as a tuple and the keyword arguments in the form of a dict. These are actually the task objects that are scheduled and executed by the core. An absence of this method raisesNotImplementedError``(one thing to be kept in mind is that you need to implement a ``generate_task
only if the other bear base classes don’t offer the right parallelization level.) .
A bear inheriting from the class FileBear
can parallelize tasks for each
file given. A bear inheriting from the class DependencyBear
can
parallelize tasks for each dependency result. A bear inheriting from the class
ProjectBear
does not parallelize tasks for each file as it runs on the
whole codebase given.
Let’s write our own bears with custom generate_tasks
methods. We will call
this bear PairWiseDependencyBear
which will compare the results from
genereted by two of its dependency bears. (This kind of bear might be useful
in case of code clone detection).
# This bear provides some code analysis
class SomeDependencyBear(Bear):
def analyze(self, bear, section_name, file_dict):
yield 'Some analysis result'
# This bear provides some code analysis
class SomeOtherDependencyBear(Bear):
def analyze(self, bear, section_name, file_dict):
yield 'Some more analysis result'
# This bear depends on the above bear and performs some
# more analysis after receving its results
class PairWiseDependencyBear(Bear):
BEAR_DEPS = {SomeDependencyBear, SomeOtherDependencyBear}
def analyze(self, file, filename):
return 'More analysis'
def generate_tasks(self):
similar_results = []
results = [r['SomeDependencyBear'] for r in self.dependecy_results]
other_results = [r['SomeOtherDependencyBear']
for r in self.dependecy_results]
for a, b in zip(results, other_results):
if a == b:
similar_results = a
# returns some kind of task object containing
# the results common to both dependency bears
# and their corresponding lengths
return (((i, len(i)), {}) for i in similar_results)
Superior Caching¶
The NextGen-Core’s caching mechanism is based on task objects. Bears can offload
tasks via generate_tasks() which get executed by a Python pool. Structure wise
the cache is a dictionary-like-object with bear types and cache-tables as key
value pairs. The cache-tables themselves are dictionary-like-objects that map
the hash values of the task objects (generated by
PersistentHash.persistent_hash
) to the bear results.
At the time of scheduling the bears, the core performs a cache lookup. If the
parameters to execute_tasks()
are the same (in other words it looks for
identical task objects in the cache, and fetches their corresponding results
if found) as that of the previous run then instead of executing that bear again
we get the cached results of that bear.
The NextGen-Core expects the analyze
functions of each bear to provide
results that only depend on the input parameters. In other words analyze
shall be mapping its parameters to results. Using volatile values like
time-dependent data without putting it into the task objects is prohibited since
it might lead to unknown behaviour in coala.