NextGen-Core ============ This document provides a brief overview of coala's NextGen-Core. coala's NextGen-Core comes with the promise of lifting many limitations of the old core and better efficiency and performance. What is new? ------------ The following new features have been added as a part of the NextGen-Core: - Easier Interface - Official support for virtual files - Improved dependency system - New base bear type: ``DependencyBear`` - Ability to modify bear dependencies at runtime - Superior caching What has changed? ----------------- - ``Global Bears`` are now called ``Project Bears`` and ``Local Bears`` are now known as ``File Bears``. Both the ``ProjectBear`` and the ``FileBear`` classes inherit from the new base class ``coalib.core.Bear``. - There is no need to pass ``HiddenResults`` between different bears (made possible by the new dependency management system) to hide results to the result callback. Only the results that were explicitly requested by passing the needed bears are passed now. The dependency bears can pass arbitrary python objects, not just the ``Result`` objects. - The former ``run`` function that was inherited by all the bears to run code analysis is now replaced by the ``analyze`` function. Easier Interface ---------------- Running a coala session with the NextGen-Core can be done by accessing only one function, ``core.run(bears)``. The ``run`` method takes the arguments ``bears``, ``result_callback``, ``cache`` and ``executor`` (the last two are ``None`` by default) and initiates a coala session. * The ``bears`` argument contains the list of bears to be run for a coala session. * The ``result_callback`` is a function that is called on each result as soon as it's available. It should have the following signature: :: def result_callback(result): pass * The ``cache`` argument if provided enables caching and runs the session using the cache provided to store the bear results. The default value of this parameter is ``None`` which when provided, runs coala without a cache. * The ``executor`` argument is used to provide a custom executor (which is closed after the core is closed) in which the passed bears are to be run. If this argument is not provided then ``ProcessPoolExecutor`` is used using as many processes as cores available on the system. Bears in the NextGen-Core are implemented differently as compared to the old bears. Following points must be kept in mind while writing NextGen bears: * Every bear has an ``analyze`` function to perform code analysis instead of the ``run`` function that was there in the old bears. * The new bears must be able to be constructed with ``section`` and ``file_dict`` as parameters. Default parameters are allowed but discouraged, as you have no control over them when your bear is used as a dependency. :: class TestBear(Bear): def analyze(self, bear, section_name, file_dict): return "Some analysis" More details can be found at the `API Docs `_. Official support for virtual files ---------------------------------- IDEs like IntelliJ use virtual files to represent files in a filesystem (VFS) and perform operations on them. Hence NextGen-Core provides official support for virtual files. Bears have to point to the right file data objects when run, whether they are real files or virtual ones. This makes coala easier to integrate with IDEs. Task Objects ------------ Task objects are the representation of tasks performed by bears. Structure-wise they are a tuple containing tuples of positional arguments and dicts of keyword arguments to the ``execute_task`` function, which itself calls ``analyze`` with them caching mechanism as their hash values are stored in the cache along with the bear results and are looked up during each coala run to fetch the results. To get a clear picture of what task objects for a bear might look like take a look at the following example ``FileBear``: :: from coalib.core.FileBear import FileBear class SomeFileBear(FileBear): def analyze(self, file, filename, filename_prefix: str='', filename_suffix: str=''): yield 'Some analysis result' Its corresponding task object would look like the one below: :: [ [(file, filename), {'filename_prefix': "", 'filename_suffix': ""}], ] These task objects can then be offloaded by bears to be executed in a Python pool by the ``generate_tasks`` method. Improved Dependency System -------------------------- The NextGen-Core introduces a better dependency management system than the one used by the old core. It features following improvements: * A bear specifies its bear dependencies in ``BEAR_DEPS``. * A class ``DependencyTracker`` manages dependency management. Dependencies are added and resolved by this class and it checks for circular dependencies. * Dependency relations between two objects are tracked using a directed graph. When two nodes are connected with a directed edge they form a dependency relation. The NextGen-Core lifts the limitation of specifying ``LocalBear``\s as dependencies of ``GlobalBear``\s. The ``initialize_dependencies`` method in ``Core`` receives the bears that are to be run and processes bear dependencies using a consumer-based system so that each dependency bear has only one instance per section and file-dict. It returns a set of dependency bears along with those bears that don't have any dependencies or whose dependencies have been resolved (these are the ones that are scheduled to be run). Before the bears are run we initialize the dependency tracking in the ``__init__`` method of the class ``Session`` which is responsible for running coala sessions. The bears that have no dependencies or whose dependencies have been resolved, only their tasks will be scheduled for execution. Before executing any task coala looks it up in the cache. In case of a hit, the existing results that are stored in the cache for the corresponding task arguments are called using ``execute_task_with_cache`` method. In case of a miss or if coala is run without a cache the task is executed. The bears without any running tasks are cleaned up from the state of an ongoing run by resolving its dependencies, scheduling dependent bears and removing the bear from the ``running_tasks`` dict. Even though bears still have to pass ``Result`` instances to communicate with coala, it is now possible to pass arbitrary Python objects. Dependency bears benefit from this because now they can pass data according to their needs without being bound to ``Result`` objects only. The dependency results lie inside ``self.dependecy_results`` and can be accessed that way. **But this is highly discouraged since it bypasses caching and could yield unexpected results when the core is run multiple times in a row.** DependencyBear -------------- Handling of bear dependencies by the old core wasn't effective. The old core used a queuing mechanism to communicate between bear runs. The NextGen-Core improves on this. A new bear type was introduced, ``DependencyBear``, makes it more convenient for bear developers to write dependency bears, by passing the dependency results using task objects. This technique of handling dependencies make it possible for the ``DependencyBear`` to support caching. This bear serves as a base class which parallelizes tasks for each dependency result. A bear dependent on other bears can specify its dependencies in ``BEAR_DEPS``. For example, there are two bears ``Foo`` and ``Bar`` and bear ``Bar`` depends on ``Foo``. This can be written as :: class BarBear(DependencyBear): BEAR_DEPS = {FooBear} This solves the dependency issues of ``GlobalBear``\s on ``LocalBear``\s that were there in the old core. Now that the new dependency management is in place ``GlobalBear``\s won't be stalled due to the termination of a LocalBear run. This eradicates all the synchronization problems faced by the old core. Multiple bears can be included as a dependency of a bear in the ``BEAR_DEPS`` field. The results of the dependency bears are saved in a dictionary called ``_dependency_results`` which is initialized in the ``__init()__`` method of the class ``Bear`` and can be accessed using the method ``dependency_results()`` also belonging to the same class. Writing a DependencyBear ------------------------ Let's consider a bear to be dependent on a project bear ``Fizz`` and a file bear ``Buzz`` then the corresponding DependencyBear let's call it ``FizzBuzz`` will look like the following: :: class FizzBear(ProjectBear): def analyze(self, file, filename): yield 'Fizz analysis' :: class BuzzyBear(FileBear): def analyze(self, file, filename): yield 'Buzz analysis' :: class FizzBuzzBear(DependencyBear): BEAR_DEPS = {FizzBear, BuzzBear} def analyze(self, dependency_bear, dependency_result, a_number=100): yield '{} ({}) - {}'.format( dependency_bear.name, a_number, dependency_result) Ability To Modify Bear Dependencies At Runtime ---------------------------------------------- A bear might depend on multiple bears before its execution can begin. ``Bear.BEAR_DEPS`` is just a set of bear classes that need to be executed before that bear can run. Once all these dependencies have run, their results are appended to ``self.dependency_results``. The results are in the form of a dictionary with the types of the bears and their corresponding results (in the form of a list) as *key-value* pairs. From the previous example if we try to access the BEAR_DEPS of the ``BarBear`` we will get the result ``{}``. In the `__init__()` method of the class ``Bear`` the dependencies specified in the ``BEAR_DEPS`` are copied to every instance of a Bear run using which makes runtime modifications possible. Override bears -------------- A NextGen bear has to have the following functions to perform analysis: - ``analyze``: This method contains the code that performs the actual code analysis routine that that bear is used for. - ``generate_tasks``: This method is a part of the parent ``Bear`` class and returns tuples containing the positional arguments as a tuple and the keyword arguments in the form of a dict. These are actually the task objects that are scheduled and executed by the core. An absence of this method raises ``NotImplementedError``(one thing to be kept in mind is that you need to implement a ``generate_task`` only if the other bear base classes don't offer the right parallelization level.) . A bear inheriting from the class ``FileBear`` can parallelize tasks for each file given. A bear inheriting from the class ``DependencyBear`` can parallelize tasks for each dependency result. A bear inheriting from the class ``ProjectBear`` does not parallelize tasks for each file as it runs on the whole codebase given. Let's write our own bears with custom ``generate_tasks`` methods. We will call this bear ``PairWiseDependencyBear`` which will compare the results from genereted by two of its dependency bears. (This kind of bear might be useful in case of code clone detection). :: # This bear provides some code analysis class SomeDependencyBear(Bear): def analyze(self, bear, section_name, file_dict): yield 'Some analysis result' :: # This bear provides some code analysis class SomeOtherDependencyBear(Bear): def analyze(self, bear, section_name, file_dict): yield 'Some more analysis result' :: # This bear depends on the above bear and performs some # more analysis after receving its results class PairWiseDependencyBear(Bear): BEAR_DEPS = {SomeDependencyBear, SomeOtherDependencyBear} def analyze(self, file, filename): return 'More analysis' def generate_tasks(self): similar_results = [] results = [r['SomeDependencyBear'] for r in self.dependecy_results] other_results = [r['SomeOtherDependencyBear'] for r in self.dependecy_results] for a, b in zip(results, other_results): if a == b: similar_results = a # returns some kind of task object containing # the results common to both dependency bears # and their corresponding lengths return (((i, len(i)), {}) for i in similar_results) Superior Caching ---------------- The NextGen-Core's caching mechanism is based on task objects. Bears can offload tasks via `generate_tasks()` which get executed by a Python pool. Structure wise the cache is a dictionary-like-object with bear types and cache-tables as key value pairs. The cache-tables themselves are dictionary-like-objects that map the hash values of the task objects (generated by ``PersistentHash.persistent_hash``) to the bear results. At the time of scheduling the bears, the core performs a cache lookup. If the parameters to ``execute_tasks()`` are the same (in other words it looks for identical task objects in the cache, and fetches their corresponding results if found) as that of the previous run then instead of executing that bear again we get the cached results of that bear. The NextGen-Core expects the ``analyze`` functions of each bear to provide results that only depend on the input parameters. In other words ``analyze`` shall be mapping its parameters to results. Using volatile values like time-dependent data without putting it into the task objects is prohibited since it might lead to unknown behaviour in coala.