wiki:SoC2017

Version 9 (modified by awulkiew, 6 years ago) ( diff )

Add info about Boost.Geometry competency test.

Boost Google Summer of Code 2017

Welcome to the Boost C++ Libraries' home page for Google Summer of Code (GSoC) 2017. This page provides information about suggested student projects, proposal submission templates, advice on writing good proposals, and links to information on getting started writing with Boost.

Quick summary of policies and processes for this year

After many extremely successful years of GSoC at Boost during which even some of the old timers have been wowed by the C++ code some of the students have brought us, we are minded to keep in place the more rigorous candidate selection process which involves preferentially selecting over all others every GSoC Boost candidate who takes a C++ programming aptitude test or provides links to at least 1,000 lines (excluding comments and whitespace) non-coursework C++ library (not application nor solution) open source code. Note if following the second route, code should have been open sourced at least three months ago, and show a log of commits improving the library over time.

What students should do now

  1. Students should review the list of ideas from previous GSoCs and the archives of the Boost developer's mailing list relating to GSoC (tip: try searching boost-dev for subjects tagged [gsoc17] or [gsoc16] etc). You may find this searchable archive of boost-dev useful.
  1. If you wish to proceed, you need to join the Boost Developer's mailing list and find a mentor who will be an experienced Boost developer in one of the Boost libraries listed at http://www.boost.org/doc/libs/1_63_0/. Read the Boost Discussion Policy in full, and once read in full go to http://lists.boost.org/mailman/listinfo.cgi/boost and subscribe.
  1. After as an absolute minimum reading all posts tagged [gsoc17], students should write a well researched and intelligent message with [gsoc17] at the front of the subject line to that developer's mailing list seeking a mentor, and be as flexible as possible in finding a topic that both they and the mentor is interested in upon which to base a GSoC project proposal text to be submitted to Google.

As a general rule, a well written and researched proposal to extend or improve an existing mature Boost library is likely to be much better received that student originated ideas for new libraries or facilities.

  1. Once a potential mentor and project idea is found, the student must write a project proposal which should follow this submission template.

Potential mentors may add precanned project ideas with programming competency tests to this page below as GSoC approaches. These may prove useful in starting a discussion with potential mentor(s) whom the student should approach directly.

Github's for standalone GSoCs past and present

Since 2013 with Boost's transition to git we have kept a single umbrella org on github for those GSoCs which are fairly self standing. Incremental extensions to existing libraries usually enter that library's main git repo as an experimental branch. Here are those orgs:

Students may find examining past GSoC source code and commit histories of use.

Historical GSoC Ideas pages for years 2006 to now

Suggested GSoC project proposals

To any potential mentor adding a proposal HERE please use this template

1. Static map (boost::static_map<Key, T, ConstExprHash = boost::constexpr_hash<Key>, Pred = boost::constexpr_equal_to<Key>>)

Potential mentors: Niall Douglas

Background

Even with all of Boost's facilities and using C++ 14, this program represents the currently best available method of implementing a static constant associative map of keys to values (play with it in an online compiler here and see the assembler generated here):

#include <initializer_list>
#include <experimental/string_view>
#include <map>
#include <unordered_map>
#include <iostream>

using std::experimental::string_view;

enum class weekday { sunday, monday, tuesday, wednesday, thursday, friday, saturday };

// initializer_list, pair and string_view are constexpr, so this list can be constexpr
// (i.e. lives in the mind of the compiler only and has zero representation in generated code)
#define STRING_VIEW(str) { str, sizeof(str)-1 }
constexpr std::initializer_list<std::pair<const string_view, weekday>> string_to_weekday {
    { STRING_VIEW("sunday"),    weekday::sunday },
    { STRING_VIEW("monday"),    weekday::monday },
    { STRING_VIEW("tuesday"),   weekday::tuesday },
    { STRING_VIEW("wednesday"), weekday::wednesday },
    { STRING_VIEW("thursday"),  weekday::thursday },
    { STRING_VIEW("friday"),    weekday::friday },
    { STRING_VIEW("saturday"),  weekday::saturday }
};

int main(void)
{
  {
    // Calls malloc() at least 7 times
    static const std::map<string_view, weekday> to_weekday1 = string_to_weekday;
    std::cout << "'monday' maps to " << static_cast<int>(to_weekday1.at("monday")) << std::endl;
    std::cout << "'friday' maps to " << static_cast<int>(to_weekday1.at("friday")) << std::endl;
    // Calls free() at least 7 times
  }
  {
    // Calls malloc() at least 8 times
    static const std::unordered_map<string_view, weekday> to_weekday2 = string_to_weekday;
    std::cout << "'monday' maps to " << static_cast<int>(to_weekday2.at("monday")) << std::endl;
    std::cout << "'friday' maps to " << static_cast<int>(to_weekday2.at("friday")) << std::endl;
    // Calls free() at least 8 times
  }
  return 0;
}

As you can see, we have either the choice of linearly scanning the associative array string_to_weekday for keys (which has linear complexity to number of items per lookup), or we can copy at runtime the associative array into an associative map which can thereafter be queried at runtime in either O(log N) or O(1) complexities respectively. Both these options generate very significant amounts of runtime code.

Almost every seasoned C++ programmer has at least once (and often many times throughout a career) written a static constant associative map which is initialised once from static constant data usually at process start, and is thereafter never modified. Given that C++ 14 can implement an associative map where key lookups are constexpr when possible, it would be a very worthwhile addition to Boost and C++ in general if we gained a high quality implementation of a static map class.

GSoC project proposal

  1. To seek consensus from the Boost Developer's mailing list on a suitable design for a Boost static_map class with the following design features:
    1. Whose number of items and keys are completely fixed from construction onwards.
    2. All features of which can be used in constant expressions (i.e. all member functions are marked constexpr).
    3. Which can be completely statically initialised in the mind of the compiler, or in static global storage.
    4. Values, though not keys nor number of items, are modifiable.
    5. Which performs constexpr key to value lookup to the maximum possible extent.

It turns out this list of requirements is in fact quite challenging to implement, and even seasoned Boost programmers have to stop and ponder this one. To demonstrate the difficulty in code:

constexpr std::pair<int, const char *> map_data[] = {
  { 5, "apple" },
  { 8, "pear" },
  { 0, "banana" }
};
// Easy: generates no runtime code
constexpr auto cmap = make_static_map(map_data);
// Easy: generates no runtime code, this is as if literal "apple"
constexpr const char *what_is_5 = cmap[5];
// Easy: generates no runtime code, this is as if literal "apple"
const char *what_is_0 = std::get<0>(cmap);

// Challenging: needs to only generate code loading immediately from a memory location.
// It must NOT generate any additional runtime overhead like hashing nor searching.
const char *what_is_8 = cmap[8];

The reason why the last statement is challenging is because of the rules of when a constexpr-capable code path is executed by the compiler: the compiler only has to constexpr execute when all the inputs are constant expressions and all the outputs will be used in constant expressions, and it only may constexpr execute when all the inputs are constant expressions. The outcome therefore depends entirely on which compiler version you are using and a fair bit of luck - sometimes GCC but not clang will work, sometimes it's the opposite. And Visual Studio is always a surprise!

  1. To implement a static_map class which runs on at least two major C++ compilers.
    • Hopefully in C++ 14, however upcoming features in C++ 1z do make the problem considerably easier to fix.
  1. To implement a comprehensive unit test suite for the static_map class, including tests ensuring no runtime overhead is generated for the challenging use case exampled above.
  1. To configure per-commit continuous integration for the unit test suite on at least one of the major public open source CI services (e.g. Travis, Appveyor).
  1. To write documentation to Boost quality levels for the new container class, including time and space complexity guarantees and benchmarks, and exception guarantees for each API and each use of each API.

Potential project extension funded by Boost

  • static_multimap (keys are not unique)
  • static_set and static_multiset (keys only)
  • static_ordered_map and static_ordered_multimap (iteration is in lexical order of keys)
  • static_bimap (values are also keys)
  • static_index<T...> (multi-column table with some columns being hash indexed)

Programming competency test

  1. Study the toy static_map implementation at https://goo.gl/eO7ooa. Note that with optimisation enabled, clang executes the non-constexpr statement const char *foo=cmap[5]; entirely as a constant expression and we get our desired behaviour. HOWEVER neither GCC nor VS2017 RC1 do so (you can select GCC on godbolt using the dropdown menu).
  1. Attempt to reimplement the toy static_map implementation so it executes the non-constexpr statement const char *foo=cmap[5]; entirely as a constant expression under optimisation on one of GCC or VS2017.
    • Note it need not work well with clang.
    • You are allowed use C++ 17 only features if that helps you.
    • Hints: You will likely see the best success in this task if you completely ignore the toy static_map we showed you and start again fresh, testing each layer you add to ensure GCC or MSVC is optimising away your code entirely. Also be aware that MSVC will often do the right thing if you sprinkle forceinline around, its inliner gives up very early compared to clang or GCC.
  1. Submission of the programming test should be via copying and pasting the code you wrote into the end of the proposal you submit to Google. Note that we do not expect you to succeed in the task, we are much more interested in how you approach the problem, so please describe step by step how you approached the problem.

2. Boost.uBLAS: linear algebra and matrix computations

Potential mentors: David Bellot

All projects with Boost.uBLAS requires knowledge of C++11.

Background

uBLAS is a library for linear algebra and matrix computations. Using recursive templates, it allows the compiler to optimize any complex linear algebra expressions as if it were written by hand by the programmer. Basic classes are matrix and vector. The library has all the basic functionalities and a few standard algorithms. We would like to improve the functionality of this library by adding new algorithms and functionality especially in the field of data analysis and machine learning.

PROJECT 1 : Add Multicore and GPU computations to uBLAS

The project description is simple: add support of multicore parallel and GPU computations to uBlas ! The realization is not straightforward though. Boost supports parallel/GPU computations thanks to the Boost.Compute library (http://www.boost.org/doc/libs/1_63_0/libs/compute/doc/html/index.html). Boost.uBlas is CPU only. If the compilers is able to vectorize, uBlas can take benefit of it. Here we want to extend Boost to the support of parallel architecture and GPU computations to enable it to do big data or deep learning computations.

The student will have to first understand how ublas works and how it generates and optimizes code with the expression template mechanism and then start adding options to enable the use of Boost.Compute. Test will be done on multicore systems and graphics card or computers which support Boost.Compute (through OpenCL for example).

We expect to see the basic matrix operations to be implemented like this. The code will have to be thoroughly documented and a tutorial document provided. We prefer quality of the implementation to exhaustivity.

For exceptionally good and fast students, extensions to support other library will be considered, like nVidia CUDA for example.

In other words it will have to be clean and super fast !

PROJECT 2 : Data.Frames in boost.uBlas

Languages like R or Python (with Pandas) uses the notion of Data Frame and have many aggregation or grouping algorithm to generate all sorts of statistics on huge matrices. As it became a very important topic we would like to have similar functions in uBLAS. For example you can see libraries like Pandas ( http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html) or a very powerful R package name data.table ( http://cran.r-project.org/web/packages/data.table/index.html). Having similar functionalities in ublas would be a must!

The project will require the student to understand the basics of R data.frame and see what kind of limitations arise when it has to be implemented with a template meta-program in C++. However, the project will require the student to also identify all the possible optimizations that can't be done with generic purpose data.frame in R and Python because of missing information (like column types), etc...

Finally, the student is expected to implement algorithms on the data.frame that can potentially be re-used on matrices too like subset selection with generic operators, statistics and summaries. Understanding memory management, alignment, optimizations, vector processing is not mandatory but most welcome.

Understanding C++ expression template and C++ meta-programming is required.

The student will start by studying existing implementations and propose a design. Then he or she will implement a prototype with tests and benchmarks. The final stage will be a thorough integration into ublas, and especially writing examples and documentation.

Potential project extension funded by Boost

Not as much detail as the GSoC proposal itself, but enough detail that such an extension could be specced out.

If there is no good extension, simply say N/A here.

Programming competency test

Implement a small library in C++ using expression templates and C++17 features like generic lambdas to:

  1. represent a matrix of numerical types (int, long, float, double, complex,....)
  2. compute simple algebraic expressions including + and *
  3. fit in one header file

3. Boost.Geometry

Potential mentors: Vissarion Fysikopoulos, Adam Wulkiewicz

All projects requires knowledge of C++ template metaprogramming techniques.

Background

Boost.Geometry part of collection of the Boost C++ Libraries, defines concepts, primitives and algorithms for solving geometry problems.

See http://www.boost.org/libs/geometry

PROJECT 1. Filtering of compare distance predicates.

In some algorithms there is the need to compare two distances of two point pairs. That is, use the predicate compare_distance(p1, p2, q1, q2) that returns (1, 0, -1) if the length of segment (p1, p2) is larger, equal or smaller respectively than the length of segment (q1, q2). Since, computing distances could be costly especially when computing on the ellipsoid (i.e. geographic computations) we would like to avoid that computation whenever possible.

One approach could be to compute the 3D cartesian distances first and if these numbers are not "far enough" then fall back to expensive geographic distance calculation otherwise return the result by comparing those numbers obtained by less expensive cartesian compuation. There are some issues that should be further clarified, e.g. declare the "far enough" value that works in practice. Several other approaches could be used and tested, e.g. perform a local spheroid approximation and return the 2D distance.

The project will require the student to understand parts of Boost.Geometry, in particular algorithms and strategies. Some knowledge of computational geometry is a plus.

PROJECT 2. R-tree serialization.

The goal is to implement serialization of Boost.Geometry R-tree with Boost.Serialization library. It should be done in a form of optional extension, i.e. Boost.Geometry shouldn't depend on Boost.Serialization by default. The example result could look like this:

#include <boost/archive/binary_oarchive.hpp>
#include <boost/geometry.hpp>
#include <boost/geometry/index/rtree.hpp>
#include <boost/geometry/serialization/index/rtree.hpp>

int main() {
    namespace bg = boost::geometry;
    namespace bgi = bg::index; 
    typedef bg::model::point<double, 2, bg::cs::cartesian> point_type;
    typedef bg::model::box<point_type> box_type;

    bgi::rtree<box_type, bgi::rstar<16> > rtree;

    // fill rtree with data

    std::ofstream file("serialized_rtree.bin",
                       std::ios::binary | std::ios::trunc);
    boost::archive::binary_oarchive archive(file);
    archive << rtree;
}

The project will require the student to understand parts of Boost.Geometry, in particular primitives and the internals of R-tree and the interface of Boost.Serialization library.

PROJECT 3. Nearly antipodal points distance accuracy improvement.

Compute accurately the distance between two nearly antipodal points on the ellipsoid of revolution. Currently Boost.Geometry does not handle this case correctly. A solution is proposed in [1] as a solution of a so called astroid problem. It is also implemented in C++ in [2].

Programming competency test

As mentioned above you may choose to either provide links to existing library you developed or take the competency test. In case of the latter the requirements are listed below.

PROJECT 1 and 3 Implement a library that provides:

  • a representation of points on Earth's surface
  • a function to compute the shortest distance between two given points

PROJECT 2 Implement an addition for Boost.Geometry that provides:

  • serialization of Boost.Geometry models, minimally Point models in arbitrary dimension (model::point, model::point_xy)
  • handling of errors related to saving and loading of incompatible types

In both cases the code should be as generic as possible and fit in one header file.

Also provide main.cpp file that demonstrate the use of your code, i.e. (1, 3) usage of distance function and comparison with distance functions in Boost.Geometry or (2) the serialization of various point types.

Potential project extension funded by Boost

N/A

To any potential mentor adding a proposal HERE please use this template

Note: See TracWiki for help on using the wiki.