Skip to content

Releases: databricks/koalas

Version 0.25.0

09 Jan 03:30
Compare
Choose a tag to compare

loc and iloc indexers improvement

We improved loc and iloc indexers. Now, loc can support scalar values as indexers (#1172).

>>> import databricks.koalas as ks
>>>
>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
...                   index=['cobra', 'viper', 'sidewinder'],
...                   columns=['max_speed', 'shield'])
>>> df.loc['sidewinder']
max_speed    7
shield       8
Name: sidewinder, dtype: int64
>>> df.loc['sidewinder', 'max_speed']
7

In addition, Series derived from a different Frame can be used as indexers (#1155).

>>> import databricks.koalas as ks
>>>
>>> ks.options.compute.ops_on_diff_frames = True
>>> 
>>> df1 = ks.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]},
...                    index=[20, 10, 30, 0, 50])
>>> df2 = ks.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]},
...                    index=[20, 10, 30, 0, 50])
>>> df1.A.loc[df2.A > -3].sort_index()
10    1
20    0
30    2

Lastly, now loc uses its natural order according to index identically with pandas' when using the slice (#1159, #1174, #1179). See the example below.

>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
...                   index=['cobra', 'viper', 'sidewinder'],
...                   columns=['max_speed', 'shield'])
>>> df.loc['cobra':'viper', 'max_speed']
cobra    1
viper    4
Name: max_speed, dtype: int64

Other new features and improvements

We added the following new features:

koalas.Series:

koalas.Index

koalas.MultiIndex:

Other improvements

  • Add support from_pandas for Index/MultiIndex. (#1170)
  • Add a hidden column __natural_order__. (#1146)
  • Introduce _LocIndexerLike and consolidate some logic. (#1149)
  • Refactor LocIndexerLike.__getitem__. (#1152)
  • Remove sort in GroupBy._reduce_for_stat_function. (#1147)
  • Randomize index in tests and fix some window-like functions. (#1151)
  • Explicitly don't support Index.duplicated (#1131)
  • Fix DataFrame._repr_html_(). (#1177)

Version 0.24.0

19 Dec 07:18
Compare
Choose a tag to compare

NumPy's universal function (ufunc) compatibility

We added the compatibility of NumPy ufunc (#1127). Virtually all ufunc compatibilities in Koalas DataFrame were implemented. See the example below:

>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> np.log(kdf)
         id
0       NaN
1  0.000000
2  0.693147
3  1.098612
4  1.386294
5  1.609438
6  1.791759
7  1.945910
8  2.079442
9  2.197225

Other new features and improvements

We added the following new features:

koalas:

koalas.DataFrame:

koalas.Index

koalas.MultiIndex:

koalas.SeriesGroupBy

koalas.DataFrameGroupBy

Other improvements

  • Setting index name / names for Series (#1079)
  • disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' (#1097)
  • Support 'compute.ops_on_diff_frames' for NumPy ufunc compay in Series (#1128)
  • Support arithmetic and comparison APIs on same DataFrames (#1129)
  • Fix rename() for Index to support MultiIndex also (#1125)
  • Set the upper-bound for pandas. (#1137)
  • Fix _cum() for Series to work properly (#1113)
  • Fix value_counts() to work properly when dropna is True (#1116, #1142)

Version 0.23.0

05 Dec 01:56
Compare
Choose a tag to compare

NumPy's universal function (ufunc) compatibility

We added the compatibility of NumPy ufunc (#1096, #1106). Virtually all ufunc compatibilities in Koalas Series were implemented. See the example below:

>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> kser = np.sqrt(kdf.id)
>>> type(kser)
<class 'databricks.koalas.series.Series'>
>>> kser
0    0.000000
1    1.000000
2    1.414214
3    1.732051
4    2.000000
5    2.236068
6    2.449490
7    2.645751
8    2.828427
9    3.000000

Other new features and improvements

We added the following new features:

koalas:

koalas.DataFrame:

koalas.Series:

koalas.Index

koalas.MultiIndex:

Other improvements

  • Fix comparison operators to treat NULL as False (#1029)
  • Make corr return koalas.DataFrame (#1069)
  • Include link to Help Thirsty Koalas Fund (#1082)
  • Add Null handling for different frames (#1083)
  • Allow Series.__getitem__ to take boolean Series (#1075)
  • Produce correct output against multiIndex when 'compute.ops_on_diff_frames' is enabled (#1089)
  • Fix idxmax() / idxmin() for Series work properly (#1078)

Version 0.22.0

14 Nov 05:37
Compare
Choose a tag to compare

Enable Arrow 0.15.1+

Apache Arrow 0.15.0 did not work well with PySpark 2.4 so it was disabled in the previous version.
With Arrow 0.15.1, now it works in Koalas (#902).

Expanding and Rolling

We also added expanding() and rolling() APIs in all groupby(), Series and Frame (#985, #991, #990, #1015, #996, #1034, #1037)

  • min
  • max
  • sum
  • mean
  • std
  • var

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

Documentation

We added "Best Practices" section in the documentation (#1041) so that Koalas users can read and follow. Please see https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html

Other new features and improvements

We added the following new features:

koalas.DataFrame:

koalas.Series:

koalas.MultiIndex:

Along with the following improvements:

  • Introduce column_scols in InternalFrame substitude for data_columns. (#956)
  • Fix different index level assignment when 'compute.ops_on_diff_frames' is enabled (#1045)
  • Fix Dataframe.melt function & Add doctest case for melt function (#987)
  • Enable creating Index from list like 'Index([1, 2, 3])' (#986)
  • Fix combine_frames to handle where the right hand side arguments are modified Series (#1020)
  • setup.py should support Python 2 to show a proper error message. (#1027)
  • Remove Series.schema. (#993)

Version 0.21.0

31 Oct 06:12
Compare
Choose a tag to compare

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

Documentation

Now, we have installation guide, design principles and FAQ in our public documentation (#914, #944, #963, #964)

Other new features and improvements

We added the following new features:

koalas

koalas.DataFrame:

koalas.Series:

koalas.Index:

koalas.MultiIndex:

koalas.Expanding

Along with the following improvements:

  • Fix passing options as keyword arguments (#968)
  • Make is_monotonic~ work properly for index (#930)
  • Fix Series.__getitem__ to work properly (#934)
  • Fix reindex when all the given columns are included the existing columns (#975)
  • Add datetime as the equivalent python type to TimestampType (#957)
  • Fix is_unique to respect the current Spark column (#981)
  • Fix bug when assign None to name as Index (#974)
  • Use name_like_string instead of str directly. (#942, #950)

Version 0.20.0

15 Oct 13:37
Compare
Choose a tag to compare

Disable Arrow 0.15

Apache Arrow 0.15.0 was released on the 5th of October, 2019, which Koalas depends on to execute Pandas UDF, but the Spark community reports an issue with PyArrow 0.15.

We decided to set an upper bound for pyarrow version to avoid such issues until we are sure that Koalas works fine with it.

  • Set an upper bound for pyarrow version. (#918)

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

Other new features and improvements

We added the following new features:

koalas.DataFrame:

koalas.Series:

koalas.GroupBy:

Along with the following improvements:

  • Implement nested renaming for groupby agg (#904)
  • Add 'index_col' parameter to DataFrame.to_spark (#906)
  • Add more options to read_csv (#916)
  • Add NamedAgg (#911)
  • Enable DataFrame setting value as list of labels (#905)

Version 0.19.0

04 Oct 05:08
Compare
Choose a tag to compare

Koalas Logo

Now that we have an official logo!

We can see the cute logo in our documents as well.

Documentation

Also we improved the documentation: https://koalas.readthedocs.io/en/latest/

  • Added the logo (#831)
  • Added a Jupyter notebook for 10 min tutorial (#843)
  • Added the tutorial to the documentation (#853)
  • Add some examples for plot implementations in their docstrings (#847)
  • Move contribution guide to the official documentation site (#841)

Binder integration for the 10 min tutorial

You can run a live Jupyter notebook for 10 min tutorial from Binder.

Multi-index columns support

We continue improving multi-index columns support. We made the following APIs support multi-index columns:

Plots

We also continue adding plot APIs as follows:

For DataFrame:

Other new features and improvements

We added the following new features:

koalas.DataFrame:

koalas.Series:

koalas.DataFrameGroupBy:

koalas.SeriesGroupBy:

Along with the following improvements:

  • Add squeeze argument to read_csv (#812)
  • Raise a more helpful error for duplicated columns in Join (#820)
  • Issue with ks.merge to Series (#818)
  • Fix MultiIndex.to_pandas() and __repr__(). (#832)
  • Add unit and origin options for to_datetime (#839)
  • Fix on wrong error raise in DataFrame.fillna (#844)
  • Allow str and list in aggfunc in DataFrameGroupby.agg (#828)
  • Add index_col argument to to_koalas(). (#863)

Version 0.18.0

19 Sep 07:42
Compare
Choose a tag to compare

Multi-index columns support

We continue improving multi-index columns support (#793, #776). We made the following APIs support multi-index columns:

Also, we can set tuple or None name for Series and Index. (#776)

>>> import databricks.koalas as ks
>>> kser = ks.Series([1, 2, 3])
>>> kser.name = ('a', 'b')
>>> kser
0    1
1    2
2    3
Name: (a, b), dtype: int64

Plots

We also continue adding plot APIs as follows:

For Series:

For DataFrame:

  • plot.hist() (#780)

Options

In addition, we added the support for namespace-access in options (#785).

>>> import databricks.koalas as ks
>>> ks.options.display.max_rows
1000
>>> ks.options.display.max_rows = 10
>>> ks.options.display.max_rows
10

See also User Guide of our project docs.

Other new features and improvements

We added the following new features:

koalas.DataFrame:

koalas.indexes.Index/MultiIndex

Along with the following improvements:

  • Add index_col for read_json (#797)
  • Add index_col for spark IO reads (#769, #775)
  • Add "sep" parameter for read_csv (#777)
  • Add axis parameter to dataframe.diff (#774)
  • Add read_json and let to_json use spark.write.json (#753)
  • Use spark.write.csv in to_csv of Series and DataFrame (#749)
  • Handle TimestampType separately when convert to pandas' dtype. (#798)
  • Fix spark_df when set_index(.., drop=False). (#792)

Backward compatibility

  • We removed some parameters in DataFrame.to_csv and DataFrame.to_json to allow distributed writing (#749, #753)

Version 0.17.0

05 Sep 07:19
Compare
Choose a tag to compare

Options

We started using options to configure the Koalas' behavior. Now we have the following options:

  • display.max_rows (#714, #742)
  • compute.max_rows (#721, #736)
  • compute.shortcut_limit (#717)
  • compute.ops_on_diff_frames (#725)
  • compute.default_index_type (#723)
  • plotting.max_rows (#728)
  • plotting.sample_ratio (#737)

We can also see the list and their descriptions in the User Guide of our project docs.

Plots

We continue adding plot APIs as follows:

For Series:

  • plot.area() (#704)

For DataFrame:

Multi-index columns support

We also continue improving multi-index columns support. We made the following APIs support multi-index columns:

  • koalas.concat() (#680)
  • koalas.get_dummies() (#695)
  • DataFrame.pivot_table() (#635)

Other new features and improvements

We added the following new features:

koalas:

  • read_sql_table() (#741)
  • read_sql_query() (#741)
  • read_sql() (#741)

koalas.DataFrame:

Along with the following improvements:

  • GroupBy.apply should return Koalas DataFrame instead of pandas DataFrame (#731)
  • Fix rpow and rfloordiv to use proper operators in Series (#735)
  • Fix rpow and rfloordiv to use proper operators in DataFrame (#740)
  • Add schema inference support at DataFrame.transform (#732)
  • Add Option class to support type check and value check in options (#739)
  • Added missing tests (#687, #692, #694, #709, #711, #730, #729, #733, #734)

Backward compatibility

  • We renamed two of the default index names from one-by-one and distributed-one-by-one to sequence and distributed-sequence respectively. (#679)
  • We moved the configuration for enabling operations on different DataFrames from the environment variable to the option. (#725)
  • We moved the configuration for the default index from the environment variable to the option. (#723)

Version 0.16.0

22 Aug 06:35
Compare
Choose a tag to compare

Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES environment variable is set to true as below:

>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
    id
0 -5.0
1 -3.0
2 -1.0
3  NaN
4  NaN
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
   id  new_col
0   0      1.0
1   1      2.0
3   3      4.0
2   2      3.0
4   4      NaN

Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX as one of three types:

  • (default) one-by-one: It implements a one-by-one sequence by Window function without
    specifying partition. This index type should be avoided when the data is large.

    >>> ks.range(3)
       id
    0   0
    1   1
    2   2
  • distributed-one-by-one: It implements a one-by-one sequence by group-by and
    group-map approach. It still generates a one-by-one sequential index globally.
    If the default index must be a one-by-one sequence in a large dataset, this
    index can be used.

    >>> ks.range(3)
       id
    0   0
    1   1
    2   2
  • distributed: It implements a monotonically increasing sequence simply by using
    Spark's monotonically_increasing_id function. If the index does not have to be
    a one-by-one sequence, this index can be used. Performance-wise, this index
    almost does not have any penalty comparing to other index types.

    >>> ks.range(3)
                 id
    25769803776   0
    60129542144   1
    94489280512   2

Thirdly, we implemented many plot APIs in Series as follows:

See the example below:

import databricks.koalas as ks

ks.range(10).to_pandas().id.plot.pie()

image

Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:

  • DataFrame.sort_index()(#637)
  • GroupBy.diff()(#653)
  • GroupBy.rank()(#653)
  • Series.any()(#652)
  • Series.all()(#652)
  • DataFrame.any()(#652)
  • DataFrame.all()(#652)
  • DataFrame.assign()(#657)
  • DataFrame.drop()(#658)
  • DataFrame.reindex()(#659)
  • Series.quantile()(#663)
  • Series,transform()(#663)
  • DataFrame.select_dtypes()(#662)
  • DataFrame.transpose()(#664).

Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:

koalas.DataFrame

koalas.groupby.GroupBy:

Along with the following improvements:

  • Add a basic infrastructure for configurations. (#645)
  • Always use column_index. (#648)
  • Allow to omit type hint in GroupBy.transform, filter, apply (#646)