Releases: databricks/koalas
Version 0.25.0
loc
and iloc
indexers improvement
We improved loc
and iloc
indexers. Now, loc
can support scalar values as indexers (#1172).
>>> import databricks.koalas as ks
>>>
>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df.loc['sidewinder']
max_speed 7
shield 8
Name: sidewinder, dtype: int64
>>> df.loc['sidewinder', 'max_speed']
7
In addition, Series derived from a different Frame can be used as indexers (#1155).
>>> import databricks.koalas as ks
>>>
>>> ks.options.compute.ops_on_diff_frames = True
>>>
>>> df1 = ks.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]},
... index=[20, 10, 30, 0, 50])
>>> df2 = ks.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]},
... index=[20, 10, 30, 0, 50])
>>> df1.A.loc[df2.A > -3].sort_index()
10 1
20 0
30 2
Lastly, now loc
uses its natural order according to index identically with pandas' when using the slice (#1159, #1174, #1179). See the example below.
>>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df.loc['cobra':'viper', 'max_speed']
cobra 1
viper 4
Name: max_speed, dtype: int64
Other new features and improvements
We added the following new features:
koalas.Series:
get
(#1153)
koalas.Index
koalas.MultiIndex:
Other improvements
- Add support
from_pandas
for Index/MultiIndex. (#1170) - Add a hidden column
__natural_order__
. (#1146) - Introduce
_LocIndexerLike
and consolidate some logic. (#1149) - Refactor
LocIndexerLike.__getitem__
. (#1152) - Remove sort in
GroupBy._reduce_for_stat_function
. (#1147) - Randomize index in tests and fix some window-like functions. (#1151)
- Explicitly don't support
Index.duplicated
(#1131) - Fix
DataFrame._repr_html_()
. (#1177)
Version 0.24.0
NumPy's universal function (ufunc) compatibility
We added the compatibility of NumPy ufunc (#1127). Virtually all ufunc compatibilities in Koalas DataFrame were implemented. See the example below:
>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> np.log(kdf)
id
0 NaN
1 0.000000
2 0.693147
3 1.098612
4 1.386294
5 1.609438
6 1.791759
7 1.945910
8 2.079442
9 2.197225
Other new features and improvements
We added the following new features:
koalas:
to_numeric
(#1060)
koalas.DataFrame:
koalas.Index
koalas.MultiIndex:
koalas.SeriesGroupBy
head
(#1050)
koalas.DataFrameGroupBy
head
(#1050)
Other improvements
- Setting index name / names for Series (#1079)
- disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' (#1097)
- Support 'compute.ops_on_diff_frames' for NumPy ufunc compay in Series (#1128)
- Support arithmetic and comparison APIs on same DataFrames (#1129)
- Fix rename() for Index to support MultiIndex also (#1125)
- Set the upper-bound for pandas. (#1137)
- Fix _cum() for Series to work properly (#1113)
- Fix value_counts() to work properly when dropna is True (#1116, #1142)
Version 0.23.0
NumPy's universal function (ufunc) compatibility
We added the compatibility of NumPy ufunc (#1096, #1106). Virtually all ufunc compatibilities in Koalas Series were implemented. See the example below:
>>> import databricks.koalas as ks
>>> import numpy as np
>>> kdf = ks.range(10)
>>> kser = np.sqrt(kdf.id)
>>> type(kser)
<class 'databricks.koalas.series.Series'>
>>> kser
0 0.000000
1 1.000000
2 1.414214
3 1.732051
4 2.000000
5 2.236068
6 2.449490
7 2.645751
8 2.828427
9 3.000000
Other new features and improvements
We added the following new features:
koalas:
option_context
(#1077)
koalas.DataFrame:
koalas.Series:
koalas.Index
symmetric_difference
(#953, #1059)to_numpy
(#1058)transpose
(#1056)T
(#1056)dropna
(#938)shape
(#1085)value_counts
(#949)
koalas.MultiIndex:
symmetric_difference
(#953, #1059)to_numpy
(#1058)transpose
(#1056)T
(#1056)dropna
(#938)shape
(#1085)value_counts
(#949)
Other improvements
- Fix comparison operators to treat NULL as False (#1029)
- Make corr return koalas.DataFrame (#1069)
- Include link to Help Thirsty Koalas Fund (#1082)
- Add Null handling for different frames (#1083)
- Allow
Series.__getitem__
to take boolean Series (#1075) - Produce correct output against multiIndex when 'compute.ops_on_diff_frames' is enabled (#1089)
- Fix idxmax() / idxmin() for Series work properly (#1078)
Version 0.22.0
Enable Arrow 0.15.1+
Apache Arrow 0.15.0 did not work well with PySpark 2.4 so it was disabled in the previous version.
With Arrow 0.15.1, now it works in Koalas (#902).
Expanding and Rolling
We also added expanding()
and rolling()
APIs in all groupby()
, Series and Frame (#985, #991, #990, #1015, #996, #1034, #1037)
min
max
sum
mean
std
var
Multi-index columns support
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
Documentation
We added "Best Practices" section in the documentation (#1041) so that Koalas users can read and follow. Please see https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html
Other new features and improvements
We added the following new features:
koalas.DataFrame:
koalas.Series:
koalas.MultiIndex:
Along with the following improvements:
- Introduce column_scols in InternalFrame substitude for data_columns. (#956)
- Fix different index level assignment when 'compute.ops_on_diff_frames' is enabled (#1045)
- Fix Dataframe.melt function & Add doctest case for melt function (#987)
- Enable creating Index from list like 'Index([1, 2, 3])' (#986)
- Fix combine_frames to handle where the right hand side arguments are modified Series (#1020)
setup.py
should support Python 2 to show a proper error message. (#1027)- Remove
Series.schema
. (#993)
Version 0.21.0
Multi-index columns support
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
Documentation
Now, we have installation guide, design principles and FAQ in our public documentation (#914, #944, #963, #964)
Other new features and improvements
We added the following new features:
koalas
merge
(#969)
koalas.DataFrame:
koalas.Series:
koalas.Index:
koalas.MultiIndex:
koalas.Expanding
count
(#978)
Along with the following improvements:
- Fix passing options as keyword arguments (#968)
- Make is_monotonic~ work properly for index (#930)
- Fix Series.__getitem__ to work properly (#934)
- Fix reindex when all the given columns are included the existing columns (#975)
- Add datetime as the equivalent python type to TimestampType (#957)
- Fix is_unique to respect the current Spark column (#981)
- Fix bug when assign None to name as Index (#974)
- Use name_like_string instead of str directly. (#942, #950)
Version 0.20.0
Disable Arrow 0.15
Apache Arrow 0.15.0 was released on the 5th of October, 2019, which Koalas depends on to execute Pandas UDF, but the Spark community reports an issue with PyArrow 0.15.
We decided to set an upper bound for pyarrow version to avoid such issues until we are sure that Koalas works fine with it.
- Set an upper bound for pyarrow version. (#918)
Multi-index columns support
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
Other new features and improvements
We added the following new features:
koalas.DataFrame:
xs
(#892)
koalas.Series:
koalas.GroupBy:
shift
(#910)
Along with the following improvements:
Version 0.19.0
Koalas Logo
Now that we have an official logo!
We can see the cute logo in our documents as well.
Documentation
Also we improved the documentation: https://koalas.readthedocs.io/en/latest/
- Added the logo (#831)
- Added a Jupyter notebook for 10 min tutorial (#843)
- Added the tutorial to the documentation (#853)
- Add some examples for plot implementations in their docstrings (#847)
- Move contribution guide to the official documentation site (#841)
Binder integration for the 10 min tutorial
You can run a live Jupyter notebook for 10 min tutorial from .
Multi-index columns support
We continue improving multi-index columns support. We made the following APIs support multi-index columns:
transform
(#800)round
(#802)unique
(#809)duplicated
(#803)assign
(#811)merge
(#825)plot
(#830)groupby
and its functions (#833)update
(#848)join
(#848)drop_duplicate
(#856)dtype
(#858)filter
(#859)dropna
(#857)replace
(#860)
Plots
We also continue adding plot APIs as follows:
For DataFrame:
plot.kde()
(#784)
Other new features and improvements
We added the following new features:
koalas.DataFrame:
koalas.Series:
koalas.DataFrameGroupBy:
koalas.SeriesGroupBy:
Along with the following improvements:
- Add squeeze argument to read_csv (#812)
- Raise a more helpful error for duplicated columns in Join (#820)
- Issue with ks.merge to Series (#818)
- Fix
MultiIndex.to_pandas()
and__repr__()
. (#832) - Add unit and origin options for to_datetime (#839)
- Fix on wrong error raise in DataFrame.fillna (#844)
- Allow str and list in aggfunc in DataFrameGroupby.agg (#828)
- Add
index_col
argument toto_koalas()
. (#863)
Version 0.18.0
Multi-index columns support
We continue improving multi-index columns support (#793, #776). We made the following APIs support multi-index columns:
Also, we can set tuple or None name for Series and Index. (#776)
>>> import databricks.koalas as ks
>>> kser = ks.Series([1, 2, 3])
>>> kser.name = ('a', 'b')
>>> kser
0 1
1 2
2 3
Name: (a, b), dtype: int64
Plots
We also continue adding plot APIs as follows:
For Series:
plot.kde()
(#767)
For DataFrame:
plot.hist()
(#780)
Options
In addition, we added the support for namespace-access in options (#785).
>>> import databricks.koalas as ks
>>> ks.options.display.max_rows
1000
>>> ks.options.display.max_rows = 10
>>> ks.options.display.max_rows
10
See also User Guide of our project docs.
Other new features and improvements
We added the following new features:
koalas.DataFrame:
koalas.indexes.Index/MultiIndex
is_boolean
(#795)is_categorical
(#795)is_floating
(#795)is_integer
(#795)is_interval
(#795)is_numeric
(#795)is_object
(#795)
Along with the following improvements:
- Add
index_col
forread_json
(#797) - Add index_col for spark IO reads (#769, #775)
- Add "sep" parameter for read_csv (#777)
- Add axis parameter to dataframe.diff (#774)
- Add read_json and let to_json use spark.write.json (#753)
- Use spark.write.csv in to_csv of Series and DataFrame (#749)
- Handle TimestampType separately when convert to pandas' dtype. (#798)
- Fix
spark_df
whenset_index(.., drop=False)
. (#792)
Backward compatibility
Version 0.17.0
Options
We started using options to configure the Koalas' behavior. Now we have the following options:
display.max_rows
(#714, #742)compute.max_rows
(#721, #736)compute.shortcut_limit
(#717)compute.ops_on_diff_frames
(#725)compute.default_index_type
(#723)plotting.max_rows
(#728)plotting.sample_ratio
(#737)
We can also see the list and their descriptions in the User Guide of our project docs.
Plots
We continue adding plot APIs as follows:
For Series:
plot.area()
(#704)
For DataFrame:
plot.line()
(#686)plot.bar()
(#695)plot.barh()
(#698)plot.pie()
(#703)plot.area()
(#696)plot.scatter()
(#719)
Multi-index columns support
We also continue improving multi-index columns support. We made the following APIs support multi-index columns:
Other new features and improvements
We added the following new features:
koalas:
koalas.DataFrame:
style
(#712)
Along with the following improvements:
GroupBy.apply
should return Koalas DataFrame instead of pandas DataFrame (#731)- Fix
rpow
andrfloordiv
to use proper operators in Series (#735) - Fix
rpow
andrfloordiv
to use proper operators in DataFrame (#740) - Add schema inference support at DataFrame.transform (#732)
- Add
Option
class to support type check and value check in options (#739) - Added missing tests (#687, #692, #694, #709, #711, #730, #729, #733, #734)
Backward compatibility
- We renamed two of the default index names from
one-by-one
anddistributed-one-by-one
tosequence
anddistributed-sequence
respectively. (#679) - We moved the configuration for enabling operations on different DataFrames from the environment variable to the option. (#725)
- We moved the configuration for the default index from the environment variable to the option. (#723)
Version 0.16.0
Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES
environment variable is set to true
as below:
>>> import databricks.koalas as ks
>>>
>>> kdf1 = ks.range(5)
>>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
>>> (kdf1 - kdf2).sort_index()
id
0 -5.0
1 -3.0
2 -1.0
3 NaN
4 NaN
>>> import databricks.koalas as ks
>>>
>>> kdf = ks.range(5)
>>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
>>> kdf
id new_col
0 0 1.0
1 1 2.0
3 3 4.0
2 2 3.0
4 4 NaN
Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX
as one of three types:
-
(default)
one-by-one
: It implements a one-by-one sequence by Window function without
specifying partition. This index type should be avoided when the data is large.>>> ks.range(3) id 0 0 1 1 2 2
-
distributed-one-by-one
: It implements a one-by-one sequence by group-by and
group-map approach. It still generates a one-by-one sequential index globally.
If the default index must be a one-by-one sequence in a large dataset, this
index can be used.>>> ks.range(3) id 0 0 1 1 2 2
-
distributed
: It implements a monotonically increasing sequence simply by using
Spark'smonotonically_increasing_id
function. If the index does not have to be
a one-by-one sequence, this index can be used. Performance-wise, this index
almost does not have any penalty comparing to other index types.>>> ks.range(3) id 25769803776 0 60129542144 1 94489280512 2
Thirdly, we implemented many plot APIs in Series as follows:
See the example below:
import databricks.koalas as ks
ks.range(10).to_pandas().id.plot.pie()
Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:
DataFrame.sort_index()
(#637)GroupBy.diff()
(#653)GroupBy.rank()
(#653)Series.any()
(#652)Series.all()
(#652)DataFrame.any()
(#652)DataFrame.all()
(#652)DataFrame.assign()
(#657)DataFrame.drop()
(#658)DataFrame.reindex()
(#659)Series.quantile()
(#663)Series,transform()
(#663)DataFrame.select_dtypes()
(#662)DataFrame.transpose()
(#664).
Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:
koalas.DataFrame
koalas.groupby.GroupBy:
Along with the following improvements: