Release Version 1.2.0 · databricks/koalas

Non-named Series support

Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0" if no name is specified or None is set to the name, whereas pandas allows a Series without the name.

For example:

>>> ks.__version__
'1.1.0'
>>> kser = ks.Series([1, 2, 3])
>>> kser
0    1
1    2
2    3
Name: 0, dtype: int64
>>> kser.name = None
>>> kser
0    1
1    2
2    3
Name: 0, dtype: int64

Now the Series will be non-named.

>>> ks.__version__
'1.2.0'
>>> ks.Series([1, 2, 3])
0    1
1    2
2    3
dtype: int64
>>> kser = ks.Series([1, 2, 3], name="a")
>>> kser.name = None
>>> kser
0    1
1    2
2    3
dtype: int64

More stable "distributed-sequence" default index

Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:

>>> from databricks import koalas as ks
>>> ks.options.compute.default_index_type = 'distributed-sequence'
>>> ks.range(10).reset_index()

did not work as below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
pyspark.sql.utils.PythonException:
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  ...
  File "/.../koalas/databricks/koalas/internal.py", line 620, in offset
    current_partition_offset = sums[id.iloc[0]]
KeyError: 103

We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.

Improve testing infrastructure

We changed the testing infrastructure to use pandas' testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.

Other new features and improvements

We added the following new features:

DataFrame:

last_valid_index (#1705)

Series:

product (#1677)
last_valid_index (#1705)

GroupBy:

cumcount (#1702)

Other improvements

Refine Spark I/O. (#1667)
- Set partitionBy explicitly in to_parquet.
- Add mode and partition_cols to to_csv and to_json.
- Fix type hints to use Optional.
Make read_excel read from DFS if the underlying Spark is 3.0.0 or above. (#1678, #1693, #1694, #1692)
Support callable instances to apply as a function, and fix groupby.apply to keep the index when possible (#1686)
Bug fixing for hasnans when non-DoubleType. (#1681)
Support axis=1 for DataFrame.dropna(). (#1689)
Allow assining index as a column (#1696)
Try to read pandas metadata in read_parquet if index_col is None. (#1695)
Include pandas Index object in dataframe indexing options (#1698)
Unified PlotAccessor for DataFrame and Series (#1662)
Fix SeriesGroupBy.nsmallest/nlargest. (#1713)
Fix DataFrame.size to consider its number of columns. (#1715)
Fix first_valid_index() for Empty object (#1704)
Fix index name when groupby.apply returns a single row. (#1719)
Support subtraction of date/timestamp with literals. (#1721)
DataFrame.reindex(fill_value) does not fill existing NaN values (#1723)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 1.2.0

Non-named Series support

More stable "distributed-sequence" default index

Improve testing infrastructure

Other new features and improvements

Other improvements