Skip to content

Version 1.2.0

Compare
Choose a tag to compare
@ueshin ueshin released this 28 Aug 08:34
· 342 commits to master since this release

Non-named Series support

Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0" if no name is specified or None is set to the name, whereas pandas allows a Series without the name.

For example:

>>> ks.__version__
'1.1.0'
>>> kser = ks.Series([1, 2, 3])
>>> kser
0    1
1    2
2    3
Name: 0, dtype: int64
>>> kser.name = None
>>> kser
0    1
1    2
2    3
Name: 0, dtype: int64

Now the Series will be non-named.

>>> ks.__version__
'1.2.0'
>>> ks.Series([1, 2, 3])
0    1
1    2
2    3
dtype: int64
>>> kser = ks.Series([1, 2, 3], name="a")
>>> kser.name = None
>>> kser
0    1
1    2
2    3
dtype: int64

More stable "distributed-sequence" default index

Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:

>>> from databricks import koalas as ks
>>> ks.options.compute.default_index_type = 'distributed-sequence'
>>> ks.range(10).reset_index()

did not work as below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
pyspark.sql.utils.PythonException:
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  ...
  File "/.../koalas/databricks/koalas/internal.py", line 620, in offset
    current_partition_offset = sums[id.iloc[0]]
KeyError: 103

We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.

Improve testing infrastructure

We changed the testing infrastructure to use pandas' testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.

Other new features and improvements

We added the following new features:

DataFrame:

  • last_valid_index (#1705)

Series:

GroupBy:

Other improvements

  • Refine Spark I/O. (#1667)
    • Set partitionBy explicitly in to_parquet.
    • Add mode and partition_cols to to_csv and to_json.
    • Fix type hints to use Optional.
  • Make read_excel read from DFS if the underlying Spark is 3.0.0 or above. (#1678, #1693, #1694, #1692)
  • Support callable instances to apply as a function, and fix groupby.apply to keep the index when possible (#1686)
  • Bug fixing for hasnans when non-DoubleType. (#1681)
  • Support axis=1 for DataFrame.dropna(). (#1689)
  • Allow assining index as a column (#1696)
  • Try to read pandas metadata in read_parquet if index_col is None. (#1695)
  • Include pandas Index object in dataframe indexing options (#1698)
  • Unified PlotAccessor for DataFrame and Series (#1662)
  • Fix SeriesGroupBy.nsmallest/nlargest. (#1713)
  • Fix DataFrame.size to consider its number of columns. (#1715)
  • Fix first_valid_index() for Empty object (#1704)
  • Fix index name when groupby.apply returns a single row. (#1719)
  • Support subtraction of date/timestamp with literals. (#1721)
  • DataFrame.reindex(fill_value) does not fill existing NaN values (#1723)