Version 1.2.0
Non-named Series support
Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0"
if no name is specified or None
is set to the name, whereas pandas allows a Series without the name.
For example:
>>> ks.__version__
'1.1.0'
>>> kser = ks.Series([1, 2, 3])
>>> kser
0 1
1 2
2 3
Name: 0, dtype: int64
>>> kser.name = None
>>> kser
0 1
1 2
2 3
Name: 0, dtype: int64
Now the Series will be non-named.
>>> ks.__version__
'1.2.0'
>>> ks.Series([1, 2, 3])
0 1
1 2
2 3
dtype: int64
>>> kser = ks.Series([1, 2, 3], name="a")
>>> kser.name = None
>>> kser
0 1
1 2
2 3
dtype: int64
More stable "distributed-sequence" default index
Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:
>>> from databricks import koalas as ks
>>> ks.options.compute.default_index_type = 'distributed-sequence'
>>> ks.range(10).reset_index()
did not work as below:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
...
File "/.../koalas/databricks/koalas/internal.py", line 620, in offset
current_partition_offset = sums[id.iloc[0]]
KeyError: 103
We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.
Improve testing infrastructure
We changed the testing infrastructure to use pandas' testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.
Other new features and improvements
We added the following new features:
DataFrame:
last_valid_index
(#1705)
Series:
GroupBy:
cumcount
(#1702)
Other improvements
- Refine Spark I/O. (#1667)
- Set
partitionBy
explicitly into_parquet
. - Add
mode
andpartition_cols
toto_csv
andto_json
. - Fix type hints to use
Optional
.
- Set
- Make read_excel read from DFS if the underlying Spark is 3.0.0 or above. (#1678, #1693, #1694, #1692)
- Support callable instances to apply as a function, and fix groupby.apply to keep the index when possible (#1686)
- Bug fixing for hasnans when non-DoubleType. (#1681)
- Support axis=1 for DataFrame.dropna(). (#1689)
- Allow assining index as a column (#1696)
- Try to read pandas metadata in read_parquet if index_col is None. (#1695)
- Include pandas Index object in dataframe indexing options (#1698)
- Unified
PlotAccessor
for DataFrame and Series (#1662) - Fix SeriesGroupBy.nsmallest/nlargest. (#1713)
- Fix DataFrame.size to consider its number of columns. (#1715)
- Fix first_valid_index() for Empty object (#1704)
- Fix index name when groupby.apply returns a single row. (#1719)
- Support subtraction of date/timestamp with literals. (#1721)
- DataFrame.reindex(fill_value) does not fill existing NaN values (#1723)