Categories
Creating Datasets - Full DISTINCT List of Values for Columns

Currently, when creating datasets, the dataset size limit is also limiting the list of values users can see in the column distribution and filter values available. This does not provide a holistic picture to users on data available.
We need the column list of values to be separate from the dataset row limit to allow a full distinct list of values for accurate dataset creation.
Comments
-
Can you be more specific as to where you see that? Filters on live datasets send a DISTINCT query to the underlying source so any query row limits are applied on the distinct list of values. Some surfaces (e.g., grouping wizard) use sampling (which we plan to change), but these are not consumer-visible capabilities. An example will help in understand this request better.
0 -
Yes, thanks for the follow-up. For example, we connect directly to a database and go to create a Dataset based off of an entire table. The table holds 16 million records. We don't ultimately need the dataset to hold those 16 million records. However, when we go to create filters on fields and a LOV is provided to select from, OR we are viewing the dataset overview where it lists all the fields, the distinct column values, % NULL, we are not seeing the full distinct set of values, but merely a subset. Does that help?
0 -
Are you talking about Workbooks or the Dataset editor?
For Workbook filters (and for Filters in Dataset definitions), List filters are paginated to retrieve 500 values at a time. However, while the list will only show the first 'page' (until the next page fetch is triggered by a user action such as scroll), the search will still look at the complete list. This assumes that these values are actually in the dataset (when cached) or that the dataset is live. The scenarios where I think you might not get a full list are: 1) the dataset is cached, and the number of rows in the source exceeds the dataset cache limits; therefore, not all rows are present in the cache (to fix it, indicate that your dataset is live), 2) the distinct LOV is greater than the query row limit.
From your response, are you not bringing the entire 16 million? if so, how are you slicing those down?
I assume you are experiencing something else, have a different use case, or are referring to a different surface?
0 -
Can you post a picture of the experience you are looking at? This could be a few places such as the dataset editor "quality insight" tiles which do provide a sample/estimate or in the grouping dialog? Or it could be when you are working in a workbook after creating the datasets. We want to understand the ask and need a bit more detail.
0 -
We're seeing the issue in the Dataset editor. For large tables (our example is 169 million rows), we either receive an error when we try to filter on a column and open the select list (screenshots), or we do not see the full distinct list of values to pick from. For users that don't know SQL to further filter down the table, we want them to be able to use the user friendly filter LOV instead of having to add WHERE clauses into a SQL statement. One last screenshot below shows the dataset editor.
0