Differences from SAS language Data Step#

This page summarizes the key differences between the SAS language DATA step and limulus.

IO Support#

Standard support: pyarrow.Table / polars.DataFrame
Optional support: pandas.DataFrame (pip install limulus[pandas])
session.loads() accepts all three formats listed above

Support for sas7bdat input/output is not planned, as the format specification is proprietary.
If you need to work with sas7bdat files, use pandas or other libraries to load them first.

Type Mapping#

Arrow Type	SAS language Equivalent	Python
`int64`, etc.	Numeric	`int`
`float64`, etc.	Numeric	`float`
`string`	Character	`str`
`bool`	Numeric (0/1)	`bool`
`date`, etc.	Numeric	`datetime`

Arrow supports a wide variety of data types.
Complex types such as structs that do not exist in SAS are not expected to be used in logic, but can be retained as columns without issue.

Behavioral Differences#

Labels#

Dataset labels and column labels are supported, but they are stored as Arrow metadata rather than as a separate display-layer construct. Dataset labels live in schema metadata under memlabel, and column labels live in per-field custom metadata.

Dataset Option Type Preservation#

Some row-oriented execution paths materialize rows before the final Arrow table is rebuilt. As a current limitation, Arrow physical types are not always preserved exactly after SET-based processing, including cases that use source dataset options such as firstobs= or obs=. The logical values are preserved, but numeric columns may be widened, for example from float32 to float64.

If exact physical types matter, it is recommended to perform type conversion at the final materialized output stage. In practice, prefer DatasetView.cast(...) or DatasetView.astype(...) as the last transformation before using the result.

SQL API#

Session.sql() is available for read-oriented queries and CREATE TABLE ... AS ... style result persistence. The feature is backed by the Polars SQL engine and is intended as a practical session-level query helper rather than a full PROC SQL reimplementation.

length#

Character length is variable by default, so no character truncation occurs.

BY Groups#

Pre-sorting is not required to use BY groups.

Missing Values#

SAS language numeric missing values use the special . symbol, but limulus treats them as null (Arrow).
SAS language character missing values are "", but in limulus, following the Arrow specification, "" and null are treated as distinct values.
Use missing() to check for both "" and null together.
Assign null using =. for numeric types, or =None for both numeric and character types. (Support for call missing is also under consideration.)

Data Joining#

Merge operates like SQL; if there are duplicate column names, an error is raised.

Array#

Both numeric and character types can be included in the same array without issues.

result = session.submit("""
data out; 
  set iris; 
  array var [*] species sepal_length ; 
  do i = 1 to dim(var); 
    vname = vname(var[i]); 
    output ;
  end ; 
  stop ;
  keep vname ;
run;
""")

Additional Operators#

Operator	Added
`=`	`==`
`^=`	`!=`
`\|\|`	`+`

Custom Functions#

lead function
Looks ahead to a future row’s value, opposite to lag. Useful for cases where you need the next value, such as computing differences.

shift function
A general-purpose shift function: negative values behave like lag, positive values like lead.

apply function
Allows calling Python lambda functions or custom functions from within a Data Step.
Useful for extending functionality not covered by limulus’s built-in functions, such as complex string operations.

Unsupported Features#

SAS language Feature	Notes
Attrib	Full ATTRIB parity is not implemented yet; use dataset labels and LABEL statements for supported metadata cases
Data transposition	Handle on the Python side; `retain` workaround available; `transpose` API planned separately
Numeric format (`PUT(x, 8.2)`)	Planned for future implementation; `apply` can be used as a workaround
Format (`FORMAT`, `INFORMAT`)	Handle with if statements and merge; to be revisited
Macro variables (`%let`, `&var`)	Planned; use Python f-string as an alternative
`INFILE` / `FILE`	Not planned; handle on the Python side
`INPUT` / `DATALINES`	Not planned; handle on the Python side
Colon-based options (`=:`, `aa:`)	Workarounds available; under consideration
Range notation (`a-z`, `a1-a3`)	Workarounds available; under consideration
Variable groups (`_numeric_`, `_character_`, `_all_`)	Workarounds available; under consideration
Character operators (e.g., `eq`)	May be implemented if there is demand
CALL subroutines	`symputx` and `missing` are planned
Dictionary tables	Planned for future implementation

In general, features that have no practical workaround and are commonly used will be prioritized.
For features that are costly to implement and have good Python-side alternatives, those alternatives are recommended instead, freeing resources for areas like performance.
For example, external file I/O (INFILE / FILE) is well served by pandas, polars, or jinja2.

Sorting and Column Reordering#

Operations typically performed by procedures such as sorting are executed through the column-oriented API.
A high-performance polars-based wrapper is planned, but currently only basic operations are available.
Column reordering can be done with select or keep.

session.dataset("ds").sort("x")
session.dataset("ds").sort(["x", "y"])
session.dataset("ds").sort(["x"], nodupkey=True)
session.dataset("ds").sort([("x", "Ascending"),("y", "Descending")])

session.dataset("ds").select(["x","y"])

Recommended Patterns#

%LET macro variables → Python variables + f-string to build DSL strings
INFILE → pass Polars / Pandas to session.loads()
FORMAT → handle with if statements and merge
Missing value checks → use missing() function or cmiss()|nmiss()

Differences from SAS language Data Step

Contents