Spline: Data Lineage For Spark Structure Streaming (2018)

Vaclav Kosar and Marek Novotny presentation at Spark N AI Summit 2018 of a POC of Structured Streaming data lineage tool.

Data lineage tracking is one of the significant problems that companies in highly regulated industries face. These companies are required to have a good understanding of how data flows through their systems to comply with strict regulatory frameworks. Many of these organizations also utilize big and fast data technologies such as Hadoop, Apache Spark and Kafka. Spark has become one of the most popular engines for big data computing. In recent releases, Spark also provides the Structured Streaming component, which allows for real-time analysis and processing of streamed data from many sources. Spline is a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans in a lightweight, unobtrusive and easy to use manner.

Additionally, Spline offers a modern user interface that allows non-technical users to understand the logic of Apache Spark applications. In this presentation we cover the support of Spline for Structured Streaming and we demonstrate how data lineage can be captured for streaming applications.

Session hashtag: #SAISExp18

Presentation:

Spline presentation PDF download is here and the conference page is here.

Spline: Data Lineage For Spark Structured Streaming from Vaclav Kosar
#slideshare-play {
    position: absolute;
    top: calc(50% - 30px);
    left: calc(50% - 30px);
    width: 60px;
    fill: #f16c00;
    background: white;
}

#video-img {
    position: relative;
    cursor: pointer;
    text-align: center;
}

#video-img-img {
    /*border-top: 10px solid #0b2e13;*/
    /*border-bottom: 10px solid #0b2e13;*/
    border-left: 10px solid #0b2e13;
    border-right: 10px solid #0b2e13;
}

</style>

)

Vaclav Kosar and Marek Novotny at Spark Summit 2018

About Authors

About Vaclav Kosar

Vaclav Kosar

Vaclav is a programming and analytics enthusiast. He currently forges big data software for ABSA R&D focusing on cruicial data lineage project Spline. He studied electronics, physics and mathematics.

About Marek Novotny

Marek obtained bachelor and master degree in computer science at Charles University in Prague. His master studies were mainly focused on development of distributed and dependable systems. In 2013, Marek joined ABSA Capital in Prague to develop a scalable data integration platform and a framework for calculating regulatory reports. During the work on those two projects, he gained experience with many NoSQL and distributed technologies (e.g. Kafka, Zookeper, Spark). Nowadays, he is a member of Big Data Engineering team and primarily focused on development of the Spline project.

Marek Novotny

Let's connect

Presentation Text

Spline: Data Lineage for Spark Structured Streaming

Marek Novotny, ABSA Vaclav Kosar, ABSA

SAISExp18

About Us

  • ABSA is a Pan-African financial services provider
  • With Apache Spark at the core of its data engineering
  • We try to fill gaps in the Hadoop eco-system
  • Contributions to Apache Spark
  • Spark-related open-source projects (github.com/AbsaOSS)
  • ABRiS - Avro SerDe for structured APIs (#SAISDev5)
  • Cobrix - Cobol data source
  • Atum
  • Completeness and accuracy library
  • Spline - Data lineage tracking and visualization tool (#EUent3)

  • How data is calculated?
  • What is the schema and format of streamed data?
  • To make Spark BCBS (Clarity) compliant
  • To communicate with business people
  • Online documentation of
  • Job dependencies
  • Spark SQL job details
  • Attributes occurring in the logic

Lineage Tracking of Batch Jobs

  • Dataset-oriented
  • Job Leverages execution plans
  • Structured APIs only
  • SQL
  • Dataframes Dataset A Lineage A Datasets
  • UDFs and lambdas are considered as black boxes

Lineage Tracking of Streaming Jobs

  • Structured Streaming only
  • App Source-oriented (topic)
  • Evolves in time

Structured Streaming Support

  • StreamingQueryManager Spark structured streaming job
  • Information about start
  • Can provide execution Spark libraries
  • Information about progress Session

StreamingQueryManager

  • MicroBatch

Transformations Start Execution Plans Progress Query

Spline Streaming Listener Lineage Model Event details Spline UI Interval View

  • Displays data flow in fixed interval Job W1

Demo

What is temperature per hour in Prague?

  • Use Case Output 2018-09-24
  • Select Interval View
  • Select Interval
  • Select Sink
  • Find Highlighted Sink
  • Review The Lineage
  • Change The Interval
  • Observe New Lineage
  • Select A Job
  • Drill Down
  • Review Job Details
  • Select An Operation
  • See Operation Attributes

Interval View Limitations

  • Edge case (delayed read, early write)
  • Job W1 should be linked
  • Job W2 should not be linked Interval View

Lineage

Beyond The Interval View

  • Instead of timestamp use Progress Event addresses of rows Source 2 SS has addresses (offsets ) on each source, but not on sinks Job
  • Most sinks are also sources and thus could return offsets Sink

Offset-Based Linking

  • Jobs are linked when progress offsets overlap
  • Offset timestamp doesn’t matter Job W1

Conclusion

  • Spline: data lineage tracking tool
  • New support for Structured Streaming
  • Demo POC: Interval View
  • Proposed generalization: offset-based linking

Future Plans

  • Release Interval View in Spline
  • After changes to Spark:
  • Offset based linking for micro-batch streaming
  • Continuous streaming support
  • Support for dataset checkpoints

Questions

  • Now is a good time
  • Or feel free to contact us
  • Marek Novotny
  • Vaclav Kosar

Created on 04 Oct 2018. Updated on: 06 Jun 2022.
Thank you

Ask or Report A Mistake


Let's connect








Privacy Policy How many days left in this quarter? Twitter Bullet Points to Copy & Paste About Vaclav Kosar