yesiscan: license scanning tool

Status

The original author and maintainer (James / @purpleidea) has left Amazon and the project is currently unmaintained. If you are interested in this project please either contact Amazon, or the original author who has a mirror at: yesiscan.

About

yesiscan is a tool for performing automated license scanning. It usually takes a file path or git URL as input and returns the list of discovered license information.

It does not generally implement any individual license identification algorithms itself, and instead pulls in many different backends to complete this work for it.

It has a novel architecture that makes it unique in the license analysis space, and which can be easily extended.

If you choose to run it as a webui, the homepage looks like this:

The full results will then be shown like this.

Architecture

The yesiscan project is implemented as a library. This makes it easy to consume and re-use as either a library, CLI, API, WEBUI, BOTUI, or however else you'd like to use it. It is composed of a number of interfaces that roughly approximate the logical design.

Parsers

Parsers are where everything starts. A parser takes input in whatever format you'd like, and returns a set of iterators. (More on iterators shortly.) The parser is where you tell yesiscan how to perform the work that you want. A simple parser might simply expect a URI like https://github.com/purpleidea/yesiscan/ and error on other formats. A more complex parser might search through the text of an email or chat room to look for useful iterators to build. Lastly, you might prefer to implement a specific API that takes the place of a parser and gives the user direct control over which iterators to create.

Iterators

Iterators are self-contained programs which know how to traverse through their given data. For example, the most well-known iterator is a file system iterator that can recursively traverse a directory tree. Iterators do this with their recurse method which applies a particular scanning function to everything that it travels over. (More on scanning functions shortly.) In addition, the recurse method can also return new iterators. This allows iterators to be composable, and perform individual tasks succinctly. For example, the git iterator knows how to download and store git repositories, and then return a new file system iterator at the location where it cloned the repository. The zip iterator knows how to decompress and unarchive zip files. The http iterator knows how to download a file over https. Future iterators will be able to look inside rpm's, and so much more.

fs

The filesystem iterator knows how to find git submodules, zip files, and open regular files for scanning. It is the cornerstone of all the iterators as we eventually end up with an fs iterator to do the actual work.

zip

The zip iterator can decompress and extract zip files. It uses a heuristic to decide whether a file should be extracted or not. It usually does the right thing, but if you can find a corner case where it does not, please let us know. It also handles java .jar and python .whl files since those are basically zip files in disguise.

tar

The tar iterator can extract tar files. It uses a heuristic to decide whether a file should be extracted or not. It usually does the right thing, but if you can find a corner case where it does not, please let us know. It only extracts regular files and directories. Symlinks and other special files will not be extracted, nor will they be scanned as they have zero bytes of data anyways.

gzip

The gzip iterator can decompress gzip files. While the gzip format allows multistream so that multiple files could exist inside one .gzip file, this is not currently supported and probably not desired here. This does what you expect and can match extensions like .gz, .gzip, and even .tgz. In the last case it will create a new file with a .tar extension so that the tar iterator can open it cleanly.

bzip2

The bzip2 iterator can decompress bzip and bzip2 files. This does what you expect and can match extensions like .bz, .bz2, .bzip2, and even .tbz and .tbz2. In the last two cases it will create a new file with a .tar extension so that the tar iterator can open it cleanly.

http

The http iterator can download files from http sources. Because many git sources actually present as https URL's, we use a heuristic to decide what to download. If you aren't getting the behaviour you expect, please let us know. Plain http (not https) urls are disabled by default.

git

The git iterator is able to recursively clone all of your git repository needs. It does this with a pure-golang implementation to avoid you needing a special installation on your machine. This correctly handles git submodules, including those which use relative git submodule URLs. There is currently a small bug or missing feature in the pure-golang version, and for compatibility with all repositories, we currently make a single exec call to git in some of those cases. As a result, this will use the git binary that is found in your $PATH.

Scanning

The scanning function is the core place where the coordination of work is done. In contrast to many other tools that perform file iteration and scanning as part of the same process or binary, we've separated these parts. This is because it is silly for multiple tools to contain the same file iteration logic, instead of just having one single implementation of it. Secondly, if we wanted to scan a directory with two different tools, we'd have to iterate over it twice, read the contents from disk twice, and so on. This is inefficient and wasteful if you are interested in analysis from multiple sources. Instead, our scanning function performs the read from disk that all our different backends (if they support it) can use, and so this doesn't need to necessarily be needlessly repeated. (More on backends shortly.) The data is then passed to all of the selected backends in parallel. The second important part of the scanning function is that it caches results in a datastore of your choice. This is done so that repeated queries do not have to perform the intensive work that is normally required to scan each file. (More on caching shortly.)

Backends

The backends perform the actual license analysis work. The yesiscan project doesn't really implement any core scanning algorithms. Instead, we provide a way to re-use all the existing license scanning projects out there. Ideal backends will support a small interface that lets us pass byte array pointers in, and get results out, but there are additional interfaces that we support if we want to reuse an existing tool that doesn't support this sort of modern API. Sadly, most don't, because most software authors focus on the goals for their individual tool, instead of a greater composable ecosystem. We don't blame them for that, but we want to provide a mechanism where someone can write a new algorithm, drop it into our project, and avoid having to deal with all the existing boilerplate around filesystem traversal, git cloning, archive unpacking, and so on. Each backend may return results about its analysis in a standard format. (More on results shortly.) In addition to the well-known, obvious backends, there are some "special" backends as well. These can import data from curated databases, snippet repositories, internal corporate ticket systems, and so on. Even if your backend isn't generally useful worldwide, we'd like you to consider submitting and maintaining it here in this repository so that we can share ideas, and potentially get new ideas about design and API limitations from doing so.

Google License Classifier

The google license classifier backend wraps the google license classifier project. It is a pure golang backend which is nice, although the API does use files on disk for intermediate processing which is suboptimal for most cases, although makes examination of incredibly large files possible. Some of the results are spurious so use it with a lower confidence interval.

Cran

Cran is a backend for DESCRIPTION files which are text files to store important R package metadata. It finds names of licenses in the License field of the text file.

Pom

Pom is a backend for parsing Project Object Model or POM files. It finds names of licenses in the licenses field of the pom.xml file which are commonly used by the Maven Project. This parser sometimes cannot identify licenses due to the name being written in its full form.

Spdx

This is a simple pure-golang, SPDX parser. It should find anything that is a valid SPDX identifier. It was written from scratch for this project since the upstream version wasn't optimal. It shouldn't have any bugs, but if you find any issues, please report them!

Askalono

This wraps the askalono project which is written in rust. It shells out to the binary to accomplish the work. There's no reason this couldn't be easily replaced with a pure-golang version, although we decided to use this because it was already built and it serves as a good example on how to write a backend that runs an exec. Due to a limitation of the tool, it cannot properly detect more than one license in a file at a time. As a result, benefit from its output, but make sure to use other backends in conjunction with this one. The askalono binary needs to be installed into your $PATH for this to work. To install it run: cargo install askalono-cli. It will download and build a version for you and put it into ~/.cargo/bin/. Either add that directory to your $PATH or copy the askalono binary to somewhere appropriate like ~/bin/.

Scancode

This wraps the ScanCode project which is written mostly in python. It is a venerable project in this space, but it is slower than the other tools and is a bit clunky to install. To install it first download the latest release, then extract it into /opt/scancode/ and then add a symlink to main entrypoint in your ~/bin/ so that it shows up in your $PATH where we look for it. Run it with --help once to get it to initialize if you want. This looks roughly like this:

wget https://github.com/nexB/scancode-toolkit/releases/download/v30.1.0/scancode-toolkit-30.1.0_py36-linux.tar.xz
tar -xf scancode-toolkit-30.1.0_py36-linux.tar.xz
sudo mv scancode-toolkit-30.1.0/ /opt/scancode/
cd ~/bin/ && ln -s /opt/scancode/scancode
cd - && rm scancode-toolkit-30.1.0_py36-linux.tar.xz
scancode --help

In the future a more optimized scancode backend could be written to improve performance when running on large quantities of files, using the directory interface, and also perhaps even spawning it as a server. Re-writing the core detection algorithm in golang would be a valuable project.

Bitbake

Bitbake is a build system that is commonly used by the yocto project. It has these .bb metadata files that contain LICENSE= tags. This backend looks for them and includes them in the result. It tries to read them as SPDX ID's where possible.

Regexp

Regexp is a backend that lets you match based on regular expressions. Nobody likes to do this, but it's very common. Put a config file at ~/.config/yesiscan/regexp.json and then run the tool. An example file can be found in [examples/regexp.json](examples/regexp.json). You can override the default path with the --regexp-path command line flag.

Caching

The caching layer will be coming soon! Please stay tuned =D

Results

Each backend can return a result "struct" about what it finds. These results are collected and eventually presented to the user with a display function. (More on display functions shortly.) Results contain license information (More on licenses shortly.) and other data such as confidence intervals of each determination.

Display Functions

Display functions show information about the results. They can show as much or as little information about the results as they want. At the moment, only a simple text output display function has been implemented, but eventually you should be able to generate beautiful static html pages (with expandable sections for when you want to dig deeper into some analysis) and even send output as an API response or to a structured file.

Licenses

Licenses are the core of what we usually want to identify. It's important for most big companies to know what licenses are in a product so that they can comply with their internal license usage policies and the expectations of the licenses. For example, many licenses have attribution requirements, and it is usually common to include a legal/NOTICE file with these texts. It's also quite common for large companies to want to avoid the GPL family of licenses, because including a library under one of these licenses would force the company to have to release the source code for software using that library, and most companies prefer to keep their source proprietary. While some might argue that it is idealogically or ethically wrong to consume many dependencies and benefit financially, without necessarily giving back to those projects, that discussion is out of scope for this project, please have it elsewhwere. This project is about "knowing what you have". If people don't want to have their dependencies taken and made into proprietary software, then they should choose different software licenses! This project contains a utility library for dealing with software licenses. It was designed to be used independently of this project if and when someone else has a use for it. If need be, we can spin it out into a separate repository.

Building

Make sure you've cloned the project with --recursive. This is necessary because the project uses git submodules. The project also uses the go mod system, but the author thinks that forcing developers to pin dependencies is a big mistake, and prefers the vendor/+ git submodules approach that was easy with earlier versions of golang. If you forgot to use --recursive, you can instead run git submodule init && git submodule update in your project git root directory to fix this. To then build this project, you will need golang version 1.17 or greater. To build this project as a CLI, you will want to enter the cmd/yesiscan/ directory and first run go generate to set the program name and build version. You can then produce the binary by running go build.

Usage

CLI

Just run the binary with whatever input you want. For example:

yesiscan https://github.com/purpleidea/mgmt/

Web

Just run the binary in web mode. Then you can launch your web browser and use it normally. For example:

yesiscan web
xdg-open http://localhost:8000/

Config

You can store your default configuration options in a ~/.config/yesiscan/config.json file. This location can be overridden by the --config-path argument. If this file exists, then these values will be used as defaults. The below flags can override any of these. The following keys are supported:

auto-config-uri
auto-config-cookie-path
auto-config-expiry-seconds
auto-config-force-update
auto-config-binary-version
quiet
regexp-path
output-type
output-path
output-template
output-s3bucket
region,
profiles
backends
binaries
configs These keys should all be the top-level keys in a single json dictionary. More information on some of these keys are described below.

"profiles"

This key should be a list of "profiles" to use. See the Profiles section below for more information.

"backends"

These keys should be a dictionary of backend names to boolean true or false values representing the enabled state of that backend. If you don't specify a backend here, then whether or not that backend will be enabled or not is undefined and will depend on which backend flags you use. As a result, it is always recommended to be explicit about which backends you want to enable.

"binaries"

This key is a map which lists the available binaries for a particular yesiscan version. The value of each map is a direct URI to the binary in question. The keys in this map have the following pattern: $OS-$ARCH-$VERSION where $OS is the specific operating system used, such as linux, darwin, or windows, and where $ARCH might be amd64 or arm64, and where $VERSION is the special short version string as seen by running the program with the version arg.

"configs"

These keys should be a dictionary of destination file names to source URI paths. This map of files will be downloaded to the destination paths from the source URI paths. The destination file paths accept the tilde (~) character to use for $HOME directory path expansion. The destination paths must all be rooted under the parent directory of the main config file. This prevents using this tool to write to /etc/passwd or ~/.ssh/id_rsa for example. The source URI's will try and use the cookie path if it is specified. Overall this feature is helpful for pulling down multiple files for use in concert with a specific config that is likely brought in via the auto config mechanism.

Flags

You can add flags to tell it which backends to include or remove. They're all included by default unless you choose which one to exclude with the --no-backend variants. However if you use any of the --yes-backend variants, then you have to specify each backend that you want individually. You can get the full list of these flags with the --help flag.

--auto-config-uri

This is a special URI which if set, will try and pull a config from that location on startup. It will use the cookie file stored at --auto-config-cookie-path if specified. If successful, it will check if the config is different from what is currently stored. If so then it will validate if it is a valid json config. If so it will replace (overwrite!) the current config and then run with that!

For example: --auto-config-uri 'https://example.com/config.json'.

--auto-config-cookie-path

This is a special path which if set will point to a netscape/libcurl style cookie file to use when making the get download requests. This is useful if you store your config behind some gateway that needs a magic cookie for auth. It accepts the tilde (~) character to use for $HOME directory path expansion. We only read from this path, and expect another tool to have previously written the cookie file there.

For example: --auto-config-cookie-path '~/.secret/cookie'.

--auto-config-expiry-seconds

This value if set is the minimum number of seconds to wait between automatic updates of the configuration. If this is set to zero, then updates will always be attempted. If this is negative then updates will never be attempted unless forcefully request them with --auto-config-force-update.

--auto-config-force-update

If this flag is specified, then we will always attempt to update the auto config on each run.

--auto-config-binary-version

If this flag is specified, we will attempt to replace the current binary with this version of the program if it exists in our config. To override this setting in the remote config, you can specify this with the empty string '' as the arg so that we will avoid replacing the requested version. These versions are stored in a giant map in the main config file in the binaries section shown above.

--noop

If this flag is specified, no scan is done. The auto config code will execute though. This is useful to get the config up-to-date without running a scan. It can be combined with --auto-config-force-update for some guaranteed updates!

--quiet

When this boolean flag is enabled, all log messages will be suppressed.

--regexp-path

This is the path to the regexp rules files as used by the regexp backend. If it is not specified, then we will automatically look for a file in ~/.config/yesiscan/regexp.json.

--config-path

This is the path to the main config.json file. If it is not specified, then we will automatically look for a file in ~/.config/yesiscan/config.json.

--output-type

When run with --output-type html the scan results will be output in html. When run with --output-type text the scan results will be in plain text. This requires that you also specify --output-path or --output-template or --output-s3bucket. If you don't specify this, it will default to html.

--output-path

When run with --output-path <path> the scan results will be saved to a file. This will overwrite whatever file contents are already there, so please use carefully. If you specify - as the file path, the stdout will be used. This will also cause the quiet flag to be enabled.

--output-template

When run with --output-template <path> the scan results will be saved to a file. This will overwrite whatever file contents are already there, so please use carefully. If you specify - as the file path, the stdout will be used. This will also cause the quiet flag to be enabled. This option is identical to the --output-path option, except that it accepts named format strings. Each named format string must be surrounded by curly braces. Certain dangerous values will be stripped from the output template, so don't try and be malicious or strange. The list of valid format string names are as follows.

"date": Returns the RFC3339 date with colons changed to dashes.

--output-s3bucket

If you specify this flag with the name of an AWS S3 bucket, then the report will be uploaded to this location. You must have previously created an AWS account and have installed the credentials triple to the machine where you are running this tool. It is recommended that you use a dedicated (not shared) S3 bucket with this tool, as it will control the internal namespace and could potentially overwrite a file that you have already stored there. After the file is written, it will return a presigned URL that you can share with others. It will also return a public URL that you can share as well. This URL will only work if you have public access settings configured for your bucket. To configure those, you can refer to the below settings. The public object URL's that are generated are pseudo-hard to guess, but not impossible. The advantage they have over the presigned URL's is that they don't expire, where as the presigned URL's expire after seven days. This is an Amazon imposed limit.

Public access settings you may or may not want to set.

For more info please refer to the [AWS docs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/configuring-block-public-access-account.html).

--region

This is the S3 region that is used for uploading files to S3 buckets.

--listen

This flag is used by the web variant to tell the server where to listen. You can specify a port or both a port and ip address. For example, try: 127.0.0.1:8000 or :8000.

--profile

This flag may be used multiple times to enable different profiles. This is used by both the regular cli and also the web variant. The profiles system is described below.

Profiles

Most users might want to filter their results so that not all licenses are shown. For this you may specify one or more --profile <name> parameters. If the <name> corresponds to a <name>.json file in your ~/.config/yesiscan/profiles/ directory, then it will use that file to render the profile. The contents of that file should be in a similar format to the example file in [examples/profile.json](examples/profile.json). You get to pick a comment for personal use, a list of SPDX license ID's, and whether this is an exclude list or an include list. If you don't specify any profiles you will get the default profile. It is also a built-in name so you can add in this profile to your above set by doing --profile default and if there is no such user-defined profile, then the default will be displayed.

Bash Auto Completion

If you source the bash-autocompletion stub, then you will get autocompletion of the common flags! Download the stub from https://github.com/urfave/cli/blob/main/autocomplete/bash_autocomplete and put it somewhere like /etc/profile.d/yesiscan. The name of the file must match the name of the program! Things should just work, but if they don't, you may want to add a stub in your ~/.bashrc like:

# force yesiscan bash-autocompletion to work
if [ -e /etc/profile.d/yesiscan ]; then
	source /etc/profile.d/yesiscan
fi

Style Guide

This project uses gofmt -s and goimports -s to format all code. We follow the mgmt style guide even though we don't yet have all the automated tests that the mgmt config project does. Commit messages should start with a short, lowercase prefix, followed by a colon. This prefix should keep things organized a bit when perusing logs.

Legal

Copyright Amazon.com Inc or its affiliates and the yesiscan project contributors Written by James Shubin [email protected] and the project contributors

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

We will never require a CLA to submit a patch. All contributions follow the inbound == outbound rule.

This is not an official Amazon product. Amazon does not offer support for this project.

Authors

James Shubin, while employed by Amazon.ca, came up with the initial design, project name, and implementation. James had the idea for a soup can as the logo, which Sonia Xu implemented beautifully. She had the idea to do the beautiful vertical lines and layout of it all.

Happy hacking!

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
art		art
backend		backend
cmd/yesiscan		cmd/yesiscan
examples		examples
interfaces		interfaces
iterator		iterator
lib		lib
parser		parser
s3		s3
util		util
web		web
.gitignore		.gitignore
.gitmodules		.gitmodules
COPYING		COPYING
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

License

purpleidea/yesiscan

Folders and files

Latest commit

History

Repository files navigation

yesiscan: license scanning tool

Status

About

Architecture

Parsers

Iterators

fs

zip

tar

gzip

bzip2

http

git

Scanning

Backends

Google License Classifier

Cran

Pom

Spdx

Askalono

Scancode

Bitbake

Regexp

Caching

Results

Display Functions

Licenses

Building

Usage

CLI

Web

Config

"profiles"

"backends"

"binaries"

"configs"

Flags

--auto-config-uri

--auto-config-cookie-path

--auto-config-expiry-seconds

--auto-config-force-update

--auto-config-binary-version

--noop

--quiet

--regexp-path

--config-path

--output-type

--output-path

--output-template

--output-s3bucket

--region

--listen

--profile

Profiles

Bash Auto Completion

Style Guide

Legal

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages