Skip to main content

Streamsets Data Collector authentication through LDAP

StreamSets Data Collector (SDC) allows user authentication based on files or LDAP. By default, Data Collector uses file authentication. This post gives you details on how to switch to use your company's LDAP.

To enable LDAP authentication you need to perform the following tasks:
- Configure the LDAP properties for the Data Collector configuration editing the $SDC_CONF/sdc.properties file:
     - set the value of the http.authentication.login.module property to ldap
     - configure the value of the http.authentication.ldap.role.mapping property to map your LDAP groups to Data Collector roles following this syntax:
            <LDAP_group>:<SDC_role>,<additional_SDC_role>,<additional_SDC_role>
        Multiple roles can be mapped to the same group or vice versa. You need to use a semicolon to separate LDAP groups and commas to separate Data Collector roles. Here's an example:
            http.authentication.ldap.role.mapping=LDAP000:admin;LDAP001:creator,manager;LDAP002:guest
        The roles you can use are the same (admin, manager, creator, guest) available by default in SDC for the authentication based on files.
         By default, this property is empty, but it is mandatory to set it when http.authentication.login.module=ldap.
 - Configure the LDAP connection information editing the $SDC_CONF/ldap-login.conf file like in the following example:
     ldap {
         com.streamsets.datacollector.http.LdapLoginModule required
         debug="false"
         useLdaps="false"
         contextFactory="com.sun.jndi.ldap.LdapCtxFactory"
         hostname="ldaphost.yourcompany.com"
         port="389"
         bindDn=""
         bindPassword=""
         authenticationMethod="simple"
         forceBindingLogin="true"
         userBaseDn="ou=ldappages,o=yourcompany.com"
         userRdnAttribute="uid"
         userIdAttribute="mail"
         userPasswordAttribute="userPassword"
         userObjectClass="person"
         roleBaseDn="ou=yourcompanygroups,o=yourcompany.com"
         roleNameAttribute="cn"
         roleMemberAttribute="uniquemember"
         roleObjectClass="groupOfUniqueNames";
     };

where
  • debug: enables debugging.
  • useLdaps: enables using LDAP over SSL.
  • contextFactory: the initial LDAP context factory. You could leave the default value com.sun.jndi.ldap.LdapCtxFactory
  • hostname: the LDAP server name.
  • port: the LDAP server port.
  • bindDn: the root distinguished name.
  • bindPassword: the connection password. The value can be set here or in a file and then set the reference to that file here.
  • authenticationMethod: the authentication method. You could leave the default value, simple
  • forceBindingLogin: determines if binding login checks are performed. Two possible values for this property. When true, SDC passes the user credentials inputted through the login form to the LDAP server for authentication. When false, SDC performs authentication based on the information received by the LDAP server.
  • userBaseDn: the base distinguished name under which user accounts are located.
  • userRdnAttribute: the name of the username attribute.
  • userIdAttribute: the name of the user ID attribute.
  • userPasswordAttribute: the name of the attribute where the user password is stored.
  • userObjectClass: the name of the user object class.
  • roleBaseDn: the base distinguished name to search for role membership.
  • roleNameAttribute: the name of the attribute for roles.
  • roleMemberAttribute: the name of the role attribute for user names.
  • roleObjectClass: the role object class.

In order to check for the proper objects classes, attribute names and values in your company's LDAP, you can use the ldapsearch command-line utility from a Linux machine. This is the syntax of the command in order to retrieve a given user information and the full list of properties for the user object:

ldapsearch -H ldap://<host>:<port> -D "BINDDN" -x -w 'PASSWORD' -b ROLEBASEDN

Example:

ldapsearch -H ldap://ldap.googlielmo.org:389 -D "" -x -w 'ldap123' -b "ou=ldap,o=googlielmo.org" "mail=john.smith@googlielmo.org"

Finally, don't forget to restart SDC to apply the configuration changes above.

Comments

Popular posts from this blog

Streamsets Data Collector log shipping and analysis using ElasticSearch, Kibana and... the Streamsets Data Collector

One common use case scenario for the Streamsets Data Collector (SDC) is the log shipping to some system, like ElasticSearch, for real-time analysis. To build a pipeline for this particular purpose in SDC is really simple and fast and doesn't require coding at all. For this quick tutorial I will use the SDC logs as example. The log data will be shipped to Elasticsearch and then visualized through a Kibana dashboard. Basic knowledge of SDC, Elasticsearch and Kibana is required for a better understanding of this post. These are the releases I am referring to for each system involved in this tutorial: JDK 8 Streamsets Data Collector 1.4.0 ElasticSearch 2.3.3 Kibana 4.5.1 Elasticsearch and Kibana installation You should have your Elasticsearch cluster installed and configured and a Kibana instance pointing to that cluster in order to go on with this tutorial. Please refer to the official documentation for these two products in order to complete their installation (if you do

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You